IPFS Content Identifiers

2018-08-06 • 7 min read

If you’ve used IPFS or watched one or the other talk or tutorial about it, you know that IPFS generates hashes for the data that’s being added to the network. While it might be obvious why that is (IPFS uses those hashes to identify content in the network), it’s less obvious how those hashes are being put together.

Sure, there’s probably some hash algorithm applied to data, however it turns out that IPFS goes far beyond that as it uses a combination of Multiformat protocols to create its content identifiers and keep them future-proof, which I think is quite smart as we’ll soon learn.

One of the Multiformat protocols is Multihash. If you haven’t heard about it before, I recommend heading over to my post on future-proof cryptographic hashes, which talks about what the protocol is and why it exists. This post builds on top of it.

The anatomy of CIDs

Let’s start by getting a hash from IPFS for the content “Hello World”, by piping it through ipfs add (you can also create a file with the content first and call ipfs add on it as well, but I prefer to save this step):

$ echo "Hello World" | ipfs add -n
$ added QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u

The -n option is a cool trick to get a hash from IPFS without actually writing the data to disk. ☝🏼Also notice that IPFS outputs the hash twice. This is because it outputs the hash of the content and the path of the file it resides in. If there’s no path, like in our case, it uses the hash as path as well.

Alright, QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u it is. This content identifier, or short CID, for “Hello World” will always be the same as long as the content stays the same. While this is not too special, let’s see what happens if we do the same thing with different content. “Hello other world” for example, will output QmcaHpwn3bs9DaeLsrk9ZvVxVcKTPXVWiU1XdrGNW9hpi3. Go, try it yourself! 🙂

Noticed something? Right, both hashes start with Qm. This is an important characteristic to be aware of - all hashes generated by IPFS start with Qm. Spoiler alert: this will change once IPFS has moved to CIDv1b32 which was not the case at the time of writing this post, but more on that later.

So what’s this Qm prefix all about? The first thing we need to know is that there are multiple versions of CIDs. There’s CIDv0 and CIDv1. We’ll touch on what the exact differences are in a minute, but whenever we see a hash generated by IPFS that starts with Qm we know we’re dealing with a CIDv0 hash. This is because a CIDv0 is a Multihash encoded in Base58.

Uhm… so why exactly do we know it’s a CIDv0? Now it’s useful to know what a Multihash is (again, if this is new to you read this post first and come back 🙃)!

Since we know it’s a Multihash, we also know that the first two bytes of the CID represent the hash algorithm type that was used to hash the original data, and the length of the data hash respectively, while the rest of the CID represents the actual hash of the data.

<hash-type> - <hash-length> - <hash-digest>

We can verify this by transforming our “Hello World” hash from Base58 to Base16 (or hexadecimal) and comparing the first byte with the Multihash Table:

122074410577111096cd817a3faed78630f2245636beded412d3b212a2e09ba593ca

or, to make it more clear

12 - 20 - 74410577111096cd817a3faed78630f2245636beded412d3b212a2e09ba593ca
<hash-type> - <hash-length> - <hash-digest>

The first byte 0x12 represents sha2-256 in the hash table, which means the hash digest is a SHA-256 hash! But it turns out this is not the whole story. If we try to verify this and use a SHA2-256 function to get a hash for “Hello World”, we’ll notice it doesn’t really add up…

$ echo Hello World | shasum -a 256
$ d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26

We get d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26 instead of 74410577111096cd817a3faed78630f2245636beded412d3b212a2e09ba593ca… so what’s going on there?

After a bit of research I’ve learned that IPFS is doing a little bit more work than just hashing the data and creating a multihash out of that. David Dias, team lead and researcher of the IPFS project, puts it quite nicely in this StackOverflow question:

IPFS chunks the given file into 256 KiB pieces - This is not really the case for us right now as we deal with very little data.
Each chunk goes into a so called “DAG” (Directed Acyclic Graph) node inside a so called “Unixfs protobuf”. More on that in a second.
Another dag is created with links to all the chunks.

…it does what?!

Yea, that was my reaction too. Lots of words in there that need to be demystified. I’ll go into more details on dags and protocol buffers in future articles, but to keep it very simple for now, IPFS basically serializes some metadata (is it a file? Is it a directory? What’s the size in bytes?) along with our original data. It then serializes that with more metadata (what other data is our data linking to?) and eventually multihashes that using sha2-256 (which just happens to be the default hashing algorithm at the time of writing this post) and puts that in the resulting CID we see on the screen. The APIs mentioned here can be used to play with that stuff.

So in order to get the hash digest that resides in the CID, we have to get that serialized data and pipe that through sha2-256. IPFS' block API lets us do exactly that. First let’s inspect that serialized data:

$ ipfs block get QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u | sed -n l
$ $
$ \022\b\002\022\fHello World$
$ \030\f$

This already shows us that there’s more than just the “Hello World\n” data. Now, let’s pipe this through sha2-256 and see what happens:

$ ipfs block get QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u | shasum -a 256
$ 74410577111096cd817a3faed78630f2245636beded412d3b212a2e09ba593ca

There it is! At this point I’d like to make a shout-out to Alan and Steven who helped me figuring out this last bit.

Also, if this was a bit hard to follow, no problem. It’s not required to know what’s going on behind the scenes to use the tool. Just keep an eye on future articles here that will take a closer look at these things 🙃.

Alright, now that we’ve got a better idea of where those Qm* hashes come from, let’s quickly talk about how CIDv0 differs from CIDv1.

Differences in v1

Content identifiers version 1 encode even more information than its predecessor in a very compact format. While CIDv0 encodes hash type, hash length and hash digest, a CIDv1 can be represented as:

<mb><version><mc><mh>

Where

mb is a Multibase code - very similar to multihashes, there’s multibase codes that encode the base encoding type (base16, base32, base58, etc) and the base encoded data itself, into a single string.
version is the CID version in use - this makes CIDs upgradable.
mc is a Multicodec type - similar to multihashes and multibases, just for codec types.
mh is a Multihash - basically what we’re getting in CIDv0.

It turns out that these are very powerful properties. For example, since the base encoding type is part of the CID, one can easily transform CIDs into a base encoding that creates shorter hashes.

If you want to know what base encoding you’re dealing with, take a look at the Multibase Table provided by the spec. Can you guess what encoding this hash uses? 🙂

MTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==

Moving to Base32

At the time of writing this post there was an ongoing effort to migrate CIDs to CIDv1 and with default base32 encoding (CIDv1b32), while still being backwards compatible with CIDv0. According to the linked issue, this is due to security considerations with sub domains, an experimental protocol handler API and to support legacy URIs.

This probably raises even more questions, but for now we stop here and let everything sink in we’ve just learned. Here are some of the most important links:

Sovereign Individual

IPFS Content Identifiers

The anatomy of CIDs

Differences in v1

Moving to Base32