At its core, IPFS is a distributed system for storing and accessing files, websites, applications, and data. It is transport-layer agnostic, which means it can communicate over a variety of transport layers, including Transmission Control Protocol (TCP), uTP, UDT, QUIC, TOR, and even Bluetooth. Compared to HTTP, IPFS has a faster transmission speed because IPFS finds files by hash identification. When you have the hash, you ask and connect to the network "who owns this content (hash)". Then connect to the corresponding node and download it, that is, this can form a point-to-point coverage, thus achieving very fast, extensive and ready-to-use routing. IPFS is essentially a P2P system for retrieving and sharing IPFS objects. An IPFS node is a data structure with two fields. The link structure has three data fields: Name: The name of the link Hash: The hash of the IPFS object being linked Size: The cumulative size of the linked IPFS node, including where its links are followed IPFS nodes are usually referenced by their Base58-encoded hash. For example, let's use the IPFS command-line tool to view an IPFS object with the hash QmarHSr9aSNaPSR6G9KFPbuLV9aEqJfTk1y9B8pdwqK4Rq: You may notice that all hashes start with "Qm". Because the hash is actually a multihash, which means that the hash itself specifies the hash function and the length of the hash in the first two bytes of the multihash. In the example above, the first two bytes in hexadecimal are 1220, where the 12 indicates that this is a SHA256 hash function, and the 20 indicates the length of the hash in bytes, which is 32 bytes. The data and named links give IPFS a collection of objects a structure called a Merkle DAG - DAG means directed acyclic graph, and Merkle represents this as a cryptographically authenticated data structure that uses cryptographic hashes to address content. To visualize the graph structure, we will visualize IPFS objects through a graph with Data in the nodes, and Links pointing to other IPFS objects on the graph edges, where the name of the Link is the label on the graph edge. We will now give examples of the various data structures that can be represented by IPFS objects. IPFS can easily represent a file system consisting of files and directories. We can break down the representation of files through the following examples. A small file (<256 kB) is represented by an IPFS object with the data being the file contents (plus a small header and footer) and no links, i.e. the links array is empty. Note that the file name is not part of the IPFS object, so two files with different names and the same contents will have the same IPFS object representation and therefore the same hash. We can add a small file to IPFS using the ipfs add command: We can use ipfs cat to view the file contents of the above IPFS object: Using the ipfs object to view the infrastructure yields: We visualize the file as follows: Large files (> 256 kB) are represented by a list of links pointing to file chunks < 256 kB, and with only minimal Data specifying that this object represents a large file, the links to the file chunks have the empty string as name. A directory is represented by a list of links pointing to IPFS objects that represent files or other directories. The names of the links are the names of the files and directories. For example, consider the following directory structure for the directory test_dir: The files hello.txt and my_file.txt both contain the string Hello World!\n. The file testing.txt contains the string Testing 123\n. When this directory structure is represented as an IPFS object, it looks like this: Note that the file containing Hello World!\n is automatically deduplicated, and the data in that file is stored in only one logical location in IPFS (addressed by its hash address). The IPFS command-line tool can seamlessly follow directory link names to traverse the file system: IPFS can represent the data structures that Git uses for versioned file systems. The Git commit object is described in the Git Book. The main properties of a Commit object are that it has one or more links, with names like parent0, parent1, etc., pointing to previous commits, and a link to a name object (a tree in Git) that points to the file system structure that the commit refers to. Let's use the same example where we had a previous file system directory structure with two commits together: the first commit is the original structure, and in the second commit, we have updated the file my_file.txt to another world instead of the original Hello World. Also note here that we have automatic deduplication enabled, so the new objects in the second commit are just the main directory, the new directory my_dir, and the updated file my_file.txt. Blockchains have a natural DAG structure because past blocks are always linked by the hash of their successor blocks. More advanced blockchains such as the Ethereum blockchain also have an associated state database that has a Merkle-Patricia tree structure that can also be emulated using IPFS objects. We assume a simple blockchain model where each block contains the following data: This blockchain can then be modeled in IPFS as follows: We saw the deduplication we gain when putting the state database on IPFS; between two blocks, only the state entries that have changed need to be explicitly stored, rather than the entire state (which would significantly increase the data burden). An interesting point here is the difference between storing data on the blockchain and storing a hash of the data on the blockchain. On the Ethereum platform, we need to pay a high fee to store data in the associated state database to minimize the bloat of the state database. Therefore, it is a common design pattern that larger data does not store the data itself, but rather stores the IPFS hash of the data in the state database. Typically, blockchains distinguish between what’s in the global ledger that every miner replicates (that is, the data stored in the chain itself) and data that may be referenced in the chain but is not replicated among all nodes. If a blockchain with an associated state database is already represented in IPFS, then the distinction between storing a hash on the blockchain and storing data on the blockchain becomes blurred, since everything is stored in IPFS anyway, and only the hash of the block requires the hash of the state database. In this case, if someone stores an IPFS link in the blockchain, we can seamlessly follow that link to access the data as if the data was stored in the blockchain itself. However, we can still distinguish between on-chain and off-chain data storage by looking at what miners need to process when creating a new block. In the current Ethereum network, miners need to process transactions that will update the state database. To do this, they need access to the full state database to be able to update it wherever it is changed. Therefore, in the blockchain state database represented by IPFS, we still need to mark data as "on-chain" or "off-chain". For miners, "on-chain" data is essential for local mining, and this data will be directly affected by transactions. "Off-chain" data will have to be updated by users without being touched by miners. |