Resumption: How Internet Computer Nodes Quickly Catch Up to the Blockchain’s Latest State

An explanation of how replication enables nodes on the Internet Computer blockchain to resume and catch up to the latest state of the protocol.

By Manu Drijvers, Engineering Manager (Consensus) | DFINITY

The Internet Computer blockchain is created by hundreds of nodes — eventually scaling to millions over the next decade — in independent data centers located around the world that are running the Internet Computer Protocol (ICP). This provides a secure and reliable way to run canister smart contracts to build dapps, DeFi platforms, NFTs, websites, and internet services directly on the open internet. The Internet Computer is the first scalable blockchain, where the network can infinitely increase its capacity by adding new subnets (aka blockchains) where groups of canisters run together.

Within each subnet, the Internet Computer makes sure that canisters run securely and reliably:

Securely, meaning that the state of the canister only changes according to the rules of the canisters, and that it cannot be tampered with such that the state of the canister does not agree with its code.
Reliably, meaning that canisters in the subnets do not suddenly stop running.

The network requires that the canisters get this security and reliability through an approach called replication. If you were to look under the hood of a subnet, you would see that it is powered by multiple replicas. Each of the replicas holds all of the states of the canisters and processes, as well as all the messages that should be processed by the canisters. The Internet Computer provides that even if some of the replicas powering the subnets are offline or even malicious, the subnet continues to process messages for the canisters. More precisely, as long as less than a third of those replicas are offline or malicious (i.e., more than two-thirds of them are online, available, and participating in the protocol), then the subnets continue making progress for replication. The network uses a consensus protocol, such that all of the replicas process the same messages via a blockchain.

“The Internet Computer’s subnets are designed to process substantially more data with less latency.”

Each replica on an Internet Computer subnet has its own view of the blockchain, and each replica tries to exchange artifacts via a gossip network. But sometimes some replicas might be unavailable — for example, in a four-replica subnet consisting of Replicas One, Two, Three, and Four, perhaps Replica Four is temporarily disconnected from the internet. Because the Internet Computer Protocol works in a way that is fault tolerant, the remaining three replicas on the subnets should be able to make progress even if Replica Four is offline. If another replica would now be unavailable, then this would not work because more than a third of the replicas would be offline. If Replica Four now comes back online, it’s able to fully catch up to the latest state of the protocol. Because more than two-thirds of the nodes are available, the blockchain should continue to grow and the network’s subnets should continue processing messages.

Additionally, the network might at some point want to add a node to a subnet, perhaps to increase the fault tolerance on the subnets, but this node comes in without knowing anything about the state of the subnet. Here again the node needs to be able to fully catch up to the latest state of the subnets, such that it can participate in the protocol. Whenever more than two-thirds of the nodes over subnets are online and available, the subnet must grow such that the canisters on the subnets run in a very reliable way, even if some of the nodes powering the subnets are offline or malicious. The result of this is that an honest replica must always be able to catch up to the latest state no matter how far behind it is.

Conventional blockchains like Bitcoin and Ethereum achieve this property by relying on their blocks being stored forever, and all of their blocks are needed to be able to cryptographically verify transactions and participate in the blockchain. For example, the Bitcoin blockchain is currently estimated to be roughly 350 GB in size, and Ethereum is roughly 900 GB, so fully syncing them is a difficult effort. The Internet Computer’s subnets are designed to process substantially more data with less latency, so it is not feasible to require that all participants have the full blockchain to operate on the network. This poses a challenge: suppose Replica One is far behind the other replicas of the subnet, such that all other subnet replicas already deleted parts of the blockchain that Replica One is looking for?

Another complication is that the nodes of a given subnet can change over time. Since our protocol relies on verifying signatures from the nodes of a subnet, it’s very difficult for a replica that’s behind to know which signatures to trust, because it doesn’t know which nodes are currently powering the subnets.

The Internet Computer therefore implements a novel approach to resumption that allows nodes to fully participate (and do all the required cryptographic validation) without requiring all historical blocks, building on Chain Key cryptography. Every subnet (i.e., every blockchain) has one fixed public key that does not change over time. The corresponding secret key is actually shared between the participants of the subnets. And if the membership of the subnet changes, then the secret shares are securely redistributed over to new nodes. And the nodes of our subnets can now collaboratively sign an artifact on behalf of the subnet. Such artifacts can now be verified using this fixed subnet key.

This will make it very easy to verify objects that were signed by the subnets because all that’s needed is one fixed public key that does not change over time. Even a replica that’s very far behind and doesn’t know the current nodes of the subnet could verify such a signature.

To solve the resumption problem, the Internet Computer introduces a special artifact called the catchup package (CUP). This CUP allows a replica to securely jump to a newer height, skipping parts of the blockchain. It is signed using this fixed subnet key, such that even replicas that are behind can always verify its authenticity and catch up to the latest state. An honest replica makes sure it always has a CUP available, which allows other replicas to always fully catch up to the latest state of the protocol.

The remaining question is: What exactly must go into this CUP, such that a replica can securely skip ahead? To answer that, let’s first look into the details of the protocol. The Internet Computer blockchain orders the messages that should be executed on the subnets, resulting in the replicated state of the subnets, which contains all the memory of the canisters. Using the previous replicated states and the messages in the block, every replica can compute the next replicated state. For a replica to catch up, it needs to have one of these replicated states, such that it can compute the next one on its own. Additionally, blocks refer to these replicated states, and their validity is checked against some parts of this replicated state and the subsequent blocks in the blockchain. This allows the Internet Computer to require, for example, that one message does not appear multiple times in the blockchain and that it is not processed multiple times. To do such checks, the replica not only needs to have the right replicated states, but also needs to have some part of the blockchain, which is another ingredient needed for a CUP.

Finally, there is the random beacon. At every height, there is a random-looking artifact that is called a random beacon, which is an unpredictable value, and the past random beacon needs to verify the next random beacon. The random beacon is also required to verify the consensus artifacts, because the random beacon is used to select the roles in the consensus protocol. A replica will likely need to have a random beacon to fully catch up.

With this, the CUP is ready to be defined. Replicas will regularly create CUPs (e.g., every 200 heights). Once Height 200 is reached, replicas will check whether it’s time to create a CUP. Replicas check to see if they have a random beacon available, a block at that height, and the replicated state. Additionally, they check that the blockchain has advanced to a point where older states than at Height 200 are no longer required. If all of this is satisfied, then the replicas are ready to create a CUP for Height 200. They group the replicated states, a block, and a random beacon together into a single artifact, which they sign with that special subnet public key.

Now this artifact should be sufficient for other replicas to catch up. Note that the replicated state is actually too large to be included in the CUP. Therefore, only a hash of the replicated states is included. (The full state can be obtained by a separate state sync protocol that is out of scope for this blog post.) Whenever the replicas succeed in creating such a CUP, they now can throw away all the older artifacts because they know that replicas will be able to catch up. If a replica far behind the others obtains a CUP over the network, because it’s signed in a special way under the subnet public key that is still known, the authenticity of this artifact can be verified and its contents trusted.

Let’s consider how a replica can use a CUP to catch up to the latest state of the blockchain. From the CUP, it obtains a random beacon, a block, and a replicated state, which it trusts because they are authenticated via Chain Key cryptography. With that, the replica can follow the random beacon chain, because now the replica has a previous beacon to verify the next beacon against. The replica also has a block so that it can follow the subsequent blocks and verify the notarizations and the finalizations. Since it has a replicated state and the subsequent blocks from that, the replica can compute the next states. The replica now has the latest blockchain and random beacon artifacts and the latest replicated states, meaning it’s fully up to date and can participate in the protocol.

In summary, the Internet Computer runs canisters a reliable way, even if some of the machines that power the subnets are offline. More precisely, whenever more than two-thirds of the replicas on the subnet are honest and online, the subnets should make progress and the blockchain should grow. A result of that is that every honest replica must always be able to catch up to the latest state, no matter how far it is behind. At the same time, the blockchain can grow very large, so replicas must delete all parts of the blockchain eventually. A special artifact called the catchup package (CUP) is authenticated using the fixed subnet public key. The CUP contains everything that a replica needs to securely skip parts of the blockchain and jump ahead to the recent state such that it can fully participate in the protocol.
____

Start building at smartcontracts.org and join our developer community at forum.dfinity.org.

Resumption: How Internet Computer Nodes Quickly Catch Up to the Blockchain’s Latest State was originally published in The Internet Computer Review on Medium, where people are continuing the conversation by highlighting and responding to this story.

Publication date

10/20/2021 - 18:09