Restoring the truth | Who was surprised by the Filecoin chain outage?

Restoring the truth | Who was surprised by the Filecoin chain outage?

On December 19, 2020, the Filecoin network experienced an on-chain outage, which meant that new blocks could be created for a period of time, but miners could not reach consensus on the resulting state, and each block calculated a different value. Thanks to the rapid response among community members, miners, and developers - a fix was released within four hours, and the network achieved full recovery within seven hours.

This article will describe the issue in detail, explain the impact of the outage, quick response, and future development.
0 1
Interruption reason

The underlying issue is potentially non-deterministic iteration over a mapping of objects in the storage miner actor implementation. The actor is implemented in Go. Iteration over Go mappings is known to be non-deterministic.

Participants always sort the results of an iteration before using it (enforced by static analysis). Unfortunately, a bug in the comparison function used when sorting two such maps resulted in an invalid sort (see #1335 ). As a result, different nodes processed the map entries in different orders, leading to different results and gas consumption.

This code path can only be reached by (a) a miner declaring multiple sectors to terminate at once, or (b) a miner recovering from a failure across multiple partitions at once. (The other two code paths get to this point, but are extremely unlikely in practice.) Neither of these paths has been used in mainnet before, with multiple sectors/partitions exposed as non-deterministic data. The simultaneous termination of multiple sectors triggered this stall.

Filecoin participant testing covers the code in question, but does not include mechanisms to verify deterministic execution between different test runs. Integration testing of the Lotus node implementation does not cover finalizing multiple sectors.
02
Impact of disruption

Most importantly, it should be emphasized that no data was lost during the outage . While the inability to create new blocks temporarily inhibited transactions on the network, all data provided by storage providers is safe and available once the network is back up and running. In addition, it is worth noting that the Filecoin protocol specification provides for data retrieval even in the event of a chain outage. In other words: on-chain transactions were not possible for the duration of the event, but the core functionality of the Filecoin network remained intact.

Additionally, the fixes put in place ensure that mining operations themselves are not penalized for downtime; instead, consensus slashing is temporarily reduced in order to deprioritize and encourage network recovery.
03
Quick response

The speed with which basic issues were first discovered, identified, fixed, and deployed was also evident:

1. Automatic monitoring triggered an alarm within 15 minutes of the incident.

2. Within thirty minutes, miners and implementation developers came together to respond

3. Within four hours, the developers identified and released a fix for this issue

4. Within seven hours, enough nodes adopted the fix to exceed the power threshold for majority consensus, putting the network on the path to recovery

This is an incredibly fast response for a young decentralized network. Even though established blockchains experience chain pauses and forks, the time it takes Filecoin to resolve this event is comparable to blockchains that have been running for years. The entire community should be proud of the speed with which this event was handled.

This recovery was only possible due to the collaborative efforts of multiple groups around the world . Various parties across the ecosystem worked together to make this happen: miners detected and reported the issue and brought it to the attention of developers; engineering teams coordinated to develop and release a peer-reviewed patch for the underlying issue while communicating the status of this fix through community channels; and network participants around the world worked hard to apply the patch and bring the network back online as quickly as possible. While this is an emergency we don’t want to repeat, it was an impressive display of engagement and focus within the Filecoin ecosystem.
04
What's next

Building a blockchain is like building a rocket. There are so many complex technologies involved that it’s hard to get everything right on the first try. Just like a real rocket, unexpected events can be hard to anticipate. When they do happen, it’s important to have the infrastructure in place to resolve the issue as quickly as possible, minimize the impact, and reduce the likelihood of the problem happening again.

To this end, multiple teams worked on the writing and execution of post-mortems, identifying test coverage for actors/roles and other improvements to alerting and issue escalation for network infrastructure/communications to help mitigate future incidents.

With the concerted efforts of the entire Filecoin community, this new technology will continue to be improved. We believe that the entire network will continue to improve in the process of discovering and solving problems, and will eventually form a stable and reliable "launchable" platform.

<<:  Former U.S. Treasury Secretary Summers: Bitcoin "will continue to develop" and its price will continue to rise in the long run

>>:  What else can millions of Ethereum 4GB graphics card mining machines do?

Recommend

V God proposes to limit the total amount of ETH to 120 million

Ethereum founder Vitalik Buterin proposed setting...

How to tell whether a woman's marriage is good or bad by looking at her face

1. Women with high foreheads usually have unhappy...

What does physiognomy say about moles on the cheekbones?

The cheekbone is the protrusion in our cheek, loc...

What does a square hand look like? Is a square hand a good shape?

What does a square hand look like? Is a square ha...

The face of a woman who always likes to throw tantrums

People's requirements for women are usually t...

Offshore outsourcing service providers will be replaced by blockchain

Rage Comment : The information and intelligent er...

Who is rich and who is poor?

Who is rich and who is poor? Everybody wants to b...

Bitcoin investment exceeds $1 billion in 2015

Author: Velvet Gold Mine Image source: Dazhi Amer...

What is the fortune of people with flat arches?

The shape of a person's feet actually has as ...