An article to understand the Filecoin chain suspension event

Professionalism and focus, win-win cooperation

Recently, the Filecoin development team published a review of the short-term outage of the Filecoin network in December last year, saying that multiple teams have begun writing and executing post-analysis to determine the test coverage of actors/lotus, as well as other improvements such as network infrastructure and communication alarms to reduce the possibility of the problem happening again. Today, let's review the causes and consequences of the last network outage and the improvement plan~

On December 19, 2020, the Filecoin network experienced a chain stall, which meant that new blocks could be created for a period of time, but the nodes on the chain could not reach a consensus on the resulting state, so each node calculated a different value.

Thanks to a quick response between community members, block producers, and developers, a fix was released within four hours, and the network was fully restored within seven hours. This post describes the issues encountered, the impact of the outage, the quick response, and next steps.

reason

The underlying issue is potentially non-deterministic iteration over a map of objects in the storage node actor implementation. The actor is implemented in Go. As it is well known that traversing a Go map is non-deterministic, the actor uses a pattern to always sort the iteration results before using them (enforcing static analysis).

Unfortunately, a bug in the comparison function used when sorting two such maps caused the sort to be invalid (see #1335). As a result, different nodes processed the map entries in different orders, resulting in different results and gas consumption.

This code path is actually only reachable by (a) a node declaring multiple sectors immediately finalized, or (b) a node immediately recovering from a failure that spans multiple partitions. (The other two code paths get to this point, but are extremely unlikely in practice.)

Prior to this, neither of these two paths was used in the mainnet, and multiple sectors/partitions were not used as data to expose non-determinism. The termination of multiple sectors at the same time caused a network outage.

Filecoin participant testing covers the code in question, but does not include mechanisms to verify deterministic execution between different test runs. Integration testing of the Lotus node implementation does not cover finalizing multiple sectors.

Impact of network outages

Most importantly, it should be stressed that no data was lost during this network outage. While transactions on the network were temporarily inhibited due to the inability to create new data blocks, all data provided by storage providers is safe and available once the network is back up and running.

Additionally, it is important to note that the Filecoin protocol specification provides for data retrieval even in the event of a chain outage. Therefore, while on-chain transactions are not possible for the duration of the outage, the core functionality of the Filecoin network remains unchanged.

Meanwhile, the fixes the network has put in place ensure that mining nodes themselves are not penalized for downtime; instead, consensus slashing is temporarily reduced in order to deprioritize and encourage network recovery.

Quick response

The speed with which basic issues were first discovered, identified, fixed, and deployed was also evident:

➊Within fifteen minutes of the incident, the automatic monitoring triggered an alarm;

➋ Within thirty minutes, the node and implementation developers came together to resolve the issue;

➌ Within four hours, the developers identified and released a fix for this issue;

➍Within seven hours, enough nodes adopted the fix to exceed the power threshold for majority consensus, putting the network on the path to recovery.

This is an incredibly fast response for a young distributed storage network. Even established blockchains experience chain pauses, and compared to Filecoin’s time handling, the entire community should be proud of the speed with which this incident was handled.

Recovery from this kind of incident is only possible with the collaborative efforts of multiple groups around the world.

Parties across the ecosystem collaborated to make this happen: nodes detected and reported the issue, bringing it to the attention of developers; engineering teams coordinated to develop and release a peer-reviewed patch for the underlying issue, while communicating the status of this fix through community channels; and network participants around the world worked to apply the patch and get the network back online as quickly as possible.

While this doesn’t need to be so complex, it’s an impressive demonstration of the level of engagement and focus within the Filecoin ecosystem.

What's next

Building a blockchain is like building a software rocket. They are a very complex technology, so it is difficult to get everything right on the first try. Just like a real rocket, things can go wrong in unexpected ways. When this happens, it is important to have the infrastructure in place to resolve the problem as quickly as possible, minimize the impact, and reduce the likelihood of the problem happening again.

To this end, multiple teams worked on the writing and execution of post-mortems, identifying test coverage for actors/roles and other improvements to alerting and issue escalation for network infrastructure/communications to help mitigate future incidents.

Thanks to the patience, hard work, and commitment of the Filecoin community, the shortcomings of this novel technology are constantly being addressed. As all issues are discovered and resolved, the network will further develop into a stable, reliable, and flight-proven platform.

<<: Xiaomi invests in terminal AI chip company spun off from Bitmain

>>: IMF survey shows that 80% of people consider cryptocurrencies as "money"

An article to understand the Filecoin chain suspension event

Small movements reveal his mental state

Skybridge Capital Chief Investment Officer: Gold is good, but we are more optimistic about Bitcoin

Dark Web Processes More Bitcoin Transactions Than BitPay, Study Shows

What are the faces that only seek novelty in relationships?

Bitcoin startup BitOasis is expanding in the Middle East

Bitcoin mining halving record: thermal power is completely shut down, and hydropower has 3 million loads idle

Crypto Ice and Fire: Trading volume falls to lowest level in 2021, whales continue to hoard BTC

Indonesia shows signs of boosting Bitcoin development

Let’s brainstorm together! Key points of the central bank digital currency topic

What is the secret of a man's mouth? Look at a man's mouth to understand his fate

Recommend

Judging marriage from face reading, will a woman with cross-eyed eyes have a smooth marriage?

What are the characteristics of a woman destined to remarry? Is physiognomy really accurate in predicting marriage?

Will Changpeng Zhao go to jail?

Analysis: Are people with long and sparse eyelashes kind-hearted?

Why do people with pointed chins have to be sarcastic and sharp-tongued?

Knowing destiny through face

Palmistry of men that women should never marry

Dark circles under the eyes indicate a fierce look on a woman's face

What does it mean when a woman has vertical lines between her eyebrows?

What does a mole on the palm mean?

What does a cheating woman look like?

What does a mole on a woman’s ear mean? Will it mean her marriage will be happy?

Is it true that a man with a mole on his collarbone likes to help others?

How to tell fortune through palmistry How to tell fortune through palmistry

What kind of face is the most blessed?