An article to understand the Filecoin chain suspension event

An article to understand the Filecoin chain suspension event
Professionalism and focus, win-win cooperation
Recently, the Filecoin development team published a review of the short-term outage of the Filecoin network in December last year, saying that multiple teams have begun writing and executing post-analysis to determine the test coverage of actors/lotus, as well as other improvements such as network infrastructure and communication alarms to reduce the possibility of the problem happening again. Today, let's review the causes and consequences of the last network outage and the improvement plan~
On December 19, 2020, the Filecoin network experienced a chain stall, which meant that new blocks could be created for a period of time, but the nodes on the chain could not reach a consensus on the resulting state, so each node calculated a different value.
Thanks to a quick response between community members, block producers, and developers, a fix was released within four hours, and the network was fully restored within seven hours. This post describes the issues encountered, the impact of the outage, the quick response, and next steps.
01
reason
The underlying issue is potentially non-deterministic iteration over a map of objects in the storage node actor implementation. The actor is implemented in Go. As it is well known that traversing a Go map is non-deterministic, the actor uses a pattern to always sort the iteration results before using them (enforcing static analysis).
Unfortunately, a bug in the comparison function used when sorting two such maps caused the sort to be invalid (see #1335). As a result, different nodes processed the map entries in different orders, resulting in different results and gas consumption.
This code path is actually only reachable by (a) a node declaring multiple sectors immediately finalized, or (b) a node immediately recovering from a failure that spans multiple partitions. (The other two code paths get to this point, but are extremely unlikely in practice.)
Prior to this, neither of these two paths was used in the mainnet, and multiple sectors/partitions were not used as data to expose non-determinism. The termination of multiple sectors at the same time caused a network outage.
Filecoin participant testing covers the code in question, but does not include mechanisms to verify deterministic execution between different test runs. Integration testing of the Lotus node implementation does not cover finalizing multiple sectors.
02
Impact of network outages
Most importantly, it should be stressed that no data was lost during this network outage. While transactions on the network were temporarily inhibited due to the inability to create new data blocks, all data provided by storage providers is safe and available once the network is back up and running.
Additionally, it is important to note that the Filecoin protocol specification provides for data retrieval even in the event of a chain outage. Therefore, while on-chain transactions are not possible for the duration of the outage, the core functionality of the Filecoin network remains unchanged.
Meanwhile, the fixes the network has put in place ensure that mining nodes themselves are not penalized for downtime; instead, consensus slashing is temporarily reduced in order to deprioritize and encourage network recovery.
03
Quick response
The speed with which basic issues were first discovered, identified, fixed, and deployed was also evident:
➊Within fifteen minutes of the incident, the automatic monitoring triggered an alarm;
Within thirty minutes, the node and implementation developers came together to resolve the issue;
Within four hours, the developers identified and released a fix for this issue;
➍Within seven hours, enough nodes adopted the fix to exceed the power threshold for majority consensus, putting the network on the path to recovery.
This is an incredibly fast response for a young distributed storage network. Even established blockchains experience chain pauses, and compared to Filecoin’s time handling, the entire community should be proud of the speed with which this incident was handled.
Recovery from this kind of incident is only possible with the collaborative efforts of multiple groups around the world.
Parties across the ecosystem collaborated to make this happen: nodes detected and reported the issue, bringing it to the attention of developers; engineering teams coordinated to develop and release a peer-reviewed patch for the underlying issue, while communicating the status of this fix through community channels; and network participants around the world worked to apply the patch and get the network back online as quickly as possible.
While this doesn’t need to be so complex, it’s an impressive demonstration of the level of engagement and focus within the Filecoin ecosystem.
04
What's next
Building a blockchain is like building a software rocket. They are a very complex technology, so it is difficult to get everything right on the first try. Just like a real rocket, things can go wrong in unexpected ways. When this happens, it is important to have the infrastructure in place to resolve the problem as quickly as possible, minimize the impact, and reduce the likelihood of the problem happening again.
To this end, multiple teams worked on the writing and execution of post-mortems, identifying test coverage for actors/roles and other improvements to alerting and issue escalation for network infrastructure/communications to help mitigate future incidents.
Thanks to the patience, hard work, and commitment of the Filecoin community, the shortcomings of this novel technology are constantly being addressed. As all issues are discovered and resolved, the network will further develop into a stable, reliable, and flight-proven platform.

<<:  Xiaomi invests in terminal AI chip company spun off from Bitmain

>>:  IMF survey shows that 80% of people consider cryptocurrencies as "money"

Recommend

What happened to Bitmain yesterday Series 7 (cut off chip supply)

In the early morning of the 20th, Zhan Ketuan rel...

Women with these palm lines are destined to be poor

1. Broken Palm It is actually quite easy to tell ...

What kind of chin will make people rich?

The chin is the lower part of the three parts of ...

What is the success line? What does the success line mean?

If there is a success line, success will be achie...

Qingqingmei fortune telling diagram

Characteristics of light and clear eyebrows <b...

The potential market for Bitcoin is huge

Bitcoin prices rebounded in the Asian session on ...

What are the facial features of a man who always messes up?

If a person has a sense of responsibility, he won...

Microsoft apologizes: We still accept Bitcoin

Technology company Microsoft was forced to apolog...

What are the facial features of longevity?

A person’s life span can be affected by external ...

Palmistry to tell which men are worth marrying

As the saying goes, it is better to marry well th...

Is it a blessing or a curse if a woman's eyebrows are connected?

How about a woman with joined eyebrows? Eyebrows ...