Professionalism and focus, win-win cooperation Recently, the Filecoin development team published a review of the short-term outage of the Filecoin network in December last year, saying that multiple teams have begun writing and executing post-analysis to determine the test coverage of actors/lotus, as well as other improvements such as network infrastructure and communication alarms to reduce the possibility of the problem happening again. Today, let's review the causes and consequences of the last network outage and the improvement plan~ On December 19, 2020, the Filecoin network experienced a chain stall, which meant that new blocks could be created for a period of time, but the nodes on the chain could not reach a consensus on the resulting state, so each node calculated a different value. Thanks to a quick response between community members, block producers, and developers, a fix was released within four hours, and the network was fully restored within seven hours. This post describes the issues encountered, the impact of the outage, the quick response, and next steps. The underlying issue is potentially non-deterministic iteration over a map of objects in the storage node actor implementation. The actor is implemented in Go. As it is well known that traversing a Go map is non-deterministic, the actor uses a pattern to always sort the iteration results before using them (enforcing static analysis). Unfortunately, a bug in the comparison function used when sorting two such maps caused the sort to be invalid (see #1335). As a result, different nodes processed the map entries in different orders, resulting in different results and gas consumption. This code path is actually only reachable by (a) a node declaring multiple sectors immediately finalized, or (b) a node immediately recovering from a failure that spans multiple partitions. (The other two code paths get to this point, but are extremely unlikely in practice.) Prior to this, neither of these two paths was used in the mainnet, and multiple sectors/partitions were not used as data to expose non-determinism. The termination of multiple sectors at the same time caused a network outage. Filecoin participant testing covers the code in question, but does not include mechanisms to verify deterministic execution between different test runs. Integration testing of the Lotus node implementation does not cover finalizing multiple sectors. Impact of network outages Most importantly, it should be stressed that no data was lost during this network outage. While transactions on the network were temporarily inhibited due to the inability to create new data blocks, all data provided by storage providers is safe and available once the network is back up and running. Additionally, it is important to note that the Filecoin protocol specification provides for data retrieval even in the event of a chain outage. Therefore, while on-chain transactions are not possible for the duration of the outage, the core functionality of the Filecoin network remains unchanged. Meanwhile, the fixes the network has put in place ensure that mining nodes themselves are not penalized for downtime; instead, consensus slashing is temporarily reduced in order to deprioritize and encourage network recovery. The speed with which basic issues were first discovered, identified, fixed, and deployed was also evident: ➊Within fifteen minutes of the incident, the automatic monitoring triggered an alarm; ➋ Within thirty minutes, the node and implementation developers came together to resolve the issue; ➌ Within four hours, the developers identified and released a fix for this issue; ➍Within seven hours, enough nodes adopted the fix to exceed the power threshold for majority consensus, putting the network on the path to recovery. This is an incredibly fast response for a young distributed storage network. Even established blockchains experience chain pauses, and compared to Filecoin’s time handling, the entire community should be proud of the speed with which this incident was handled. Recovery from this kind of incident is only possible with the collaborative efforts of multiple groups around the world. Parties across the ecosystem collaborated to make this happen: nodes detected and reported the issue, bringing it to the attention of developers; engineering teams coordinated to develop and release a peer-reviewed patch for the underlying issue, while communicating the status of this fix through community channels; and network participants around the world worked to apply the patch and get the network back online as quickly as possible. While this doesn’t need to be so complex, it’s an impressive demonstration of the level of engagement and focus within the Filecoin ecosystem. Building a blockchain is like building a software rocket. They are a very complex technology, so it is difficult to get everything right on the first try. Just like a real rocket, things can go wrong in unexpected ways. When this happens, it is important to have the infrastructure in place to resolve the problem as quickly as possible, minimize the impact, and reduce the likelihood of the problem happening again. To this end, multiple teams worked on the writing and execution of post-mortems, identifying test coverage for actors/roles and other improvements to alerting and issue escalation for network infrastructure/communications to help mitigate future incidents. Thanks to the patience, hard work, and commitment of the Filecoin community, the shortcomings of this novel technology are constantly being addressed. As all issues are discovered and resolved, the network will further develop into a stable, reliable, and flight-proven platform. |