This article was originally written by Joss from IPFS Force Zone Although the scope of operation and maintenance is vague for most people, especially in the field of distributed storage, "operation and maintenance" is often accompanied by terms such as "computer room" and "IDC", which leads to many people in other industries remaining in the physical level of understanding operation and maintenance, with moving machines, loading and unloading servers, managing network power, etc. as benchmarks, and maintaining machine operation like network management. In fact, operation and maintenance engineers are divided into many categories based on their working methods, such as: operation and maintenance engineers, operation and maintenance development engineers, operation and maintenance platform R&D engineers, database engineers, database R&D engineers, etc., which reflect the operation and maintenance capabilities more than the development of customized native tools for the business. Operation and maintenance engineers are responsible for maintaining and ensuring the high availability of the entire service, while continuously optimizing the system architecture to improve deployment efficiency, optimize resource utilization and increase overall ROI. As Filecoin's mainnet approaches, the industry has basically reached a consensus on the importance of "operation and maintenance". As the amount of information and content of industry preaching increases, more and more investment participants are also entering a period of rational analysis, whether from the periodic dynamics of the Filecoin project or the selection of mining service providers. On the eve of the mainnet launch, are the most important operation and maintenance engineers at this stage ready? Getting started with operations and maintenance
1.1 What is Operation and Maintenance? Operation and maintenance generally refers to Internet operation and maintenance, which is one of the four major technical departments along with R&D, testing, and system management. To be more specific, the technical directions of operation and maintenance include service monitoring technology, service fault management, service capacity management, service performance optimization, service global traffic scheduling, service task scheduling, service security assurance, data transmission technology, automatic service release and deployment, service cluster management, service cost optimization, database management, platform development, development and optimization of distributed storage platforms, etc. Among them, distribution is a very clear technical essential. At the same time, in the operation and maintenance work, the operation and maintenance personnel have to take care of both large and comprehensive as well as small and detailed matters such as Web servers, monitoring, automatic deployment, configuration management, load balancing, transmission tools, backup tools, databases, distributed platforms, distributed databases, containers, virtualization, security, and problem tracing. Operation and maintenance engineers synchronize data with third-party systems in real time through software or command lines to achieve seamless integration of the visual monitoring platform and various systems, ensure the accuracy and stability of system data, process alarm information in a timely manner, and improve the efficiency of dynamic environment monitoring and management. Dynamic environment monitoring has been around for twelve years since 2008, and mainly includes the following modules: Power distribution system: UPS and DC power supply, self-contained generator, power distribution cabinet, lightning protection detection, etc. Environmental system: air conditioning, temperature and humidity monitoring, water leakage monitoring, gas monitoring, etc. Fire protection system: smoke sensor, temperature sensor, early warning system, other fire protection equipment, etc. Security system: image monitoring, access control monitoring, infrared detection, glass breakage detection, etc. IT network management monitoring: network equipment, PC servers, operating systems, databases and applications, etc. In addition, operation and maintenance engineers also need to pay attention to: Linkage control: electronic switch, linkage video recording, data storage, motion control, etc. Event records: operation records, status records, exception records, confirmation records, etc. Abnormal alarm: sound and light alarm, voice broadcast, telephone alarm, SMS alarm, email notification, etc. An operation and maintenance person’s day starts with clocking in at work and ends with clocking out at get off work. It is a busy cycle over and over again, and the “007” work mode is common. 1.2 History of Operation and Maintenance What operation and maintenance needs to do is to make the variables in system operation controllable, but the heterogeneity and complexity of the operation and maintenance environment have led to increasingly high manpower and time costs for daily operation and maintenance work. The transition from primary operation and maintenance to the latest intelligent operation and maintenance is mainly a four-stage transition: Script Era Tool Era The era of automation The era of intelligence Two years ago, "intelligent operation and maintenance" began to attract widespread attention. With the rise and gradual maturity of technologies such as big data analysis, APM (application performance management), intelligent anomaly detection, and machine learning, operation and maintenance needs have gradually transitioned to automation and intelligence. The significance of automation 2.1 Automated operation and maintenance method Automation is the premise of intelligence. Automated operation and maintenance methods solve the automatic management of hardware and networks, automatic management of virtual machines, and automatic installation and configuration of operating systems and software. We have seen a lot of "management". The significance of automation is, on the one hand, to improve efficiency, optimize costs, optimize resources, and make better flexibility, freeing up resources to do other things; on the other hand, it is to standardize and replicate the operation and maintenance results. Of course, the process from tooling to automation is not that easy. For the entire industry, more work is currently done on exploring automation. IPFS Force Zone, which has been deeply engaged in distributed storage for many years and Filecoin source code for nearly two years, is one of the few providers of automated operation and maintenance services. We would like to pay tribute to the technical leaders who have worked hard to serve the development of the industry. Back to operation and maintenance, in the field of Filecoin, standard operation and maintenance software development is different from other relatively traditional operation and maintenance software. For example, in the process of Alibaba's transition from tooling to automation, I think the tooling challenge is relatively small, and even traditional operation and maintenance personnel can easily write some tools, such as using Python to write more tool systems. However, if the developer's tools are eventually to be able to reach the stage of automation, it means that the requirements for the tools will become higher and higher. For example, the quality of the tools. If the tools written by the developers often have problems and cannot withstand the pressure at a large scale, then from a human perspective, the developers will gradually lose their sense of trust, and it will be difficult to complete the process in the end. When automated O&M implements monitoring, problem diagnosis, visualization, etc., O&M personnel only have a few manual tasks left, including disaster recovery switching, emergency operations, application deployment, and start and stop. In this way, a large amount of energy left can be devoted to O&M development, bringing users a better service experience. 2.2 Ways to achieve automation A complete integrated power environment monitoring system can collect and monitor the operating status of the distributed independent power equipment, computer room environment, security monitoring, etc. in real time, record and process relevant data; detect faults in time, and make necessary remote control and adjustment operations, and notify on-site and remote operation and maintenance processing in time; realize the computer room with few or even no people on duty, as well as the centralized monitoring, maintenance and management of power supply and air conditioning, improve the reliability of the power supply system and the safety of communication equipment, and provide strong technical support for the automation and even intelligent management of the computer room and scientific decision-making. However, at present, the actual implementation of automated operation and maintenance in the Filecoin field is not very large, and there are even fewer outstanding, native customized operation and maintenance systems, and the Force Mining Pool is one of them. The importance of operations to Filecoin 3.1DEVOPS Concept In the DevOps model, development teams and operations teams are no longer "isolated". They will collaborate with each other throughout the entire life cycle of the application (from development and testing to deployment and operation) and develop a series of skills that are not limited to a single function. These teams will use practical experience to automate the slow processes that were previously done manually, and use technical systems and tools that can help them operate and develop applications quickly and reliably, further improving the team's work speed. 3.1.2 Cultural Concept of DevOps The transition to DevOps requires a change in culture and mindset. The purpose of DevOps is to eliminate the barriers between two traditionally isolated teams. They strive for frequent communication, increased efficiency, and improved customer service. They have full control over their services and often go beyond the traditional scope of their established roles or functions to think about the needs of end users and solve these needs. 3.1.3DevOps Practice Notes There are some important practices that can help organizations innovate faster by automating and simplifying software development and infrastructure management processes, and most of these practices require the right tools. One of the basic practices is to make small, frequent updates. This is an effective way for organizations to deliver innovation to customers quickly. Such updates are often more incremental in nature than the occasional updates of traditional release practices. Frequent, small updates reduce the risk of each deployment. They help teams deal with bugs more quickly because they can identify the most recent deployment that caused the bug. While the cadence and size of updates may vary, organizations using a DevOps model will update more frequently than those using traditional software deployment practices. Additionally, organizations can use microservices architecture to increase the flexibility of applications, thereby accelerating the pace of innovation. Microservices architecture breaks large, complex systems into simple, independent projects. Applications are broken down into many individual components (services), each limited to a single purpose or function, and these services can run independently of their peers or with the application as a whole. This architecture reduces the coordination overhead of updating applications, and organizations can achieve faster development when each service is mapped to a small, agile team that controls each service. However, microservices combined with a high release frequency can lead to a significant increase in deployment volume, which can create operational challenges. Therefore, DevOps practices such as continuous integration and continuous delivery can help address these issues, allowing organizations to deliver quickly in a secure and reliable manner. Like infrastructure as code and configuration management, infrastructure automation practices can help maintain the elasticity of computing resources and adaptability to frequent changes. In addition, monitoring and logging practices can help engineers track the performance of applications and infrastructure so that they can quickly respond to problems that arise. 3.2 Differences between Filecoin O&M and Traditional O&M The operation and maintenance of Filecoin miners is several times or even dozens of times more difficult than that of traditional Internet operations and maintenance. This is mainly affected by the mining model. For example, when the whole machine is used in series and parallel, the difficulty is only around the stability of the program on a single firmware, but if a clustered or distributed mining pool is used, the high standards of various request scheduling and minute-level deployment between clusters are a major challenge for operation and maintenance engineers. When the demand for computing power of the Filecoin network increases sharply, the operation and maintenance of the clustered mining pool model can still be handled with ease, while the operation and maintenance of other models requires a lot of people and resources to solve such situations. If we talk about the specific differences between Filecoin operations and traditional operations, here are a few examples: Physical layer : Since service providers like Alibaba Cloud do not have standardized service support, Filecoin cloud services need to pay more attention to the underlying architecture and require customized self-built IDCs, which goes far beyond the scope of hardware.
SaSS : The Filecoin software service layer also requires a large number of operational tools to support data visualization. Therefore, operation and maintenance need to platformize the development and visualize the tools. In this process, operation and maintenance participate in a lot of R&D work.
Operation and maintenance process : Traditional operation and maintenance has few participants and simple logic. It mostly accesses interfaces in the form of Web, monitors ports well, and the feedback results can basically control most variables, and the process is simple. However, Filecoin has a complex process, many modules that need to be maintained, high automation difficulty, complex and high-frequency monitoring data, and especially the penalty mechanism, which is like the sword of Damocles, always reminding miners that even with such difficulty, mistakes are not allowed. Serving customers' data storage needs is Filecoin's top priority.
Accuracy : Operation and maintenance requires physical layer monitoring, but for Filecoin, monitoring dimensions such as block time, block rate, computing power trend, and Lotus synchronization accuracy is no less important than the physical layer operation status. In traditional operation and maintenance scenarios, an exception handling requirement may be in the hour level, and application services for tens of millions of users may be in the minute level, but any abnormality in any parameter of the Filecoin network may cause huge losses in revenue and mortgage penalties to miners, which may easily lead to a loss. In addition, the Force Zone operation and maintenance must optimize the official Lotus code, test the development capabilities of the operation and maintenance, and the stability of the program running results, and implement health status checks, automatic restarts for faults, self-healing for faults, etc. All of this is to increase CPU utilization, thereby increasing computing power, block output, and revenue, and increasing efficiency by 2-3 times. 3.3 Differences in project releases 3.3.1 Frequency Traditional Internet projects are released at a fixed frequency and time, such as every Wednesday. The reason for deployment is mostly to fix bugs and add new features. However, given the current status of the Filecoin network, what The Force Operations and Maintenance needs to do is to flexibly deploy, deploy at any time, and update the chain version at any time, so that the existing cluster can be pushed down and the entire process can be redeployed at the first time. Any adjustment to any details must be reviewed and tested dozens of times to achieve a rapid response to network changes. This is also a prerequisite for achieving real-time optimal mining returns. 3.3.2 Granularity In addition to realizing visualization of data monitoring and operation and maintenance application status monitoring, the Filecoin mining service also has extremely fine granularity. For example, in the process of doing Sector, the P1-P7 status and return value are automatically monitored. When the Force Zone operation and maintenance engineers have made the granularity of data, automation, refinement, and platformization (backend) fine enough, automated deployment at home and abroad can be completed in minutes. All servers can be managed with one click in the background, new codes can be deployed in parallel, and automated tools can compress deployment time, achieving 99% time-saving efficiency optimization. This minute-level remote deployment of data packets as small as hundreds of megabytes is a challenge for the industry, but it is a leap forward for the advancement of Filecoin mining pool technology. Automated large-scale operation and maintenance
The "5PB" large miner standard originally set in the Filecoin large miner test no longer seems to be a challenge to the industry. It is estimated that the Filecoin mainnet will reach 1000PB 3-6 months before its launch. This data brings about an assignment that Filecoin operation and maintenance engineers have prepared in advance: How to operate and maintain on a large scale? Behind the door is the Force Pool team at 3am At present, automated operation and maintenance is the only way to solve large-scale cluster operation and maintenance, and it is also the biggest challenge facing operation and maintenance engineers. How to manage services on hundreds of thousands of servers while ensuring high availability of services requires cluster replication capabilities, but compared with traditional operation and maintenance projects, the complexity of replication deployment has increased by dozens of times. Provide 24-hour on-site maintenance to ensure equipment is put on the shelves; high-frequency timing ring monitoring and log recording to ensure equipment operation; automated distributed deployment and distributed monitoring systems to ensure system operation; core network monitoring to escort equipment, system, and application connectivity; mature 1-to-5 redundant protection strategy to ensure data security; attack and defense protection, etc., etc., this is still far from enough, the road of operation and maintenance is long and has no end...
When the long-awaited mainnet arrives, Filecoin development engineers will gradually retire, and the development of Filecoin will eventually be handed over to the community. The network operation status will depend on the skills of the operation and maintenance engineers of the big miners. In this final stage of the space race, the hard work of the operation and maintenance engineers will be rewarded. The details are still unknown at the moment, so let's wait and see on August 25th. Statement: This article is an original article from IPFS Force District. The copyright belongs to IPFS Force District. It may not be reproduced without authorization. Violators will be held accountable according to law. Tip: Investment is risky, so be cautious when entering the market. This article is not intended as investment and financial advice. |