Uncategorized

Is It Edge or Just a Piece of a Large Distributed System?

This article is part of a series of blog posts published by the OpenInfra Edge Computing Group, to address some of the thornier aspects of edge computing as it enters the mainstream adoption cycle.

December 6, 2022

For years, the many challenges and promises of edge computing have been a topic of seemingly endless discussions, without ever successfully reaching a good definition of what the term ‘edge’ really means. Often used simply as a marketing term, yet at the same time, it also refers to both resource-constrained environments and massively, geographically distributed systems. At this point, it seems better to leave the arguments to the logophiles and go directly to the challenges, solutions and innovative ideas that are needed to turn edge into reality. The OpenInfra Edge Computing Group is doing just that in its weekly meetings, where it has been focusing on infrastructure software, automation and how it all integrates with the underlying hardware.

Distributed systems have been in existence for decades and yet there is still more work to be done to efficiently and cost-effectively operate and maintain them. However, new and emerging use cases and workloads put high demands on the infrastructure building blocks. The telecommunications industry is a good example of the need to pivot from the traditional ways of doing things to support the new demands. During the October 2022 PTG, Beth Cohen (New Product Strategist at Verizon) gave a presentation about the company’s Virtual Network Services product journey where she pointed out how the backbone infrastructure changes, as it addresses increased expectations for easier manageability and robustness. The recent technology trends have been pointing towards micro-services, event-driven architecture and hardware/software disaggregation, all of which are designed to address the rigidity of monolithic solutions. These changes deliver the highly desired flexibility, but one always has to be careful with what they wish for. While everybody is working towards designing simplified systems, the reality is that resilient distributed systems are not simple. We all need to put more thought into architecture, automation and orchestration to be successful.

Beth’s presentation covered the Services Product Development lifecycle concept, often referred to as Day 0, Day 1 and Day 2 operations, which summarizes the challenges of designing, developing, deploying and operating large-scale network infrastructure and services. The concept originates from the telecommunications industry, but it is applicable to all market segments, as well as to the application software stack. The overall process can be broken down into three stages as follows:

Day 0: Planning and development phase that can also involve to pre-load images that helps with delivery and deployment automation.
Day 1: The deployment phase where the infrastructure gets built out and connected. Automation is key during this step which reduces errors, however, it also prevents customization.
Day 2: This is the last phase where the system along with the applications is up and running. Monitoring, reporting and orchestration are the most desired features.

New applications and services and new versions of existing ones are following an agile software development method to speed the delivery, but it requires more efficient mechanisms to package, test and deploy these components. Once the infrastructure and workloads are in place, the number of challenges just keeps on growing. During the delivery phase, complications arise when the same service is deployed with slight variations to multiple customers. The best way to automate the process is to use blueprints and templates. But what is the right level of granularity to use? Or do you just stop allowing customization at all? Does one size really fit all?

With access to the data that is generated by the monitoring services throughout a deployment it is supposed to be easy to identify an outage; the system should generate an alarm at the least. As the issue gets fixed the monitoring and reporting systems have the information that is needed to calculate the length of an outage.

For more sophisticated providers and customers, Service Level Agreements (SLAs) are a useful way to measure the service quality, as long as the data is available to measure! A way to use automation is issuing service credit for missing the SLA, this way while the customer experienced an outage, their level of satisfaction will still be on a good level as they didn’t need to talk to anybody about it. At the same time, the automation still needs to be able to tell the difference between a planned maintenance operation and outages.

Operating large-scale, especially geographically distributed, systems and applications is hard! The simplest way to address the problem is to ensure the highest level of automation combined with a good end-to-end orchestration system. In the next installment of this article, we will dig into more of how edge is affected by and provide solutions to sustainability and other challenges worldwide.

If you are curious about what was discussed during our PTG session, you can find the recording on the OpenInfra Edge Computing Group wiki. If you have an edge orchestration story or challenge to share and discuss, we want to hear it, join our conversations on our weekly calls or mailing list!

About OpenInfra Edge Computing Group:

The Edge Computing Group is a working group comprised of architects and engineers across large enterprises, telecoms and technology vendors working to define and advance edge cloud computing. The focus is on open infrastructure technologies, not exclusive to OpenStack.

Get Involved and Join Our Discussions: