What does it take to bring and operate your edge in production? - Day 2

This article is the continuation of the OpenInfra Edge Computing Group’s discussions at the PTG in April, 2022 about the challenges and practices to bring edge use cases into production.

If you haven’t yet, check out the segments about ‘Day 0’ and ‘Day 1’ first!

‘Day 2’ – Operation and Maintenance in Production

The Day 2 sessions focused on operational issues to keep edge systems running.

A major pain point is license management. Many licensing schemes require the edge site to periodically check-in to a licensing server in a remote location. Many companies like to have better control over these servers so they keep them centralized. As not every edge site has connectivity to the internet, this model sometimes needs extra attention and clever workarounds to be sustainable. This triggered an overall conversation about handling changes, updates and upgrades. Delivering patches to a large deployment or doing full-blown upgrades are hard! New versions always have to be tested and verified first, that can cause delays. This is a mix of ‘Day 1’ and ‘Day 2’ challenges as you need to test the new version of a software component that a vendor supplies before you apply it to your already running edge sites. The extra testing and verification cycles require time and resources, which makes it costly and there is not much that can be done to improve that. However, many communities around the infrastructure software space have some level of certification or conformance programs that help to get at least the core set of APIs and interfaces tested. Backups can be used to roll back to an operating version, that is preferably not too old. If you have a big enough site with some redundancy then you can at least upgrade one server at a time while the other one is handling the load. Backward-compatible APIs and interfaces are key! Getting back to automation, the purpose of Infrastructure as Code, is to have your systems break in a predictable way. With automation you can get a smooth process to get through upgrades and changes that require to get a new image up at your edge site. In a pipeline you should also include testing among the steps to setup the system.

No matter if you are a vendor, an operator or a developer in an open source project, one thing is always certain, users will always use your software in ways you’ve never thought they would. An example can be to run encrypted voice traffic over a firewall. Probably not the best idea and it is definitely a path that will most probably result in surprises. Sometimes you just have to tell your users that what they are doing might not have the expected performance and throughput.

You need to be aware of all the overhead that your system will have once it’s fully up and running. For instance, monitoring overhead tends to get high and from there your system should also be able to report tickets automatically, in an ideal world at least. This concept applies to applications just as much as it does to infrastructure software! You will probably want to have some level of local as well as central monitoring to catch anomalies and drift as early as possible. If the system detects an issue you can automate it to fall back to a previous, good version. Be careful with what you wish for though as you need to think twice what triggers this behavior.

To stay with practical matters the attendees talked a bit about a recent desire and effort in the OpenStack community, which is to have some edge-specific documentation at a central place. This can help with finding relevant information on a higher level and pointers to service-specific docs to configure or troubleshoot the system. The community is currently discussing options of where to store this new set of documents.

Finally, we looked at GitOps as a way to drive automation. This model comes out of the Kubernetes ecosystem and the main idea behind it is to use a version control system to initiate, review and approve a configuration change. This is a developer friendly method that keeps them in a safe distance from operations. The idea is that the infrastructure can pull git repositories for changes or use web-hooks to apply changes in the system once they get merged into a repo. This will result in a change controlled system while still keeping the human factor in the workflow along with automation. It can be set up with multiple repos for more flexibility. While GitOps sounds very appealing it may not be directly applicable to edge just yet, as there might be large number of slightly different edge site configurations. Web-hooks also become harder to manage, even centralized because they get harder to detect when they fail. The attendees talked a bit about what the difference is between GitOps and pipelines, and the main takeaway was: “There is no free lunch in any of this!”.

The main takeaways of the three days were:

Automate as much as you can to create a system that is sustainable over the long-term
Testing, verification and certification are essential for a robust deployment and in some industry segments they are an absolute must
You can test a lot of things in a virtual environment, but you always need to make sure to test functions on real hardware as well
Setting up an edge deployment is hard, but operating it is even harder

If you missed the event and would like to listen to the sessions you can access the recordings on the OpenInfra Edge Computing Group wiki. You can also find notes on the etherpad that we used during the event. The group has a list of follow-up discussions and presentations scheduled already! Check out our lineup and join our weekly meetings on Mondays to get involved!

About OpenInfra Edge Computing Group:

The Edge Computing Group is a working group comprised of architects and engineers across large enterprises, telecoms and technology vendors working to define and advance edge cloud computing. The focus is open infrastructure technologies, not exclusive to OpenStack.

Get Involved and Join Our Discussions: