Edge Infrastructures and Why They Are Not a Done Deal Yet?!

The OpenInfra Edge Computing Group participated in the recent Project Teams Gathering (PTG) event to discuss critical topics that are on the must-solve list to build a robust, flexible and maintainable edge infrastructure for your or someone else’s workload.

Building infrastructure has never been easy, but it got under control while the world was organized in large data centers. As edge computing is taking the cloud out of these large and somewhat centralized locations, closer to the edge it puts a new spin on distributed systems both from requirements and solutions perspective. The Edge Working Group realizes this challenge and has been working on collecting the new requirements and coming up with solutions. During the PTG in October, 2021, the group took the available time and opportunity to deep dive into key topics including portable automation, trust & verification, networking at the edge, containers and more. This article provides a quick look into the discussions that the attendees had on three consecutive days at the event.

In case you would like to listen to the whole conversation, you can check out the session recordings. You can also find the notes on the etherpad the group used during the event.

Day 1

To avoid being opinionated on what’s the most important challenge, the article takes the topics that were discussed in a chronological order. So let’s start with ‘portable automation’!

Portable Automation

Automation is key and is a driving force in every industry segment by today, and yet, it is one of the most challenging desires at the same time. The group was focusing on automation from the perspective of the edge sites. How do we create automation for the edge that is reusable and portable?

The first challenge is to face and accept that edge environments are heterogenous and that starts with hardware. In many cases it is not easy to get the hardware out to the edge sites and it is a component that does not get upgraded often for one reason or the other. This results in a system that has a lot of hardware variations to onboard and manage and that calls for versioning on the software, which is not as easy as it sounds. The attendees mentioned Redfish often during this segment, but it is not a solution in itself. When it comes to standards and specifications it seems to be business as usual. Two people who reads the same spec will not implement it in the same way. This is made even harder by the time factor. Vendor implementations take time and your part to integrate what they came up with only comes after. And to get to that “after” can take more than a year. You need to be able to learn how to deal with technical debt.

When it comes to versioning, dependencies still need to be kept under control which can be easier with version numbers. Keeping immutability in mind is important when it comes to deployments which will keep the automation tools and processes the same on the field even if some configuration values might be different.

The group agreed that software should be able to take care of itself as much as possible, but this always comes with complexity. Complexity is something that we always try to reduce, but it cannot be eliminated with even the best of efforts therefore you need to learn to be able to handle it.

And then comes the portability part which has two sides to it, site-to-site and team-to-team. The human factor cannot be eliminated either, it is important to avoid solutions that cannot be transferred between teams. To put this in context a little bit more, compute and networking are two areas that have been distinct both from teams as well as tools and processes perspective, but when it comes to create automation for the edge this might be a luxury item. And it is about time to let go!

As usual, we are the enemies of ourselves and that is no different at the hyperscalers either. The assumption is always to have the core systems available to extend into and fault tolerance is often an after thought, but these are not patterns to follow.

Trust and Verification

Automation is not the only challenging item on the edge. With that in mind the group transitioned over to the ‘trust & verification’ topic for the second hour. If you would like to catch this in the recording, it starts at 00:57:00. This segment started with the statement that “security is always the elephant in the room”. It seems to be an after thought, possibly because things need to be operational and connected first and it can also be a hard challenge to tackle. But it shouldn’t keep anyone back from addressing security requirements!

Due to the highly distributed nature of edge it can get difficult to verify software and hardware pieces that get delivered and deployed at remote locations and connected to the controllers. A big chunk of the discussions circulated around TPM as a widely used solution, for instance it is a preferred method in SD-WAN systems to validate the trustworthiness of a device. TPM is practically a crypto-processor that offers a variety of ways and purposes to use it for to build trusted relationship between components of a large system.

The discussions in the wide industry tend to focus on issues starting on day-2, once the system is all set up, but it is never that easy! One of the big challenges is how to combine automation with providing trust and verification at the same time. Once hardware is delivered at the edge it needs to be connected and set up without the fear of it being tampered with on the way to the remote location. And weirdly or naturally, depending on who you ask, the preference is to remove the human involvement from the process as much as possible. Which brings up the question “Can the supply chain be trusted?” The answer is usually no and takes the discussion over to be all about zero-trust.

During the session the attendees touched on the application layer and APIs for a bit. Often the applications need to operate on untrusted networks, just think about when you travel and the hotel WiFi does not ask for a password?! Following this thought, the conversation was rather focusing on how the APIs and systems can be secured as opposed to coming up with new APIs. When it comes to securing systems, the group spent some time on discussing signatures and ways to protect keys at the edge which has been a challenge for a long time. Even today, getting a signature for a machine is manual labor that requires a phone call or sending out a person to handle it. Also, you might have secret keys in place, but a backdoor is still implemented using a master key. How can we do better?

At the end the takeaways of the session mainly included TPM, SmartNICs and VPN to be used to access nodes securely on an untrusted network.

Day 2

DDI – DHCP, DNS, IPAM

The second day focused on the essential part of edge, which is connectivity! Or well, rather the network underneath and the parts that are often overlooked or just blindly assumed to be there. The day covered two main topics, first the group touched on DDI or in other words DHCP, DNS and IPAM (IP Address Management). The second big topic was looking more into the network on a higher level moving beyond the tools and focusing on what networks are utilized for and how.

The day started with DDI to leverage that people from the OpenStack Neutron and Ironic teams were able to attend for a joint session. Big chunks of the conversation circulated around DHCP since it seems to be a challenging bit when it comes to setting up edge sites and connect the new devices. The heart of the challenge seems to be the fact that DHCP was originally designed and used to function on Local Area Networks (LAN). It is also important to mention that DHCP is not exactly IPv6 friendly. Using it in edge environments is for the purpose of discoverability as well as to pair it with bootstrapping including the method to PXE-boot hardware equipment.

When it comes to Local Area Networks (WAN) networks DHCP has issues with provisioning and handoffs. To overcome these challenges you can use fixed IP addresses and register it with the DHCP server for transparency to always be on top of what is happening in the environment. However, fixed IPs can become challenging in some cases. Another way is to set up VPN connections and run DHCP provisioning and management steps within that network.

To avoid and bypass DHCP, booting from virtual media has started to become an intriguing method to look into. The virtual media image can sit on the system controller or other local sources or a bootable image can be retrieved from the BMC as well which is often seen in high-security systems. This method also has some difficulties that need to be solved. Vendor support is a big one to mention as only a few of them support media attach, it is certainly a no-go with ARM at the moment.

This session also touched on DNS as systems need names for the purpose of having a unique identifier. While it is tempting to use the IP or MAC address for this purpose, at the end of the day those are just not viable options. DNS is essential for certificate management, SSH and more while it is also a component that can become challenging from trust perspective. A way to handle issues is having a small DNS service at the edge. Beyond that, DNS hijacking is also an area that became a more frequent issue and needs to be handled.

Before moving onto the next session the attendees also checked on the progress of making Neutron more equipped for the edge. The service is prepared to handle connection loss between the edge and the central site and keep the data network up, while synchronization with the controller can happen once the connection is available again. A few more areas, like improvements to manage network segments need to be revisited to see if there are more steps to take.

What is your Network Doing at the Edge?

With a better view on some of the tooling and components to use in the underlying network, the group moved on to take a look at networks on a higher level. The first thing to highlight during the session was a quick overview about activities forming within MEF in the area of edge computing. There are two project proposals that are about to kick off. Our PTG session was highlighting the one that is targeting to define a set of APIs to allow connectivity to the edge from Service Providers as well as Consumers. This is a telecom-focused effort where items such as performance metrics will play an important role. To mention an example, timeouts are becoming an issue in an edge environment in the fixed-wireless case as it is different than what you would expect on a mobile device. The work under MEF is looking into addressing challenges like this while collaborating with adjacent communities such as the OpenInfra Edge Computing Group and Anuket.

While the OpenInfra Edge Computing Group does not have the application layer in its scope, it is still important to take a look at how workloads can be better supported by the underlying network. The conclusion of the discussion was that applications are somewhat disconnected from the network and expecting it just to be there. It is usually handled through VPN and IPSec tunnels as that is what SD-WAN tunnels practically are. This results in a disconnect between the telecommunication sector and application developers which becomes a challenge due to further expectations towards the network. Namely, the network should have some level of awareness of the application traffic to increase performance through techniques such as mapping. Prioritizing network traffic is very challenging and sometimes overlooked. Voice data is an example that enjoys higher priorities, but with new use cases appearing, such as telemedicine and other real-time applications that have safety concerns, these should now emerge to the top of the list also.

The last few minutes of the discussions were speculating about the business models and solutions that the hyperscalers are attempting. All we know is that regulated networks are off the table for them, but what is still on and how they are going to make money on it?

Day 3

Containers at the Edge – OpenYurt

The third and last day started with a very popular topic, containers. To break the tradition of the previous two days, the first session included a presentation about OpenYurt, which is a sandbox project in the CNCF ecosystem targeting to extend Kubernetes out to the edge. Contributors to the project joined the edge PTG session to describe the motivations behind the project and some details towards providing solutions to the challenges that they are facing.

Some of these challenges resonated with the findings of the edge working group with respect to the centralized and distributed control plane architecture models. Having control services in a central site while the worker nodes out on the edge are connected through public internet creates problems. As mentioned before, the connection between the sites can be unreliable which makes managing devices and pools of resources non-trivial to say the least. OpenYurt was created to solve these issues by adding services that can be integrated with upstream Kubernetes as extensions. With the new components the project provides autonomy on the edge sites, a more reliable cloud-edge communication as well as taking care of device and resource pool management. The presentation also contained a slide to compare OpenYurt, KubeEdge and K3s. The questions and discussions following the presentation were focused on areas such as autonomy, resource pools and cluster management.

Edge Architectures

After a view into the container-management space the day continued with a session about edge architectures (at 01:07:00 in the day-3 recording), which was kind of a natural follow up to some of the challenges that the OpenYurt presentation was highlighting in this area.

When it comes to the end-to-end edge infrastructure the question always seems to be whether it should follow a centralized or distributed control plane model? Or making the space look even more confusing, whether it should be a hierarchical configuration, follow a federated model or should the edge infrastructure contain a stretched control plane? All these terms came up during the hour-long chat we had about the topic during the PTG.

The group was in an agreement that a stretched control plane does not provide autonomy and that cloud patterns are crucial to follow and push out to the edge. Creating something that is very edge specific will not scale and have long-term viability. It is also questioned how applications should be structured and scheduled throughout edge sites. A pod and application pieces should not span beyond the boundaries of one single edge site, but as always, there are exceptions. For instance, an NFV application that consists of multiple microservices often needs the different pieces to be placed on different sites to provide full functionality.

The session also touched on further important items such as namespaces and naming conventions, distributed database solutions and the need to standardize IoT device lifecycle management. The attendees agreed to keep unpacking this topic and go into further details on upcoming OpenInfra Edge Computing Group weekly calls.

APIs for Edge – Anything Missing?

Last but not least, the last topic to discuss about edge at the PTG was APIs (starting at 02:12:00 in the day-3 recording). More specifically, do we need new APIs to serve the needs of edge computing use cases and workloads better? At the very beginning of the session the attendees reflected on the work that MEF is doing in this area, but did not go into further discussion about it at this time.

During this hour the attendees were taking a look at the application space again to ensure taking all requirements and aspects into account. An interesting angle that came up was how to handle large amounts of data that need to be filtered and processed at the edge while some of it also needs to be sent back to a regional or central site. The use case example for the discussion was an oil field and the sensor data that is collected there. The conclusion of this topic was to focus on APIs that are focusing on resources that the applications need as opposed to look into the functions that the applications need to perform.

The session’s main outcome was to avoid creating new, edge-specific APIs. However, existing APIs need to evolve to support the new needs and demands of edge use cases and the industry also needs to focus on aspects such as policies, handling metadata and support for use cases with real-time implications.

This was the last set of conclusions for the edge discussions at the PTG. If you missed the event and would like to listen to the sessions you can access the recordings on the edge working group wiki and you can also find notes on the etherpad that we used during the event. The group has a list of follow-up discussions and presentations scheduled already! Check out our lineup and join our weekly meetings on Mondays to get involved!

About OpenInfra Edge Computing Group:

The Edge Computing Group is a working group comprised of architects and engineers across large enterprises, telecoms and technology vendors working to define and advance edge cloud computing. The focus is open infrastructure technologies, not exclusive to OpenStack.

Get Involved and Join Our Discussions:

Tags: edge computing, Edge Computing Working Group, OpenInfra