The latest episode of OpenInfra Live, the weekly show hosted by the OpenInfra Foundation, is now available on-demand.

image

OpenInfra Live is a new, weekly hour-long interactive show streaming to the OpenInfra YouTube channel every Thursday at 14:00 UTC (9:00 AM CT). Episodes feature more OpenInfra release updates, user stories, community meetings, and more open infrastructure stories.

One of the reasons workloads moved to virtualization and clouds was to avoid having underutilized resources. But as demand for resources goes up and down, the cloud itself can now have a lot of spare capacity. How do OpenStack-based large scale clouds manage their spare capacity? Join this live discussion between operators of private and public clouds with InMotion Hosting, CERN, Verizon Media, City Network and Open Telekom Cloud.

Enjoyed this week’s episode and want to hear more about OpenInfra Live? Let us know what other topics or conversations you want to hear from the OpenInfra community this year, and help us to program OpenInfra Live!  If you are running OpenStack at scale or helping your customers overcome the challenges discussed in this episode, join the OpenInfra Foundation to help guide OpenStack software development and to support the global community.

How OpenStack Large Clouds Manage their Spare Capacity 

Belmiro Moreira, cloud architect at CERN, kicked off the discussion by talking about what spare capacity is and why cloud computing environments use it. 

Moving workloads to the cloud has several benefits for users- including scalability and ‘pay for what you use’ model. 

“One of the promises of cloud computing is that users can start small and when they are ready, they can massively scale their workloads,” he said. “This is the illusion of unlimited available resources.” 

In this scenario, he says users don’t need to care about the underlying infrastructure, managing the hardware, going through expensive hardware purchases and in the end, buying more capacity that they don’t need. 

But anticipating the need for cloud resources does not come without challenges. 

“However, as operators of large scale infrastructure, we know that providing the illusion of infinite resources to users is a big challenge,” he said. “Infrastructure needs to be ready for the ups and downs in demand from our users, so the problem of over provisioned capacity that was in the past on the end user side has now been transferred to the cloud providers.” 

Moreira says this can lead to the cloud providers having spare capacity, meaning that the resources are not efficiently used. On today’s episode of OpenInfra Live, cloud operators discussed how different OpenStack clouds manage this challenge.  

Do you have spare capacity in your cloud? 

Viktor Molnar, cloud architect at Open Telekom Cloud, said that as a public cloud provider, they promise endless resources and that users can use whatever they need when they need it. 

“It’s very important to have spare capacity. To be able to provide spare capacity, we need to think about how much we are keeping as spare capacity and how we can reduce our costs,” he said. “It’s really important to not only think about wasting resources, but if you waste the resources, you also waste the operation costs.” 

He emphasized that it’s critical to serve any customer request, so they defined how much time as a baseline is needed for a customer request so it can be delivered to their cloud. This includes time to order, deliver, and install hardware. Based on these trends, they figured out how much capacity is necessary. When thinking about OpenStack specifically, there are a lot of different parts like considering Ironic bare metal servers, dedicated hosts, as well as the upper level services that will rely on the infrastrucure-as-a-service (IaaS) solutions. 

He said that they try to keep as much spare capacity on hand as possible, because the most important thing is to serve the customers. 

Erik Johansson, senior systems engineer at City Network, agreed with Molnar, as City Network has seven public data centers as well as several private cloud instances. 

“We have more spare capacity in the public regions because we need it,” he said. “We try to solve it by having spare hardware in the different data centers and we have some places where we run some of that hardware in pre-production environments, which makes it easier to ship or move it within the same data center into a production environment when we see the need.” 

Other spare capacity use cases at City Network include internal testing and sandbox systems, and an education branch that runs self-paced courses which can benefit from shifting resources to other public cloud regions depending on what the capacity looks at the moment. 

Chris Bermudez, engineering manager at InMotion Hosting, said their spare capacity strategy was a bit different. 

“For us it’s a bit different because we are a hosting provider with multiple product lines, and our main goal is building build a unified hardware line to place hardware wherever it is needed,” he said. “We see what the demand is and allocate hardware accordingly.” 

Brendan Conlan, DevOps manager at Verizon Media, says his team makes the same promises to their customers who just happen to be at the same company. 

Echoing Moreira’s earlier comments around spare capacity, Conlan said that balancing utilization versus having enough spare capacity available at any time for anyone to use is a huge problem to solve. 

“We have almost 40 different clusters and 40 different control planes running different instances on different networks, so the fragmentation is sometimes difficult to manage,” he said. “It’s not free to move things even within a data center, so being on top of the capacity, balancing the utilization and never letting your user base have any issues due to capacity is the key.” 

Moreira then elaborated on the spare capacity strategy for CERN, where the private cloud has around 6,000 compute nodes, manages 8,000 bare metal nodes with OpenStack Ironic and has around 30,000 virtual machines. 

“More than 80% of our capacity is to process data from the different experiments. We strive for more capacity to run all of this processing power, so less than 20% of cloud capacity is reserved for services,” he said. “We need to have capacity for users to create this in a dynamic way, so we try to use techniques to reduce spare capacity.” 

The panelists also took a few questions live from the audience:

How to balance spare capacity and demand fluctuation?
How do you measure your capacity?
Controlling cloud capacity using machine learning
How to mitigate/monetize spare capacity?
The opportunity of adding spot instances or pre-emptible instances features to OpenStack
How can automation leverage capacity management?
Is storage triggering challenges that are different from compute?
The role of collaboration amongst OpenStack cloud service providers in different geolocations to enable utility computing

Next Episode on #OpenInfraLive

The Kata Containers community is going to host community members from AMD, Ant Group, Apple, IBM, and Huawei to share how they’re running Kata. Kata Containers Architecture Committee members and upstream contributors will provide an update on community development and the project roadmap.

Tune in on Thursday, July 22 at 1400 UTC (9:00 AM CT) to watch this #OpenInfraLive episode: Kata Containers Use Cases.

You can watch this episode live on YouTube, LinkedIn and Facebook. The recording of OpenInfra Live will be posted on OpenStack WeChat after each live stream!

Like the show? Join the community! 

Catch up on the previous OpenInfra Live episodes on the OpenInfra Foundation YouTube channel, and subscribe for the Foundation email communication to hear more OpenInfra updates!

Allison Price