Tales from the Trenches: The Good, the Bad, and the Ugly of OpenStack Operations

Craig Tracey, OpenStack Team Leader at Blue Box, and Jesse Keating, OpenStack Engineer at Blue Box, both bring unique perspectives to the world of OpenStack cloud operations. Craig has worked at Blue Box for over a year selling Private Cloud as a Service (PCaaS), where clouds range from small to midsize. Jesse, on the other hand, comes from a background at Rackspace, where he most recently dealt with a small number of very large clouds.

In this session, the duo shares stories from the trenches. Their breadth of knowledge and experience has uncovered different approaches, tips and tricks to achieve success. They provide a general operational overview and then detail their measured successes, notable failures, and solutions to problems within their production infrastructure.

“It was difficult to come up with really bad problems that we’ve had with OpenStack — and for the most part, I’ve been sleeping. That’s a good thing.” — Craig Tracey, Blue Box

They cover four main topics in their presentation: installation, operations, scale challenges, and "gotchas" that they have both encountered over time.

Installation

As an operator, one of the primary tasks that you’re faced with is installation. While some believe that installation is a solved problem, experience has proven otherwise for both Craig and Jesse. Luckily, there are tools available to address installation woes: Chef, Puppet, Salt, Ansible and Ursula, to name a few.

Installation is very personal to your business, and how you deliver it. As such, projects like the Chef cookbooks are typically very opinionated. The community is also beefing up the OpenStack playbooks for Ansible that address the creation of an OpenStack environment, which are happening on StackForge.

Convergence

Despite the differences in experience across use cases, there is convergence in the OpenStack community. People are starting to deliver projects with artifacts and containers.

Over time, it has become apparent that trying to run more than one service in the same environment is a hassle, particularly as each of the OpenStack services becomes more independent and more opinionated about their dependencies while operating on different schedules. Each of these things are loosely coupled.

Shipping Bits

Some of the takeaways that Craig and Jesse have learned along the way with regards to shipping include:

Distro packages are not an option. There will be many instances where Blue Box will want to add their own changes to OpenStack — it’s eternally evolving, and they don’t want to be beholden to the cadence of any type of distro.
Deploying from source is not an option. It has proven difficult for a variety of reasons, primarily dependencies.
Giftwrap is the direction they’re moving toward in shipping their bits. For context, Giftwrap is a new project designed to package Openstack projects. It is designed to run on the OS that you want to build the package for, and works by creating a virtualenv of the selected packages, then creating a package in the current platform’s packaging format in your current working directory. This package can then be installed on any system of the same platform.
Striker is a project that intends to provide a common CLI for developers to use while working with a Packaging and Release workflow from upstream OpenStack to their cloud environment.

Operations

Architecture

One of the things that got the Blue Box team where they needed to be, almost immediately, was unifying their architecture. A couple of useful tips:

Make sure that you aren’t building snowflakes.
Improvisation is key.
Always make sure that you separate control plane and data plane operations. You will undoubtedly, no matter how big or small you are, be asked to add more capacity to your cloud. By separating the control plane and data plane paths, you will be able to operate on a single node as opposed to touching the entire stack.

Upgrades

Installation is, ironically, the easy part. Once you’re up and running, you have to upgrade. Otherwise, you run the risk of bugs popping up, features going missing, etc.

Upgrading early and often is recommended, although it can be hard to execute — especially when customers are using your cloud simultaneously.

In order to achieve (near) zero downtime for upgrades, it’s crucial to dive into nitty gritty details such as, “how do I run one Nova API with version A, and another Nova API with version B in the same cloud?” That scenario can get tricky.

Operators also have to deal with database migrations, and when/where to restart their services. At Rackspace, they had a very challenging deployment that spanned about a month of change — it took about 4.5 – 5 hours to migrate their production database. With Nova, when that migration is happening, you can’t have any services running simultaneously. That hurt a lot, and motivated their team to check how long migrations are going to take when changes are made upstream.

By planning ahead, you will avoid unnecessary headaches. Also, be sure to prune your data before moving forward with a migration.

User Experience

An important question to ask yourself is, how do users interact with the cloud you’ve built? Ideally, OpenStack should be easy to consume.

Common questions/issues that Craig and Jesse get from users include:

How do I create images? A lot of folks need help ushering along in respect to image creation. Craig recommends using tools like Packer or Disk Image Builder to make it a self-service experience.
CLIs. There is a notorious lack of consistency across CLIs for both users and operators.
Horizon issues. People with a legacy IT background that want to use Horizon as their interface to OpenStack experience functionality problems.
Error messages. Messages such as “no valid host” are unclear and often difficult to analyze/diagnose.

Craig’s First Law of Systems

When systems fail, they will fail dramatically.

“Loosely coupled distributed systems tend to fail in loosely fantastic ways.”

You need someone on your team who can dive in and figure out what the problem is across the entire stack, which is not an easy feat.

Scale Challenges

Nova-compute

Nova-compute runs on all of your hypervisors. As a result, it’s difficult to scale nova-compute when it’s sitting on top of 7,000+ hypervisors with hundreds of VMs — Craig secured better automation to allow for faster deployment, which ultimately killed their database due to overstimulation.

Glance

Glance API acts as an intermediary between image storage and the hypervisor that wants to use that image. If you’re fetching a ton of images, you have to scale out your API nodes.

Rotating passwords can also be difficult.

New feature introduction cost

In an effort to make upgrades better, Jesse made use of Conductor for Compute. This is very useful for rolling upgrades, but comes at a steep cost.

Gotchas

Other “gotchas” that Craig and Jesse have run into along their journey:

Make sure DNS resolution and hostnames are always correct.
Service ordering must be correct. If you don’t have that right, you won’t know until you reboot that host. It’s fixable, but not optimal.
Logs for a lot of user operations are not optimal. Use a log aggregator.
It’s difficult to trace actions across a stack.
RPC failures make debugging hard. Use Nova match, because you’re going directly to the source of truth.
Make sure that database backups are cleaned up.
Database modifications are not optimal.

Feel free to reach out to Craig and/or Jesse on Twitter @craig_tracey and @iamjketing to learn more.

Check out the full video of the presentation at the Paris Summit below: