What Do OpenStack Operators Do All Day?

This article first appeared on Matt Fischer’s blog. A principal engineer at Time Warner Cable, he’s also a brewer of beer and a hiker of mountains in his spare time.

>__What Do Operators Do All Day?__

When I was a kid, Richard Scarry’s book, [“What Do People Do All Day”](http://childrensbooksguide.com/reviews/what-do-people-do-all-day-by-richard-scarry) was one of my favorites. I saw this book at my parent’s house and I was thinking about trying to categorize everything I’ve worked on in the past few months, so the result of that thinking is this post. “What Do Operators Do All Day?”

Being an operator means that you need to by necessity be a jack-of-all trades, unless you’re at a very large provider. And so, over the past six months, I’ve worked on almost every piece of our cloud, and in almost all cases I learn something new and grow my skill set (which is my favorite part of working on Openstack).

###Collecting My Data

Over the last six months or so, I’ve resolved 106 JIRA issues, and looking back at these provides a decent picture of where I spend my tracked work time. I’ve also done upstream reviews and commits, for which stackalytics will provide good details. Using this information, I’ll present where I do my work in order from most time to least.

###Puppet Automation

I spend most of my time these days working on puppet modules or configuring services with puppet. Some of this work includes:

Fixing/configuring/enabling new features in services like Keystone/Nova/etc
Upgrading our puppet branches from Icehouse to master
Configuring build server & infra (cobbler, puppet, package repos etc)
Configuring/deploying Icinga, or writing new checks
Refactoring and cleanup, like moving all our keystone roles/users to YAML so that they’re simpler to add

###Ansible

A close second is Ansible automation. We use ansible to manage our internode dependencies and also to drive our deployments. One example of what we’d use ansible for is to upgrade mysql, one node at a time, managing state between nodes while doing so. Over the past six months I’ve written ansible jobs to:

Deploy a new hand-built version of ovs
Perform a live upgrade mysql from 5.5 to 5.6
Upgrade openstack services from I to J or J to K
Improve to our weekly deployment process

###Misc.

Some of these tasks don’t show up in Jira, but they do take a good amount of my time.

Travel/training: Openstack conf, RabbitMQ training, etc
Planning: sprint planning, feature planning, expansion planning, etc
Mentoring and on-boarding: we’ve grown a lot and this one cannot be underestimated. I do about 5-10 code reviews per day even when I’m not answering questions
Working on Ubuntu packaging for openstack, we roll (some) of our own
MySQL/Galera DBA-esque work

###On Call/Issues

Every few months, I do an on-call rotation for a week, these can be quiet or not, depending on the shape of our monitoring and cloud. Whether it’s a good or bad on call is usually of our own doing however. Even when not on call, I deal with issues though, although we do our best, we occasionally have problems. When you get enough nodes, you’ll get failures. They could be hardware failures, kernel issues, or even simply software failures, we’ve had them all. I could do an entire post on the issues we see here, but the ones that stick out to me as focus areas for software are ovs, mysql, and rabbitmq. Those are probably the three most complex and most important pieces of our software stack, and so they get lots of my attention.
Upstream

I think that the community of one of the best things about OpenStack, so I spend what little time I have left here. I participate in IRC and mailing list discussions as a part of the Operators and Puppet-Openstack community. I also do reviews and submit fixes primarily for puppet-openstack but also for Openstack itself. Although my commits to openstack itself have slowed, I’ve earned my 3rd ATC for Vancouver and I think it’s important to participate in this process.
Summary

One of my first concerns that I expressed when interviewing for this job was that we’d have Openstack setup in a year and then we’d be done. That has been far from the truth. In reality, the life of an Openstack operator is always interesting. There’s no shortages of things to fix, things to improve, and things to learn, and that’s why I love it. Although each release of Openstack generally makes things easier and more robust, they also always add more features along the edges to keep us busy.

Does this list match what everyone else spends their time on?

Let me know in the comments.