How to move a massive hyperscale insfrastructure to OpenStack

What do you do when you work for one of the original internet giants, Oath (formerly Yahoo! and AOL) and you want to update their infrastructure? Migrating hyperscale enterprise from a legacy, bespoke system to OpenStack may seem like a daunting task.James Penick managed this transition, from the initial demonstrations of the power of OpenStack to the nitty gritty of implementation. His team manages hundreds of thousands of compute resources with OpenStack software. This comes to about 70 percent of Oath’s infrastructure. Penick spoke at the recent OpenStack Summit in Sydney, Australia about his team’s migration strategy in a talk called “Migrating hyperscale enterprise to OpenStack.”

Moving such a massive infrastructure to OpenStack is not an overnight transition. “I liken it to be like a tug boat sitting at the front of a ship pushing an aircraft carrier to turn around,” Penick said. “It takes time, but if you’re relentless, and you push and you push and you push, you will get there.” As one of the first huge internet companies out there, all of the tools needed to manage the company’s infrastructure didn’t exist. That led to a lot of manual labor and an attitude of ownership before all comers. “I actually joked that it (was) humans-as-a-service,” Penick said. “If you needed to requisition compute resources, you follow the ticket, and people would scurry around and take care of it for you.”

This sort of IT environment led to hoarding behavior, too. If specific resources — hardware or software — were created there, people tended to hang on to them, and take them with them when they leave. “The day I leave the company,” said Penick, “I’m gonna tuck that rack-mounted server under my arm and step out the door.”

The initial infrastructure at Oath consisted of dozens of layer-two backplanes in every data center, which split the entire system across many small pools of compute. “This made it very difficult to transition to more of a commodity compute approach,” Penick said, “because where do you go if you want to build a pool of compute resources when everyone’s separated into little puddles?”

The first thing Penick had to do, of course, was convince the powers that be that OpenStack was the right solution. He said that he needed to be honest with the decision makers and tell them what the platform can and cannot do. “When you’re honest and direct with people and you’re trying to influence them to move to this new thing,” he said, “they are a lot more compelled to do that when you actually show them the warts and acknowledge that they’re there.”

In addition, Penick said, you need to be able to address root needs of the organization instead of how OpenStack is supposed to work. Don’t try to solve every problem up front, either. “Premature optimization is the death of good software,” said Penick. “Don’t try and solve every problem right out of the gate. Focus on the big rocks in the jar.”

The first step to building a cloud-based, private infrastructure with a technology like OpenStack is to figure out the what and the why of the project. Penick says that his own reason was to make it easy for people to do their jobs, and to try and make the organization more agile to meet its business goals. “I want to save money, which (is) all (it) really comes down to: you want to make money and save money,” he said. Focusing on these basic tenets helps teams make decisions along the way, too. Every choice can be centered on meeting business goals.

The next step is to find friendly customers willing to pilot the new infrastructure. Ask these customers what they want. Some of the things they will ask for will be reasonable; others will not. Integrating with an existing internal tool is a good example of a reasonable request, while wanting dedicated hypervisors that they can have root access on may not be. “It’s important to pick the things that matter,” said Penick. “But hold your ground on the areas that don’t actually meet the use cases you’re trying to solve. You’re trying to present a pool, like a private cloud, for the organization, not for a specific team. So it’s important to stick to your guns on that.”

Now you’re ready to build a cluster and let your pilot customers in there, letting them boot and use compute resources. “If your pilot customer is actually in a production kind of security zone and you can surf traffic, that’s great,” said Penick. “Make sure that you can work with your customers, get that data, learn how they’re using it, and then use it to evangelize other users throughout the organization.”

It’s likely you’ll see some challenging user behavior, so be ready for it. “You’re gonna give them quotas, you’re gonna build this new cluster, and I’ve got my pilot properties and I’m gonna give them quota, and they’ve run out and they’ve booted as many VMs as they can to make sure that they’ve earmarked all this compute resources so no one will take it away from them.” Working through it, and being able to measure what’s going on, will help you document your success.

Next, you’ll want to move forward, creating more clusters, building up your infrastructure and adding in additional users. You’ll want to focus on the perception of what you’re doing here. “Don’t try and run out the door and say, ‘I’m gonna save so much money. I’m gonna run massive overcommit. I’m gonna try to show how much money I can save,’” Penick said. “Focus on building a positive perception of using infrastructure as a service, because that’s really one of the big problems we try and solve when you move a huge company, because you’re trying to get people to buy into the idea that they can actually do their day to day jobs with an API and it’s better and faster. Saving money is something that is sort of a side effect of that.”

Penick was able to get more users on board with five clusters around the world, which they call Openhouse. “Every single person at the company can boot up to five VMs in this environment,” he said. “This allows them to build development environments on the fly. As a side effect of this, we have actually managed to kill Windows laptops and desktops under the desk. They’re gone. Except for mine. I still have mine because I use it as a footrest, so I didn’t want to give it up.” The implementation has effectively saved tons of money on hardware just by making employees’ lives better.

Now people at the organization were using VM for their sandbox environments. They were doing development work, innovating, seeing some agility. The company, however, was still focused on (“addicted to”) bare metal. Penick told them that his team would support Bare Metal, but only by integrating with the existing processes that the company used to acquire Bare Metal. “It took about thirty to ninety days to get physical compute resources, so we’re gonna slide into that and we’re gonna become a part of that,” he said. “And this means that you have to sit and you have to work very closely with supply chain teams, property architects, (and) application architects to influence them and convince them to want to use this thing.” No matter what, said Penick, there needed to be an OpenStack API between the user and the infrastructure.

Ultimately, what convinced people was the way Penick’s OverStack implementation made it even easier to use Bare Metal. “If you need to request requisition compute resources, what we’re going to do is take the ten most commonly used configurations of hardware and we’re gonna build those out into the network back planes we want to see you in. We’re going to build large pools of that hardware that you can come to a forum and say, ‘Hey, I want five of these things.’ Approved. Done. You’re off and moving.”

This encouraged the right sort of behavior by offering people an easy, simple path to do it, said Penick. Eighty to ninety percent of the individual hardware requests are now coming through a simple software tool with a few simple questions. “The other form that we offered actually had something like forty six to sixty questions, most of which could actually send you off onto a pretty dark path,” joked Penick. “So we’ve reduced this down to: who are you, what do you want, where do you want it, how many do you want?”

The new tool shows requesters how many of each type of configuration is available. “Open and honest,” said Penick. “We’re all part of the same company. We all trust each other. Give people the information so they can make an informed decision so they’re not making a stab in the dark.”

While using OpenStack infrastructure was optional at first, the managed transition led to a point where Penick could make it required. “But at this part,” said Penick, “we’ve actually had sufficient momentum to really encourage people to do the right thing.”

Finally, Penick and his team took the existing hardware and began to manage it with OpenStack. He calls this “horizontal migration” where all hardware can be deleted or re-imaged through the OpenStack system. Oath is now about seventy percent on OpenStack. “The remaining thirty percent, we’ll have most of that covered by the end of 2018,” he asserted.

It took Penick and his team about five years to manage this hyperscale transition. Now, even the hardcore folks that were initially against OpenStack automatically think of it when planning new projects. “If we were to do it all over again,” said Penick, “it would probably only take two. There’s been a lot of enhancements made to OpenStack, and there (are) a lot of perceptions that have changed in the industry.”

“But be aware,” Penick concluded, “you’re not gonna get this done in six months. And if you do, that’s amazing. But have realistic expectations of what you’ll be able to accomplish.”

Catch the entire 14-minute talk below.