Open Telekom Cloud is a major OpenStack-powered public cloud in Europe. It is operated for Deutsche Telekom Group by its subsidiary T-Systems International GmbH.
Artem Goncharov, Open Telekom Cloud architect, shares why Open Telekom Cloud chose Zuul, the open source CI tool, and how they use it with GitHub and OpenStack.
How did your organization get started with Zuul?
We started using Zuul for the development of OpenStack client components like SDKs, CLIs, and other internal operational software components. After we managed to get some changes merged into Zuul, we deployed it productively as our continuous integration system. Today it is our CI system for the development of all open source tooling we offer to our clients. Furthermore, Zuul is currently used for monitoring our platform services quality. For that, we periodically execute a set of tests. It also includes monitoring permanently our RefStack compliance.
We prepare Zuul as an internal service for other departments inside Deutsche Telekom apart from our own projects in the future. We run Zuul on our own public cloud, the Open Telekom Cloud, and also spawn the VMs there. We are all-in OpenStack!
Describe how you’re using Zuul
Currently, we have Zuul working on a public domain interacting with GitHub. Although the CI workflow with Gerrit is very powerful, we observed that some users struggle with its complexity. We thus made a decision to stay with GitHub to allow more people in our community to participate in the development of our projects. Nodepool spins up the virtual machines for the jobs facilitating an OpenStack driver.
What is your current scale?
We have a five-node zookeeper cluster and each one scheduler, a nodepool-builder, and a nodepool-launcher. At present two Zuul executors satisfy our needs. We have about ten projects managed by Zuul but plan to increase this number up to 50 soon. On average we do 50 builds a day.
What benefits has your organization seen from using Zuul?
We are now prepared for growth. Today, our projects are clearly laid out in size and complexity, but we expect the complexity to grow. Therefore we are relieved to have gating in place ensuring all software is tested and consistent all the time. That allows us to scale the number of projects we cover.
Second, we have better control over where and how the build and test processes take place. Since we are testing real-life cloud scenarios, there are credentials for and access to actual cloud resources involved. With Zuul and Nodepool we have better control over these virtual machines and the stored data.
Last, but not least we have a rather complex integration and deployment workflow. It is not just software that we build and package, but we also create other artifacts like documentation, PyPI packages, and a lot more that requires extra steps. We like the flexibility of having Ansible playbooks defining those workflows.
What have the challenges been (and how have you solved them)?
It is important for us to test all aspects of our public cloud. This functional testing obviously includes logging into domains, creating resources, and dealing with all aspects of credentials. Since this setup is connected to GitHub and thus indirectly accessible for the public, we felt a bit uneasy to run the Zuul setup on the same platform where we conducted the actual tests and builds. Eventually, we segregated those scopes by means of several dedicated OpenStack domains, where only Zuul is having API access to. So in the worst case should credentials should ever leak, we just have to clean up and reset one of our test domains, but the Zuul infrastructure itself remains unaffected from that. We facilitate the “project cleanup” feature of the OpenStack SDK for that, to which we also contributed.
We also experienced functional tests or verification of refstack often leave a lot of debris behind, which was not cleaned up by the code, sometimes even because of failing API calls of OpenStack itself. We leverage “project cleanup” also to mitigate this behavior.
Zuul publishes also a lot of information in log files to public readable Swift containers. Our security teams complain about that, even if most of the information is harmless. In some cases, we patched Zuul or its jobs so this data does not accumulate in the first place.
Both for operational and security reasons, we’d like to containerize all workloads as much as possible. Zuul comes with a set of Docker containers. Unfortunately, especially the Nodepool-builder needs a lot of privileges, which is hard to implement with plain old Docker. Our approach is to leverage Podman as an alternative for that.
What are your future plans with Zuul?
The Gerrit code review system implements a sophisticated role model, which enables users to do code reviews, promote revisions, or to authorize the eventual merges. It is a real challenge to implement these access control features just with GitHub. As a workaround for the time being we use “/merge” comments on the pull requests.
Even though Zuul’s prime directive is to automate, sometimes it’s nice to be able to manually intervene. Unfortunately, there’s currently not really a UI for administrative tasks like re-building some artifacts. That would come in handy to migrate even more Jenkins jobs.
The operation of Zuul is complex and we currently don’t have a dedicated ops team. We decrease the effort of operations by implementing Ansible playbooks for that, but this an ongoing effort.
We work on transforming Zuul into an internal offering for other Deutsche Telekom subsidiaries and projects, so they also start using it. We’re also very interested in enabling Kubernetes and OpenShift to act as an operations platform for Zuul. Here the challenge is inherited from multi-cloud issues that are required by high availability.
Are there specific Zuul features that drew you to Zuul?
Zuul fuels the development of OpenStack, which is a remarkable job and a considerable responsibility. We are impressed by how scalable and flexible it is and have even adapted its architecture to internal projects. We’re confident that there is more to come.
- Inside Open Infrastructure: June 2024 - June 18, 2024
- Inside Open Infrastructure: April 2024 - April 3, 2024
- Inside Open Infrastructure: January 2024 - January 18, 2024