In a very general sense, bioinformatics is a field of science where we use computers to help the study of biological data and organize this information into databases and knowledge, such as the European COVID-19 data portal (https://www.covid19dataportal.org/).
Bioinformatics explains how genetic information is used by cells to produce bigger biomolecular structures and whole organisms and their interaction with other organisms. The task of bioinformatics is to track and organize this data and this is getting more and more daunting for computing infrastructures.
Origins of bioinformatics can be traced to the 1970s or earlier, when biologists realized that open data sharing and a systematic approach to organizing and sharing information about structures of molecules, such as proteins, was necessary. The need for large-scale computing in bioinformatics exploded when cheap gene sequencing became available. Genes carry “digital information” between generations of life forms on Earth.
A large subset of bioinformatics research attention focuses on health. This means that we have to handle sensitive data gathered from human samples. Data ranges from genetics to imaging and clinically gathered register of life habits and support the study of a multitude of diseases. Due to the sensitivity of the data and the increasing regulations around this data in Europe and globally, it has been a challenge to use traditional compute clusters and supercomputers to support analyzing this data. Supercomputers and clusters are traditionally not designed to handle sensitive data loads.
At CSC, we provide scientific IT services for the Finnish research ecosystem. This means we need to make sure we can serve a user base covering all sciences. Bioinformatics and social sciences present us with a challenge to ensure that the data stays in the hands of researchers who have appropriate permissions to analyze it. We see a plethora of important use cases (for use cases in bioinformatics from Finland see https://zenodo.org/communities/elixir-fi).
We can’t build unique isolated environments for each of them.
This is where OpenStack jumps in. The multitenancy design in OpenStack makes it possible to serve bioinformatics handling sensitive data at scale. OpenStack is also extremely flexible in how it can be deployed. This means that we can add all necessary controls to ensure we can provide secure, scalable resources to researchers who otherwise would be hard to serve. Data protection by default and design have been driving principles of the architectural design of our computing services (https://commission.europa.eu/law/law-topic/data-protection/reform/rules-business-and-organisations/obligations/what-does-data-protection-design-and-default-mean_en).
We have been running our OpenStack instance targeted at sensitive data research for close to 10 years. Some git-archeology found commits mentioning the Icehouse version of OpenStack. The hardware has been replaced and extended over the years, and we’ve added capabilities to the platforms. A lot of bioinformatics tend to be memory-hungry, so we have had large memory nodes from almost the beginning. Machine learning and heavy parallel computation have also been a growing trend, so we have been providing GPU resources for quite a while. And when you talk about bioinformatics, it’s hard not to mention the data storage and data transfer requirements. For instance, the COVID-19 sequence dataset from EMBL-EBI exceeds 1 PB and millions of files need to be made computationally accessible before meaningful data analysis can take place. As a result of these type of science needs, our storage available from Ceph clusters have steadily grown larger, complemented by NFS shares from NetApp appliances. We’ve been able to grow the platform in the direction we need while making sure we can provide special capabilities to a wide range of users.
During these 10 years, we’ve enabled an enormous variety of bioinformatics research while being able to serve a constant influx of new researchers. As we can serve this large use base with one service, the cost of serving new users has also been kept in check. A recent example is single-cell technology, which measures the activity of all genes separately in each cell and produces a vast amount of information on diseases, such as cancer. There are millions of cancer cells in a tumor. Data analysis requires increasingly more computing capacity and efficient algorithms as the number of cells and samples analyzed increases.
The resources offered by OpenStack are quite low-level, which is a challenge for some users. However, as we already have the platform designed for handling sensitive data OpenStack is also perfect for building higher-level services for our users. While OpenStack remains heavily in use, we do our best to shift our users to easier, higher-level services – built on OpenStack.
To learn more about how CSC is using OpenStack, check out their case study!
- Bioinformatics on OpenStack at CSC - November 25, 2023