GBC’s Program Manager reflects on our deepening understanding of the global biodata infrastructure

Jan 22, 2024

The term infrastructure is widely used, but has multiple meanings and instances, some of which, especially in the context of the life sciences, are not well known or understood. A dictionary definition suggests an infrastructure to be “the basic physical and organisational structures and facilities (e.g., buildings, roads, power supplies) needed for the operation of a society or enterprise.” Societal infrastructures include the electrical grid, road networks, water works, and many others. We are also familiar with big scientific infrastructures such as the Large Hadron Collider at CERN and large telescopes, whether on land, such as Keck 1 and Keck 2 in Hawaii, or in space, such as the Hubble. These infrastructures, whether societal or scientific, share a common feature: they are physical entities that we can see and perceive as constructed objects.

The life sciences also has an infrastructure, in the form of databases that archive primary research data, and knowledgebases that store, provide added-value analysis, and share data generated by researchers around the world. Collectively, we term these components biodata resources. Life scientists, biologists, and biomedical researchers are, as GBC Executive Director, Guy Cochrane noted in a previous blog, critically dependent on these resources for their work.

Unlike the physical infrastructures I have described, this infrastructure is distributed and largely hidden: its physical components are racks of computers and disks hosted in out-of-the-way data centres. It is managed by staff scattered around the world in offices and laboratories, within Institutes and universities. However, to date we do not have an accurate idea of how the infrastructure is financially supported. Large physical scientific infrastructures are typically the result of a centrally planned project with bespoke, well-described public funding allocated to support them. By contrast, the global biodata infrastructure is funded by thousands of entities, large and small, distributed throughout the globe, with this opacity adding to the challenge of understanding its scale and devising more sustainable ways of supporting it.

In order to arrive at a basic overview of this scope, the GBC has recently completed a reproducible global inventory of biodata resources that provides a baseline description of the infrastructure: the number of resources in existence, where they are located and who funds them.

Past efforts to describe the infrastructure have relied either on self-reporting by resource managers or curation by various funding bodies, professional bodies, or interested researchers. Both approaches are valuable, but have proven difficult to sustain: resource managers are overstretched and challenged to find time to report their resources, and efforts to actively collect resources as they are developed are inherently unsystematic.

We sought to develop a method of assembling an inventory that was not dependent on self-reporting or on active curation, but was nevertheless reproducible at relatively low effort. This was achieved by taking advantage of the very large, open, and searchable corpus of life sciences literature in EuropePMC to develop a machine learning-enabled method to produce an inventory of resources that can be repeated periodically and does not require a continuous funding stream.

The details of the methodology are described in two manuscripts1,2, and summarised in the figure. We identified 3112 biodata resources described in EuropePMC. They are distributed throughout the world, but not surprisingly are concentrated in higher income countries. This is certainly an undercount, since resources that have not published a description of themselves are not in our inventory.

While this one-of-a-kind study is not an end in itself, it provides a jumping off point for further exploration and we invite others to download the files and continue, or possibly expand, the analysis to help build our collective understanding of the global infrastructure, thereby serving our overall efforts towards biodata sustainability.

Inventory methodology

1 Imker HJ, Schackart KE III, Istrate A-M, Cook CE. 2023. A machine learning-enabled open biodata resource inventory from the scientific literature. PLoS ONE 18(11): e0294812.
2 Imker HJ, Schackart KE III, Istrate A-M, Cook CE. 2023. Detailed Implementation of a Reproducible Machine Learning-Enabled Workflow. Preprint: