The GBC has completed an inventory of global biodata resources as part of our work to describe this crucial infrastructure.
Biodata resources form a highly distributed infrastructure with thousands of resources around the world of varying scale, supported by hundreds of funding bodies, institutions, and charitable foundations. This infrastructure has grown dramatically over the past three or four decades as technological advances, such as for nucleotide sequencing, have enabled exponential increases in the amount and types of data generated.
This growth has been organic as researchers and institutions independently established each individual resource. Because of its distributed, organic nature, this infrastructure is difficult to describe or monitor over time: neither the number of resources nor their location has been systematically explored. There are many highly visible and firmly established biodata resources, but others are relatively small or serve specific research communities. A better understanding of the scale of the infrastructure will aid funders and other stakeholders in addressing challenges to sustainability faced by the infrastructure.
The Global Biodata Resource Inventory project
The inventory makes use of open bibliometric data, available from Europe PMC, to identify resources through associated articles published by resource owners to alert the broader scientific community about the availability and utility of the biodata resource. This approach has allowed us to create a fully open methodology that will allow others to repeat this and similar work, and ensures that the inventory can be repeated in future to monitor changes in the infrastructure over time.
In collaboration with experts at the Chan Zuckerberg Initiative, openly available machine learning models were tested and trained to classify articles and extract resource names in order to build the initial inventory. Metadata from associated articles were then gathered to identify additional information, such as author countries and resource funders. The inventory is now complete, with two preprints describing the work now available on the GBC Zenodo Community and all code and data available in the GitHub repository. Both manuscripts have been submitted for publication in peer-reveiwed journals. All are linked below.
A Machine-Learning Enabled Open Inventory of Global Biodata Resources from the Scientific Literature
A detailed Implementation of a Reproducible Machine Learning Enabled Workflow
GBC GitHub repository