The GBC is undertaking an inventory of global biodata resources as part of our work to describe this crucial infrastructure.
Biodata resources form a highly distributed infrastructure with thousands of resources around the world of varying scale, supported by hundreds of funding bodies, institutions, and charitable foundations. This infrastructure has grown dramatically over the past three or four decades as technological advances, such as for nucleotide sequencing, have enabled exponential increases in the amount and types of data generated.
This growth has been organic as researchers and institutions independently established each individual resource. Because of its distributed, organic nature, this infrastructure is difficult to describe or monitor over time: neither the number of resources nor their location has been systematically explored. There are many highly visible and firmly established biodata resources, but others are relatively small or serve specific research communities. A better understanding of the scale of the infrastructure will aid funders and other stakeholders in addressing challenges to sustainability faced by the infrastructure.
The Global Biodata Resource Inventory project
In collaboration with experts at the Chan Zuckerberg Initiative, openly available machine learning models were tested and trained to classify articles and extract resource names in order to build the initial inventory. Metadata from associated articles were then gathered to identify additional information, such as author countries and resource funders. The inventory is in the final stages of development. Preprints describing the work will be available in early 2023, and all code and data are or will be available in the GitHub repository linked below.