Two preprints of our recent work provide the first reproducible inventory of global biodata resources, highlighting some 3112 unique resources described in the scientific literature from 2011–2021.
The Global Biodata Coalition (GBC) is striving to develop mechanisms to more efficiently sustain the distributed global biodata infrastructure and to ensure its health and that of its component resources. A primary requirement to achieve this is to understand the breadth and scope of the infrastructure. With this work, we have advanced our insight and understanding of the resources that comprise the global biodata landscape.
Whilst resource managers are aware that their individual resource forms part of a global ecosystem of heterogeneous biodata resources, many of which have interdependencies, to date, they and their respective funders have had only a partial understanding of the scale and distribution of the entire resource landscape worldwide.
Heidi Imker, from the University of Illinois, Urbana-Champaign and Ken Schackart, from the University of Arizona, Tucson, working as consultants for the GBC, in collaboration with the Chan Zuckerberg Initiative and the GBC Secretariat, developed a comprehensive, and reproducible, new understanding of the overall global biodata landscape. This work provides a fuller picture of the scope of global resources, their number, geographical location and the funding organisations that support them.
Our approach
Our goal was to produce a comprehensive inventory of biodata resources using a methodology that allows periodic updating of the inventory with minimal human intervention. This methodology employs machine learning-enabled, natural language processing techniques on open data from the scientific literature, available in the Europe PMC literature database, to identify biodata resources by means of associated articles and papers published by the resource owners that describe that resource to the broader scientific community.
Openly available machine learning models were tested and trained to classify articles and extract resource names in order to build the initial inventory. Metadata from associated articles were then gathered to identify additional information, such as geolocation of the resource and the names of funders.
The Inventory and reproducibility methodology are described in these preprints.
Open and reproducible
This fully open methodology and the publication of the code for this experiment in Github, allows others to repeat this and similar work and ensures that the inventory can be reproduced in future to monitor changes in the infrastructure over time.