In providing a global foundation for data in the life and biomedical sciences, the world’s biodata resources operate in a very real way across borders. On their journey towards knowledge, biodata can cross borders many times: an experiment may be conducted in one part of the world, data generated and shared through a database in another, with those data then used by scientists in a third country or region.
Since scientists and technical experts who run biodata resources strive for the greatest possible openness of their services, they typically seek to minimise friction for those that wish to discover and reuse data. There are many anecdotes relating to impactful cross-border data flows, but the openness itself means that, frustratingly, there is little systematic tracking of data usage subsequent to access.
Every now and then, though, a study comes along that enables a more systematic look. Working with colleagues from a number of European institutions, the German research institute, IPK Gatersleben, provides a helpful tool* that tracks geographical provenance of sequenced biological samples, as well as geographical patterns of citation of these sequence records within the scholarly literature. It offers just a small sample, as it tracks only one biodata type and can only mine citations of sequence accessions from publications that are open-access and available for text mining, but it is capable of providing some relative numbers and can reveal important patterns.
Taking the examples of France, South Africa and South Korea, as we might expect, we see substantial usage of data generated on samples from each country by scientists within that country. Alongside this, however, we also see considerable usage of data from these samples by scientists outside each of the countries. This is met by sizeable use of data from foreign samples by scientists within each of the countries. Add to this the fact that the biodata resources providing access to the data are typically outside the provider or user country in each case, it becomes clear that complex webs of biodata movement across borders exist in support of the scientific advances that are described in these publications.
These findings add further weight to the already strong case for the collective efforts that are underway to determine new models for sustaining the world’s invaluable biodata resources, being coordinated by the GBC, in tandem with life science funders and the managers of biodata resources, to be truly cross-border in nature. Scientific progress hinges on the frictionless flow of data across national boundaries and the IPK Gatersleben tool provides the means to evidence this.
Paper: Lange M, Alako BT, Cochrane G, et al. Quantitative monitoring of nucleotide sequence data from genetic resources in context of their citation in the scientific literature. https://doi.org/10.1093/gigascience/giab084
For assistance in analysis and interpretation for this blog, I thank:
Matthias Lange, Jorge Garcia Brizuela, Genuar Nuñez Vega, Amber Hartmann-Scholz and Jens Freitag
I note that developing the interpretation has only been possible because the authors of the study provided open data services on their findings and had access to open data resources upon which to build their study.