Imagine it’s early 2020 and there are signs of a major infectious disease outbreak. It’s a virus and you have taken samples and generated a sequence. Your first step is to compare your new sequence to what’s already in the databases. Fortunately, you have matches; because over the last decade or so, scientists have diligently shared some 30,000 coronavirus sequences in public databases, prior to the emergence of SARS-CoV-2. This gives you rapid access to some key tools to tackle the outbreak: you can identify the evolutionary lineage of the virus–what it’s closest to; you can make inferences about the biology, transmission and infection processes of the new virus–based on what’s known about the viruses already sequenced; and you can start to design therapeutic and vaccine pipelines.
But imagine a different scenario in which previous scientists had not made their coronavirus sequences available. You would be working in the dark, unable to identify and characterise the pathogen you were facing. The public health response would be delayed and less targeted, and work towards therapeutics and vaccines would be delayed by months at best.
Although in this example, the submitters of the preceding coronavirus sequences did not know how and when their data would be useful, it transpired that they were important. This is a general pattern; while those releasing scientific data openly cannot predict future use, it is essential that future use be enabled by submission to public databases. The investment is now; the return on investment is later.
There’s a behavioural phenomenon known as present bias – our tendency to prioritise actions with immediate benefits over those that pay off in the future. This risks playing out powerfully in the life sciences. Depositing data in community-endorsed repositories takes time, and does not always deliver instant rewards. There are often no strong incentives to publish well-curated data unless it’s explicitly required, such as by funding bodies or journals. As a result, data are at risk of being lost or left inaccessible.
The problem is that future research depends on today’s data. The reference data for, and software tools of, tomorrow – the models, predictors, diagnostics – will only be as good as the data we collect today. If we only build with outdated, incomplete, or skewed data, our science will be lacking.
Take the example of an ecologically damaging invasive species. If someone identifies a new species or variant in a region and fails to register the key features that allow its identification, and/or fails to record publicly its presence in a given region, then, future researchers may be unable to recognise it and identify it as a repeat presence in the region. The onward risk, then, is that the chance is missed to intervene, track its spread, or understand its impact, simply because the essential data wasn’t available when it was most needed.
This is why comprehensive, continual data deposition must become a valued and recognised part of research practice. Shared data doesn’t just help others — it helps a scientist, their future self, and their field. It preserves the value of experimentation and ensures that today’s science remains actionable tomorrow.
We must align incentives, reduce friction, and instill a culture where sharing data is the norm and not the exception. Simply put, if your data isn’t in the system, then the system can’t work for you and other scientists around you.