As DNA is the new oil (or bacon?), and its amount continues to increase exponentially, technical, ethical, legal as well as security and privacy challenges arise by the dozens. The Medical Futurist believes now is the time for concerted, community-wide planning for the genomic data challenges of the next decade.

The amount of genomic data is soaring – and the challenges growing

Imagine genes as sentences and genomes as entire books consisting of tens of thousands of chains of words. Interpreting the whole book for the first time by completing the Human Genome Project took more than 15 years. It’s fascinating how far humanity have come – and even more so, if we look at the evolution of genome sequencing since 2006, the signature year when the DNA double spiral uncovered its secrets for the very first time. The technical conditions, the time and the cost of sequencing genomes were reduced by a factor of 1 million in less than 10 years. Anyone can have a genetic test within weeks. Moreover, anyone can get their whole genome sequenced for some hundreds of dollars, and the price is expected to further fall in the future. As a first, Veritas Genetics has been hailed as the first company to be able to sequence, analyze and interpret whole human genome data for less than $1,000.

Already three years ago, the OmicsMaps showed that there were more than 2,500 high-throughput instruments, located in nearly 1,000 sequencing centers in 55 countries in universities, hospitals, and other research laboratories. Researchers estimate that between 100 million and as many as 2 billion human genomes could be sequenced by 2025, representing four to five orders of magnitude growth in ten years and far exceeding the growth many big data domains show off.

Alongside the exponential growth in interest in genomic research, several challenges came to light. At first, the technological demands of genome sequencing should be considered. The area, as a four-headed dragon, has four various subfields with entirely different requirements and trajectories. The acquisition, storage, distribution, and analysis of large datasets require massive and unique technological solutions never seen before. However, beyond technical difficulties come the genuinely painstaking questions: data privacy and security. Who can have access to data and how do research groups, companies or governments ensure that genomic data doesn’t end up in the wrong hands?

Furthermore, how will it end up in anyone’s hands? Could a genomic data market evolve? What will regulators and policy-makers make of genomic data? How could governments benefit from the technology and the information originating from it? These burning questions need responses as soon as possible – in spite of their complex and multilayered nature.

genomic data

How to collect, store and share genomic data?

As a single human genome takes up 100 gigabytes of storage space, and more and more genomes are sequenced, storage needs will grow from gigabytes to petabytes to exabytes. By 2025, an estimated 40 exabytes of storage capacity will be required for human genomic data.

Moreover, for every 3 billion bases of the human genome sequence, 30-fold more data (~100 gigabases) must be collected because of errors in sequencing, base calling, and genome alignment. This means that as much as 2–40 exabytes of storage capacity will be needed by 2025 just for the human genomes. Some researchers anticipate the creation of genomics archives for storing millions of sequenced genomes. Others believe that cloud computing is the only storage model that can provide the elastic scale needed for DNA sequencing.

Tech giants are also betting on the latter, especially considering the distribution capabilities of cloud computing. Google launched a cloud computing service for companies to store genomic data in March 2014 – Google Genomics. The search engine has built an interface, or API, that lets scientists and researchers move DNA data into its server farms and do experiments there using the same database technology that indexes the Web and tracks billions of Internet users. Microsoft has a similar service dubbed Microsoft Genomics, while genome sequencing giant, Illumina uses the Amazon Web Services. Increasing amounts of genomic data will be available soon through these platforms – and the questions of who and how will have access pose themselves almost automatically.

genomic data

Genomic data – confidential and secure?

A person’s genetic and, holistically speaking, genomic data represents the most private information about the past, present, and future of an individual. While keeping this information package safe and confidential should be of utmost importance, it is trickier than with other types of data.

Against data breaches or data hacks, the majority of genetic testing companies use encryption and data platforms. And although significant hacks have not happened so far, it’s unfortunately just a matter of time. DNA testing service MyHeritage revealed that hackers had breached 92 million of its accounts. They did not reach genetic information, only profiles, and passwords, but what if they did?

As genomic data would require special care, firms like EncrypGen, Nebula Genomics, Luna DNA or Zenome introduced the blockchain to secure sensitive DNA records. The technology might fit perfectly for securing genomic data in creating immutable distributed ledgers where any change becomes immediately evident.

However, the problem is not only data security but rather confidentiality and the data management practices of actors having access to large genetic or genomic datasets, such as direct-to-consumer companies. Although testing establishments use anonymized data, when the DNA of a million people is collected, it will become possible to extrapolate anyone’s personality from the genetic data. The latest research by computational biologist Yaniv Erlich, published in October 2018 in Science Magazine, more than 60 percent of Americans with European ancestry can be identified through their DNA using open access genetic genealogy databases, regardless of whether they’ve ever sent in a spit kit. That’s worrying – or at least it should make people more cautious and conscious when deciding for doing a genetic test. For example, it should urge them to assess DTC company’s data privacy policies accurately.

genomic data

Who benefits from genomic big data?

Even more so, since consumer companies like 23andMe and Ancestry have so far created genetic profiles for more than 12 million people, according to recent industry estimates – and they are willing to monetize on the data.

For research facilities, pharma companies and other medical enterprises, the knowledge sitting in small DNA base pairs might prove to be invaluable – or at least valuable enough to pay huge sums. For example, in July 2018, GlaxoSmithKline decided to invest $300 million in 23andMe and forge an exclusive drug development deal with the Silicon Valley consumer genetics company to research and develop innovative new medicines and potential cures, using human genetics as the basis for discovery. Ancestry, which maintains a 5-million-person consumer database of genetic information, once partnered with Google’s stealthy life-extension spinoff Calico to study aging.

Caitlin Curtis, a research fellow at the University of Queensland, estimates 23andMe has made around $130 million from selling access to about a million genotypes, before the GSK deal, implying an average price of around $130. That means if you purchased 23andMe’s genetic test for $100-150, your genetic information could have been bought for another $130 on average price, assuming plenty of profit for the company. That’s worrisome news and not (solely) because of the money part. Beyond being concerned whether the results are reliable, or whether the outcome will change parallel to the advancement of science, users also have to worry about whether or not companies sell their personal data to the highest bidder.

If you are uncomfortable with enterprises sharing your raw DNA data for research purposes with research facilities or selling it to companies, you should definitely inform yourself on these issues: who owns your genetic data if you have a DNA test? Who will have access to such information? Could you delete your data or make the companies your genetic information disappear?

What happens on the other side: Genomic big data markets

A big reason many genetics-testing companies share data with third parties is for research – both by academic institutions and for-profit enterprises. The average customer who chooses to let 23andMe share their data for research contributes to more than 230 studies on topics including asthma, lupus, and Parkinson’s disease, the company says.

The exponential rise of the amount of genomic data coupled with the sharing and purchasing practices of DTC companies already envisage the birth of genomic big data markets. The White Paper on Nebula Genomics states that the opportunities around personal genome sequencing will soon create a genomic data market worth billions of dollars.

The team believes that this market will have similar characteristics and challenges as any market where data is involved – and it aims to show people how to avoid its pitfalls. For example, the company wants users to re-take ownership and sell their DNA on their own: instead of other for-profit actors, they would be the ones leveraging on their own genetic information. Another one, EncrypGen promises to help people sell the information in their DNA for cryptocurrency tokens. The startup says it is building a unique blockchain platform or the „Amazon of genetic material” to enable people to share genomic data safely and securely on a new emerging market – the genomic data market – for cryptocurrency.

Although these companies are small in size and possess only a fraction of the genetic information that companies such as 23andMe, Ancestry or Helix do, their vision and steps offer an alternative way for people to deal with their genetic data. Unlike in the case of these huge genetic companies, here, you could be the one deciding about the fate of your own data. Where is the regulator or the policy-maker, you ask?

genomic data

Population genomics and politics

Many privacy experts are concerned that the only law in the US currently covering genetic privacy, the Genetic Information Non-discrimination Act (also known as GINA) is too narrow in its focus on banning employers or insurance companies from accessing this information. Otherwise, there is literally nothing – the genetic information space is in many respects still an uncharted legislative territory, and consumers are taking the genetic testing companies at their word.

That should worry consumers from many respects – some of that already specified above. Without proper legislation, companies could use and re-use genomic data for many purposes outside the sight of the consumer, and also gives more space for governments to act upon such data. Law enforcement units and courts could utilize genetic databases – similarly to how they were used for catching the Golden State Killer after decades.

And perhaps the most worrying above all might be the possibility to have politics meddling in genomics – and there are already examples for that, too. In April 2018, news outlets surfaced that one of the biggest state in India, Andhra Pradesh, will secure the DNA base of 50 million citizens through the blockchain. On 20 March 2018, Estonia launched the first stage of a national state-sponsored genetic testing and information service providing 100,000 of its 1.3 million residents with information on their genetic risk for certain diseases. Already in 2015, MIT Review reported that a genetics company in Iceland named DeCode Genetics collected full DNA sequences on 10,000 individuals. And since the population in Iceland totals around 320,000 citizens, and they are relatively closely related, DeCode said it could extrapolate to accurately guess the DNA makeup of nearly the whole population of the country, including those who never participated in its studies.

Researchers estimate that as the world’s population projected to top 8 billion by 2025, it is possible that as many as 25% of the people in developed nations and half of that in less-developed countries will have their genomes sequenced (comparable to the current worldwide distribution of Internet users). But what might happen to genomic data in the hands of governments? How safe would it be for citizens to let authorities know about their innermost biological secrets alongside their health risks and future prospects? Not to mention the possibility for governments to sell their citizens’ genetic data in case that serves their own interest.

genomic data

Considering technical questions about how to store, share, distribute and analyze genetic and genomic data seem to be nothing compared to the haunting security and data privacy issues to deal with when it comes to DNA testing. Not to speak about the possibility of an already existing “shadow” genetic big data market – and the expectation of the upbuilding of an entirely real one soon; or the concerns that law enforcement, authorities, regulators or policy-makers might have access to or control over citizens’ genetic information in the future. As the accumulation of DNA base pairs won’t stop for a minute in the next years, these issues will become more and more burdening if we don’t come up with (at least) some medium-term solutions.