Why Precision Medicine Demands a Completely New Type of Database

Precision medicine is revolutionizing healthcare – by optimizing the efficiency and therapeutic benefit for particular groups of patients, especially by using genetic or molecular profiling. This approach enables a range of health disciplines, including disease prediction, prevention and the resulting treatments, diagnostic screening for prenatal and newborns, and screening for rare diseases and cancer types. For life sciences, the ability to peer into high-resolution molecular profiles has opened up the way we hunt for new drugs and targets.

As exciting as these possibilities are, precision medicine relies on being able to handle and analyze vast amounts of complex data with a great degree of nimbleness and agility. With faster, more affordable technologies, organizations are generating hundreds of terabytes of genomic data per month, and it’s estimated that genomics alone will generate up to 40 exabytes of data per year by 2025. The explosion of variant data from public and private biobanks, and the advent of multi-omics, in particular, is rendering existing solutions to store and analyze data impractical.

A prime example is Phenomic AI, a pioneering biotech company targeting the tumor stroma, a complex barrier that surrounds cancer cells and stops medicines from working. At the core of Phenomic AI’s work is a target discovery platform where scientists can apply machine learning to single cell data amassed at scale, enabling the discovery of novel stromal targets. Within a short period, the company’s single cell datasets grew from two million to over 30 million cells, which placed tremendous strain on its database system and slowed the ability of data scientists to query and analyze the data. However, with a new database approach, Phenomic AI is now successfully identifying game-changing new treatments for patients with stroma-rich cancers, treating many more patients more comprehensively.

Furthermore, increasingly this data requires that the analysis happen more quickly than ever before to enable ‘on the spot’ precision medicine in situations where speed matters. The speed of sequencing is incredibly high and the industry is pushing its limits, but still, data is being generated faster than it’s being analyzed. Precision medicine – for both life sciences and healthcare – needs better solutions, as current database approaches are too limiting. There are several reasons why:

  • Current database solutions are often bespoke solutions that only handle one data type. Databases designed for e-commerce or data science are a poor fit for cloud-based sparse data with real-time varied retrieval patterns that are so characteristic of genomic data. Furthermore, these databases often don’t address other omics modalities which ultimately will limit their long-term utility. Data becomes inherently richer and more insightful when it can be juxtaposed and analyzed together in one place. An example would be housing genomic data, electronic health records (EHR), and imaging data. This helps answer the question – what is the phenotype of the affected individuals? What clusters of physical symptoms are typically associated with the deleterious variant? There needs to be a more efficient way to bring all this information together, in a common, shareable and efficient data format that can be analyzed in aggregate.
  • Data volumes are so huge that current databases can’t handle them with the nimbleness they demand. VCF files are monolithic and it can take way too long to download and access them all. Furthermore, adding a new sample can introduce new variants, forcing every sample to be re-interrogated (this is known as the N+1 problem). Most databases are lacking in the processing capacity that genomics demand and this just exacerbates the slowness. Organizations will have data scientists focused on spinning up cycles instead of being focused where they should be – on data analysis.
  • Current databases (especially when there are more than one of them) are a fragmented approach that presents serious governance challenges in an environment as hardened and security-focused as healthcare and life sciences. Such an approach is not ‘enterprise quality’ in that it does not meet hospitals’ stringent governance mandates and does not facilitate seamless access across distributed teams. Governance becomes a lot easier when there is one common, shareable database.

The future of precision medicine – where all types of data are housed and analyzed in aggregate; at scale; and safely shared among a growing number of constituents in order to cure and diagnose diseases through faster, more comprehensive insights – requires a completely new type of database. Ideally, it will be one that can handle multiple data modalities; is fast, unified and universally shareable.

About Jeremy Leipzig

Jeremy Leipzig, PhD, is the Product Manager of Population Genomics at TileDB, a modern database that integrates all data modalities, code and compute in a single product. Leipzig has significant experience as a bioinformatics software engineer and product manager – in academia, industry, and diagnostic, therapeutic and platform startups.