Integrating Big Data with DOCKET

The “first mile” problem of translational research is: how to integrate the multitude of dynamic small-to-large data sets that have been produced by the research and clinical communities, but that are in different locations, processed in different ways, and in a variety of formats that may not be mutually interoperable. Integrating these data sets requires significant manual work downloading, reformatting, parsing, indexing and analyzing each data set in turn. The technical and ethical challenges of accessing diverse collections of big data, efficiently selecting information relevant to different users’ interests, and extracting the underlying knowledge are problems that remain unsolved. This project aims to solve the problem by leveraging lessons distilled from our previous and ongoing big data analysis projects to develop a highly automated tool for removing these bottlenecks, enabling researchers to analyze and integrate many valuable data sets with ease and efficiency, and making the data Findable, Accessible, Interoperable, and Reusable (FAIR).

We are analyzing and extracting knowledge from rich real-world biomedical data sets in the domains of wellness, cancer, and large-scale clinical records. We are formalizing these methods to develop The Dataset Overview, Comparison and Knowledge Extraction Tool (DOCKET), a novel tool for on-boarding and integrating data from multiple domains. We are also working with other teams to adapt DOCKET to additional knowledge domains.

Example questions DOCKET will allow us to address include: (Wellness) Which clinical analytes,  metabolites, proteins, microbiome taxa, etc. are significantly correlated, and which changing analytes predict transition to which disease? (Cancer) Which gene mutations in any of X pathways are associated with sensitivity or resistance to any of Y drugs, in cell lines from Z tumor types? (All data sets) Which data set entities are similar to this one? Are there significant clusters? What distinguishes between the clusters? What significant correlations of attributes can be observed? How can this set of entities be expanded by adding similar ones? How do these N versions of this data set differ, and how stable is each knowledge edge as the data set changes over time?

Current Project Leads:

Gustavo GlusmanIlya ShmulevichJennifer Hadlock