Genotype, Genome and Data Fingerprints
Soon, millions of individual human genomes with rich phenotype data will be available for analysis, posing a data management challenge and offering significant discovery opportunities. Rich genomic and phenomic knowledge will help improve our understanding of genome structure, function and evolution, and will translate into actionable opportunities for improving health and wellness.
We have developed several algorithms and methods for studying and visualizing personal genome data in family, cohort and population context. In particular, our ‘genome fingerprinting’ method enables ultrafast and private genome comparisons in very large cohorts, and our ‘data fingerprinting’ method offers fast, semantically and structurally agnostic method for analyzing electronic health records (e.g., in FHIR format). Our locality-sensitive hashing strategies summarize complex data into highly compressed representations which cannot recreate details in the data, yet simplify and greatly accelerate the comparison and clustering of data records by preserving similarity relationships. Applications include detection of duplicates, clustering and classification, which support higher goals including summarizing large and complex data sets, analyzing cohort structure, quality assessment, evaluating methods for generating simulated patient data, and data mining.
In one example, we used genome fingerprints to evaluate multiple versions of the 1000 Genomes data set, which were mapped to different genome reference versions and processed in various ways. Our analyses revealed discrepancies of several different types, including individuals added or lost, unannotated family relationships, significant changes in SNV counts, decreased heterozygosity, and poor genotype concordance between versions of the same genome. Finally, two sex-discordant genomes have nearly identical autosomes. Our observations suggest that the quality of processes as complex as sequence mapping and variant calling, applied to a large number of samples, cannot be extrapolated from benchmarking of a single sample against a gold standard. When detailed comparison cannot be applied to all samples, a rapid, approximate evaluation of the kind provided by genome fingerprinting can identify additional, unexpected quality issues, and support the goal of providing the community with resources that meet a high quality standard.
Beyond genomes and electronic health records, our approach is applicable to any domain in which semi-structured data (e.g., in JSON or XML formats) are commonly used, and also as a convenient way to analyze structured but sparse data, e.g., single-cell RNAseq.
More detailed information and resources related to these projects can be found at:
Recent publications describing this work include:
Quality control of large genome datasets using genome fingerprints. Max Robinson, Gustavo Glusman. bioRxiv 600254; doi: https://doi.org/10.1101/600254
Fast and simple comparison of semi-structured data, with emphasis on electronic health records. Max Robinson, Jennifer Hadlock, Jiyang Yu, Alireza Khatamian, Aleksandr Y. Aravkin, Eric W. Deutsch, Nathan D. Price, Sui Huang, Gustavo Glusman. bioRxiv 293183; doi: https://doi.org/10.1101/293183
Genotype fingerprints enable fast and private comparison of genetic testing results for research and direct-to-consumer applications. Max Robinson, Gustavo Glusman. bioRxiv 208025; doi: https://doi.org/10.1101/208025
Ultrafast Comparison of Personal Genomes via Precomputed Genome Fingerprints. Glusman G, Mauldin DE, Hood LE, Robinson M. Front Genet. 2017 Sep 26;8:136. doi: 10.3389/fgene.2017.00136. eCollection 2017. PMID: 29018478
Key Project Personnel:
|Max Robinson||Gustavo Glusman|