GTRAC FAST R ETRIEVAL FROM C OMPRESSED C OLLECTIONS OF G ENOMIC VARIANTS. Kedar Tatwawadi Mikel Hernaez Idoia Ochoa Tsachy Weissman

Similar documents
A Tightly-coupled XML Encoderdecoder For 3D Data Transaction

On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem

Variant visualisation and quality control

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

On Compressing and Indexing Repetitive Sequences

Complementary Contextual Models with FM-index for DNA Compression

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Nature Genetics: doi: /ng Supplementary Figure 1. Imputation server overview.

De novo assembly and genotyping of variants using colored de Bruijn graphs

arxiv: v2 [cs.ds] 29 Jan 2014

Alphabet Friendly FM Index

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

On-line String Matching in Highly Similar DNA Sequences

Introduction to PLINK H3ABionet Course Covenant University, Nigeria

Succinct 2D Dictionary Matching with No Slowdown

A Faster Grammar-Based Self-Index

Fast Hash-Based Algorithms for Analyzing Tens of Thousands of Evolutionary Trees

CS4800: Algorithms & Data Jonathan Ullman

GBS Bioinformatics Pipeline(s) Overview

GBS Bioinformatics Pipeline(s) Overview

Explore SNP polymorphism data. A. Dereeper, Y. Hueber

Centrifuge: rapid and sensitive classification of metagenomic sequences

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Lecture 6 - Raster Data Model & GIS File Organization

Self-Indexed Grammar-Based Compression

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

Small-Space Dictionary Matching (Dissertation Proposal)

Predictive Genome Analysis Using Partial DNA Sequencing Data

International Journal of Scientific & Engineering Research, Volume 6, Issue 2, February ISSN

Linear-Space Alignment

Rank and Select Operations on Binary Strings (1974; Elias)

Self-Indexed Grammar-Based Compression

Smaller and Faster Lempel-Ziv Indices

Bioinformatics and BLAST

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Lecture 18 April 26, 2012

Applied Cartography and Introduction to GIS GEOG 2017 EL. Lecture-2 Chapters 3 and 4

Data Compression Techniques

The File Geodatabase API. Craig Gillgrass Lance Shipman

arxiv: v1 [cs.ds] 19 Apr 2011

Sequence comparison by compression

/home/thierry/columbia/msongsdb/tutorials/tutorial4/tutorial4.py January 25,

Data Compression Techniques

Bayesian Clustering of Multi-Omics

Technical Report TR-INF UNIPMN. A Simple and Fast DNA Compressor

QuickScorer: a fast algorithm to rank documents with additive ensembles of regression trees

Succincter text indexing with wildcards

Calculation of IBD probabilities

TASM: Top-k Approximate Subtree Matching

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Read Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015

Wavelets for Efficient Querying of Large Multidimensional Datasets

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Nearest Neighbor Search with Keywords: Compression

Die Nadel im Heuhaufen

Reduce the amount of data required to represent a given quantity of information Data vs information R = 1 1 C

Browsing Genomic Information with Ensembl Plants

Probabilistic Hashing for Efficient Search and Learning

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;

Geodatabase An Introduction

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

3F1 Information Theory, Lecture 3

Araport, a community portal for Arabidopsis. Data integration, sharing and reuse. sergio contrino University of Cambridge

Genotyping By Sequencing (GBS) Method Overview

Fixed-Length-Parsing Universal Compression with Side Information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CS6220: DATA MINING TECHNIQUES

A fast algorithm for the Kolakoski sequence

Simple Compression Code Supporting Random Access and Fast String Matching

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

Succinct Data Structures for the Range Minimum Query Problems

A Repetitive Corpus Testbed

LEA: An R package for Landscape and Ecological Association Studies. Olivier Francois Ecole GENOMENV AgroParisTech, Paris, 2016

Course overview. Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology

Motivation for Arithmetic Coding

Overview of IslandPick pipeline and the generation of GI datasets

Algorithmic methods of data mining, Autumn 2007, Course overview0-0. Course overview

Package genphen. March 18, 2019

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

LZ77-like Compression with Fast Random Access

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

arxiv:cs.ds/ v1 13 Mar 2002

At the end of the exam you must copy that folder onto a USB stick. Also, you must your files to the instructor.

A New Space for Comparing Graphs

' Characteristics of DSS Queries: read-mostly, complex, adhoc, Introduction Tremendous growth in Decision Support Systems (DSS). with large foundsets

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

The Hash Function JH 1

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

Number Representation and Waveform Quantization

Cloud Performance Benchmark Series

Bioinformatics I, WS 16/17, D. Huson, January 30,

Image and Multidimensional Signal Processing

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Approximate Query Processing Using Wavelets

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

A Browser for Pig Genome Data

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options for grouping with metabolism

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Transcription:

GTRAC FAST R ETRIEVAL FROM C OMPRESSED C OLLECTIONS OF G ENOMIC VARIANTS Kedar Tatwawadi Mikel Hernaez Idoia Ochoa Tsachy Weissman

Overview Introduction Results Algorithm Details Summary & Further Work

Introduction GTRAC ( G ENOT YPE R ANDOM A CCESS C OMPRESSOR)

Introduction The dramatic increase in the generation of sequenced data, and thus need of a compressed representation Sequenced data is processed and stored in the form of a variants, which constitutes the variant dataset. As Variant datasets (VCF format) are constantly queried, we need a way to efficiently access data GTRAC aims at achieving a compressed representation for the variant dataset, while supporting fast access to parts of the dataset.

Sample VCF File #CHROM POS REF ALT SAMP1 SAMP2 SAMP3 20 14370 G A 1 0 1 1 0 0 20 17340 T TA 0 1 1 0 1 1 20 112500 A. 0 1 0 1 0 0 20 112800 A G 0 1 0 1 0 1 20 201215 T. 0 1 1 0 1 1 20 251513 T G 1 1 0 0 0 0

Per-Variant Random Access #CHROM POS REF ALT SAMP1 SAMP2 SAMP3 20 14370 G A 1 0 1 1 0 0 20 17340 T A 0 1 1 0 1 1 20 112500 A. 0 1 0 1 0 0 20 112800 A G 0 1 0 1 0 1 20 201215 T. 0 1 1 0 1 1 20 251513 T G 1 1 0 0 0 0

Per-Genome Random Access #CHROM POS REF ALT SAMP1 SAMP2 SAMP3 20 14370 G A 1 0 1 1 0 0 20 17340 T A 0 1 1 0 1 1 20 112500 A. 0 1 0 1 0 0 20 112800 A G 0 1 0 1 0 1 20 201215 T. 0 1 1 0 1 1 20 251513 T G 1 1 0 0 0 0

Single Genotype extraction #CHROM POS REF ALT SAMP1 SAMP2 SAMP3 20 14370 G A 1 0 1 1 0 0 20 17340 T A 0 1 1 0 1 1 20 112500 A. 0 1 0 1 0 0 20 112800 A G 0 1 0 1 0 1 20 201215 T. 0 1 1 0 1 1 20 251513 T G 1 1 0 0 0 0

GTRAC workflow Per-Genome Random Access VCF File GTRAC Compressed VCF Per-Variant Random Access

Results H.SAPIENS D ATASETS

H.Sapiens Results Experiments with the 1000 Genome Project [1] dataset containing 1092 genome samples (2184 haplotypes) For comparison, we also show results with: - 7zip : 7zip generic compression algorithm - TGC : Thousand Genome Compressor [2] We compare: - Compression - Per-Variant Random Access - Per-Genome Random Access [1] http://www.1000genomes.org/: A global reference for human genetic variantion [2] Genome compression: A novel approach for large collections 2013, Deorowicz et.al

H.Sapiens Results Input dataset file size (VCF) : 170 GB TGC 7zip GTRAC Compressed dataset size 422 MB 706 MB 1.1 GB Per Variant decompression time 7 sec 9 sec 17 ms Per Variant decompression memory 1 GB 1 GB 90 MB Per Genome decompression time 7 sec 9 sec 1 sec Single Variant decompression time 7 sec 9 sec 3 ms

H.Sapiens Results Input dataset file size (VCF) : 170 GB TGC 7zip GTRAC Compressed dataset size 422 MB 706 MB 1.1 GB Per Variant decompression time 7 sec 9 sec 17 ms Per Variant decompression memory 1 GB 1 GB 90 MB Per Genome decompression time 7 sec 9 sec 1 sec Single Variant decompression time 7 sec 9 sec 3 ms

H.Sapiens Results Input dataset file size (VCF) : 170 GB TGC 7zip GTRAC Compressed dataset size 422 MB 706 MB 1.1 GB Per Variant decompression time 7 sec 9 sec 17 ms Per Variant decompression memory 1 GB 1 GB 90 MB Per Genome decompression time 7 sec 9 sec 1 sec Single Variant decompression time 7 sec 9 sec 3 ms

H.Sapiens Results Input dataset file size (VCF) : 170 GB TGC 7zip GTRAC Compressed dataset size 422 MB 706 MB 1.1 GB Per Variant decompression time 7 sec 9 sec 17 ms Per Variant decompression memory 1 GB 1 GB 90 MB Per Genome decompression time 7 sec 9 sec 1 sec Single Genotype extraction time 7 sec 9 sec 3 ms

H.Sapiens Results Input dataset file size (VCF) : 170 GB TGC 7zip GTRAC Compressed dataset size 422 MB 706 MB 1.1 GB Per Variant decompression time 7 sec 9 sec 17 ms Per Variant decompression memory 1 GB 1 GB 90 MB Per Genome decompression time 7 sec 9 sec 1 sec Single Variant decompression time 7 sec 9 sec 3 ms

Implementation Paper & Implementation available at:https://github.com/kedartatwawadi/gtrac

Algorithm Details C OMPRESSION & R ANDOM A CCESS

Preprocessing. Variant Dataset (VCF File) mn 2 6 4 Binary Matrix 0 1 ::: 1 ::: 0 1 0 1 ::: 0 ::: 1 1....... 1 1 ::: 0 ::: 1 0....... 0 1 ::: 1 ::: 0 0 0 1 ::: 1 ::: 0 1 V 3 7 5 Variant Dic5onary... 5 1385 DEL 1 6 1348 SNP C 7 3072 SNP T 8 4602 INS TGT 9 4790 INS A... 1000 77646 SNP A 1001 77814 SNP G...

Parsing. Consider the genotype binary matrix H, with V = 20, N = 3, and m = 1 H = 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 1 0 1 0 0 0 1 K = 2 H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3

Parsing Every row represented in terms of substrings known as phrases Each phrase represented as [match, new symbol] Restriction on phrases: - Positional aligned - match-ending coincides with a phrase-ending. H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3

Parsing Every row represented in terms of substrings known as phrases Each phrase represented as [match, new symbol] Restriction on phrases: - Positional aligned - match-ending coincides with a phrase-ending in the source row. H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3 match new symbol

Parsing Every row represented in terms of substrings known as phrases Each phrase represented as [match, new symbol] Restriction on phrases: - Positional aligned - match-ending coincides with a phrase-ending in the source row. H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3 match new symbol

Phrase Parameters We represent the phrase with parameters: (s,c,e) - source row-id (s) - phrase ending position (e) - new symbol added at the end (C). H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3 (s,c,e) = (2,2,5)

Encoding & Random Access. Phrase parameters are encoded using - succinct bitvectors - variable-byte encoding More details in the paper!

Row-wise (per-genome) extraction Recursively extract the last symbol of a phrase Linear complexity of extraction (time & memory): O(L) Supports Parallel Extraction. H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3 Recursively extracted

Row-wise (per-genome) extraction Recursively extract the last symbol of a phrase Linear complexity of extraction (time & memory): O(L) Supports Parallel Extraction. H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3 Recursively extracted

Row-wise (per-genome) extraction Recursively extract the last symbol of a phrase Linear complexity of extraction (time & memory): O(L) Supports Parallel Extraction. H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3 Recursively extracted

Row-wise (per-genome) extraction Recursively extract the last symbol of a phrase Linear complexity of extraction (time & memory): O(L) Supports Parallel Extraction. H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3 Recursively extracted

Row-wise (per-genome) extraction Recursively extract the last symbol of a phrase Linear complexity of extraction (time & memory): O(L) Supports Parallel Extraction. H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3 Recursively extracted

Row-wise (per-genome) extraction Recursively extract the last symbol of a phrase Linear complexity of extraction (time & memory): O(L) Supports Parallel Extraction. H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3 Recursively extracted

Row-wise (per-genome) extraction Recursively extract the last symbol of a phrase Linear complexity of extraction (time & memory): O(L) Supports Parallel Extraction. H K = 1 2 3 4 5 6 7 8 9 10 0 3 1 2 2 0 3 1 0 2 0 3 0 3 2 0 1 1 0 3 0 3 0 3 2 1 3 1 0 1 1 2 3 Recursively extracted

Summary & Future Work

Summary Presented GTRAC, a variant dataset compressor which provides efficient random access (~100x compression, ~millisec decompression) GTRAC supports parallel decompression, and thus can be made faster by using higher number of cores. Further Work Supporting dynamic addition and removal of genomes/variants. One step closer to a compressed dynamic variant database Supporting fast computations Adapting GTRAC algorithm for other genomic datasets

Idoia Ochoa Mikel Hernaez Thank You! Q UESTIONS? Tsachy Weissman

Implementation Paper & Implementation available at:https://github.com/kedartatwawadi/gtrac