Streaming lossy compression of biological sequence data using probabilistic data structures

Dr. Titus Brown
11am Friday, September 7th, 2012
EB3105

Abstract:
In recent years, next-generation DNA sequencing capacity has completely outstripped our ability to computationally digest the resulting volume of data. Driven by the need to actually analyze the data, our lab has developed a suite of novel data structures and algorithms for graph compression and data reduction; in addition to being darned efficient on their own, our approaches make use of probabilistic data structures that enable substantially lower memory usage than the best possible exact approach. Using these approaches we have been able to scale de novo data assembly approaches down to cloud computing infrastructure, and we have also completed some of the largest de novo assemblies of metagenomes ever done. Last but not least, these approaches show the way to essentially infinite de novo assembly of environmental microbial data.

Biography:
Titus Brown received his BA in Math from Reed College in 1997, and his PhD in Developmental Biology at Caltech in 2006. He has worked in digital evolution, climate measurements, molecular and evolutionary developmental biology, and both regulatory genomics and transcriptomics. His current focus is on using novel computer science data structures and algorithms to explore big sequencing data sets from metagenomics and transcriptomics.