CSE Colloquium Series
Making Sense of Genome-Scale Data
James Taylor
Courant Institute of Mathematical Sciences
New York University
Monday, April 14, 2008
9:45 a.m.-10:45 a.m.
3105 Engineering
Host: Charles Ofria
Abstract
High-throughput data production technologies are revolutionizing modern biology. Translating this experimental data into discoveries of relevance to human health relies on sophisticated computational tools that can handle large-scale data.
One area of rapid ongoing data generation is whole genome sequencing. Comparisons between sequenced genomes can be a powerful tool to understand functional genomic regions, by going beyond the primary sequence to capture patterns in how functional regions evolve. Using data generated by the ENCODE project we will demonstrate the power of genome comparisons to distinguish cis-regulatory elements (critical for the control of gene expression). We will then describe a machine learning approach that goes beyond sequence conservation to capture broader and more informative sequence and evolutionary patterns that better distinguish different classes of elements. This approach has proven successful for a variety of classification problems. In particular, the "Regulatory Potential Score" has been used to identify putative regulatory elements with high rates of experimental validation.
Sophisticated methods for the analysis of biological data are of little value if they are not accessible. Powerful analysis tools, data warehouses, and browsers exist, but for the average experimental biologists with limited computer expertise, making effective use of these resources is still out of reach. We have developed "Galaxy", which solves this problem by providing an integrated web-based workspace that bridges the gap between different tools and data sources. For computational tool developers, Galaxy eliminates the repetitive effort involved in creating high-quality user interfaces, while giving them the benefit of being able to provide their tools in an integrated environment. For experimental biologists, Galaxy allows running complex analysis on huge datasets with nothing more than a web browser, and without needing to worry about details like resource allocation, format conversions, etc. Galaxy makes high-end computational biology more accessible, efficient, and reproducible.