CSE Colloquium Series
Designing Filtering Strategies for Faster Protein and RNA Annotation
Yanni Sun
Thursday, April 17, 2008
9:45 a.m.-10:45 a.m.
3105 Engineering
Host: Rong Jin
Abstract
With the availability of sequenced genomes for
multiple species, an urgent
task today is to decipher the biological functions of these sequences.
Annotating genomic sequence function helps us understand the genetic background
of complex diseases and thus aids drug design. The state-of-the-art method for
function annotation is to compare a query sequence against database of
sequences with known functions. However, the
high computational cost of comparison algorithms and the sheer amount of
genomic data pose a great challenge for genome function analysis. For example,
it takes several CPU months to compare a bacterial genome with a database of
noncoding RNA sequence families.
In this talk, I will present systematic filter design methods for accelerating
protein and noncoding RNA function annotation. A filter excludes a large
portion of the database that is unlikely to be related to the query and hence
comparisons are only conducted on regions with functional similarity. The
computational challenge lies in designing filters
with optimal tradeoff between sensitivity and specificity from a large design
space. I will first present our filters based on regular expression patterns
and weight matrices for protein annotation. Then, I will focus on designing
secondary structure profiles to accelerate noncoding RNA annotation. Our
experiments demonstrate that, by using our designed filters, a protein sequence
annotation program based on profile hidden Markov model can obtain 20 to 35
times speedup and a noncoding RNA annotation program based on stochastic
context-free grammar can obtain over 100 times speedup on average. I will
conclude with an overview of my research interests and plan of future works.