Research:
Graduate studies:
Undergraduate studies:
Lab information:
Photographs
Contact:
Bioinformatics Group
School of
Computer Science
University of Waterloo
200 University Ave W
Waterloo, ON N2L 3G1
Canada
E-mail:
Dan Brown

|
Bioinformatics Group Research Projects
Protein Structure Prediction
In the post-genomic era, new protein sequences are produced much
faster than it is possible to experimentally determine their 3D
structure. Protein structure prediction plays a significant role in
bridging this gap.
It has been estimated that there are only about 1000 different protein
folds in nature. Protein threading involves
searching for the best structure for a given protein sequence among already
known protein structures.
We have developed the fast and memory-efficient
protein threading tool RAPTOR, based on integer programming and support
vector machines. RAPTOR was assessed to be one of the best methods
in the CASP 5 and CASP 6
structure prediction competitions. We are investigating many
additional problems in structure prediction, such as side chain
packing and loop structure, and the special case of structure
prediction for transmembrane proteins. We also are investigating the
combination of multiple predictions from different predictors.
- Jinbo Xu, Ming Li.
Assessment of RAPTORs linear programming approach in CAFASP3.
Proteins: Structure, Function, and Genetics,
2003.
Details
- Jinbo Xu, Ming Li, Guohui Lin, Dongsup Kim, Ying Xu.
Protein Threading by Linear Programming.
In Biocomputing 2003, Proceedings of the Pacific Symposium (PSB),
2003. Details
- Jinbo Xu.
Rapid Protein Side-Chain Packing via Tree Decomposition.
In RECOMB 2005,
2005. Details
- Jinbo Xu, Ming Li, Dongsup Kim, Ying Xu.
RAPTOR: Optimal Protein Threading by Linear programming.
Journal of Bioinformatics and Computational Biology,
2003.
Details
View more papers on: protein folding |
Spaced Seeds for Homology Search
Homology search is the task of finding pairs of similar substrings in
large DNA or protein sequence databases. Such sequence alignments
highlight areas that may be preserved by evolution and help to determine their
biological function. BLAST, a popular heuristic program for this task,
is a household name among biologists.
In 2002, we introduced the technique of spaced seeds for
increasing the accuracy and speed of BLAST-like algorithms. We have
studied mathematical properties of spaced seeds, related algorithmic
problems, and selection of appropriate spaced seeds for particular
tasks. The resultant commercial tool, PatternHunter, was used to compare
the human and mouse genomes by the Mouse Genome Sequencing
Consortium. Most recently, we are combining homology search with gene finding
to find homologs of a known gene in other organisms.
- Bin Ma, John Tromp, Ming Li.
PatternHunter: faster and more sensitive homology search.
Bioinformatics,
2002.
Details
- Mouse Genome Sequencing Consortium.
Initial sequencing and comparative analysis of the mouse genome.
Nature,
2002.
(Daniel Brown and Ming Li are members of MGSC).
Details
- Daniel G. Brown, Ming Li, Bin Ma.
A tutorial of recent developments in the seeding of local alignment.
Journal of Bioinformatics and Computational Biology,
2004.
Details
- Brona Brejova, Daniel G. Brown, Tomas Vinar.
Vector seeds: an extension to spaced seeds.
Journal of Computer and System Sciences,
2005.
Early version appeared in WABI 2003.
Details
- Ming Li, Bin Ma, Zhang Louxin.
Superiority and complexity of the spaced seeds.
In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithms (SODA 2006),
2006. Details
View more papers on: spaced seeds, homology search |
Sequence Annotation and Hidden Markov Models In sequence annotation, we want to
segment a given DNA or protein sequence into biologically relevant
regions. For example, in gene finding, we want to find the
protein-coding regions of the genome. In transmembrane protein
topology prediction, we want to distinguish transmembrane helixes,
internal, and external loops. Additional sources of biolo gical
information can improve predictions over those derived from sequence
alone. With this observation in mind, we have developed two practical
tools, ExonHunter and AHMM, for these two tasks.
We also study theoretical concepts underlying the use of hidden Markov
models fo r sequence annotation. We have shown that it is important to
consider a tradeoff between prediction accuracy and running time, as a
simple change in the model may change running time from linear to
exponential.
- Brona Brejova, Daniel G. Brown, Tomas Vinar.
The Most Probable Labeling Problem in HMMs and Its Application to Bioinformatics.
In Algorithms in Bioinformatics, 4th International Workshop (WABI),
2004. Details
- Brona Brejova, Daniel G. Brown, Ming Li, Tomas Vinar.
ExonHunter: a comprehensive approach to gene finding.
Bioinformatics,
2005.
ISMB 2005.
Details
- Emily Wei Xu, Paul Kearney, Daniel G. Brown.
The use of functional domains to improve transmembrane protein topology
prediction.
Journal of Bioinformatics and Computational Biology,
2006.
Details
View more papers on: HMMs, gene finding, transmembrane proteins |
Haplotype Inference
In diploid organisms, such as humans, the identification of maternal
and paternal inheritance is important for the mapping of disease
genes. Technological limitations make this problem very difficult. It
is currently very expensive to experimentally determine these parental
haplotypes, and instead the genotype (the conflation of the two
haplotypes) is identified. The process of going from genotypes to
haplotypes is called haplotype inference. A surprising fact is that while most
problems in this area are very hard in theory, real instances of these
problems turn out to be very easy to solve in practice. We have
proved theorems explaining this phenomenon for simple models of
populations, and are currently moving to more complicated models.
By understanding how changes to the population model affect the complexity of the
problem, we hope to develop more efficient and effective techniques
for haplotype inference.
- Daniel G. Brown, Ian M. Harrower.
A New Integer Programming Formulation for the Pure Parsimony Problem in Haplotype Analysis.
In Algorithms in Bioinformatics, 4th International Workshop (WABI),
2004. Details
- Daniel G. Brown, Ian M. Harrower.
Integer Programming Approaches to Haplotype Inference by Pure Parsimony.
IEEE/ACM Transactions on Computational Biology and Bioinformatics,
2006.
Details
- Daniel G. Brown, Ian M. Harrower.
Toward an Algebraic Understanding of Haplotype Inference by Pure Parsimony.
In Proceedings of Computational Systems Bioinformatics Conference,
2006. Details
View more papers on: haplotypes |
Kernel Methods in Drug Design
A drug is typically a small molecule that interacts with the binding
site of some target protein. Drug design involves the optimization of
this interaction so that the drug effectively binds with the target
protein while not binding with other proteins (an event that could
produce dangerous side effects). Computational drug design involves
the geometric modeling of drug molecules, with the goal of generating
similar molecules that will be more effective drug candidates. It
is necessary that algorithms incorporate strategies to measure
molecular similarity by comparing molecular descriptors that may
involve dozens to hundreds of attributes. We use kernel-based
methods to define these measures of similarity. Kernels are general
functions that can be used to formulate similarity comparisons. The
main goal of this research is to develop kernel functions that will
facilitate those similarity comparisons that are most critical for a
particular binding site. |
Multiple Sequence Alignment
Comparing genomic sequences from multiple species is one of the most
essential tasks in bioinformatics. Such comparisons allow us to
investigate evolution across long time distances, and let us explore
which parts of a sequence may be partly responsible for visible
differences among a set of organisms of the same species.
The problem of finding a multiple alignment is challenging,
because for long sequences, the time needed to align many sequences
grows extremely quickly. We use a new approach to find anchoring
points that must be homologous in our produced alignment, which
exploits new probabilistic models to limit the number of false anchors
identified to well below 1. This allows us to efficiently compute
alignments of extremely high quality.
- Daniel G. Brown, Alexander K. Hudek.
New Algorithms for Multiple DNA Sequence Alignment.
In Algorithms in Bioinformatics, 4th International Workshop (WABI),
2004. Details
- Alexander K. Hudek, Daniel G. Brown.
Ancestral sequence alignment under optimal conditions.
BMC Bioinformatics,
2005.
Details
View more papers on: multiple alignment |
Motif Finding
Expression of genes is regulated by proteins binding to short sequence
motifs located close to their regulated genes. Motif finding strives to
discover unknown motifs by examining the regulatory regions of genes
regulated by the same protein, looking for substrings with many
approximate occurrences.
We have studied several abstractions of this problem formulated as
optimization problems. We have shown NP-hardness and approximation
algorithms for several variants of the problem. Bounds on the
approximation ratio are tight, but they do not reflect the good
performance of the algorithms in practice. However, recently we have
shown tighter bounds for strong motifs, which are more likely to be
biologically relevant. We have also studied the problem of finding
protein sequence segments conserved in both sequence and structure.
- Ming Li, Bin Ma, Lusheng Wang.
Finding Similar Regions in Many Strings.
In Proceedings of the thirty-first annual ACM symposium on Theory of computing (STOC),
1999. Details
- Thomas Tang, Jinbo Xu, Ming Li.
Discovering Sequence-Structure Motifs from Protein Segments and Two Applications.
In Biocomputing 2005, Proceedings of the Pacific Symposium (PSB),
2005. Details
- Brona Brejova, Daniel G. Brown, Ian M. Harrower, Alejandro Lopez-Ortiz, Tomas Vinar.
Sharper Upper and Lower Bounds for an Approximation Scheme for Consensus-Pattern.
In Combinatorial Pattern Matching, 16th Annual Symposium (CPM 2005),
2005. Details
- Brona Brejova, Daniel G. Brown, Ian M. Harrower, Tomas Vinar.
New bounds for motif finding in strong instances.
In Combinatorial Pattern Matching, 17th Annual Symposium (CPM 2006),
2006. Details
View more papers on: motifs |
Structural Comparison of Proteins
Structural comparisons of proteins are very important for evolutionary
studies, drug design, and protein classification. A structural
comparison between two proteins is usually done by performing a 3D
superimposition of the proteins. If one assumes that both proteins
are rigid entities, then it is easy to find the rotations and
translations that will optimally do this to get the maximum
superimposition. However, proteins are by nature somewhat flexible,
thus, it is more realistic to consider a rigid/flexible
comparison. This approach to structural comparison is more challenging
and demands considerably more computation because of the many degrees
of freedom in a protein molecule. The FlexSADRA algorithm does such a
comparison using a novel approach that reduces the complexity of the
comparison by working in a lower dimensional space defined through a
dimensionality reduction strategy. Unlike other structural comparison
algorithms that incorporate flexibility, FlexSADRA ensures that the final
flexed conformation has a realistically low potential energy.
- Shirley Hui.
FlexSADRA: flexible structural alignment using a dimensionality reduction approach.
Master thesis, University of Waterloo, 2005.
Details
View more papers on: structural alignment |
Structural Pattern Recognition
For the vast majority of proteins, similar sequence implies similar
function. This is the foundation for many tools in bioinformatics, and
has been instrumental in the annotation of newly sequenced genomes.
However, there are exceptions in which proteins with highly similar
sequence can have quite different biological functions, and conversely
proteins with very divergent sequences can be functionally similar. To
identify functional similarities independent of sequence, we have
developed structural pattern recognition techniques for identifying
binding and protein interaction surfaces. Antifreeze proteins provide
an example of proteins in which sequence is a poor predictor of
function. We have found that a common surface pattern (repeated carbon
atoms matching the ice lattice) is shared by antifreeze proteins,
independent of sequence or fold. This pattern can be used to predict
antifreeze activity, and has been used to identify a novel antifreeze
protein homologous to lipid transfer proteins. |
Alignment Free Whole Genome Phylogeny
We live in an information society. In the classical Newton world,
we know how to measure physical distances. Have you thought about
the equally fundamental question of how to measure the
information distance between two objects: two documents,
two emails, two music pieces, two programs, two pictures, and of course
two genomes?
From a simple and accepted assumption in thermodynamics,
we derive a universal information distance based on Kolmogorov complexity
and a general (application independent)
method to measure similarities between two sequences.
We then apply the theory to infer the evolutionary histories of mammals,
developing an alignment-free whole genome phylogeny method,
and many other applications, such as reconstruction of chain letter
history, and plagiarism detection.
- Ming Li, Jonathan H. Badger, Chen Xin, Sam Kwong, Paul Kearney, Haoyong Zhang.
An Information Based Sequence Distance and Its Application to Whole Mitochondrial Genome Phylogeny.
Bioinformatics,
2001.
Details
- Charles H. Bennett, Ming Li, Bin Ma .
Chain Letters and Evolutionary Histories.
Scientific American,
2003.
Details
- Xin Chen, Brent Francia, Ming Li, Brian Mckinnon, Amit Seker.
Shared Information and Program Plagiarism Detection.
IEEE Transactions on Information Theory,
2004.
Details
View more papers on: phylogeny, shared information, evolution |
|