Bioinformatics
Research
Group
[Bioinformatics Group Logo]
  Home
Today's events
People
Alumni

Research:
  Current projects
Publications
Software
Events

Graduate studies:
  Prospective students
Coming to Waterloo
Courses
Upcoming conferences

Undergraduate studies:
  Prospective students
Bioinformatics club
Undergraduate research assistants

Lab information:
  Mailing lists
Lab location
New user guide
Computing info
FAQs
Booking system

Photographs

Contact:
Bioinformatics Group
School of
Computer Science
University of Waterloo
200 University Ave W
Waterloo, ON N2L 3G1
Canada

E-mail: Dan Brown

University of Waterloo

Bioinformatics Group Research Projects


Protein Structure Prediction

In the post-genomic era, new protein sequences are produced much faster than it is possible to experimentally determine their 3D structure. Protein structure prediction plays a significant role in bridging this gap.

It has been estimated that there are only about 1000 different protein folds in nature. Protein threading involves searching for the best structure for a given protein sequence among already known protein structures. We have developed the fast and memory-efficient protein threading tool RAPTOR, based on integer programming and support vector machines. RAPTOR was assessed to be one of the best methods in the CASP 5 and CASP 6 structure prediction competitions. We are investigating many additional problems in structure prediction, such as side chain packing and loop structure, and the special case of structure prediction for transmembrane proteins. We also are investigating the combination of multiple predictions from different predictors.

  • Jinbo Xu, Ming Li. Assessment of RAPTORs linear programming approach in CAFASP3. Proteins: Structure, Function, and Genetics, 2003. Details
  • Jinbo Xu, Ming Li, Guohui Lin, Dongsup Kim, Ying Xu. Protein Threading by Linear Programming. In Biocomputing 2003, Proceedings of the Pacific Symposium (PSB), 2003. Details
  • Jinbo Xu. Rapid Protein Side-Chain Packing via Tree Decomposition. In RECOMB 2005, 2005. Details
  • Jinbo Xu, Ming Li, Dongsup Kim, Ying Xu. RAPTOR: Optimal Protein Threading by Linear programming. Journal of Bioinformatics and Computational Biology, 2003. Details

View more papers on: protein folding


Spaced Seeds for Homology Search

Homology search is the task of finding pairs of similar substrings in large DNA or protein sequence databases. Such sequence alignments highlight areas that may be preserved by evolution and help to determine their biological function. BLAST, a popular heuristic program for this task, is a household name among biologists.

In 2002, we introduced the technique of spaced seeds for increasing the accuracy and speed of BLAST-like algorithms. We have studied mathematical properties of spaced seeds, related algorithmic problems, and selection of appropriate spaced seeds for particular tasks. The resultant commercial tool, PatternHunter, was used to compare the human and mouse genomes by the Mouse Genome Sequencing Consortium. Most recently, we are combining homology search with gene finding to find homologs of a known gene in other organisms.

  • Bin Ma, John Tromp, Ming Li. PatternHunter: faster and more sensitive homology search. Bioinformatics, 2002. Details
  • Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature, 2002. (Daniel Brown and Ming Li are members of MGSC). Details
  • Daniel G. Brown, Ming Li, Bin Ma. A tutorial of recent developments in the seeding of local alignment. Journal of Bioinformatics and Computational Biology, 2004. Details
  • Brona Brejova, Daniel G. Brown, Tomas Vinar. Vector seeds: an extension to spaced seeds. Journal of Computer and System Sciences, 2005. Early version appeared in WABI 2003. Details
  • Ming Li, Bin Ma, Zhang Louxin. Superiority and complexity of the spaced seeds. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithms (SODA 2006), 2006. Details

View more papers on: spaced seeds, homology search


Sequence Annotation and Hidden Markov Models

In sequence annotation, we want to segment a given DNA or protein sequence into biologically relevant regions. For example, in gene finding, we want to find the protein-coding regions of the genome. In transmembrane protein topology prediction, we want to distinguish transmembrane helixes, internal, and external loops. Additional sources of biolo gical information can improve predictions over those derived from sequence alone. With this observation in mind, we have developed two practical tools, ExonHunter and AHMM, for these two tasks.

We also study theoretical concepts underlying the use of hidden Markov models fo r sequence annotation. We have shown that it is important to consider a tradeoff between prediction accuracy and running time, as a simple change in the model may change running time from linear to exponential.

  • Brona Brejova, Daniel G. Brown, Tomas Vinar. The Most Probable Labeling Problem in HMMs and Its Application to Bioinformatics. In Algorithms in Bioinformatics, 4th International Workshop (WABI), 2004. Details
  • Brona Brejova, Daniel G. Brown, Ming Li, Tomas Vinar. ExonHunter: a comprehensive approach to gene finding. Bioinformatics, 2005. ISMB 2005. Details
  • Emily Wei Xu, Paul Kearney, Daniel G. Brown. The use of functional domains to improve transmembrane protein topology prediction. Journal of Bioinformatics and Computational Biology, 2006. Details

View more papers on: HMMs, gene finding, transmembrane proteins


Haplotype Inference

In diploid organisms, such as humans, the identification of maternal and paternal inheritance is important for the mapping of disease genes. Technological limitations make this problem very difficult. It is currently very expensive to experimentally determine these parental haplotypes, and instead the genotype (the conflation of the two haplotypes) is identified. The process of going from genotypes to haplotypes is called haplotype inference. A surprising fact is that while most problems in this area are very hard in theory, real instances of these problems turn out to be very easy to solve in practice. We have proved theorems explaining this phenomenon for simple models of populations, and are currently moving to more complicated models. By understanding how changes to the population model affect the complexity of the problem, we hope to develop more efficient and effective techniques for haplotype inference.

  • Daniel G. Brown, Ian M. Harrower. A New Integer Programming Formulation for the Pure Parsimony Problem in Haplotype Analysis. In Algorithms in Bioinformatics, 4th International Workshop (WABI), 2004. Details
  • Daniel G. Brown, Ian M. Harrower. Integer Programming Approaches to Haplotype Inference by Pure Parsimony. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2006. Details
  • Daniel G. Brown, Ian M. Harrower. Toward an Algebraic Understanding of Haplotype Inference by Pure Parsimony. In Proceedings of Computational Systems Bioinformatics Conference, 2006. Details

View more papers on: haplotypes


Kernel Methods in Drug Design

A drug is typically a small molecule that interacts with the binding site of some target protein. Drug design involves the optimization of this interaction so that the drug effectively binds with the target protein while not binding with other proteins (an event that could produce dangerous side effects). Computational drug design involves the geometric modeling of drug molecules, with the goal of generating similar molecules that will be more effective drug candidates. It is necessary that algorithms incorporate strategies to measure molecular similarity by comparing molecular descriptors that may involve dozens to hundreds of attributes. We use kernel-based methods to define these measures of similarity. Kernels are general functions that can be used to formulate similarity comparisons. The main goal of this research is to develop kernel functions that will facilitate those similarity comparisons that are most critical for a particular binding site.


Multiple Sequence Alignment

Comparing genomic sequences from multiple species is one of the most essential tasks in bioinformatics. Such comparisons allow us to investigate evolution across long time distances, and let us explore which parts of a sequence may be partly responsible for visible differences among a set of organisms of the same species.

The problem of finding a multiple alignment is challenging, because for long sequences, the time needed to align many sequences grows extremely quickly. We use a new approach to find anchoring points that must be homologous in our produced alignment, which exploits new probabilistic models to limit the number of false anchors identified to well below 1. This allows us to efficiently compute alignments of extremely high quality.

  • Daniel G. Brown, Alexander K. Hudek. New Algorithms for Multiple DNA Sequence Alignment. In Algorithms in Bioinformatics, 4th International Workshop (WABI), 2004. Details
  • Alexander K. Hudek, Daniel G. Brown. Ancestral sequence alignment under optimal conditions. BMC Bioinformatics, 2005. Details

View more papers on: multiple alignment


Motif Finding

Expression of genes is regulated by proteins binding to short sequence motifs located close to their regulated genes. Motif finding strives to discover unknown motifs by examining the regulatory regions of genes regulated by the same protein, looking for substrings with many approximate occurrences.

We have studied several abstractions of this problem formulated as optimization problems. We have shown NP-hardness and approximation algorithms for several variants of the problem. Bounds on the approximation ratio are tight, but they do not reflect the good performance of the algorithms in practice. However, recently we have shown tighter bounds for strong motifs, which are more likely to be biologically relevant. We have also studied the problem of finding protein sequence segments conserved in both sequence and structure.

  • Ming Li, Bin Ma, Lusheng Wang. Finding Similar Regions in Many Strings. In Proceedings of the thirty-first annual ACM symposium on Theory of computing (STOC), 1999. Details
  • Thomas Tang, Jinbo Xu, Ming Li. Discovering Sequence-Structure Motifs from Protein Segments and Two Applications. In Biocomputing 2005, Proceedings of the Pacific Symposium (PSB), 2005. Details
  • Brona Brejova, Daniel G. Brown, Ian M. Harrower, Alejandro Lopez-Ortiz, Tomas Vinar. Sharper Upper and Lower Bounds for an Approximation Scheme for Consensus-Pattern. In Combinatorial Pattern Matching, 16th Annual Symposium (CPM 2005), 2005. Details
  • Brona Brejova, Daniel G. Brown, Ian M. Harrower, Tomas Vinar. New bounds for motif finding in strong instances. In Combinatorial Pattern Matching, 17th Annual Symposium (CPM 2006), 2006. Details

View more papers on: motifs


Structural Comparison of Proteins

Structural comparisons of proteins are very important for evolutionary studies, drug design, and protein classification. A structural comparison between two proteins is usually done by performing a 3D superimposition of the proteins. If one assumes that both proteins are rigid entities, then it is easy to find the rotations and translations that will optimally do this to get the maximum superimposition. However, proteins are by nature somewhat flexible, thus, it is more realistic to consider a rigid/flexible comparison. This approach to structural comparison is more challenging and demands considerably more computation because of the many degrees of freedom in a protein molecule. The FlexSADRA algorithm does such a comparison using a novel approach that reduces the complexity of the comparison by working in a lower dimensional space defined through a dimensionality reduction strategy. Unlike other structural comparison algorithms that incorporate flexibility, FlexSADRA ensures that the final flexed conformation has a realistically low potential energy.

  • Shirley Hui. FlexSADRA: flexible structural alignment using a dimensionality reduction approach. Master thesis, University of Waterloo, 2005. Details

View more papers on: structural alignment


Structural Pattern Recognition

For the vast majority of proteins, similar sequence implies similar function. This is the foundation for many tools in bioinformatics, and has been instrumental in the annotation of newly sequenced genomes. However, there are exceptions in which proteins with highly similar sequence can have quite different biological functions, and conversely proteins with very divergent sequences can be functionally similar. To identify functional similarities independent of sequence, we have developed structural pattern recognition techniques for identifying binding and protein interaction surfaces. Antifreeze proteins provide an example of proteins in which sequence is a poor predictor of function. We have found that a common surface pattern (repeated carbon atoms matching the ice lattice) is shared by antifreeze proteins, independent of sequence or fold. This pattern can be used to predict antifreeze activity, and has been used to identify a novel antifreeze protein homologous to lipid transfer proteins.


Alignment Free Whole Genome Phylogeny

We live in an information society. In the classical Newton world, we know how to measure physical distances. Have you thought about the equally fundamental question of how to measure the information distance between two objects: two documents, two emails, two music pieces, two programs, two pictures, and of course two genomes?

From a simple and accepted assumption in thermodynamics, we derive a universal information distance based on Kolmogorov complexity and a general (application independent) method to measure similarities between two sequences. We then apply the theory to infer the evolutionary histories of mammals, developing an alignment-free whole genome phylogeny method, and many other applications, such as reconstruction of chain letter history, and plagiarism detection.

  • Ming Li, Jonathan H. Badger, Chen Xin, Sam Kwong, Paul Kearney, Haoyong Zhang. An Information Based Sequence Distance and Its Application to Whole Mitochondrial Genome Phylogeny. Bioinformatics, 2001. Details
  • Charles H. Bennett, Ming Li, Bin Ma . Chain Letters and Evolutionary Histories. Scientific American, 2003. Details
  • Xin Chen, Brent Francia, Ming Li, Brian Mckinnon, Amit Seker. Shared Information and Program Plagiarism Detection. IEEE Transactions on Information Theory, 2004. Details

View more papers on: phylogeny, shared information, evolution


This page is maintained by Dan Brown and Alexander K. Hudek.
Contact:
Last modified: 07/20/2006
Google
Search WWW Search this site