GenCompress is an efficient compression algorithm for DNA sequences. Recently, data compression became a tool to retrieve information hidden in genetic sequences. From GenCompress results a shared information distance value is approximately calculated, which measures the relatedness between each pair of DNA sequences. Shared information distance is defined in Kolmogorov complexity theory. GenCompress can be downloaded from here.
A file containing only four lower-case characters of { a, c, g, t} is considered as a DNA sequence by GenCompress. Otherwise, it will be considered as a ASCII file to be compressed.
Yes. Please click here to get a linux executive version of GenCompress, and here for SunOS-5.7 executive version!
We use GenCompress results to approximately compute a distance (similar) measure between DNA sequences based on shared information, then can successfully construct biological evolutionary trees via distance matrix (please refer to our paper published in Bioinformatics). It can also be used to infer phylogeny of English texts (please refer to a paper by Charles Bennett, Ming Li, Bin Ma, Linking chain letters, to appear in Scientific American, accepted May. 2000).
GenDistance can calculate a n*n distance matrix for n input sequences, the (i, j) element of which is a distance measure between ith and jth sequences obtained from the above GenCompress program. In theory, this distance matrix should be symmetric. GenDistance program can be downloaded from here.