DNACompress - compress DNA sequences


    DNAcompress [1] is a fast and efficient compression algorithm for DNA sequences. In the spirit of Lempel-Ziv compression scheme, it first finds all approximate repeats that would provide compression profits via PatternHunter [2], and then encodes each of them by a pointer to its previous occurrence. A better compression performance average on standard benchmark sequence data was obtained (See the following table, encoding time also included).

Sequence Size Biocompress-2 GenCompress CTW+LZ DNACompress Encoding time
CHMPXX 121024 1.6848 1.673 1.6690 1.6716 6.21s
CHNTXX 155844 1.6172 1.6146 1.6129 1.6127 5.58s
HEHCMVCG 229354 1.848 1.847 1.8414 1.8492 5.41s
HUMDYSTROP 38770 1.9262 1.9226 1.9175 1.9116 3.21s
HUMGHCSA 66495 1.307 1.1048 1.0972 1.0272 7.45s
HUMHBB 73323 1.88 1.8204 1.8082 1.7897 4.04s
HUMDABCD 58864 1.877 1.8192 1.8218 1.7951 6.13s
HUMHPRTB 56737 1.9066 1.8466 1.8433 1.8165 5.08s
MPOMTCG 186608 1.9378 1.9058 1.9000 1.8920 5.84s
PANMTPACGA 100314 1.8752 1.8624 1.8555 1.8556 4.22s
VACCG 191737 1.7614 1.7614 1.7616 1.7580 6.60s
average --- 1.7837 1.7434 1.7389 1.7254 ---*

* DNACompress is implemented by Java. Encoding time is obtained from locally running DNACompress and may vary every time it runs because of Java program.

    DNACompress is available online or downloaded from here for local use. Here're two DNA sequences that can be used as input to DNACompress: Humghcsa.seq and Hehcmvcg.seq. Only four lowercase characters of {a, c, g, t} are allowed in the input sequence file, and the suffix of its filename must be .seq .


Any suggestions, comments? Please mail to xinchen@cs.ucr.edu

02/28/2001