Perl script:
ÊÊÊÊÊÊÊÊÊÊÊ As stated previously, the key elements in this scheme are the data compression algorithm and the distance function.Ê Several algorithms were tested, including Burroughs-Wheeler sliding block, partial-match Markov chains, and statistical compressors but all of these proved inferior to Lempel-Ziv coding.Ê Of these algorithms, we note that Lempel-Ziv has the lowest Kolmolgorov complexity ö it essentially amounts to constructing a nearly optimal dictionary for the input string and representing the string with small pointers back to the entries of this dictionary.Ê The algorithm follows: [8]
0. Initialize dictionary to blocks of length 1
1. Search for the largest block B that has appeared in the dictionary
2. Encode this block with its dictionary index
3. Add B and the first symbol in the next block to the dictionary
4. Go to 1 unless some stopping criterion has been satisfied
ÊÊÊÊÊÊÊÊÊÊÊ Note that when sequences are concatenated and compressed with an LZ scheme, conserved regions are automatically mapped to the same dictionary entries.Ê This is one of the most interesting aspects of LZ compression on biological sequences ö it does not require a multiple sequence alignment.Ê Of course, LZ doesnât know what is biologically meaningful and what isnât, it only knows how to parse and optimize.
ÊÊÊÊÊÊÊÊÊÊÊ In classical information theory, the mutual information is the amount of information shared by two random variables.Ê A similar result holds for the Kolmolgorov complexity of concatenated strings u and v [4,6]:
K(u)-K(u|v)
--------------
K(uv)
Where uv is the concatenation of uv and K(u|v) is the complexity of u given free access to the dictionary of v.Ê This measures closeness of the strings, so one minus this quantity yields a distance:
K(uv)+K(vu)-K(u)-K(v)
-----------------------------
K(uv)
Or, approximately, d=
2*C(uv)-C(u)-C(v)
----------------------
C(uv)
This yields a distance between 0 and 1.Ê We also experiment with the mapping 1/(1-d)-1 which sends (0,1) to (1,infinity) and then shifts this back to (0,infinity) which is a more intuitive range for a distance.
Note that d(a,a)=0 is satisfied in the long-sequence limit.Ê If we concatenate a with itself and compress C(aa), then LZ simply adds a pointer back to C(a).Ê The length of this pointer is generally short with respect to typical protein sequences a, but it is still finite and can cause a problem for short sequences.Ê Short sequences are not appropriate for any phylogeny determination (9).Ê
The role of sequence length is dramatically illustrated in this 1.1mb .ps plot of sequence length vs. compression ratio for every protein in the current release of Swiss-Prot.Ê At first glance, it appears that the protein is being compressed.Ê Each amino acid, however, is stored as an 8-bit ASCII character.Ê In an idealized storage system, only log2(20) ~ 4.322 bits would be needed.Ê Hence, the convergence to 4.322/8~0.54 strongly supports the hypothesis that protein is incompressible [7].Ê Also note that in the incompressibility limit, concatenations of proteins are compressible if and only if they share conserved regions.Ê Under an incompressibility assumption, our distance becomes
2*C(uv)-k(|u|-|v|)
----------------------
C(uv)
Where k~0.54 and |u| is the length of u.