ACB, DMC, PPM, LZ - CLASSIFICATION, IDEA, COMPARISON George Buyanovsky 1. Introduction This article provides a brief overview of the existing technologies for compression of information and describes the place taken by ACB compression. The overview discusses only the technologies for building compression models: LZ, PPM, DMC, ACB; coding is not discussed. LZ has two basic modifications, LZ77 and LZ78 along with a large number of variations. At present LZ models are widely used (zip,rar,arj,lzh,lzw...). The theoretical approach for LZ models was suggested in 1977 by Lempel A. Ziv J. (LZ), with software realization in the early eighties. PPM, or context modelling, is based on theoretical work conducted from 1986-95 (T.C.Bell, J.G.Cleary, I.H.Witten, W.J.Teahan, A.Moffat ....). Software realization came about in the early nineties (HA, X1, ....). DMC - Several DMC, or Dynamic Markov Coding, are known, in particular, Ross Williams'DHPC algorithm, as well as the DMC algorithm of Cormack & Horspool, which uses many heuristics methods (as well as PPM). I am not aware of industrial archivators using the DMC-technology. 2. LZ, PPM, ACB<-->DMC When two persons speak with one another, they have to formulate their ideas in detail at the beginning of their conversation, but the longer their conversation lasts, less detail is required. Why does this happen? Many ideas have been expressed and are understood by conversants without detail. 2.1 LZ-compression LZ-compression substitutes the initial text with references to a dictionary; it seems that this scheme resembles the methodology used by two interlocutors. However, with the growth of the dictionary, the number of bits necessary for formulating the reference grows proportionally by a binary logarithm to the size of the dictionary. The length of the phrases in the dictionary (in the simplest case, a binary tree) at the beginning considerably surpasses the growth of the length of the references, but after some saturation, the speed of their growth asymptotically tends to logarithmic dependence. The optimal size of the dictionary varies for different types of data; the more variable are the data, the smaller the optimal size of the dictionary. 2.2 PPM - algorithms (context modelling) The idea of context modeling is based on the fact that distribution of possibilities in the alphabet depends on the nearest context, in other words, letters are more likely to appear in a particular pattern, that is, next to or near other particular letters. For this technology, there also exist principal restrictions for the increase of a sliding frame, or a moving processor of data or letters, for which a context model is built. A small sliding frame with a corresponding meagre context model is optimal on variable data, for which the problem of zero frequency is especially acute. In the course of increasing the size of a sliding frame, more and more various information is processed (e.g., a text in French, then a text in German and then a text in Russian enters this frame), with the result that the distribution of possibilities widens and the effectiveness of context modelling (with short context) decreases quickly. With the increase of the context length, costs also increase exponentially. The transition to context-mixed models has improved the situation to some extent; however, the absence of theoretically substantiated schemes of intermingling probabilities is compensated by a large number of heuristics, which takes us back to the times of alchemy. 2.3 ACB-compression However, the brain of interlocutors in a mysterious fashion can overcome this barrier and can use all accumulated information through a mechanism called associative memory. The general principle of ACB-compression is very simple: the ACB-algorithm puts on glasses with an associative filter and looking at the past, it sees not the whole frame but rather only fragments, which are close by context to the nearest context. Fragments or phrases of the received picture differ much by quality; phrases close by context are distinct in their relation to other phrases, less close phrases are less distinct or are not distinct at all. In this case, the ACB-algorithm has ideal conditions for work, in that it works more quickly by considering larger units, moreover, it can distinguish the value of phrases, through consideration of the probability of the appearance or use. In the case of an unlimited increase of a frame, ACB-compression always ensures an optimal size of a context dictionary. Some ditailes about ACB-compression see APPENDIX_A ### 2.4 DMC - Dynamic Markov Coding Probabilistic models with a finite number of states can be described by a finite automate. A set of states S(i) and a set of probabilities of transition P(i,j) from the state i into the state J are called Markov's models. Markov's Dynamic Coding (DMC) is of practical interest, in that it works adaptively, starting from a simple initial model and adding, if necessary, new states. PPM technology is a particular case of the DMC approach. DMC allows the construction of context models not only for single symbols (as with PPM and consideration of letters of the alphabet), but also for phrases or lines. In this sense, ACB-compression can be classified as a variety of the DMC approach. I am not aware of software realizations of DMC algorithms; I would appreciate receiving any information on this issue (especially software realizations for the IBM_PC). 3. Comparison: ACB, PPM, LZ ACB.EXE v1.14a & v1.17a works only in a "solid" mode, in which all files being packed are lined in a solid stream; this allows the most effective use the advantages of a large frame. The least favourable data for this solid-mode test data. As a rule, various information is taken into the catalogue to conduct tests. However, the high adaptivity of ACB-compression reduces to a minimum the possible loss, and on real data, the solid mode gives gain in compression in 99% cases. Below please find a comparative table (as test files, I have selected those transmitted by modem): Compared: ACB v1.14a, X1 v0.93e, RAR v1.54b, ZIP v2.04g DATA - Borland_C++_v_3.1 and GS_system_(convertor for *.ps) (exe,asm,c,cpp,h,hpp,obj,rtf,hlp,txt,doc,rc,res,bmp,lib,dll,wav,prj,ps,....) Regim (Options): ACB - default ( "TAUGHT CHANNEL"-mode is not used) X1 - select mode with max compression (m1/m3/m4/m1-Solid/m3-Solid/m4-Solid) RAR - all Solid, max compression ZIP - max compression ACB PPM LZ LZ *.ACB *.X *.RAR *.ZIP TLIB 78,232 124,950 197,383 219,503 LIB 551,153 798,122 1,020,380 1,157,399 BGI 80,441 91,310 88,763 111,801 DOC 145,801 157,010 193,922 209,196 OWL 813,594 1,306,961 1,306,723 1,569,437 EXMPL 413,790 568,216 565,897 837,843 GS 605,931 653,809 671,410 725,455 2,688,942 3,700,378 4,044,478 4,830,634 Time on OWL Pentium-100 (486SX-33) ACB - 232 sec. 1057 sec. X1 -(m4-Solid) 190 sec. 710 sec. RAR - 101 sec. 341 sec. ZIP - 77 sec. 170 sec. The non-linear nature of the gain in speed in switching to a Pentium processor is explained by three reasons: 1) Time costs for disk operations in switching over to Pentium are not substantially decreased; their share for ACB is very small. 2) Optimization of the code for Pentium. 3) More effective caching of RAM, which is very critical for ACB. The advantage of ACB-compression can be increased if ACB is tuned to some type of data, using the "TAUGHT CHANNEL" mode, having created in advance the context from relative information. Such a possibility is only a side effect of this mode. Something similar occurs in choosing options in order to obtain the maximum compression coefficient (external information about the data type). This approach is good if one wants to set the world records, but is of little practical application. In the APPENDIX_B see the comparative table between the best research algorithms and ACB (on "Calgary Corpus" ). In the APPENDIX_C see the comparative table between ACB_1.17a and ACB_1.14a. 4. "TAUGHT CHANNEL" The idea: in compressing information transmitted by a channel of communications, all of ALL!!! the information earlier transmitted by this channel is tranmitted, so that any repetitions of the data transmitted in the past through the channel will be used in establishing the algorithm for compression. Realization: The ACB-compression algorithm was implemented in the archivator ACB.EXE for Dos-WIN95 (ACB_1.17a- the latest version); it was fully written on Assembler (32_Protection-mode) and optimized for the Pentium processor. It supports the TAUGHT CHANNEL mode. Innovations: The ACB-compression algorithm is the only algorithm of compression, the compression coefficient of which asymptotically increases with the growth of the sliding frame. Other algorithms can track not more than 8-65 Kb of the channel's background, depending on the types of the data, as compared with 822 Kb in ACB_1.14a & 1671 Kb in ACB_1.17a. 4.1 Testing with "taught channel" - mode Stream of data consists of two types of data: - ARCHIVE COMPARISON TEST (A.C.T.) Author: JEFF GILCHRIST for June, July, August, September, October; - and *.exe files these are different version ACB.EXE: Regim (Options): ACB - "TAUGHT CHANNEL"-mode X1 - aem# RAR - all Solid, max compression ZIP - max compression Stream: TXT --> EXE --> TXT --> EXE .... TXT --> EXE ACB PPM LZ LZ Stream *.ACB *.X *.RAR *.ZIP of data size size size size A.C.T_6.95 10323 10314 12201 12199 ACB_1.10 35605 36713 37150 37605 A.C.T_7.95 3976 11039 13118 13121 ACB_1.11 7903 36751 37172 37620 A.C.T_8.95 2120 10847 12855 12849 ACB_1.12 6963 36977 37381 37853 A.C.T_9.95 1905 11004 12962 12949 ACB_1.13 6123 37036 37402 37886 A.C.T_10.95 1847 11381 13465 13480 ACB_1.14 6873 36201 36410 36978 83638 238263 250116 252540 Time (sec.) 20 28 7 5 Testing was conducted on Pentium-75. 5. Conclusion Pentium-100 encourages the creation of new algorithms, one of which is ACB-compression. If you are working on the problem of data compression for communications channels, ACB-compression is the most effective algorithm for compressing the stream of various information. --------------------------------------------------------------------------- APPENDIX_A: Complexity of ACB-compression Time: T(n) = n/k * ( S + Log2( n/(2*k) ) ) + o(1) Memory: M(n) = n + n/8 + n*(Log2(n)+1)/4 + 11000 n - sliding frame size (for ACB v1.14 n=822000 bytes & v1.17a n=1671000 bytes), S - context dictionary size for ACB_1.14 : S=53, for ACB_1.17a : S=100 (MAX.), S=55 (NORMAL), S=16 (FAST) k - I/C I - data size, C - code size, T(n)code=T(n)decode, M(n)code=M(n)decode. Note: - memory requirements decrease proportionally to an ratio. However, the latter possibility in ACB.EXE is not used to simplify memory manager. M'(n) = n + n/8 + n*(Log2(n)+1)/(4*k) + 11000 - If S=1, the compression rate will be approximately the same as LZ-schemes have on small files (32-64 Kb). However, with the inrease of their size, the advantage of the ACB-compression increases (S=1). And all its valuable properties (on_the_fly, unrestricted growth of a sliding frame ...) remain. - Log2(n)+1 is the size of the pointer (bits), necessary for addressing each byte of the sliding frame For n=4096 bytes M(n)=28920 bytes n=32768 M(n)=178936 n=65536 M(n)=363256 n=1048576 M(n)=6695672 Property: _____ _____ Data | | Cod | | Data ----->| ACB |----->| ACB |--------> Time1 |_____| |_____| Time2 Time1=Time2 The first compressed byte generates the code sufficient for restoration of this byte (on_the_fly). On Pentium_100 the speed of the code is 40000 bits/sec, if the transmition speed in the communication line is <= than 40000 b/s, then time work of the ACB does not influence on the total transmition time. --------------------------------------------------------------------------- APPENDIX_B: Here are the results comparing ACB v1.14 against LZP4 and the best PPM realisation (on "Calgary Corpus" ). (W.J.Teahan art. "Probability estimation for PPM" 1995). LZP4,PPM3,PPM*4 is a research version. ACB_1.14a LZP4 PPM3 PPM*4 bib 1.976 1.915 1.86 1.86 book1 2.366 2.350 2.30 2.41 book2 1.983 2.011 1.96 2.00 geo 4.761 4.740 4.63 4.78 news 2.373 2.350 2.35 2.37 obj1 3.557 3.744 3.71 3.83 obj2 2.299 2.388 2.38 2.31 paper1 2.386 2.378 2.33 2.33 paper2 2.377 2.389 2.32 2.34 pic 0.784 0.814 0.80 0.84 progc 2.371 2.392 2.36 2.34 progl 1.547 1.589 1.68 1.61 progp 1.536 1.585 1.70 1.55 trans 1.323 1.343 1.47 1.39 ----- ----- ----- ----- avg 2.260 2.291 2.28 2.28 -------------------------------------------------------------------------- APPENDIX_C: Here are the results comparing ACB v1.17a against ACB v1.14a: i486DX-75 16 RAM (256 Kb cache) ///////////////// TXT //////////////////////// Non-formatted text 10 files of parts of fiction - 2359210 bytes ................ . 13.02.96 21:43 .. 13.02.96 21:43 JELYAZ02 TXT 485 162 09.09.91 17:35 LKNKPLNT TXT 357 680 25.11.95 14:17 VAGNER01 TXT 312 153 10.03.94 22:25 STARWIND TXT 206 486 05.05.93 16:50 ELEPHANT TXT 112 611 24.11.93 8:19 ANDERS07 TXT 560 582 21.11.93 2:04 BSELIKS TXT 86 491 02.04.91 15:13 WARRIOR TXT 237 851 16.01.94 23:41 10 file(s) 2 359 016 bytes ................ Size (bytes) Time (sec.) ACB_1.14a 724821 473 ACB_1.17a(NORMAL) 711285 441 ACB_1.17a(MAX) 703878 663 ACB_1.17a(FAST) 758915 241 PKZIP_2.04g (-ex) 1005277 35 ///////////////// EXE //////////////////////// WINWORD.EXE (MS_WINWORD v.6.0_rus) - 3483156 bytes Size (bytes) Time (sec.) ACB_1.14a 1753719 715 ACB_1.17a(NORMAL) 1706586 725 ACB_1.17a(MAX) 1686622 1084 ACB_1.17a(FAST) 1784174 388 PKZIP_2.04g (-ex) 2031837 50 //////////////////////////////////////////////