ACB, DMC, PPM, LZ - CLASSIFICATION, IDEA, COMPARISON

                                                         George Buyanovsky

                   1.  Introduction

 This article provides a brief overview of the existing technologies for
compression of information and describes the place taken by ACB compression.
The overview discusses only the technologies for building compression
models: LZ, PPM, DMC, ACB; coding is not discussed.

LZ has two basic modifications, LZ77 and LZ78 along with a large number of
   variations. At present LZ models are widely used (zip,rar,arj,lzh,lzw...).
   The theoretical approach for LZ models was suggested in 1977 by
   Lempel A. Ziv J. (LZ), with software realization in the early eighties.

PPM, or context modelling, is based on theoretical work conducted from 1986-95
    (T.C.Bell, J.G.Cleary, I.H.Witten, W.J.Teahan, A.Moffat ....).
    Software realization came about in the early nineties (HA, X1, ....).

DMC - Several DMC, or Dynamic Markov Coding, are known, in particular,
      Ross Williams'DHPC algorithm, as well as the DMC algorithm of
      Cormack & Horspool, which uses many heuristics methods (as well as PPM).
      I am not aware of industrial archivators using the DMC-technology.


                   2.  LZ, PPM, ACB<-->DMC


  When two persons speak with one another, they have to formulate their
ideas in detail at the beginning of their conversation, but the longer
their conversation lasts, less detail is required. Why does this happen?
Many ideas have been expressed and are understood by conversants without
detail.

                   2.1  LZ-compression

LZ-compression substitutes the initial text with references to a dictionary;
it seems that this scheme resembles the methodology used by two interlocutors.

However, with the growth of the dictionary, the number of bits
necessary for formulating the reference grows proportionally by a binary
logarithm to the size of the dictionary. The length of the phrases
in the dictionary (in the simplest case, a binary tree) at the beginning
considerably surpasses the growth of the length of the references, but after
some saturation, the speed of their growth asymptotically tends to
logarithmic dependence.
The optimal size of the dictionary varies for different types of data;
the more variable are the data, the smaller the optimal size of the
dictionary.

                   2.2 PPM - algorithms (context modelling)

The idea of context modeling is based on the fact that distribution of
possibilities in the alphabet depends on the nearest context, in other
words, letters are more likely to appear in a particular pattern, that
is, next to or near other particular letters. For this technology, there
also exist principal restrictions for the increase of a sliding frame,
or a moving processor of data or letters, for which a context model is
built. A small sliding frame with a corresponding meagre context model
is optimal on variable data, for which the problem of zero frequency is
especially acute. In the course of increasing the size of a sliding frame,
more and more various information is processed (e.g., a text in French,
then a text in German and then a text in Russian enters this frame),
with the result that the distribution of possibilities widens and the
effectiveness of context modelling (with short context) decreases quickly.
With the increase of the context length, costs also increase exponentially.
The transition to context-mixed models has improved the situation to some
extent; however, the absence of theoretically substantiated schemes of
intermingling probabilities is compensated by a large number of heuristics,
which takes us back to the times of alchemy.

                   2.3  ACB-compression

However, the brain of interlocutors in a mysterious fashion can overcome
this barrier and can use all accumulated information through a mechanism
called associative memory.

The general principle of ACB-compression is very simple: the ACB-algorithm
puts on glasses with an associative filter and looking at the past, it sees
not the whole frame but rather only fragments, which are close by context
to the nearest context. Fragments or phrases of the received picture differ
much by quality; phrases close by context are distinct in their relation to
other phrases, less close phrases are less distinct or are not distinct at
all. In this case, the ACB-algorithm has ideal conditions for work, in that
it works more quickly by considering larger units, moreover, it can
distinguish the value of phrases, through consideration of the probability
of the appearance or use. In the case of an unlimited increase of a frame,
ACB-compression always ensures an optimal size of a context dictionary.

Some ditailes about ACB-compression see APPENDIX_A ###

                   2.4  DMC - Dynamic Markov Coding

Probabilistic models with a finite number of states can be described by a
finite automate. A set of states S(i) and a set of probabilities of
transition P(i,j) from the state i into the state J are called Markov's
models. Markov's Dynamic Coding (DMC) is of practical interest, in that
it works adaptively, starting from a simple initial model and adding,
if necessary, new states. PPM technology is a particular case of the DMC
approach. DMC allows the construction of context models not only for
single symbols (as with PPM and consideration of letters of the alphabet),
but also for phrases or lines. In this sense, ACB-compression can be
classified as a variety of the DMC approach. I am not aware of software
realizations of DMC algorithms; I would appreciate receiving any
information on this issue (especially software realizations for the IBM_PC).

                   3. Comparison: ACB, PPM, LZ

ACB.EXE v1.14a & v1.17a works only in a "solid" mode, in which all files being
packed are lined in a solid stream; this allows the most effective use the
advantages of a large frame. The least favourable data for this solid-mode
test data. As a rule, various information is taken into the catalogue to
conduct tests. However, the high adaptivity of ACB-compression reduces to a
minimum the possible loss, and on real data, the solid mode gives gain in
compression in 99% cases. Below please find a comparative table (as test
files, I have selected those transmitted by modem):

Compared: ACB v1.14a, X1 v0.93e, RAR v1.54b, ZIP v2.04g
DATA -	Borland_C++_v_3.1 and GS_system_(convertor for *.ps)
(exe,asm,c,cpp,h,hpp,obj,rtf,hlp,txt,doc,rc,res,bmp,lib,dll,wav,prj,ps,....)

       Regim (Options):
 ACB - default ( "TAUGHT CHANNEL"-mode is not used)
 X1  - select mode with max compression (m1/m3/m4/m1-Solid/m3-Solid/m4-Solid)
 RAR - all Solid, max compression
 ZIP - max compression

      ACB        PPM        LZ          LZ
      *.ACB	 *.X	    *.RAR	*.ZIP
TLIB  78,232 	 124,950    197,383 	219,503 
LIB   551,153 	 798,122    1,020,380 	1,157,399 
BGI   80,441 	 91,310     88,763 	111,801 
DOC   145,801	 157,010    193,922 	209,196 
OWL   813,594 	 1,306,961  1,306,723 	1,569,437 
EXMPL 413,790 	 568,216    565,897 	837,843 
GS    605,931 	 653,809    671,410 	725,455 
      2,688,942	 3,700,378  4,044,478	4,830,634

Time on OWL     Pentium-100  (486SX-33)
ACB -           232 sec.     1057 sec.
X1  -(m4-Solid) 190 sec.     710  sec.
RAR -	        101 sec.     341  sec.
ZIP - 	        77  sec.     170  sec.

The non-linear nature of the gain in speed in switching to a Pentium processor
is explained by three reasons:
1) Time costs for disk operations in switching over to Pentium are not
   substantially decreased; their share for ACB is very small.
2) Optimization of the code for Pentium.
3) More effective caching of RAM, which is very critical for ACB.

The advantage of ACB-compression can be increased if ACB is tuned to some type
of data, using the "TAUGHT CHANNEL" mode, having created in advance the
context from relative information. Such a possibility is only a side effect of
this mode. Something similar occurs in choosing options in order to obtain the
maximum compression coefficient (external information about the data type).
This approach is good if one wants to set the world records, but is of little
practical application.

In the APPENDIX_B see the comparative table between the best research
algorithms and ACB (on "Calgary Corpus" ).

In the APPENDIX_C see the comparative table between ACB_1.17a and ACB_1.14a.


                   4. "TAUGHT CHANNEL"

The idea: in compressing information transmitted by a channel of
          communications, all of ALL!!! the information earlier transmitted by
          this channel is tranmitted, so that any repetitions of the data
          transmitted in the past through the channel will be used in
          establishing the algorithm for compression.

Realization: The ACB-compression algorithm was implemented in the
             archivator ACB.EXE for Dos-WIN95 (ACB_1.17a- the latest version);
             it was fully written on Assembler (32_Protection-mode) and
             optimized for the Pentium processor.
             It supports the TAUGHT CHANNEL mode.

Innovations: The ACB-compression algorithm is the only algorithm of
             compression, the compression coefficient of which asymptotically
             increases with the growth of the sliding frame. Other algorithms
             can track not more than 8-65 Kb of the channel's background,
             depending on the types of the data, as compared with
             822 Kb in ACB_1.14a & 1671 Kb in ACB_1.17a.

                   4.1 Testing with "taught channel" - mode

Stream of data consists of two types of data:
- ARCHIVE COMPARISON TEST (A.C.T.) Author: JEFF GILCHRIST
  for June, July, August, September, October;
- and *.exe files these are different version ACB.EXE:

       Regim (Options):
 ACB - "TAUGHT CHANNEL"-mode
 X1  - aem#
 RAR - all Solid, max compression
 ZIP - max compression

Stream: TXT --> EXE --> TXT --> EXE .... TXT --> EXE

             ACB        PPM      LZ       LZ
Stream       *.ACB      *.X     *.RAR    *.ZIP
of data       size      size     size     size
A.C.T_6.95   10323     10314    12201    12199
ACB_1.10     35605     36713    37150    37605
A.C.T_7.95    3976     11039    13118    13121
ACB_1.11      7903     36751    37172    37620
A.C.T_8.95    2120     10847    12855    12849
ACB_1.12      6963     36977    37381    37853
A.C.T_9.95    1905     11004    12962    12949
ACB_1.13      6123     37036    37402    37886
A.C.T_10.95   1847     11381    13465    13480
ACB_1.14      6873     36201    36410    36978
             83638    238263   250116   252540
Time (sec.)    20        28       7        5

Testing was conducted on Pentium-75.

                   5.  Conclusion

Pentium-100 encourages the creation of new algorithms, one of which is
ACB-compression. 

If you are working on the problem of data compression for communications
channels, ACB-compression is the most effective algorithm for compressing
the stream of various information.

---------------------------------------------------------------------------
APPENDIX_A:
                      Complexity of ACB-compression

Time:   T(n) = n/k * ( S + Log2( n/(2*k) ) ) + o(1)
Memory: M(n) = n + n/8 + n*(Log2(n)+1)/4 + 11000

n - sliding frame size (for ACB v1.14 n=822000 bytes & v1.17a n=1671000 bytes),
S - context dictionary size
for ACB_1.14 : S=53, for ACB_1.17a : S=100 (MAX.), S=55 (NORMAL), S=16 (FAST)
k - I/C  I - data size, C - code size,
T(n)code=T(n)decode,
M(n)code=M(n)decode.

Note:
- memory requirements decrease proportionally to an ratio.
  However, the latter possibility in ACB.EXE is not used 
  to simplify memory manager.
  M'(n) = n + n/8 + n*(Log2(n)+1)/(4*k) + 11000

- If S=1, the compression rate will be approximately the same as LZ-schemes
  have on small files (32-64 Kb). However, with the inrease of their size,
  the advantage of the ACB-compression increases (S=1). And all its valuable
  properties (on_the_fly, unrestricted growth of a sliding frame ...) remain.

- Log2(n)+1 is the size of the pointer (bits), necessary for addressing
  each byte of the sliding frame 

For n=4096 bytes   M(n)=28920 bytes
    n=32768        M(n)=178936
    n=65536        M(n)=363256
    n=1048576      M(n)=6695672

Property:
       _____        _____
Data  |     | Cod  |     |  Data
----->| ACB |----->| ACB |-------->
Time1 |_____|      |_____| Time2

Time1=Time2

The first compressed byte generates the code sufficient for
restoration of this byte (on_the_fly).

On Pentium_100 the speed of the code is 40000 bits/sec, if the transmition
speed in the communication line is <= than 40000 b/s, then time work
of the ACB does not influence on the total transmition time.
---------------------------------------------------------------------------
APPENDIX_B:

Here are the results comparing ACB v1.14 against LZP4
and the best PPM realisation (on "Calgary Corpus" ).
(W.J.Teahan art. "Probability estimation for PPM" 1995).
LZP4,PPM3,PPM*4 is a research version.

            ACB_1.14a   LZP4     PPM3    PPM*4

bib          1.976      1.915     1.86    1.86
book1        2.366      2.350     2.30    2.41
book2        1.983      2.011     1.96    2.00
geo          4.761      4.740     4.63    4.78
news         2.373      2.350     2.35    2.37
obj1         3.557      3.744     3.71    3.83
obj2         2.299      2.388     2.38    2.31
paper1       2.386      2.378     2.33    2.33
paper2       2.377      2.389     2.32    2.34
pic          0.784      0.814     0.80    0.84
progc        2.371      2.392     2.36    2.34
progl        1.547      1.589     1.68    1.61
progp        1.536      1.585     1.70    1.55
trans        1.323      1.343     1.47    1.39
             -----      -----     -----   -----
  avg        2.260      2.291     2.28    2.28

--------------------------------------------------------------------------
APPENDIX_C:

Here are the results comparing ACB v1.17a against ACB v1.14a:

i486DX-75  16 RAM  (256 Kb cache)

///////////////// TXT ////////////////////////

Non-formatted text 10 files of parts of fiction  - 2359210 bytes

................
.            <DIR>         13.02.96   21:43
..           <DIR>         13.02.96   21:43
JELYAZ02 TXT       485 162 09.09.91   17:35
LKNKPLNT TXT       357 680 25.11.95   14:17
VAGNER01 TXT       312 153 10.03.94   22:25
STARWIND TXT       206 486 05.05.93   16:50
ELEPHANT TXT       112 611 24.11.93    8:19
ANDERS07 TXT       560 582 21.11.93    2:04
BSELIKS  TXT        86 491 02.04.91   15:13
WARRIOR  TXT       237 851 16.01.94   23:41
       10 file(s)      2 359 016 bytes
................

                   Size (bytes)  Time (sec.)
ACB_1.14a           724821       473
ACB_1.17a(NORMAL)   711285       441
ACB_1.17a(MAX)      703878       663
ACB_1.17a(FAST)     758915       241
PKZIP_2.04g (-ex)  1005277        35

///////////////// EXE ////////////////////////

WINWORD.EXE (MS_WINWORD v.6.0_rus) - 3483156 bytes

                   Size (bytes)  Time (sec.)
ACB_1.14a          1753719        715
ACB_1.17a(NORMAL)  1706586        725
ACB_1.17a(MAX)     1686622       1084
ACB_1.17a(FAST)    1784174        388
PKZIP_2.04g (-ex)  2031837         50

//////////////////////////////////////////////