7. BACKGRND: concepts, program design, compressor design, benchmarks
====================================================================

Although many aspects of UC look simple and straightforward, UC is a
very advanced program.

This document contains the following paragraphs:

        - A. Inside UC
        - B. Compression technology
        - C. Damage protection technology
        - D. Benchmarking

To jump to a certain paragraph, press the corresponding letter. To
jump to a specific document/chapter, press the corresponding digit.
See chapter 1 paragraph E for an overview of all documentation.


7.A INSIDE UC.
==============

Here a rough overview of the internals of UC is given. It can give
those who are interested more insight into what is going on INSIDE
UltraCompressor II (tm).

If you want more detailed technical information (e.g. for making add-on
tools) you can always contact us. Please note however we do not intend
to supply the source code of UC to third parties. Sample add-on tools
(including their source code) are available on request.

UC is programmed in C++ and assembly language (for the time critical
sections). It is devised in a number of modules which are briefly
described in this section.


Overview
--------
                        GENERAL CONTROL
                           main
                           command interpreter
                           super manager
                           archive I/O


   SYSTEM SERVICES                             COMPRESSION SERVICES
      video                                       neuro manager
      lowlevel I/O                                ultra engine
      memory manager
      virtual memory manager
                                               RELIABILITY SERVICES
                                                  error handler
   OTHER                                          damage protection
      lister                                      guard
      viewer                                      test


General control
---------------
   MAIN
      The main module calls all initialization routines and takes care
      of the configuration and help menus.

   COMMAND INTERPRETER
      The command line and script files are interpreted in this
      module.

   SUPER MANAGER
      Finds out what should happen (e.g. ask the command interpreter,
      scan the disk and the archive), and makes it happen (call the
      datacompressor etc.). This module is the most complex module of
      UC, but it delegates a lot of hard work to other modules.

   ARCHIVE I/O
      Here the archive file format is managed. Conversion between
      external (compressed flat file) and internal (object database)
      formats, as well as all kinds of integrity checks are performed.

System services
---------------
   VIDEO
      The keyboard, screen and output file (in case of redirection)
      are managed here.

   LOWLEVEL I/O
      Here lowlevel file specifics and networking are handled. Also
      this module takes care of file caching, an essential issue in
      making UC perform fast on (floppy) disk and network access.

   MEMORY MANAGER (MM)
      This module manages base memory, XMS and EMS. The MM decides
      which component of the program can get memory, and how much
      memory it can get, to optimize overall performance.

   VIRTUAL MEMORY MANAGER (VMM)
      This module manages the 'persistent object database' containing
      the central directory of an archive and numerous links needed to
      make processing as efficient as possible. No matter the amount
      of data in the VMM, it can always work (fast) with only a small
      amount of memory. VMM uses the memory manager to optimize its
      performance wherever possible. The VMM also supplies pipeline
      services for decoupling translation (archive I/O) and compression
      logic.

Compression services
--------------------
   NEURO MANAGER
      This component makes 'neuro models' of collections of files.
      These neuro models allow UC to use similarities between
      different files to compress all files even further. The
      analyzation phase of UC refers to the creation of neuro models
      based on found files. The optimizing phase (which is different
      from the optimize command) refers to the compression of the
      neuro models. One special neuro model (the 'master') is built
      into the compressor. This model is used to improve compression
      of the neuro models. The master contains information about the
      general structure of common file types such as: documents,
      spreadsheets, databases, sources etc..

   ULTRA ENGINE
      The bare high speed compressor/decompressor. Based on a neuro
      model the redundancy of a stream of data is analyzed and
      compressed.

Reliability services
--------------------
   ERROR HANDLER
      The error handler traps DOS system errors and handles them in a
      neat and consistent way. It often allows the user to go to DOS,
      solve the problem, leave DOS and continue where UC has stopped.
      The error handler makes it impossible for the user to specify
      'ignore', (an unfortunate option DOS offers if something goes
      wrong).

   DAMAGE PROTECTION
      This module handles damage protection, damage detection and
      damage recovery.

   GUARD
      The guard continuously checks the internal integrity of UC. It
      is often able to prevent damage caused by e.g. software
      conflicts or hardware problems.

   TEST (only available in beta versions)
      Tests and interrogates various components of UC during
      execution.

Other
-----
   LISTER
      Shows contents of archive in various formats.

   VIEWER
      Online help viewer.


7.B COMPRESSION TECHNOLOGY.
===========================

UC significantly outperforms existing datacompression technology. For
those interested in background information, we will discuss WHY and
HOW UC is capable of doing this.

AIP-NL would like to specifically thank the University of Utrecht, Drs.
Ing. D. Bezemer, Prof. Dr. J. van Leeuwen, Dr. P. van Oostrum and Drs.
Ing. N.E. de Vries for their significant participation in the design
of the UC compression engine. The detailed design is on request
available at the public library of the University of Utrecht.

To begin, we will give a brief explanation of AR002. Most modern
datacompressors (ARJ, LHA, ZIP, ZOO) are derivatives of AR002. AR002
was designed by Haruhiko Okumura.

AR002, a two phase compressor
-----------------------------
   DATA <-> (#1, LZ) <-> (grouping) <-> (#2, Huffman) <-> COMPRESSED

   In phase #1 (the 'LempelZiv' phase) 'matches' are located. Matches
   are strings of data which are equal. The string 'hello there hello'
   can be converted to 'hello there '(-12,5) since 'hello' is matched.
   In general ARJ, LHA, ZIP and ZOO can find matches with a maximum
   distance of about 32700, and a maximum length of about 500 (rough
   estimates).

   In phase #2 (the 'Huffman' phase) the output of phase #1 is divided
   into blocks. For each block statistics are analyzed (for example
   the compressor notices the character 'e' is used much more often
   than 'q'). From this analysis an optimal translation table
   ('Huffman tree') is generated (e.g. 'e'=001 'q'=100100010). The
   compressed block then consists of the Huffman tree and the
   translated contents of the block.

   The storage of the Huffman tree, at the beginning of each block, is
   a complicated matter which we will not cover here.

   The number of different values phase #1 can generate is very large
   (e.g. all 256 possible values of a byte, and about 32700 * 500
   different match discriptions). Since it is very inefficient to make
   a Huffman tree for all those possible values, the values are
   grouped, and statistics are gathered for the groups instead of the
   single values. LHA, ZOO, ARJ and ZIP all use various methods to
   realize this grouping.

UltraEngine
-----------
   The UltraEngine is also an AR002 derivative, but with some
   fundamental enhancements. These enhancements can be divided into
   two major features: the core engine and the neuro manager.

   The neuro manager is the most significant enhancement. Ordinary
   datacompression engines always start completely 'blank', knowing
   absolutely nothing about the data they are going to compress. The
   neuro-manager collects generic information about a group of files.
   For example a collection of *.DOC files will all have similar
   characteristics, and share common text strings like 'paper'. The
   neuro-manager collects this common 'knowledge' in a neuro model.
   This neuro model is used to make the core engine much smarter when
   similar files are compressed.

   UC has a neuro model ('master') built into the executable as well.
   This master contains information about common file formats. It
   contains for example information about spreadsheets, databases,
   wordprocessors, programming languages, common words in: English,
   French, German and Dutch etc.. This built in information improves
   compression.

   The core engine also has some enhancements over the basic AR002
   engine.

   The match engine has been replaced by a completely new one. This
   new engine is capable of finding matches at a distance of 64000 and
   with a maximum length of 30000.

   Also the way in which blocks of data are gathered for further
   Huffman compression is completely different. Instead of looking at
   one block at a time, the core engine attempts to 'learn' from
   previous blocks so less information is needed to encode statistical
   information of a block.

Alternative
-----------
   A collection of files can also be compressed more, with the often
   used TAR+COMPRESS 'trick'. First collect all files in a single
   large file, and then compress this single large file at once.
   Especially for large collections of small files, this improves
   compression significantly. This approach can be reached with ARJ as
   well, by first compressing with ARJ -m0, and compressing the
   resulting archive again with ARJ -jm1.

   It should be noted, however, that the UltraEngine approach works
   much better. First of all, the compression is better. The neuro
   manager is much smarter than just a simple concatenation. This can
   be verified by compressing the output of ARJ -m0 with UC -tt. Also
   the core engine is fine tuned for optimal performance in
   cooperation with neuro models. The 64000 byte match length is one
   of the essential enhancements.

   Another reason why the UC approach is much better, is random access.
   Try to delete a single file of a collection of 5000 C files when
   ARJ -m0/-jm1 has been used, or try to add or extract SOME files. UC
   will mostly do this much faster than ordinary archivers (this is a
   result of better caching and a better file format), and remarkably
   faster than a collect/compress archiver.


7.C DAMAGE PROTECTION TECHNOLOGY.
=================================

When an archive is 'damage protected' with the P command, it becomes
resistant to a certain amount of damage. This paragraph explains how
UC achieves this.

We will start with a mathematical operator called 'XOR'. XOR has the
following properties:

0 XOR 0 = 0; 0 XOR 1 = 1; 1 XOR 0 = 1; 1 XOR 1 = 0

if   a1 XOR a2 XOR a3 XOR a4 XOR a5 = xx
then a1 XOR xx XOR a3 XOR a4 XOR a5 = a2
and  a1 XOR a2 XOR a3 XOR xx XOR a5 = a4
etc.

This exactly shows the essence of damage protection, a1..a5 are the
bare data, xx is special (damage recovery) information which can be
used to restore bare data if it somehow has been destroyed.

In practice UC XOR's groups of 512 bytes (sectors) instead of single
bits. To determine if a specific sector is correct or damaged a
checksum of all sectors is included in the damage recovery
information.

About 1% of damage protection information is added to an archive. For
short files this is a single recovery sector (meaning up to 1 damaged
sector can be recovered). But if the archive is longer multiple damage
recovery sectors are added, for example one for odd and one for even
sectors.


7.D BENCHMARKING.
=================

Benchmarking is a very complicated matter. It is always possible to
create a legitimate looking benchmark which makes product A or product
B look better. A nice example of this can be seen by just looking at
some ads.

Some products are even optimized for running benchmarks. An
unfortunate habit which makes it even more difficult to compare
products.

We have done a successful attempt to optimize UC for daily 'real life'
use. This means it compresses fast and it decompresses very fast.

But when archives are used intensively, the bare compression/
decompression speed is only an aspect of the whole picture. What is
also very important is how efficient updates are performed. Especially
when 'the going gets tough', for example when a large archive resides
on a floppy disk, or when the archive is on a network, UC can be much
faster than other products.

An unfortunate problem in benchmarking archivers is the 'Calgary
Compression Corpus', which is often used to compare datacompressors.
Well, UC will do noticeably better than other compressors with this
test, yet the real power of the compression engine is not used in this
unrealistic test. When collections of 'real life' files (e.g. a large
collection of documents, or sources, or the average archive one finds
on a BBS) is compressed the difference between UC and other products
becomes much more significant.

Of course, although performance is very important, it is only one
aspect in the usability of a product.

In general it is our opinion that you, the user, are the best person
to perform benchmarks. Try the product the way you would actually USE
it and see for yourself.