$Header: /usr/people/tcl/src/uutar/RCS/README,v 1.3 1993/09/12 00:40:52 tcl Exp $

----------- What are encode/decode?

Encode and decode are utilities which encode binary data into
printable format suitable for transmission via email, posting to
usenet, etc. They are intended to replace the aging uuencode and
uudecode.

----------- Features:

Encode features a very flexible encoding scheme which allows the user
to specify exactly which printable characters to use in the output.
The default is to use all 95 printable characters in the encoding
process, as this produces the least expansion of the input data.
However, for cases such as file transfer to a mainframe or to a
foreign country where some characters may be modified en route, these
characters can simply be removed from the output character set.
Encoding is possible with as few as 2 characters in the output
character set.

Regardless of how many characters are specified in the output
character set, encode only expands the data by a factor very close to
the theoretical limit for that number of characters. (see next
section)

My implementation is simple (less than 500 lines total without
comments) and efficient (runs at a speed comparable to
uuencode/uudecode)

----------- Some theory on file expansion during encoding:

The number of bits required to encode n distinct values is log2(n)
(log base 2 of n). For example, to encode 256 distinct values, you
need log2(256) = 8 bits. Let's think of the input file before encoding
as a raw stream of bits without byte boundaries. If we want to
represent this data with 256 distinct characters, we will consume 8
bits of the input bitstream per output character. This is how files
are normally encoded. However, if we can't use all 256 output
characters, we will consume fewer than 8 input bits per output
character, and thus we will require more output characters to
represent the input bitstream than if we had 256 output characters.
Thus, the process of encoding a binary file in printable format will
necessarily expand the file. For example if we use the 95 printable
characters, we'll consume an average of log2(95) = 6.57 bits in the
input stream for each output character. Thus the file will be expanded
by a factor of log2(256)/log2(95) = log(256)/log(95) = 1.217 or 21.7%.
Note that this is a theoretical figure. In practice, we can't
subdivide bits, but this figure does provide a theoretical estimate of
the smallest amount of expansion we can hope to get with n output
characters. In practice some coding schemes should be able to do
better for select cases, but for a very large sample space of random
data, no encoding scheme should ever be able to do better than this
theoretical limit.

Uuencode maps 3 input characters to 4 output characters for an
expansion of 33% (not including control information). Lately several
encoding schemes which map 4 input characters to 5 output characters
have popped up, for an expansion of 25%.

An analysis of encode shows that the average expansion over a very
large input file of random data is 
8 / (pb - 2 + 2n/p)
where n is the number of output characters, p is the smallest power of
2 greater than or equal to n, and pb is log2(p), or the number of bits
needed to represent p values. A graph of this function for values of n
from 2 to 256 shows a very close approximation of the theoretical
expansion of log(256)/log(n). For example, for n = 95, the expansion
factor is
8 / (7 - 2 + 2*95/128) = 1.234 or 23.4%

Note that all expansion factors given above fail to take into account
the addition of newline characters to limit output width.

----------- The encoding process:

The encoding process used by encode is simply to throw away the byte
boundaries in the input bitstream and insert new byte boundaries in
such a manner that there are only n distinct "tokens" in the input
stream where n is the number of output characters. These tokens can
then be mapped one-to-one with the output characters, both during
encoding and decoding. A good example of this process is uuencode,
which discards the byte boundaries which occur every 8 bits and
inserts byte boundaries every 6 bits. The result is a series of tokens
with a maximum of 64 possible values, each of which is mapped
one-to-one with the output character set of 64 printable characters.
This process is trivial for any n which is a power of two, you simply
insert byte boundaries every log2(n) bits. When n is not a power of 2,
however, the process is somewhat more complicated.

We can no longer insert the byte boundaries at regular intervals of b
bits, since this would imply 2^b output characters. If we select b
such that 2^b < n, then we aren't using all n output characters, and
we're expanding the file more than necessary. On the other hand if we
select b such that 2^b > n, we don't have enough output characters to
encode the data. The solution is to start with the smallest b such
that 2^b >= n and then eliminate some of the input tokens until there
are exactly n of them, then we can map one-to-one with the output
characters. Input tokens can be eliminated by taking two input tokens
and combining them to form a single, shorter token. This is best
explained by giving an example.

Let's say we have 6 output characters. We start with 8 input tokens:
000,001,010,011,100,101,110,111
This set of tokens has the property that any input bitstream can
be broken down to a series of these tokens in exactly one way.
Now let's combine two of the tokens. The tokens to be combined must
have identical bits except for the last bit, and the process of
combining strips that bit from the tokens. e.g. 110 and 111 can be
combined into the token 11, so we now have the token set
000,001,010,011,100,101,11
If we combine two more tokens, 100 and 101 -> 10, we get
000,001,010,011,10,11
This token set still has the property that any input bitstream can be
broken down into a series of these tokens in exactly one way, and
since there are 6 of them, we can map one-to-one with the output
character set.

The standard for the generation of these tokens will be as follows:
Start with 2^b distinct tokens of length b bits, where b is the
smallest integer such that 2^b >= n, where n is the number of output
characters. Then, as above, while there are more than n tokens of any
length, replace the two numerically greatest b length tokens with a
single b-1 length token such that the b-1 length token is equivalent
to the b-1 most significant bits of either b length token. (It is
asserted that at any time in the procedure, the two numerically
greatest b length tokens differ only in the least significant bit).

The standard for the one-to-one mapping between tokens and output
characters will be as follows: tokens will be sorted such that all b
length tokens come first, in numerical order, followed by all b-1
length tokens, in numerical order. Output characters will be sorted by
ascii code in numerical order. A one-to-one mapping will be
established between these two sets.

The standard for the checksum will be as follows: The checksum will be
computed on the decoded data. It will be 32 bits wide. For each
character read from the input file during encoding or written to the
output file diring decoding, the checksum will first be rolled 7 bits
to the left (the 7 bits which slide off the MSB end will be reinserted
into the LSB end) and then the character will be xor'd onto the low
order 8 bits of the checksum.

----------- Implementation:

Decoding with this scheme is trivial: you simply map the printable
character from the input to the corresponding variable length token,
and then append that token to the decoded bitstream.

Encoding is a bit more tricky however, since the token length is
variable, and the input bitstream has no token boundaries in it. The
solution is to set up a 256 element array which is indexed by the next
8 bits in the input bitstream. Note that these 8 bits are not
necessarily byte-aligned in the input file. The indexed element in the
array will indicate how many bits should be consumed in the input, and
what printable character to append to the output. For example, in
order to recognize the token 010, all elements of the array whose
index is 010xxxxx for all xxxxx should be set up to indicate that 3
bits were seen and give the printable character that maps to 010. The
input bitstream will then be advanced by 3 bits and the operation is
repeated, using the next 8 bits to index the array again.

My implementation of this encoding process is fairly simplistic and
incorporates no more than the basic functionality provided by
uuencode/uudecode. It is intended primarily to introduce this encoding
scheme to the public in the hopes that it will be widely adopted.
Should such adoption occur, this file should be used as a standard
reference for the encoding algorithm.