$Header: /usr/people/tcl/src/uutar/RCS/README,v 1.3 1993/09/12 00:40:52 tcl Exp $ ----------- What are encode/decode? Encode and decode are utilities which encode binary data into printable format suitable for transmission via email, posting to usenet, etc. They are intended to replace the aging uuencode and uudecode. ----------- Features: Encode features a very flexible encoding scheme which allows the user to specify exactly which printable characters to use in the output. The default is to use all 95 printable characters in the encoding process, as this produces the least expansion of the input data. However, for cases such as file transfer to a mainframe or to a foreign country where some characters may be modified en route, these characters can simply be removed from the output character set. Encoding is possible with as few as 2 characters in the output character set. Regardless of how many characters are specified in the output character set, encode only expands the data by a factor very close to the theoretical limit for that number of characters. (see next section) My implementation is simple (less than 500 lines total without comments) and efficient (runs at a speed comparable to uuencode/uudecode) ----------- Some theory on file expansion during encoding: The number of bits required to encode n distinct values is log2(n) (log base 2 of n). For example, to encode 256 distinct values, you need log2(256) = 8 bits. Let's think of the input file before encoding as a raw stream of bits without byte boundaries. If we want to represent this data with 256 distinct characters, we will consume 8 bits of the input bitstream per output character. This is how files are normally encoded. However, if we can't use all 256 output characters, we will consume fewer than 8 input bits per output character, and thus we will require more output characters to represent the input bitstream than if we had 256 output characters. Thus, the process of encoding a binary file in printable format will necessarily expand the file. For example if we use the 95 printable characters, we'll consume an average of log2(95) = 6.57 bits in the input stream for each output character. Thus the file will be expanded by a factor of log2(256)/log2(95) = log(256)/log(95) = 1.217 or 21.7%. Note that this is a theoretical figure. In practice, we can't subdivide bits, but this figure does provide a theoretical estimate of the smallest amount of expansion we can hope to get with n output characters. In practice some coding schemes should be able to do better for select cases, but for a very large sample space of random data, no encoding scheme should ever be able to do better than this theoretical limit. Uuencode maps 3 input characters to 4 output characters for an expansion of 33% (not including control information). Lately several encoding schemes which map 4 input characters to 5 output characters have popped up, for an expansion of 25%. An analysis of encode shows that the average expansion over a very large input file of random data is 8 / (pb - 2 + 2n/p) where n is the number of output characters, p is the smallest power of 2 greater than or equal to n, and pb is log2(p), or the number of bits needed to represent p values. A graph of this function for values of n from 2 to 256 shows a very close approximation of the theoretical expansion of log(256)/log(n). For example, for n = 95, the expansion factor is 8 / (7 - 2 + 2*95/128) = 1.234 or 23.4% Note that all expansion factors given above fail to take into account the addition of newline characters to limit output width. ----------- The encoding process: The encoding process used by encode is simply to throw away the byte boundaries in the input bitstream and insert new byte boundaries in such a manner that there are only n distinct "tokens" in the input stream where n is the number of output characters. These tokens can then be mapped one-to-one with the output characters, both during encoding and decoding. A good example of this process is uuencode, which discards the byte boundaries which occur every 8 bits and inserts byte boundaries every 6 bits. The result is a series of tokens with a maximum of 64 possible values, each of which is mapped one-to-one with the output character set of 64 printable characters. This process is trivial for any n which is a power of two, you simply insert byte boundaries every log2(n) bits. When n is not a power of 2, however, the process is somewhat more complicated. We can no longer insert the byte boundaries at regular intervals of b bits, since this would imply 2^b output characters. If we select b such that 2^b < n, then we aren't using all n output characters, and we're expanding the file more than necessary. On the other hand if we select b such that 2^b > n, we don't have enough output characters to encode the data. The solution is to start with the smallest b such that 2^b >= n and then eliminate some of the input tokens until there are exactly n of them, then we can map one-to-one with the output characters. Input tokens can be eliminated by taking two input tokens and combining them to form a single, shorter token. This is best explained by giving an example. Let's say we have 6 output characters. We start with 8 input tokens: 000,001,010,011,100,101,110,111 This set of tokens has the property that any input bitstream can be broken down to a series of these tokens in exactly one way. Now let's combine two of the tokens. The tokens to be combined must have identical bits except for the last bit, and the process of combining strips that bit from the tokens. e.g. 110 and 111 can be combined into the token 11, so we now have the token set 000,001,010,011,100,101,11 If we combine two more tokens, 100 and 101 -> 10, we get 000,001,010,011,10,11 This token set still has the property that any input bitstream can be broken down into a series of these tokens in exactly one way, and since there are 6 of them, we can map one-to-one with the output character set. The standard for the generation of these tokens will be as follows: Start with 2^b distinct tokens of length b bits, where b is the smallest integer such that 2^b >= n, where n is the number of output characters. Then, as above, while there are more than n tokens of any length, replace the two numerically greatest b length tokens with a single b-1 length token such that the b-1 length token is equivalent to the b-1 most significant bits of either b length token. (It is asserted that at any time in the procedure, the two numerically greatest b length tokens differ only in the least significant bit). The standard for the one-to-one mapping between tokens and output characters will be as follows: tokens will be sorted such that all b length tokens come first, in numerical order, followed by all b-1 length tokens, in numerical order. Output characters will be sorted by ascii code in numerical order. A one-to-one mapping will be established between these two sets. The standard for the checksum will be as follows: The checksum will be computed on the decoded data. It will be 32 bits wide. For each character read from the input file during encoding or written to the output file diring decoding, the checksum will first be rolled 7 bits to the left (the 7 bits which slide off the MSB end will be reinserted into the LSB end) and then the character will be xor'd onto the low order 8 bits of the checksum. ----------- Implementation: Decoding with this scheme is trivial: you simply map the printable character from the input to the corresponding variable length token, and then append that token to the decoded bitstream. Encoding is a bit more tricky however, since the token length is variable, and the input bitstream has no token boundaries in it. The solution is to set up a 256 element array which is indexed by the next 8 bits in the input bitstream. Note that these 8 bits are not necessarily byte-aligned in the input file. The indexed element in the array will indicate how many bits should be consumed in the input, and what printable character to append to the output. For example, in order to recognize the token 010, all elements of the array whose index is 010xxxxx for all xxxxx should be set up to indicate that 3 bits were seen and give the printable character that maps to 010. The input bitstream will then be advanced by 3 bits and the operation is repeated, using the next 8 bits to index the array again. My implementation of this encoding process is fairly simplistic and incorporates no more than the basic functionality provided by uuencode/uudecode. It is intended primarily to introduce this encoding scheme to the public in the hopes that it will be widely adopted. Should such adoption occur, this file should be used as a standard reference for the encoding algorithm.