BASE64 GUIDE FOR UUENCODING TYPES _______________________________________ This brief guide to working with Base64 encoded images found on USENET newsgroups, as well as the free decoder that should accompany it, were written and posted in response to the many "What the @#$%* is Base64" postings one sees in newsgroups that have a lot of binaries. While UUEncoding is familiar to most DOS\Windows users on the Net, Base64 encoding is a mystery of sorts outside of the UNIX world. Hopefully, this document will help with this problem. There is also a lack of intelligent Windows based software for Base64 encoding and decoding. If you got this file from an archive called SJHB64.ZIP, there's a free decoder in the archive that tries to address that need. WHAT THE !@#$%* IS BASE64!? ___________________________ Like UUencoding, Base 64 is a method of taking binary data like image, audio, video, Postscript or executable files and encoding it into 7 bit ASCII, a text format. Once the former binary is encoded, you should be able to load it in a text editor like Windows Notepad and view the ASCII encoded data. Encoded binaries can be sent through Internet Mail gateways by mail and news reader programs which cannot read or forward normal binary data. The recipient has to find a way to convert(decode) the ASCII data back to it's original binary format. WHY DOESN'T EVERYBODY USE UUENCODING? _____________________________________ You've often run across UUEncoded files that will not decode properly. While this is often due to buggy encoding software or bad phone lines that hack files in transmission, the way UNIX Mail Gateways work is also a problem. Many gateways strip non-alphanumeric characters from mail packets(like commas and semi-colons for example), making a UUencoded file unusable. Your decoder will OFTEN decode it normally, but you'll find the resulting binary file, be it JPEG, AVI, etc. is corrupted. Since Base64 uses only alphanumeric characters(upper and lowercase a-z, the numbers 0-9 and /), this problem is eliminated. This is one reason why Base64 is among the most popular method for encoding binaries on UNIX systems. Base64 is also the encoding method of choice for embedding binaries in UNIX Mail files. MIME(Multipurpose Internet Mail Extensions) is the most popular UNIX mail format, and the MIME specification supports embedded Base64 encoded binaries. One of MIME's most powerful features is it's support for multiple embedded binaries Like UUE, MIME files are 7 or 8 bit ASCII, so you can embed UUEncoded binaries in a MIME file as 7 bit ASCII data. If you load a MIME file into a text editor, you'll see something like the fragment below. Note that the first five lines are a typical USENET header tacked on to the beginning of the MIME file by the uploader's News reader: Newsgroups: comp.binaries.images From: Somebody Subject: I can't decode this image.. Message-ID: <3157F28F.4E1@some.com> Date: Tue, 26 Mar 1996 13:35:11 GMT This is a multi-part message in MIME format. --------------5213E5135CD MIME-Version: 1.0 Content-Description: "Is this image corrupt?" Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi..I'm a programmer working on a jpeg decoder and I can't get it to decode this particular image. Is the image corrupt or is it my program?? Thanks, Scott --------------5213E5135CD MIME-Version: 1.0 Content-Type: image/jpeg Content-Description: "Base64 encode of hacked.jpg" Content-Transfer-Encoding: base64 Content-Disposition: inline; filename="HACKED.JPG" /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRof Hh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwh [More Base64 data follows] The two lines above starting with "/9j/" are the start of the Base64 encoded binary, in this case, a .JPG image file. The lines that start with "---------" are boundaries which seperate the header from the text message and in turn seperate the text message from the Base64 encoded binary. The binary and text sections each have a subheader that describes the type of data the mail reader is about to encounter. According to the MIME specification, if an embedded binary is present, the keyword INLINE and the name of the embedded file should appear in the CONTENT-DISPOSITION field. Intrestingly, the MIME specification supports multiple messages and binaries within the same file. Different types of binaries such as a .WAV audio file and an .MPEG video file can also be combined in the same file. One common problem Windows and DOS users run into is multiple binaries-most of the shareware Windows based decoders that claim to support Base64 are unable to extract more than one binary from a message file containing multiple binaries. The only way you'll know if a MIME message file has more binaries than your decoder can handle is by loading the file into a text editor and manually finding each header. If your decoder can't handle multiple binaries, your only choice is to cut and paste the binaries into separate files and then try to decode them. Even though millions of Windows and DOS users are using the Internet, it's important to remember that the servers, hubs and gateways that actually ARE the net are still predominately UNIX sites which support the MIME standard which in turn uses Base64 for binary encoding. It's safe to assume for the time being that this encoding method won't go away, so the remainder of this document will focus on how to work with Base64 encoded binaries. TYPES OF FILES CONTAINING BASE64 ENCODED BINARIES _________________________________________________ Base64 encoded binaries are generally found in three forms on Usenet: a) Embedded in a MIME or other type of Email file as shown above. These are usually created in an EMail program and the user posts the resulting file to a Usenet newsgroup. Most UUEncoders which support Base64 will include some or all of the header fields found in MIME files. b) Appended to a short USENET header. This type of file is usually comes from a News Reader program. Unlike MIME files, the header does not always conform to a specified format. Here's an example: From: Somebody Newsgroups: comp.binaries.images Subject: I can't decode this image.. Date: Tue, 26 Mar 1996 13:35:11 GMT Message-ID: <3157F28F.4E1@some.com> R0lGODdhrgCcAPcAAA4EBBMIBxgMBBMIDh0IBx0QCygMBy0MDDwKCUINCVEQB1sMB0IUClsQ DkcYC0wmFlsYDmIZCF8UEWYYEnYSCW4cDnkbDIgZC18lEXIlEIMlC44hFXIpFI0pEXsuF4JE [More Base64 data follows] As in our sample MIME file, the two lines above starting with "R0LGOD" are the start of the Base64 encoded binary, in this case, a .GIF image file. Notice that there is no filename or file type specified in the header. This is often the case with files created by News readers- most of them handle all data as if it's a text message and have no awareness of encoded binaries. Most users who post Newsreader files containing binaries specify a filename in the description you see when you browse newsgroup postings. This however, is not much help to a program that has to decode this stuff. Most Base64 decoders will take a file like this and extract and decode a binary and give it a generic name(like UNKNOWN), with no extension. Unless the user remembers the file type from the Newsgroup description, he or she won't even know what type of program to load the decoded binary into. c) As raw Base64 data. Sometimes you'll encounter a newsgroup posting that looks like this: UklGRoI0AABXQVZFZm10IBAAAAABAAEA8FUAAPBVAAABAAgAZGF0YV00AAB3dXNxb25ubm5vcHFz dXd5e32AgoWHioyOkJGRkpOUlZaWl5eYmJiYmJiZmZqampqZmZiXlpSTkY6MiYaDf3x5dnNxb21r [More Base64 data follows] This is simply encoded Base64 data with no header, in this case a .WAV file. The decoding program has no idea what type of binary this is and if the person who posted it didn't specify the filename in the Newsgroup description, neither do you. Sometimes users create these themselves by copying the raw data and pasting it into their Newsreader for posting with no header information, but most encoders will allow you to create Base64 binaries without a header. Files like this with incomplete or missing header drive end users crazy as they are often left to guess what type of binary they just decoded. Base64 encoded binaries are usually one of the eight binary types supported in the original MIME specification. They are : 1) Application : Octet-Stream (usually executables or archive files) 2) Application : Postscript (formatted text and EPS files) 3) Image : JPEG (.JPG images) 4) Image : GIF (.GIF images) 5) Image : X-BMP (other rasterized images) 6) Video : MPEG (compressed video data) 7) Audio : X-WAV (.WAV sound files) 8) Audio : X-VOC (.VOC sound files) (This list does not include MAC-specifc formats and some more exotic data types added when the MIME spec was recently updated.) God only knows why, but many encoders have an option to encode a binary file without header data. If for some reason you use this option, you'll wind up with a raw Base64 file like the one above. Usually, a program courageous enough to decode this will write a file with a generic name and no extension and it's up to you to figure out what it is. A feature lacking in the Windows based decoders is the ability to "recognize" the common binary types in encoded form when there's no header information available. If you have an "unknown" on your hard drive, load the original Base64 file into a text editor, take a look at the list of encoded binary signatures below and see if you can match your mystery file with a signature. If you can, all you have to do is rename the decoded file in order to(hopefully) use it. DECODING BASE64 ENCODED DATA ____________________________ To do this, obviously you need a decoder. If you're a UNIX type, you're in luck- there's a number of quality decoders available, many of them free. If you're a Windows user, your Base64 choices are limited to DOS command line decoders that are awkward to use or Windows based UUDecoders that don't support Base64 very well, unless you want to spend the money for a commercial MIME compliant mail program. Eudora and Pegasus are the best known Windows based mailers. WHAT IF THE DECODED FILE DOESN'T WORK? ______________________________________ There are a number of reasons this could happen. The encoded data could have been corrupt to start with, or could have been hacked by one of the many buggy encoders out there or could have been corrupted by a bad phone line during up or downloading. If the Base64 data itself is corrupted, most decoders will show an error message of some kind. If however, the source data(the original binary file) was corrupt, most decoders have no way of knowing this and will merrily decode hacked data which you won't be able to use. Some commercial mailers run CRC(Cyclical Redundancy Check)Tests on decoded binaries to verify their integrity. Even this method is not foolproof, especially with compressed data like JPEG or GIF images or MPEG video. Another common problem is the way in which some Newsreaders and EMail programs interpet the data they're handling. Some programs will convert 7 bit encoded ASCII data to 8 bit or strip carriage return and\or linefeed characters, making it harder for your decoder to read the data properly(the EMail software used by the major online services often does this). If you load a file like this into a text editor, you'll see lines that are hundreds or thousands of characters long(Base64 encoded data is usually arranged in lines of 72 characters each). Sometimes you can recover a file like this by loading it into a binary mode text editor (like the Windows95 Wordpad app) and saving it, which restores the stripped control characters. Many programmer's text editors also have this capability to load text files in Binary mode and it's worth a shot to try this when you have a file that decoded without errors but simply doesn't work. (Note: While Notepad always saves files as ASCII text, you must specify Save As Text when using Wordpad. As a rule it's a good idea to stay away from Word Processors since they often strip Carriage Returns and insert control characters that can throw a decoder off. Notepad also has a limitation that makes it less than ideal for editing Base64 files- it can't load a file larger than 40K or so, and many encoded binaries are larger than that) USING AN EDITOR WITH PROBLEM BASE64 FILES _________________________________________ One common situation where a text editor comes in handy is when you've decoded a file and don't know what type of binary it contains. If, for whatever reason, a decoder can't extract a filename from an encoded file, it will write a file with a default name and no extension. Now that you've decoded this mystery file, how do you figure out what it is? Sometimes the only way is to load the Base64 encoded file in a text editor. Once it's up there on your screen, look for a header-MIME compliant files will always have a line in the header that identifies the file type, such as : Content-Type: image/jpeg If you find this, you know your mystery file is a JPEG file, and all you have to do is add the extension .JPG to the decoded binary. MIME compliant headers also list the full filename of the encoded binary-look for the line: Content-Disposition: inline; filename= or Content-Disposition: attachment; filename= If the encoded file has an incomplete or non-existent header, you'll need to do some detective work. Every type of Binary file begins with a signature unique to that file type. Although this signature will be different after the file is Base64 encoded, it is still unique to that file type. You must find the start of the Base64 encoded data and check the beginning of the first line. Chances are, you'll find one of the following signatures: FILE TYPE ENCODED SIGNATURE _________ _________________ JPEG /9j/4AAQSkZJRgABAQ GIF R0lGODdh BMP QK WAV UklGR MPEG AAABsxQAyBH// EXE TV ZIP UEsDB EPS JSFQUy1BZG This list is far from complete, but these are among the most common types. If you know where to find the start of the encoded data and can identify one of these signatures, then you've solved the mystery. Another common situation where a good text editor can help is one that I call "Mailer Syndrome". Many EMail and Newsreader programs default to a "quoted" mode, where the program attaches a greater than[ > ]sign at the beginning of each line of an existing USENET thread it's are responding to. If the thread happens to contain base64 encoded data , some poor soul will eventually download a file that won't decode properly. If you load one of these files into a text editor, you'll see something like this: >/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsL >DBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/ >2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIy Remember, Base64 encoded data should consist only of the upper and lowercase characters a-z, numbers 0-9, +, -, and the / character. The only exception to this is at the end of the encoded data where sometimes you'll see one or two equal[==] signs , which your decoder should ignore. If you see any other characters, chances are the file that won't decode correctly. Thanks for your time, and hopefully we'll all see less "What the @#$%* is Base64" posts. Scott Hanrahan SJHDesign, Inc. EMail : SJHDES@ibm.com