************************************************************************ * * * ASMGEN.COM - by J. Gersbach and J. Damke (Ver. 2.01) * * * * A program to generate cross-referenced assembly language code * * from any executable file. * * * * * * * * Downloadet from DASAN BBS in Sandefjord Norway Tf.DAT:03-59530 * * * ************************************************************************ * PREFACE * This program will generate 8086/87/88 assembly code text that is compatible with the IBM Personal Computer Macro Assembler from any executable diskette file up to 65,535 bytes. The output can be routed to the console or a disk- ette file. A reference list may be generated separately or embedded at the appropiate instruction counter address in the assembly code. Some manual touch up will be required before reassembly, but nearly all the typing is done for you by ASMGEN and anything questionable is marked with "??". A file of sequential instructions may be resident on the same diskette to indicat to ASMGEN which addresses contain code, byted, words, or strings. This file may also include instructions to assume segment register values or toggle the output of assembley code text, generation of the reference table, 8087 mnemonics, of the inclusion of embedded reference information in the assembly file. DEBUG may be used to browse through the executable file to determine the starting locations of code and data to develop the sequential instruction file. It is important to accu- rately specify these locations for an accurate reference tabel and minimum touching up of the ASM output text. The number of references within the file determines the amount of memory required since a reference tabel is built in memory during the first pass. Disassembly is done from disk and only one file sector is in memory at any given time. Therefore memory size does not limit the size of the file to be disassembled. 48K bytes of memory will be enough for most programs but a few will need 64K or 128K. One diskette drive is sufficient but two is more convenient. * STARTING ASMGEN * There are two ways to work with ASMGEN: either by using the command menu or by calling ASMGEN with parameters. Following are the descriptions of both options. * USING THE ASMGEN MENU * The program is invoked by typing: ASMGEN You are then prompted for a file specification. Respond with the name of the executable file from which you wish to generate the assembly code. The executable file will normally have an extension of .EXE or .COM. ASMGEN will check this file spec for validity and then respond with a prompt that includes a summary of the command letters indicating that you may give it a command. The executable file contents are not checked for valid code and ASMGEN will try to dis- assemble text or compressed BASIC files and produce unintell- igible assembly code. The commands are: X filespec This file spec replaces any previous executable file spec. The usual file extension is .COM or .EXE EXAMPLE: X DATE.COM A The executable file is disassembled and the assem- bly code is routed to the specified file. The usual file extension is .ASM. If the filespec is omitted, the output will default to the console. EXAMPLE: A DATE.ASM R The reference table is sent to the file specified. The usual file extension is .TBL. If the filespec is omitted, the output will default to the console. EXAMPLE: R DATE.TBL Q The program is terminated and control returned to DOS. Each time a command has been executed, ASMGEN waits with a one line prompt for the next command. X , A , R or Q ? The default filespec for each command is shown in brackets. Enter the next command of your choice as described above. * USING ASMGEN WITH PARAMETER CALLS * Up to three file specifications may be included when ASMGEN is first called from DOS. The executable file's name is given first, followed by specifications for the assembly and reference table files. EXAMPLE: ASMGEN DATE.COM, DATE.ASM, DATE.TBL If a semicolon follows the last filespec, ASMGEN will exit to DOS when the command has been executed. If no semicolon is entered, ASMGEN will display the menu options described above and wait for further input after executing the command. EXAMPLE: ASMGEN DATE.COM, DATE.ASM; If the filespec for the .ASM file and/or .TBL file is omitted, ASMGEN will generate first the .ASM file, then a .TBL file using the filename of the first filespec. EXAMPLE: ASMGEN DATE.COM,,; creates DATE.ASM and DATE.TBL and exits to DOS. If only the reference table is desired, the dummy name NUL should be entered in place of an .ASM filespec EXAMPLE: ASMGEN DATE.COM, NUL, DATE.TBL If only one filespec is given when the program is called, the reference table is built in memory and then the menu options are displayed for further commands. EXAMPLE: ASMGEN DATE.COM * PROGRAM EXECUTION * The disassembly is done in two passes through the scource file. On pass #1, the reference table is built in memory and the actual output is gen- erated during pass #2. Once the reference table is established, it remains in memory until an X or Q command is issued, and subsequent A and R com- mand executions skip pass #1. This saves a lot of time when the executable file is large. Three contiguous data areas are built dynamically in memory during pass #1. First is the compressed sequential instruction list. Second is a list of pointers for .EXE files that point to the locations of all relocatable variables in the program, also arranged in numerical order. These are established before reading any code. Third, the reference table is then built in a higher area of memory as pass #1 progresses. If all available memory in the program segment is filled before the first two data areas are completed, ASMGEN will abort to the command prompt. After the reference table is started, a shortage of memory will produce the message "Reference Table Incomplete Due to Insufficient Memory" and continue. Ctrl-Break may be used at any time to interrupt a command in progress. * READING THE ASSEMBLY CODE FILE (.ASM) * This file begins with a title taken from the executable file's name and date followed by the current date (in brackets). If not inhibited by the M switch in a SEQ file (explained later), the macro library will appear next in the file. Next will be a .RADIX 16 pseudo-op which tells the macro assembler that all numbers are in hexadecimal form. Then comes a header that indicates a starting value for the code segment, stack segment, instruction pointer and the stack pointer. The stack pointer is usually set to FFFF for .COM files but may be somewhat less depending on available memory. These values are passed by the linker for .EXE files. The first ASSUME statement might come next. There is one generated for each segment that begins with code. All segment registers are designated according to the current set of ASSUMEs. They will sometimes be incorrect, so all ASSUME statements should be checked prior to re-assembly. The disassembled output follows, terminated by an END statement and the execution address. An ORG psuedo-op is included if required. The text is compatible with the IBM Macro Assembler and the format is the same except for RETurns. To avoid the need for PROCedure titles, special mnemonics are provided for all RET instructions. These are defined in the macro library at the beginning of the file. Only macros that are needed for the current file are produced. The optional embedded commands that make up the reference table enhance the readability of the file. For very large files, this is sometimes undesirable and a separate reference table is best. When invalid instructions are encountered in code areas, they are reproduced as byte values followed by "??". If a near jump is defined previously in the code, and it is within range of a short jump, a NOP instruction is inserted after the jump. The executable file created with this .ASM file and the Macro Assembler and Linker will then be the same length as the original file. This makes it less important to differentiate between labels and numeric constants since the label values and their offsets within the file will be the same. The fundamental problem of disassembly is in knowing if the original assembly code defined a number as a label which changes as a function of it's position or as a number that always remains the same. If you make changes in the assembly code however, you must properly specify all values. You might as well remove all NOPs at the same time. Labels are five characters long and begin with "L". Segment labels begin with "S". The remaining characters are the current instruction counter in hex form, thus making each label unique and showing it's location in the original file. The instruction counter is continuous throughout the assembly code without resetting at segment boundaries. The segment labels are then in byte as opposed to paragraph form. In those cases where a label value is modified by an ASSUME statement, the original value is included as a comment in the referencing instruction so that it may be easily changed back if it was not intended as a location. The word "Relocatable" is printed at the end of any line that contains an ablolute paragraph value. These are values that DOS modifies after loading but befor executing a program. They are used for loading segment registers that are sensitive to the program location in menory. Relocatable values are not modified by ASSUMEs. ASMGEN converts these numbers from paragraph to byte values by multiplying them by sixteen so that they will fit within the 16-bit instruction counter field. When the paragraph value is negative or exceeds 0FFFH, it is left unchanged and a warning (??) is issued on that line. When a program larger than 64K bytes is being disassembled, it should be divided into smaller files. All words are produced as labels, except when the "L" switch has been enacted in the .SEQ file (explained later). The label name indicates it's numeric value and, if it does not occur on an instruction boundary, the name indicates it's position relative to the current instruction pointer is given by an EQU statement. Therefore the Macro Assember will assume that it is a location, but it is easily changed to a constant since the value is given in the label name. The word OFFSET precedes a label whenever it is questionable whether it is a label or an immediate value. You must decide which of the labels should be constants and which of the constants should be labels, and change them accordingly. When changing labels to numbers, be sure to append an "H" if the number ends with a "D" or a "B" since the Macro Assembler will otherwise assume that it is decimal or binary. Bytes are always treated as constants. An optional switch may be included in the .SEQ file (explained later) which enables numbers instead of labels if all references to the value are data segment and immediate operation types. An effective procedure to follow in attempting to understand the assembly code file is to look first for the message text area, the input commands, and the simpler subroutines. Then add label names to addresses in the .SEQ file (explained later) that remind the you of their purpose. Add comments to the labels. If these names are well chosen, the larger routines eventually will become clear. The embedded references are produced as labels so they will retain their meanings as they are changed. It is also helpful to spend some time studying the structure of data areas. Vector tables, which are frequently used to control the program's flow, reveal the program's structure very quickly. If some routines do not have labels at the beginning, it is usually because the code or tables that reference them (or the segment register assumptions) are not properly defined in the .SEQ file. * READING THE REFERENCE TABLE (.TBL) * A referencee is defined as a number that is referenced somewhere in the program. It may be a program loaction or a numeric constant. A referencor is is defined as the address in the program from which a refer- ence is made to the referencee. Each entry is composed of a referencEE followed by a list of referencors. If more than one line is needed, additional lines are indented to the first referencor position. The referencEE is followed by an "S" if it includes references to the beginning of segment. The referencor is followed by two letters, the first of which represents the segment register that is implied or prefixed in the referencing instruction. The second letter indicates the type of operation on the referencEE. When the reference entries are embedded in the assembly code, all values are preceded with the letter "L". ---------------------------------------------------------------------------- 1st letter | 2nd letter SEG REGISTER | TYPE OF OPERATION ---------------------------------------------------------------------------- C code | J jump M modify - INC, ADD, etc. S stack | C call I immediate - value or offset D data | R read T test or compare E extra | W write ? unknown or ESC instruction | P port ----------------|----------------------------------------------------------- * WRITING/READING THE SEQUENTIAL INSTRUCTION FILE (.SEQ) * The sequential instruction file is a list of special instructions to ASMGEN which the user creates. The file takes the form of a list of hexadecimal addresses and single-letter instructions or generation switches. If used, the .SEQ file must be on the same diskette as the source file and have the same name as the source file with an extension of .SEQ. Each instruction in the file must be in one of the following formats: addr command or addr command ;comment or addr command label comment or addr command label comment ;comment "addr" represents the instruction pointer value. All addr values must be in numerical sequence in the file. "command" may be either a toggle switch or a generation instruction. "label" is optional and replaces the label generated for this address with this non-blank string. "comment" is optional and must be preceded by "label" unless the dummy label "." is used. Everything following "label" is treated as an address comment and will be printed in the ASM file behind the generated instruction. The address comment may be up to 255 characters in length and should not contain a semi-colon. ";comment" is optional. Anything following a semi-colon in the .SEQ file instructions is considered as a comment in the .SEQ file only and is not added to the generated .ASM file. "label" and "comment" are not allowed when a generation switch is coded, but a ";comment" may be used to help clarify the .SEQ file. The .SEQ file is read into memory before the first pass starts. The addresses and commands will be compressed, but "label" and "comment" will be held in memory one to one. An effect of this is that memory space required for dis- assembly increases with each "label" and "comment" added to the .SEQ file. * DESCRIPTION OF GENERATION SWITCHES * THE VARIOUS TOGGLE SWITCHES ARE SET TO ON BY DEFAULT. Switches may be toggled on and off at any point in the .SEQ file/disassembly. All options switches except /M and /H can be either toggled or directly set by the user. A suffix of "+" turns the switch ON, and a suffix of "-" turns the switch OFF. Switches encountered in the file that have neither of these suffixes are toggled to the opposite of their state at the time; ON switches are turned OFF and OFF switches are turned ON. /B - generate byte references When ON, byte and word references are included in the reference table. When OFF, only word references are generated. /E - embedded references in ASM file When ON, reference table entries are inserted in the text just before the referencee's definition statement. When OFF, these entries are not included with the disassembled text. The entire reference table can be printed with the "R" command. /F - 8087 mnemonics When ON, ESC instructions are produced. When OFF, ESC instructions are assumed to be 8087 instructions and 8087 mnemonics are produced. /H - append hex "H" When this switch appears at any point in the .SEQ file, an "H" is appended to all hex numbers. This does not, of course, apply to the labels which are hex values preceded by the letter "L". The .RADIX 16 pseudo-op is omitted which allows the assembler's radix to default to decimal. This switch defaults to NO H APPEND. Note that it will be set only once. It retains it's value until the next .SEQ file is read. /L - generate label or number When ON, all word references are treated as labels. When OFF, a word reference is treated as a constant if all referencors are data immediate types. /M - suppress macro library When this switch appears at any point in the .SEQ file, no macro library is included in the text output. The DEFAULT IS THAT THE MACRO LIBRARY WILL BE INCLUDED. Note that this switch will be set only once. It retains it's value until the next .SEQ file is read. /O - control ASM output When ON, ASMGEN will output the generated text. When OFF, output will be suppressed. /R - control TBL output When ON, ASMGEN will output the generated reference data. When OFF, the reference table is not printed. /T - control trace output When ON, up to 16 bytes of object code are included as comments in each line of the assembly code file. When OFF, object code is not included. * DESCRIPTION OF .SEQ FILE COMMANDS * A - assume The following lines contain ASSUMptions for segment register values. They become effective at the address specified by this instruction and may be modified anywhere in the disassembly. The required format for assumptions is: & 0400 DS The ampersand indicates a continuation of the A instruction. In this example, a data segment beginning at a instruction pointer value of 400 will be assumed until another A instruction changes it. CS, ES, and SS are also supported. The segment assumptions are used for effective address calculations only. The code segment assumption does not affect the instruction pointer value. B - bytes The bytes encountered in the source file are assumed to have meaning as single byte values. C - code The bytes encountered in the source file are assumed to be valid 8088 machine language instructions. D - generate data operand The operand of the instructions is changed to immediate data. Subsequent bytes are interpreted as "C" (code follows). I - initial value for IP The hexadecimal value on this line overrides the instruction pointer value at the beginning of the file - not to be confused with the address at which execution begins. The default values are 0000 for EXE files and 0100H for COM and other files. The execution address following the END statement is omitted if this option is invoked. S - strings The bytes encountered in the source file are assumed to form text. Quoted text is produced for valid ASCII characters and byte values for others. # - defined length strings The first byte encountered in the source file contains the length of the character string which begins with the next encountered character. This length value may be overridden by a subsequent SEQ file instruction. $ - defined length strings The first byte encountered in the source file contains the length of the character string which begins with the next encountered character plus the length byte itself. This length value may be overridden by a subsequent SEQ file instruction. W - words Pairs of bytes encountered in the source file are assumed to have meaning as word values. X - repeating data structure A cyclic data structure is assumed to begin at the specified instruction pointer value. The structure definition may follow and is prefixed by an ampersand (&) to indicate the continuation of this instruction. If the definition does not follow, then the most recent definition is used. If no structure is yet defined, then an error message is displayed. The following elements may be used to define the structure: & NNNN S - The next NNNN bytes are defined as string characters & NNNN B - The next NNNN bytes are defined as byte values & NNNN W - The next NNNN bytes are defined as word values & XXNN $ - The next sequence of bytes is defined as NN fields. Each field consists of a length byte and a string of characters. The length of each field is contained in the first encountered byte. The high nibble (XX), if non-zero, is a bit mask of the length field within the byte. The length field is right-justified within the byte after the byte value is sent to the output file. * EXAMPLES OF .SEQ COMMANDS * This example .SEQ file shows all the possible instructions in the appropriate format. ;All switches are on at the beginning. 0 /T ;no object code as comments in output 0 /M ;no macro library in output 0 /H ;append "H" to all numbers 00H /A ;assume the following segment values ;Note that the ampersand (&) indicates the extended ASSUME & 380 DS ;the data segment starts at 380 hex & 380 ES ;the extra segment starts at 380 hex 0200 I ;initialize the instruction pointer to 200 0200 /F ;introduce 8087 mnemonics (not ESC) 0200 /E ;no embedded references 0200 C ;code begins at 200 0203H W ;words are at 203 0207 C ;more code starting here 220 X ;complex data structure begins here & 3 W ;words & 1 B ;byte & 0E02 $ ;2 strings starting with the 2nd byte follow ;bits 3,2,1 of the first byte contain the length of the ;string including the length byte. ;the high nibble (0E) is the mask. ;see also # in summary below & 1 B ;byte ;the structure repeats until 351 351 B ;bytes 358 C ;more code 380 S ;strings - list of messages 421 W ;words 4FD /B ;no further byte references 502 /R ;garbage here - turn off reference generation 502 /O ;and output 600H /O+ ;valid code - turn output back on 600 /R 600 C 1A60 /O- ;output file about to fill diskette - turn output off but keep ;scanning for references. ;another run will be needed to get the remaining code. 1B00 /D ;treat operand as immediate data 1DFD /B+ ;continue with byte references 1F45 W user_prt ;user provided labels will translate 2256 S $MSG ;to upper case Comments may be included if preceded by a semicolon. Alphabetic characters may be either upper or lower case. An "H" may follow the hex address. * SAMPLE SESSION * The external command CHKDSK.COM will serve as an example for this sample session because it is short. The .SEQ file is also short and easy to generate. Only these few instructions are needed. 0100 /T ;include object code as comments in .ASM file 0100 /E ;simpler output without references 04F7H S ;messages 04F7H /H ;append "H" to numeric values Using DEBUG, browse through CHKDSK.COM to see how this was arrived at. Usually, but not always, the best procedure is to assume code. If the code appears unintelligible, display it in hex/ASCII. If it is not text, assume bytes. Label positions in the first disassembly may indicate that some locations should be words. Next, generate the .ASM file by typing ASMGEN CHKDSK.COM A The assembly code can be viewed on the screen. Then type A CHKDSK.ASM to save the assembly source code to a file. Then, R CHKDSK.TBL to save the cross-reference table to disk. The Macro Assembler, Link.exe and Exe2bin could now be used to assemble CHKDSK.ASM, link it to .EXE and convert it to a .COM file. No modification should be necessary in this case. If working with code that is to be modified, the symbol types must be correctly specified as locations or as constants. If they are constants, place them outside of any segment. The label names may then be changed to make the code more readable. ENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDEND