Metropoli BBS
VIEWER: soundexa.asm MODE: TEXT (ASCII)
TITLE SOUNDEXA
PAGE ,132

;Author : bp-programs, Kalamazoo, Michigan
;Date: Jan 88, for Clipper Summer 87
;Source Code protected by United States Copyright Law
;Permission given for code to be incorporated in other programs by author

;Syntax: SOUNDEXA(string[,filler])

;The soundex code is useful to look up names where you aren't sure of the
;spelling.  Codes for similar sounding names are generally (but NOT always)
;close together. The code has the format LETTER-DIGIT-DIGIT-DIGIT. LETTER is
;simply the upper case first letter of the name. DIGITs are derived from the
;translation table below.  Empty positions are NOT translated.  If there are
;two or more letters with the same code following each other in the name, only
;ONE code number is used. 'Schmidt' is 'S530', not 'S253' or 'S533'. If there
;are more than three code numbers, the extra ones aren't used.  If there are
;fewer, the code is padded with zeros. (But see about FILLER below).

;Soundex               ABCDEFGHIJKLMNOPQRSTUVWXYZ
;Translation Table:     123 12  22455 12623 1 2 2

;SOUNDEXA is an assembly language implementation of the soundex code.  It
;follows my interpretation of the algorithm found on pages 392/393 of Knuth's
;book 'Sorting and Searching', volume 3 of "The Art of Computer Programming".

;It does NOT return the same code as the soundex routine in examplec.c
;(SOUNDEXC) distributed with Clipper Summer 87 or the Rettig soundex routine
;distributed with Clipper Autumn 86 in extenddb.prg (SOUNDEXD).
;The main differences among the three implementations are listed below.

;           SOUNDEXA                 SOUNDEXC               SOUNDEXD
;           ----------------------   ---------------------  ----------------
;Format     A999                     A999                   A9999

;Dupes      Skips ltrs generating    Skips identical ltrs   Skips duplicate
;           the same code which are  adjacent in original   code numbers even
;           immediately adjacent in  text                   if not adjacent in
;           original text                                   original text

;Null       1. Null string           1. Null string         1. Null string
;Returns    2. Completely non-alpha  2. Non-alpha/non
;              string                   space characters
;                                       except first char

;Fault      1. Ltrims leading non-   1. Does not trim, uses 1. Does not trim,
;Tolerance     alpha characters         non-alpha as lead      uses any char
;           2. Skips intermediate    2. Aborts with non-    2. Skips inter-
;              non-alpha characters     alpha/non-space        mediate non-
;                                       except first char      alpha chars

;Speed      3 secs/5000 repeats      9 secs/5000 repeats    90 secs/5000 repts
;I believe, of course, that SOUNDEXA is the 'best' implementation because
;it's closest to Knuth's algorithm, most fault tolerant, fastest (and also
;smallest, by the way) and the most FLEXIBLE.  More about this below.

;Knuth's algorithm uses 0s (character zero) to fill trailing empty slots.
;This makes sense when you're constructing an index, such as

;       INDEX ON SOUNDEXA(LASTNAME) TO NAMX

;However, when you're SEEK/LOCATEing with SOUNDEX you generally want to find
;all likely candidates and want to make sure that you don't miss any.  You'd
;rather find a few wrong ones than miss a single right one. In that case
;you want to include even partial matches, such as

;       LOCATE ALL FOR TRIM(SOUNDEXA(PART_NAME))

;SOUNDEXA allows you to select between two fillers, spaces or '0'.  Even
;though zeros are 'standard', I find spaces more flexible and have made them
;the default. By specifying a second argument SOUNDEXA(LASTNAME,FILLER) once,
;you change the state of the routine.  If FILLER is a '0' (as a character, not
;a number), all future calls to SOUNDEXA will use zeros for filling.  If
;FILLER is any other character (or even a null string), SOUNDEXA will use
;spaces in the future.  If there isn't a second argument, SOUNDEXA will use
;what you specified before or the default. If you prefer zeros as the default,
;change the FILLER DB to '0' in the DATASG.


;===================================================
EXTRN   __PARINFO:FAR  ;Clipper EXTEND routine, tells how many arguments
EXTRN   __PARC:FAR     ;Clipper EXTEND routine, gets a character argument
EXTRN   __RETC:FAR     ;Clipper EXTEND routine, returns a character value

SX_LENGTH       EQU    4     ;Length of soundex code

DGROUP  GROUP   DATASG       ;Ties this segment to the other data segments
                             ;of Clipper.  DS points to this DGROUP when
                             ;we arrive in the assembly routine

DATASG SEGMENT WORD PUBLIC 'DATA'  ;All PUBLIC segments with the name DATASG
                             ;will be combined by the linker.  All segments
                             ;with the class 'DATA' will be adjacent to
                             ;each other. WORD means that the segment
                             ;starts on an even byte, which can sometimes
                             ;be minutely faster in an 8086/80286 machine.

SOUNDEX      DB   SX_LENGTH DUP (?)        ; Space for SOUNDEX result
             DB 00                         ; Terminator byte
             ;Strings in C and Clipper are terminated by a NULL (or NUL or
             ;NIL, it all means the same thing).  There is no length byte
             ;or word as in BASIC or Turbo Pascal.
FILLER       DB ' '          ; Filler byte for padding of SOUNDEX, can be
                             ; space (default) or '0'
                ; Translate table from UC letters to SOUNDEX codes
                ; Omitted letters return NULL
;               'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
;               ' 123 12  22455 12623 1 2 2'
TRANSLATE       db 0,'123',0,'12',0,0,'22455',0,'12623',0,'1',0,'2',0,'2'

DATASG  ENDS
;==================================================


;==================================================
_PROG   SEGMENT   BYTE PUBLIC 'CODE'      ;All PUBLIC segments with
                                          ;the name _PROG will be com-
                                          ;bined, all segments with the
                                          ;class 'CODE' will be
                                          ;adjacent.  BYTE means that
                                          ;the segments will be aligned
                                          ;(stuck together) without any
                                          ;padding.

ASSUME  CS:_PROG, DS:DGROUP, ES:NOTHING   ;This is the way the segment
                                          ;registers are set up when we
                                          ;arrive here from Clipper.
PUBLIC  SOUNDEXA        ;Used in linking to Clipper, lets Clipper know
                        ;where this routine is.

SOUNDEXA     PROC FAR   ;The name of our routine (procedure)

             PUSH BP         ;The Clipper extend documentation on disk
             PUSH DI         ;says that we have to save registers
             PUSH SI         ;BP, DI, SI, ES and DS.  We are not
             PUSH ES         ;disturbing BP, so we may not have to save it.
             PUSH DS         ;But the Clipper routines __PARINFO, __PARC
                             ;and __RETC may do so, we don't know.

             ;Ensure null string in case of missing argument or no letters
             ;We do this by moving a NULL byte in the first place of the
             ;SOUNDEX code.  It will be overwritten if there's no error.
             MOV BYTE PTR DGROUP:[SOUNDEX], 0

             SUB AX, AX      ;Faster and smaller than MOV AX, 0
             PUSH AX
             CALL __PARINFO  ;Find out how many arguments passed
             ADD SP, 2       ;Clean up stack. C routines, unlike BASIC
                                ;or Pascal do NOT clean up the stack.
             CMP AX, 1       ;Is there 1 argument?
             JE MAIN_ROUT    ;Yes, use stored filler for conversion

             CMP AX, 2       ;Are there two arguments?
             JNE LEAV        ;No, an invalid number of arguments. Leave.
             ;Two arguments, get the second one - a new FILLER
             PUSH AX         ;AX is always 2 here
             CALL __PARC     ;Get the address of FILLER string
             ADD SP, 2       ;DX:AX hold pointer to string
             MOV ES, DX
             MOV BX, AX      ;Use ES:BX to point to FILLER string
             MOV AL, ' '     ;Load default space character
             CMP BYTE PTR ES:[BX], '0'        ;Is the new FILLER a '0'?
             JNE SX010               ;No, all set with space
             MOV AL, '0'             ;Yes, make it a '0'
SX010:       MOV DGROUP:FILLER, AL   ;Set Filler character
             MOV AX, 1

             ;AX is always 1 here, either set above or CMP AX, 1
             ;Pointer to string in DX:AX (SEG:OFS) for Clipper S87
MAIN_ROUT:   PUSH AX
             CALL __PARC     ;Get pointer to string to convert
             ADD SP, 2       ;Pointer to string returned in DX:AX (SEG:OFS)
             ;Set up pointer registers, seg and ofs
             ;DS:SI - String to convert, pointer incrementing
             ;ES:DI - SOUNDEX code in DGROUP, pointer incrementing
             ;ES:BX - TRANSLATE in DGROUP, points always to base
             PUSH DS
             POP ES          ;ES now points to DGROUP
             MOV DS, DX      ;And DS to where ever Clipper stores
                             ;its string arguments.
             MOV SI, AX      ;DS:SI point to start of string to convert

             MOV DI, OFFSET DGROUP:SOUNDEX ;ES:DI point to start of
                                           ;SOUNDEX
             MOV BX, OFFSET DGROUP:TRANSLATE ;ES:BX point to TRANSLATE base
             CLD             ;Work upward in string instructions

             ASSUME DS:NOTHING, ES:DGROUP
             ;Let MASM know that we've switched seg regs around
             MOV CX, SX_LENGTH   ;Maximum SOUNDEX length

FIRST_LTR:   LODSB           ;Get start byte from string to convert
             OR AL, AL       ;At end of string to convert?
             JZ LEAV         ;NULL string or no letters anywhere in
                             ;string, return a NULL string
             ; Real chararacter here, but is it a letter?
             AND AL, 0DFH    ;This converts letters to upper case,
                             ;destroys other characters.  But since
                             ;we don't care about those, it's ok.
             MOV AH, AL      ;Save the possible starting letter
             SUB AL, 'A'     ;Subtract the ASCII value of A which
                             ;is 65. This makes A 0, B 1, C 2 etc.
             JS FIRST_LTR    ;Negative, so not a letter, try again
             CMP AL, 'Z' - 'A'       ;ASCII Z minus ASCII A is the largest
                                     ;real letter value.
             JA FIRST_LTR            ;Not a letter either, try again
             ;We found a valid UC starting letter. It's both in AH and AL
             XLAT DGROUP:TRANSLATE   ;Convert to SOUNDEX code 1-6 or NULL
                                     ;XLAT adds the value in AL to BX and
                                     ;fetches the character pointed to by
                                     ;(normally) DS:BX+AL.  Since in this
                                     ;case DS points the NOTHING and ES to
                                     ;DGROUP, MASM is smart enough to make
                                     ;a segment override so that XLAT gets
                                     ;the byte at ES:BX+AL and puts it in
                                     ;AL. (Replacing the original pointer)

             XCHG AH, AL     ;After switch, AH holds code,
                             ;AL the UC starting letter
             STOSB           ;Put first letter into SOUNDEX
             LOOP DIGITS     ;Decrement CX and jump to actual
                             ;digit conversion. Skip over one
                             ;piece of code.

ERR_DIGITS:  SUB AH, AH      ;Jump to here only when looping back
                             ;and we want to clear out false
                             ;'previous' letter matches if
                             ;there are non-letters in between.

DIGITS:      LODSB           ;Get the next character
             OR AL, AL       ;ORing a value is the fastest way to
                             ;find out if it's NULL (end of string)
             JZ ALL_DONE     ;Trailing NULL detected
             ;Not at end of string to convert
             AND AL, 0DFH    ;Convert to UC
             SUB AL, 'A'     ;Subtract ASCII 'A'
             JS ERR_DIGITS   ;Negative, not a letter
                             ;Clear out previous code in AH
             CMP AL, 'Z' - 'A'
             JA ERR_DIGITS   ;Not a letter either, clear previous
                             ;code in AH
             ;Valid UC letter in AL, 'previous' code in AH
             XLAT DGROUP:TRANSLATE   ;Convert to SOUNDEX code or NULL
             CMP AH, AL      ;Same code as previous letter?
             JE DIGITS       ;Yes, duplicate, don't add to SOUNDEX
             ;New code, not a duplicate
             MOV AH, AL      ;Save it as the new 'previous' code
             OR AL, AL       ;Is it a real or a null code?
             JZ DIGITS       ;Null code, don't add to SOUNDEX

             ;Valid code in AL, not the same as previous, add to SOUNDEX
             STOSB
             LOOP DIGITS     ;Continue until SX_LENGTH in SOUNDEX


ALL_DONE:    JCXZ LEAV       ;Complete soundex, CX counted down
             MOV AL, DGROUP:[FILLER] ; ' ' or '0'
             REP STOSB       ;Fill remainder of SOUNDEX with FILLER

LEAV:        POP DS          ;Restore DGROUP segment into DS
                             ;Clipper routines, such as __RETC
                             ;expect DS to be pointing to DGROUP

             PUSH DS         ;Push segment of SOUNDEX string
             MOV AX, OFFSET DGROUP:SOUNDEX
             PUSH AX         ;And push the offset of SOUNDEX
             CALL __RETC     ;Return pointer to SOUNDEX to Clipper
             ADD SP, 4       ;Clean up stack

             ;DS already popped above
             POP ES          ;Get remainder of saved registers back
             POP SI
             POP DI
             POP BP
             RET             ;Go back to Clipper

SOUNDEXA     ENDP
_PROG        ENDS
             END

[ RETURN TO DIRECTORY ]