HTMSTRIP.DOC 1 Revised: 09-03-96 The HTMSTRIP.EXE program attempts to read HTML pages, remove the HTML coding, and write the file out as something more useful. Features of this program: * Can be run across an entire subdirectory (for example, your entire cache subdirectory), and will only process the HTML documents that it finds. (There are some options on this.) * Removes all imbedded HTML commands. * Recodes the standard HTML "entity references" (e.g. "©" becomes "(c)"). The actual replacements are coded in a user-modifiable lookup file. * Handles standard indent, heading, selection groups, menus, tables, etc. * Reflows all text as appropriate * Optionally, will replace Link, Image, and Input references with user-definable text representations. * Optionally, alerts you to possible errors in the HTML code itself. HTML codes are surrounded within <...> indicators. For upward compatibility reasons, Web browsers ignore any codes that they don't understand and just process the ones they can handle. Note that the HTMSTRIP command is currently geared for handling HTML 2.0 files and then Netscape table-specific extensions (added to HTML 3.0). HTMSTRIP removes all HTML codes. It also handles the standard HTML "&xxx;" "entity references" (e.g. "©" is replaced by "(c)"). You can add or change these replacements as desired by using the INI file (documented later). HTMSTRIP is also tuned to allow it to specially-handle several embedded HTML codes. These codes are the following: External link Establishing a root directory
...
Indented block of text
Forced line break ... Title for a table
...
Centering text
Term definition ... Directory list of items End of definition list
First term of definition list/glossary

to

...
to Heading items
Horizontal rule Image User input
  • Menu/Ordered/Unordered/Directory list item ... Menu listing
      ...
    Ordered listing
    external links -> (link) for image references -> (image) for user inputs -> [Input] You can redefine any and all of these entity references in the same lookup file. These substitutions are specified more or less like the previous substitutions: = (link) = (image) = [Input] Unlike with the other lookups, the left side is not case sensitive so "=(link)" works just fine. Hexadecimal and decimal replacements are again acceptable (see BRUCEHEX.DOC file). You might, for example, want to redefine some of them like this: = \251 /* Replaces with a û symbol = \015 /* Replaces with a  symbol (little flash cube) = ? /* Replaces with a question mark Any symbolic references that you do not redefine will default to their original values. If /-SYMBOLS is specified, any symbolic definitions are ignored and a "NULL" replacement string is used for all of them. Note: references are also affected by the /A=spec parameter. references are also affected by the /IMG=spec and /IMGALT=spec parameters. HTMSTRIP.DOC 5 Revised: 09-03-96 Syntax: HTMSTRIP [ filespec | @listfile ] [ outfile ] [ /EXT=.xxx ] [ /ALL ] [ /WIDTH=n ] [ /RULE=s ] [ /BORDER=c ] [ /BUFF=n ] [ /SPACES ] [ /SYMBOLS ] [ /A=spec ] [ /IMG=spec | /IMGALT=spec | /ALTONLY ] [ /-INDENT ] [ /WARNINGS ] [ /-TABLE ] [ /Tpath ] [ /MONO ] [ /Iinitfile | /-I ] [ /Linitfile ] [ /? ] [ /?&H ] where: "filespec" tells the routine which file or files are to be processed. The specification can include path and wildcards if desired. Typically, the file names are *.HTM files. If no input specification (filespec or @listfile) is provided, you'll be prompted for one. "@listfile" allows you to have a variety of file specifications saved in a text file named "listfile". Each line in the file should consist of one file specification, each of which can include a path and wildcards if desired. Blank lines and lines beginning with semi-colons, colons, or quotes are ignored. If no input specification (filespec or @listfile) is provided, you'll be prompted for one. "outfile" is the name of the output file to create. Is overwritten if it exists already. If no output file name is provided, the routine will use the infile and provide an extension of *.OUT. (The default .OUT extension can be overridden using the /EXT=.xxx parameter.) An outfile cannot be specified if wildcards or @listfile are used for the input file specification. "/EXT=.xxx" allows you to specify a different default file extension for the output file. This parameter only matters if you do not explicitly specify an output file name. Initially defaults to "/EXT=.OUT". "/ALL" says that if the program encounters what it thinks is just a text file, it should take the file and try to fix up CR/LF problems (Unix files end with LF's instead of CR/LF which is what DOS needs) and that's it. This can be somewhat risky since it may misdiagnose a file but it should be safe if you're running it on your cache subdirectory. Initially defaults to "/-ALL" which means it won't process it unless it thinks it's an HTML file. "/-ALL" says to skip files if the program thinks it's not an HTML file. This is initially the default. "/WIDTH=n" specifies the desired line length for wrapping long lines and also for centering. Initially defaults to "/WIDTH=80". "/RULE=s" specifies that a string is to be repeated the width of the line. This is used to separate sections. The string can be a single character (like "/RULE=-"), multiple characters (like "/RULE="- ""), it can contain decimal and hexadecimal characters (like "/RULE=\066\097\116"--see BRUCEHEX.DOC), it can be "/RULE=NULL" (which typically results in a blank line), or just simply "/RULE" (which is the same thing as "/RULE=-" if /BORDER=T and "RULE=\196" if /BORDER=S or /BORDER=D). Personally, if your printer supports IBM graphics characters, I find "/RULE=\196" to be the most pleasing of the rule lines. HTMSTRIP.DOC 6 Revised: 09-03-96 "/BORDER=c" specifies the type of border to use. The possible choices for "c" are "D" (double line), "S" (single line), "T" (text line), "B" (blanks), or "N" (none). You can also specify "DV" (double line, verticle lines only except for headers and table top/bottom), "SV" (single line, verticle lines), and "TV" (text line, verticle lines). The default is /BORDER=S. /BORDER=B shows spaces instead of delimiters whereas /BORDER=N skips the blank lines between cells entirely. Examples of the basic other three: ext ingle ouble +-----+-----+-----+ ÚÄÄÄÄÄÂÄÄÄÄÄÂÄÄÄÄÄ¿ ÉÍÍÍÍÍËÍÍÍÍÍÑÍÍÍÍÍ» | 1 | 2 | 3 | ³ ³ ³ ³ º º ³ º +-----+-----+-----+ ÃÄÄÄÄÄÅÄÄÄÄÄÅÄÄÄÄÄ´ ÌÍÍÍÍÍÎÍÍÍÍÍØÍÍÍÍ͹ | 4 | 5 | 6 | ³ ³ ³ ³ º º ³ º +-----+-----+-----+ ÃÄÄÄÄÄÅÄÄÄÄÄÅÄÄÄÄÄ´ ÇÄÄÄÄÄ×ÄÄÄÄÄÅÄÄÄÄĶ | 7 | 8 | 9 | ³ ³ ³ ³ º º ³ º +-----+-----+-----+ ÀÄÄÄÄÄÁÄÄÄÄÄÁÄÄÄÄÄÙ ÈÍÍÍÍÍÊÍÍÍÍÍÏÍÍÍÍͼ "/BUFF=n" specifies how many spaces to position on either side of the vertical bars in the tables. Defaults to /BUFF=1. "/SPACES" turns off extra vertical spacing between sections. There are frequently lots of extra blank lines that appear in the output file either due to specific HTML requests or to insure proper reformatting. Specifying /SPACES allows these to stay there. "/-SPACES" removes these extra blank lines. This also tries to remove empty columns in tables as well as some blank rows in tables. This is initially the default. "/SYMBOLS" says to allow (unless redefined in your INI file) the "(link)", "(image)", and "[Input]" indicators. Initially defaults to "/-SYMBOLS". "/-SYMBOLS" skips the indicators even if they're defined in your INI file. This is initially the default. "/A=spec" tells the program how to handle links. These are used when the program is supposed to hop to another HTML page or to a section within the current HTML page. The values of "spec" are mutually exclusive: /A=FSITE says to show the site name, using its full url address, and imbed this name in the body of the text page /A=FSITEFN says to show the site name, using its full url address, and place this site name in a footnote section at the end of the text page /A=SITE says to show the site name, but only the part after the last "/" or "\", and imbed this name in the body of the text page /A=SITEFN says to show the site name, but only the part after the last "/" or "\", and place this site name in a footnote section at the end of the text page /A=SYMBOL says to use the specified symbol but to only do this if /SYMBOLS is in effect; this is initially the default HTMSTRIP.DOC 7 Revised: 09-03-96 "/IMG=spec" tells the program how to handle links. These are used for imbedded graphics. The values of "spec" are mutually exclusive and are documented in the "/A=spec" section above. "/IMGALT=spec" is identical to "/IMG=spec". However, if /IMGALT=spec is specified, the program will look for an ALT=alias reference in the link and use that if found. Note that alias will be used in its entirity if it's found and it will be embedded in the output text (appearing within brackets). The "spec" items are used for any reference that doesn't have an ALT=spec specification; in this case, the program works identically to /IMG=spec for these. So site names might be tossed at the bottom as footnotes if /IMGALT=SITEFN or /IMGALT=FSITEFN is used but any ALT=spec items are always in the text itself. "/ALTONLY" specifies that if an ALT=alias reference exists in an link, then the alias should be embedded in the output text (appearing within brackets) but, otherwise, all references are to be ignored in the input file. "/-INDENT" removes block indent sections from the output file. By default, five spaces are inserted before each line within a
    ...
    block. These can be nested so you can end up with a lot of white space in your document. "/-INDENT" turns them off. Initially defaults to "/INDENT". "/INDENT" retains the
    ...
    indenting. This is initially the default. "/WARNINGS" displays warnings when HTMSTRIP finds either internal problems in the document or things it can't handle. Initially defaults to "/-WARNINGS". "/-WARNINGS" turns off the warning messages. This is initially the default. "/TABLE" says to process text within table declaration sections as tables whenever the program can. There are some maximum cell length limits in the program and some tabular text will be dumped as straight ASCII text anyway. This is initially the default. "/-TABLE" says to process text within table declarations sections as straight text, removing it from the tabular structure entirely. There are other cases where page authors have switched to tables for formatting purposes and the resulting pages when converted to text at mostly space filled. Initially defaults to "/TABLE". "/Tpath" specifies where to write the temporary files that the routine needs. Examples are "/TC:" and "/TC:\TEMP". If not specified, the routine writes to the following in sequence: - the value of any TEMP, then TMP, environmental variable - C:\TEMP - C:\ "/MONO" (or "/-COLOR") does not try to override screen colors. Initially defaults to "/COLOR". "/COLOR" (or "/-MONO") allows screen colors to be overridden. This is initially the default. HTMSTRIP.DOC 8 Revised: 09-03-96 "/Iinitfile" says to read an initialization file with the file name "initfile". The file specification *must* contain a period. If no drive or path information is specified, the program will search for initfile beginning in your default subdirectory and then going throughout your DOS path. The use of an initialization file is optional. Initially defaults to "/IHTMSTRIP.INI". "/-I" (or "/INULL") says to skip loading the initialization file. "/Linitfile" says that the "&xxx;" and "
    " etc lookup codes are found in a file other than from the default "/Iinitfile" file. This is primarily useful if you want to have a master *.INI file and a separate code lookup table. "/?" or "/HELP" or "HELP" shows you the syntax for the command. "/?&H" gives you a hexadecimal and decimal conversion table. Author: This program was written by Bruce Guthrie of Wayne Software. It is free for use and redistribution provided relevant documentation is kept with the program, no changes are made to the program or documentation, and it is not bundled with commercial programs or charged for separately. People who need to bundle it in for-sale packages must pay a $50 registration fee to "Wayne Software" at the following address. Additional information about this and other Wayne Software programs can be found in the file BRUCEymm.DOC which should be included in the original ZIP file. ("ymm" is replaced by the last digit of the year and the two digit month of the release. BRUCE508.DOC came out in August 1995. This same naming convention is used in naming the ZIP file that this program was included in.) Comments and suggestions can also be sent to: Bruce Guthrie Wayne Software 113 Sheffield St. Silver Spring, MD 20910 fax: (301) 588-8986 e-mail: bguthrie@nmaa.org http://hjs.geol.uib.no/guthrie/ See BRUCEymm.DOC file for revision history. Please provide an Internet e-mail address on all correspondence.