Copyright 1996, Hyperion Softword ************************************* * Orpheus 2, beta release 2.00.30 * ************************************* Comments and queries to: Hyperion Softword, 535 Duvernay, Sherbrooke, QC J1L 1Y8, Canada tel/fax - 819-566-6296 (Rod Willmot) email - willmot@interlinx.qc.ca Contents of this file: Purpose Usage Registration Requirements Project Requirements Memory and Disk Requirements Index Order Output File 1 - ALLWORDS.LOG Making an Exclusion List Making an Inclusion List Output File 2 - projname.IDX Reader Interface - Full-text Search Foreign Language Support Last revised: 96-02-07 PURPOSE: ======== OHINDEX.EXE (The Orpheus Indexer) is a utility for indexing finished documents created with Orpheus. The resulting IDX file enables full-text search in the Reader. Note that the Indexer belongs to the feature set of Orpheus Professional; see "Registration Requirements" below. USAGE: ====== OHINDEX [/options] projname[.prj] The command-line must include the name of an existing project. For example, if the project is TEST (with a project file named TEST.PRJ), a minimal command-line would be "OHINDEX test". The project must be in at least semi-finished state, with a valid NODELIST file in the project directory; see "Project Requirements" below. Command-line switches: /a - allow high-ascii characters that may be accented letters in a particular language; default is to treat them as spaces. See "Foreign Language Support" elsewhere in this document, and the following switch... /e[filespec] - use an exclusion list to specify words that you wish to leave out of the index. If no filespec is given the program looks for EXCWORDS.TXT, first in the current directory, then in the project directory, and finally (if still not found) in the directory containing OHINDEX.EXE. If you do give a filespec the file may be named whatever you please; the filespec can include a drive/path if different from the current directory. See "Making an Exclusion List" below. /h - include hyphenated words; default is to treat a hyphen like a space, e.g. indexing "helter-skelter" as two words, "helter" and "skelter". If you use the /h switch the index treats "helter-skelter" as one word (unless the hyphen is on a line-break). /i[filespec] - use an inclusion list to specify word *combinations* that you wish to be indexed. If no filespec is given the program looks for INCWORDS.TXT, first in the current directory, then in the project directory, and finally (if still not found) in the directory containing OHINDEX.EXE. If you do give a filespec the file may be named whatever you please; the filespec can include a drive/path if different from the current directory. See "Making an Inclusion List" below. /m[#] - specify the minimum length of words to include; default is 3 (leaving out "he", "it", "to"...). For example, /m4 sets a minimum word length of 4 characters. /n - include numbers (and words beginning with numbers); default is to leave them out. For example, "2001" is normally left out, as is "21st". When number-words are included, the characters ",-.:" (comma, hyphen, period, and colon) are permitted inside them; for example, "1,1", "2-2", "3.3", and "4:4" would all be indexed as words. /q - swiftly creates the ALLWORDS.LOG (see below), but does not store location data or build the final index (IDX file). Uses half or less memory than a normal run; see "Memory and Disk Requirements" below. /s[filespec] - strip accents from high-ascii characters according to a conversion list; default is to leave them alone. Use of this switch automatically turns on the /a switch, but the reverse is not true. Foreign-language users MUST make use of this feature to ensure correct sorting. If no filespec is given we use a default conversion list. See "Foreign Language Support" elsewhere in this document. /t[#] - specify the maximum number of location records to store for a single word; default is 6000. The top limit for this number is 12000 due to memory demands in the Reader when Boolean search is provided. See "Memory and Disk Requirements" below. Switches must be given before the project name, separated by a space, and you can use "-" instead of "/". Examples: OHINDEX /q test Performs a "quick" index of the TEST project, but only so far as to make the ALLWORDS.LOG; it does not store location data or build the final index. Uses half or less memory than a normal run. See "Memory and Disk Requirements" below. OHINDEX /m4 /n /s /a /e test Sets a minimum word length of 4 characters; includes numbers and hyphenated words (unless broken by a line-break); allows for words containing accented (high-ascii) letters and strips the accents according to the default conversion list; and uses the exclusion list EXCWORDS.TXT, which may be in the current directory, in the project directory, or in the directory containing the Indexer. Indexes the TEST project. OHINDEX /emylist.doc test.prj Uses the exclusion list MYLIST.DOC in the current directory, and indexes the TEST project. Uses the default settings of 3 for the minimum word length, leaves out numbers (and words beginning with numbers), and breaks hyphenated words into multiple words. REGISTRATION REQUIREMENTS: ========================== OHINDEX.EXE is part of the Orpheus Professional feature set. If you are registered at the standard level, you are welcome to evaluate the Indexer, but you may not distribute the IDX files with your finished works. (If the Reader opens an IDX file with a work that was not assembled by a user with an Orpheus Professional licence, it considers that work to have been created with a *shareware* copy of the software, and displays the "unregistered shareware" warning.) (If you do have an Orpheus Professional licence and the Indexer says you don't, make sure it can find your OHREG.KEY file. It should be in the same directory as OHINDEX.EXE and your other Orpheus system files.) Please contact Hyperion Softword if you wish to upgrade to Orpheus Professional. PROJECT REQUIREMENTS: ===================== Before you can index a project it must pass through the first two stages of project building: compilation and link-verification. (These are performed through OH.EXE's Build Project dialog, on the Project Menu.) To ensure that index data corresponds precisely to the contents of your finished work, the Indexer works with the compiled versions of your cards, which are stored in CMP files. Just like the assembler (the final stage of project building), it uses the NODELIST file to select only those cards that belong in the finished work. Only Text cards are indexed. Because the index gives the exact location of every instance of every word that is included in the index, you have to do your part to keep the index and HTX synchronized. Whenever you update the HTX you should regenerate the index immediately afterwards. MEMORY AND DISK REQUIREMENTS: ============================= The amount of memory required to index a project depends on how many unique words are included. The amount of disk space required depends on the total word count being included. During processing a large amount of temporary data is swapped to disk; since this can easily amount to the total uncompiled size of your project, you will need plenty of free disk space. An example: at approximately half a megabyte of uncompiled text, the online Help for Orpheus contains over 55000 words of 3 or more characters. Of that total the Indexer identifies some 3241 unique words. On a normal run (no exclusions or other switches) it uses a maximum of about 128000 bytes of RAM. On a "quick" run (with the /q switch), memory use falls by more than half, to under 61000 bytes of RAM. The /q switch tells the Indexer not to store any data about the locations of words. This frees up a substantial amount of memory and reduces disk use to a minimum. If your project contains tens of thousands of unique words, the Indexer may not be able to get through to the end (on a normal run) without running out of memory. If this happens, try again with the /q switch. This will at least generate an ALLWORDS.LOG which you can use to make an exclusion list; both of these are discussed below. Providing the Indexer with a substantial exclusion list will free up a proportional amount of memory, and of course will slim down your final IDX file. You can reduce the size of the final output file by using the /t switch to set the maximum number of location records to store for a single word. The default for this variable is 6000. It could be argued that any word occurring over a certain maximum is too common to demand a search under any circumstances. What that maximum is depends on the project and your expectations; it could be as low as 500 or as high as 12000 -- the highest currently permitted. The top limit is 12000 due to memory demands in the Reader. With searches on a single word the Reader uses the same very modest amount of memory no matter how many records there are. However, with multi-word or Boolean searches (planned for future development), memory use is exactly proportional to the number of records. INDEX ORDER: ============ The index is alphabetically sorted. However, if numbers are included, such as "21st" or "2001", they are given first (in ascending order). If high-ascii characters are included and the /s switch isn't used, any words that begin with an accented letter are given last (after the last letter of the regular alphabet). Since sorting is by ascii-value, words beginning with accented characters may not be in the order expected for a particular language. Use of /s switch corrects this phenomenon (see "Foreign Language Support"). OUTPUT FILE 1 - ALLWORDS.LOG: ============================= The first product of the Indexer is a file named ALLWORDS.LOG, which is placed in the project directory of the project being indexed. This is a plain text file containing a complete list, in alphabetical order, of the words included in the index. Each word is given on a separate line, followed by a space and a number; the number is the word's "hit count" -- how many times it was encountered. You can view this file with any text editor or file viewer, or even load it into OH.EXE. The ALLWORDS.LOG is generated for your use, not the Indexer's. You may delete it if you wish, but it does have an important purpose -- to enable you to generate an exclusion list, discussed below. When you use the /q switch on the command line, the Indexer *only* makes ALLWORDS.LOG. MAKING AN EXCLUSION LIST: ========================= While developing the Indexer I tested it on the online Help for Orpheus. Here is a sample of the ALLWORDS.LOG from the preliminary output: abandoned 1 ability 16 able 13 abort 6 aborting 3 aborts 1 absence 3 absolutely 1 Obviously, none of those words has anything to do with a significant topic in Help. Storing them in the index, with their location records, would require some 300 bytes of dataspace; storing the word "would" (with some 122 location records) would require over 600 bytes of dataspace. Multiply by thousands for popular words like "then" and "there", and watch your disk fill up with useless data. The solution is to make an exclusion list: a text file in the same format as the ALLWORDS.LOG, listing all of the words that you do NOT want in your index. Please note the following specifications: * The exclusion list must give one word per line; anything after the first word on a line will be ignored. Therefore, you can copy lines directly from the ALLWORDS.LOG into your list without having to remove the numbers. * Lines beginning with ";" or "/" are ignored. * The exclusion list can use both uppercase and lowercase. * The exclusion list must be in strict alphabetical order, as in ALLWORDS.LOG. If you are including numbers but wish to exclude specific numbers (or words beginning with numbers), give them first and in ascending order. If you are including high-ascii characters but wish to exclude specific words beginning with high-ascii characters, give them last. * If you are using accented characters (e.g. in French or Dutch) and are using accent-stripping with the /s switch, the exclusion list MUST NOT contain any accents. This is because accent-stripping is peformed just before the test for exclusion. If you intend to build your exclusion list on the basis of ALLWORDS.LOG, do so with accent-stripping enabled right from the start, since the LOG will then contain no accents. If you obtain a word list from some other source, you may need to use a word processor to convert the accented characters if any. * The Indexer will only use your exclusion list if you tell it to, with the /e switch on the command line. * The exclusion list may be as short as a single word, but cannot be longer than 65535 bytes. I may increase capacity if there is demand for it. A simple way to create an exclusion list is to run the Indexer once on your project (use the /q switch to do this quickly), then load the resulting ALLWORDS.LOG in OH.EXE or any text editor, and do one of the following: - either copy selected lines into a separate EXCWORDS.TXT file... - or *delete* from the LOG whatever words you do want in the index, leaving behind all those that you don't. Be sure to rename the resultant file EXCWORDS.TXT to prevent the Indexer from overwriting it later. The default filename for the exclusion list is EXCWORDS.TXT. If you use the /e switch without a filespec, the Indexer looks for EXCWORDS.TXT in the current directory; if the file is not there, it looks in the project directory for the project being indexed; and if not there, it looks in the directory containing OHINDEX.EXE (in case that is different). If you do give a filespec the Indexer looks for exactly the file you specify. Once you have created an exclusion list you can continue to extend it for use with the same project or with other projects. It doesn't matter if the list contains words that are not even used by a project. What does matter is that the list be in strict sorted order as discussed above. If the list falls out of order or if you are uncertain how to sort accented characters, you can easily sort the file by using the DOS SORT command. (See DOS help on SORT.EXE for details.) MAKING AN INCLUSION LIST: ========================= For special purposes you may wish to include word combinations in the index. This can be done with the aid of an inclusion list, using the /i switch on the command-line together with a file named INCWORDS.TXT (created by you with any text editor). For example, in a legal work references to articles of the law may take the form "Article 2" or "Art. 2", and it would be useful to have the index list all occurrences of "Art." or "Article" *together with* those numbers. Somewhat differently, a work on beverages might include references to drinks whose names consist of more than one word, such as "Harvey Wallbanger". An inclusion list to handle these examples would go like this: "art. "+ "article "+ "harvey wallbanger" The rules are obvious: the main part of the combination must be enclosed in quotation marks. In the case of "harvey wallbanger", that's all there is. In the other two cases, adding a "+" plus sign tells the Indexer to add to the combination any additional characters UP TO THE END OF THE WORD. Since "art. " and "article " each end with a space, this means that anything AFTER the space will be treated as belonging to the same word. Thus "Art. 2" will be included, as will "Article Fifty-nine" and so on. NOTE: In some cases you will need to add other switches to make sure that everything is included as desired. Adding the /h switch ensures that hyphenated words are allowed. Adding the /n switch ensures that numbers are allowed, and that a number-word may include such characters as "-,.:". Thus, with the /n switch and the "art. "+ entry in the inclusion list, "Art. 2.20.17" would be indexed as if it were a single word. Please note the following additional specifications: * The inclusion list must give one entry per line. No leading spaces are allowed, and the entry must be enclosed in quotation marks, optionally followed by a "+" sign as discussed above. * Lines beginning with ";" or "/" are ignored. * The inclusion list can use both uppercase and lowercase; within the IDX, all words are rendered in lowercase characters. * The inclusion list must be in strict alphabetical order, as in ALLWORDS.LOG and the exclusion list. * If you are using accented characters (e.g. in French or Dutch) and are using accent-stripping with the /s switch, the inclusion list MUST NOT contain any accents. * The Indexer will only use your inclusion list if you tell it to, with the /i switch on the command line. * The inclusion list may be as short as a single line, but may not be longer than 12288 bytes. The default filename for the inclusion list is INCWORDS.TXT. If you use the /i switch without a filespec, the Indexer looks for INCWORDS.TXT in the current directory; if the file is not there, it looks in the project directory for the project being indexed; and if not there, it looks in the directory containing OHINDEX.EXE (in case that is different). If you do give a filespec the Indexer looks for exactly the file you specify. OUTPUT FILE 2 - projname.IDX: ============================= On conclusion of a normal run (without the /q switch), the Indexer generates the final index to your project. This file has the same name as your project but with an extension of ".IDX". The IDX file can then be used by the Reader, as discussed below. (Please note that you may only distribute the IDX with the finished version of your work if you are registered at the level of Orpheus Professional. See "Registration Requirements" above.) READER INTERFACE - FULL-TEXT SEARCH: ==================================== The Reader interface is illustrated by the Search dialog in online Help. FOREIGN LANGUAGE SUPPORT: ========================= The Indexer works by default in terms of the English language, which does not use accented (high-ascii) characters. You can modify this behavior by using the /a or /s switch (or both) on the command line, as shown under "Command-line switches" in the "Usage" section at the top of this document. The /a switch turns on inclusion of high-ascii characters within words, on the assumption that they may be accented letters. (Only characters that MIGHT be letters are so included; those that are linedraw characters or other symbols on one or more of the code-pages that I have examined are left out.) Note that words containing high-ascii characters will not be correctly sorted if only the /a switch is used. Specifically, all words beginning with a high-ascii character are grouped after the last regular letter of the alphabet, "z". The /s switch corrects this problem by stripping the accents prior to sorting. (The /s switch automatically turns on the /a switch.) Note that to the computer there is no connection whatever between a given low-ascii letter and its high-ascii accented version; with a different code-page enabled, the high-ascii character may not even be a letter. Therefore, conversion is performed according to a "strip list" consisting of pairs of related characters: a high-ascii accented character followed by the regular letter of the alphabet to which it should be converted. Here is the default striplist: cueaaaaceeeiiiaaeeeooouuyouaiounn In other words, "" will convert to "c", "" will convert to "u", and so on. You can provide your own striplist if you wish, as outlined below. There are two reasons for doing this: one would be because the default list does not include characters used in your language. Another would be because your language uses only a few such characters, and you will get much better performance with a shorter striplist. To make your own striplist, follow the example above, placing the entire list on the first line of a text file. The Indexer considers the list to end at the first space or line-break if either of these occurs before the end of the file. Please note that while the first character in each pair can be whatever you like, the second must be a LOWERCASE letter of the alphabet in order for the conversion to have the desired effect. Do NOT make a list like "CUEAAA", because it simply won't work. To tell the Indexer to use your striplist instead of the default, add the name of the file to the /s switch on the command line, without any intervening spaces, e.g. "ohindex /sstripper.txt test" to index the TEST project while using STRIPPER.TXT for character conversion. When accent-stripping is enabled, the final IDX file contains a copy of the striplist used. The same list is then applied to any input typed in by the user, so that "caon" will come out as "canon" and be correctly located in the index. NOTE: certain languages may use letters that the Indexer excludes even with the /a switch. For users registered at the Orpheus Professional level, I will be happy to extend the Indexer's intelligence; all I need is a list of the desired ascii values or a photocopy of the code-page listing from your DOS manual. (This offer applies to languages based on the Roman alphabet; I can't promise anything about Russian for example.) Code-pages currently supported include: 437 English 850 Multilingual (Latin I) 852 Slavic (Latin II) 860 Portugal 863 Canadian-French 865 Nordic