TROUBLESHOOTING OPERATING SYSTEM ABENDS



DISCLAIMER:  THE ORIGIN OF THIS INFORMATION MAY BE INTERNAL OR
EXTERNAL TO NOVELL.  NOVELL MAKES EVERY EFFORT WITHIN ITS
MEANS TO VERIFY THIS INFORMATION.  HOWEVER, THE INFORMATION
PROVIDED IN THIS DOCUMENT IS FOR YOUR INFORMATION ONLY. 
NOVELL MAKES NO EXPLICIT OR IMPLIED CLAIMS TO THE VALIDITY OF
THIS INFORMATION.


Contents of this Document


What Is A Server Abend
Troubleshooting an Abend - Preliminary Steps
Troubleshooting  An Abend - A Process 
Troubleshooting  An Abend - Data Collection
Troubleshooting An Abend - Isolation & Duplication 
Using Novell Technical Support
Getting a Core Dump
Appendix A - Check list/Summary 
Appendix B - Dealing With An NMI Error
Appendix C - How To Access The NetWare OS Patches And Updated Files
Appendix D - Troubleshooting Tools
Appendix E - Using the Internal Debugger
Appendix F - Faxback Service
Fax Form for Feedback on this Document


What Is A Server Abend


"When an Abend message appears on the server console, either NetWare or the server
CPU has detected a critical error condition (fault) and jumped into the NetWare's fault
handler.  This handler idles NetWare and displays the Abend message on the server
console for immediate action by the server administrator....."  

   An error condition, or fault, that is detected by the CPU is called a "Processor
Exception."  
   An error condition that is detected by NetWare is called a "Software Exception."

"NetWare's fault handler (function) is declared as a public function by the operating
system so it can be used by any operating system module or Novell NLM."   (Abend
Recovery Techniques for NetWare Servers.  June 1995 Application Notes. Page 79.)

"The NetWare 3 and 4 operating systems continually monitor the status of various server
activities to ensure proper operation. If NetWare detects a condition that threatens the
integrity of its internal data (such as an invalid parameter being passed in a function call,
or certain hardware errors), it abruptly halts the active process and displays an "Abend"
message on the screen. ("Abend" is a computer science term signifying an ABnormal
END of program.)

The primary reason for Abends in NetWare is to ensure the stability and integrity of the
internal operating system data. For example, if the operating system detected invalid
pointers to cache buffers and yet continued to run, data would soon become unusable or
corrupted. Thus an Abend is NetWare's way of protecting itself and users against the
unpredictable effects of data corruption."  (Resolving Critical Server Issues.  Feb. 1995
Application Notes. Page 37.)


The term Abend will be used in its most generic sense in this document.  Meaning any
Software, or Processor Exception or where the server hangs or locks.  


Troubleshooting  An Abend - Preliminary Steps


Anytime you experience a server Abend (or almost any other server problem) consider
the following maintenance steps before you do anything else.  Experience has shown that
a high percentage of Abends are corrected by applying current versions of the Operating
System (OS) patches, and by updating lan and disk drivers.  See the "patch.doc"
document included in TABND2.exe.  Novell Technical Support (NTS), therefore, 
strongly recommends,  that regardless of the problem you are experiencing, apply the
current patches and update drivers.  After this is done consider the other items on the list
that follows.  The list is in no particular order.

NOTE:      Be aware that an NMI Parity error (Abend: Non-Maskable Interrupt) is a
           hardware error.  (See Appendix B - Dealing With An NMI Error.)

1. Load ALL the current patches that are available for your version of NetWare.  The
   patches are written to resolve known issues.   The patch download file will be called 
   <OS version>PT<file revision number or letter>.EXE for the "PT" set,  and <OS
   version>IT<file revision>.EXE for the "IT" set  (IE: 410pt3.exe & 410it6.exe). 
   Load both the "PT" patches and the "IT" patches.   (See "Patch.doc," included in the
   TABND2.EXE file, for a detaile explanation of the patches,)
2. Update Drivers.  Each manufacturer of Lan and disk cards must develop their own
   drivers.  The only way to assure that you have the latest version of these drivers is to
   download them from the respective vendor.  Even new hardware does not usually
   ship with the most current drivers. Be certain that drivers are the newest available
   from the respective vendor!!! 
3. Look for nlms that may be outdated.  Remember, every time we update a file it is to
   make it more robust, and/or usable.  Commonly updated NetWare nlms are
   compressed into self-extracting download files.  Get a list of the current revision of
   files and patches from the file "patlst.txt."  This list can be viewed or downloaded
   from Novell's Internet Web site.  (see Appendix C for more on downloading files
   from Novell.)   Also, don't forget to check with 3rd party vendors to update their
   NLM's as well.
4. Virus scan the dos partition on the server to make certain that a virus has not crept
   in.
5. Clean and re-seat the cards and cables (ALWAYS use static precautions.)
6. Double check termination and SCSI ID's,  etc.
7. Check fans for proper operation.
8. Use an anti-static air can to blow dust off of the system board and other cards and
   components.

Troubleshooting  An Abend - A Process 


Figure 1, below, is a flow chart which suggests a logical process that could be used when
troubleshooting a server Abend.  Some Abends have simple solutions, but more often you
will have to pull a little bit of hair before you come to a solution.  The name of the
troubleshooting game is usually called "Experience," and a "Little Bit of Luck."  Don't
forget to draw on other peoples experience.  When troubleshooting, don't expect a magic
fix the first thing that you try.  Do expect to have to narrow in on the problem by trying
different things. 






Troubleshooting  An Abend - Data Collection


This step, if done well, can be the most valuable step toward identifying your problem. 
The most common shortcoming when doing data collection is not looking far enough
beyond the immediate symptoms.  Following is a partial list of things to look for and
questions to ask yourself.  Don't feel embarrassed to raise a question that seems
completely unrelated.  Sometimes it IS unrelated, but often times it helps to get a more
complete picture of the problem.  The questions and suggestions that follow should help
you gain a higher level view of the problem.  Sometimes your data collection will reveal
more than one potential problem and you'll  have to perform "Computer Triage."  
However, without a complete picture you could waste time troubleshooting in a
meaningless direction.  Use this list to get you thinking about the kinds of things that
could be happening on your server / network to cause problems.   These are in no
particular order:

1.   Keep a record of the Abend messages and watch for trends.  Abend messages that
     are consistent sometimes indicate that software is at fault, while messages that are
     not consistent may indicate a hardware failure. There is NO hard/fast rule here, you
     want the data to point you in some troubleshooting direction. 
2.   Look through the system error log (sys:sytem\sys$log.err) for clues that haven't
     surfaced anywhere else.  An error on a  certain node just before the Abend, an error
     on a certain file, a print queue, volume dismounts, etc.
3.   What resources are being used at the time of the Abend?  IE. Printing, file, tape, com
     port, memory access, etc.
4.   What time of day does the Abend occur?  Is it consistent?  
5.   Is the room air conditioning off when the machine Abends?  This may indicate a
     heat problem. 
6.   What is the environment like? Dry, dusty, or hot  environments contribute to heat
     problems and static.
7.   Are there power problems either at the power source or at the power supply? 
8.   Is there a certain data base function running such as reindexing? 
9.   Is lan traffic high?  Are disk reads or writes high?
10.  With the Abend still on the screen, break into the debugger and record basic
     information such as the EIP (instruction pointer), running nlm, running process. 
     (See Appendix E - Using the Internal debugger.)
11.  Use conlog.nlm to capture console messages you would otherwise miss.  Note:
     Conlog has to be unloaded in order to close the log file and make it available to read. 
     (See conlog.nlm in the 4.1 Utilities Reference Manual.)
12.  Is there a certain user, segment, application, server process (like a backup), or
     anything else that is common or consistent when the Abend occurs?
13.  Is the client software current?
14.  Ask, "What has changed in my server environment?"   Consider these questions:
          -  Have the number of users increased? 
          -  Is there new software or any software upgrades that have been put on? 
          -  Is someone using software in a way different than it had been used, such as
database indexing, etc.? 
          -  Is there new or different hardware?  
          -  Have there been changes to the LAN, the  routers, or the cabling? 
          -  Have workstations or the file server been physically moved?  
          -  Are there new printers on the LAN? 
          -  Have there been any power outages?  
          -  Have SET parameters been changed?  
15.  Is there any new or strong Electro Magnetic Force (EMF) near the server or cabling? 
     Large motors, cabling across the florescent lights, vacuum cleaner, transmitters, etc. 
16.  Has the hardware been handled without Static Protection?
17.  "Set DStrace = on" and then watch the DSTRACE screen for errors, or for an "All
     processed = NO" message.   Be sure to give any DS errors time to go away  before
     you worry too much.  Fifteen minutes to several hours is usually adequate. 
18.  Are any users dropping their connection? 
19.  File corruption?
20.  Drive deactivation?  If partitions are mirrored, the drive can deactivate without
     bringing the server down.  Check the error log. 

21.  Printing problems? 
22.  Is power filtered?  If so has the filtering hardware been tested recently?  IE: Is it still
     functioning?
23.  Monitor.nlm and Install.nlm are valuable NetWare utilities to check your servers
     health.  Use them to find information such as 
          -  Climbing packet receive buffers. 
          -  No ECB available count that continues to climb.
          -  Low server memory.  Cache buffer percentage should usually be around 60 -
70%. 
          -  LRU sitting time.
          -  Dirty cache buffers that stay high. 
          -  A high number of LAN errors (more then 10% of the total packets sent or
          received).
          -  High utilization (if it stays high for more then 10 or 20 minutes at a time).
          -  Check Service Processes to see if they have max'd out. 
          -  Partition, volume, and mirroring information.
          -  View and edit NCF files.


Config.nlm:   An invaluable tool for data collection is Config.nlm (included in
TABND2.exe).  When run at the server, config.nlm will create a file which includes
information about your server's configuration.  You may notice something here that
raises a  "red-flag" that you hadn't noticed before.  Use it to document your configuration
before you make any changes to the server.   Also, If you place a call to Tech Support,
you will often ask for this information. 


IMPORTANT NOTE: Sometimes understanding the data you've collected will require you
to find out from other sources if what you are seeing is normal.  For example, it's
common for the server's utilization to stay at 100% for a few seconds or even a few
minutes, or more.  Likewise, the allocation of packet receive buffers or the size of the
directory entry table is dynamic up to the setable maximum.  They are allocated
dynamically,  on an as-needed-basis.   It is often only through experience that you'll
determine if what you're seeing is normal or if it is indicating a problem.  Watch YOUR
server to determine what is normal in YOUR environment and then tweek SET
parameters or make other changes as needed.  It is also ok to get DSTRACE errors. If DS
is trying to process a request,  and another request is already in process, you can get a DS
error until the first process completes.  It is important to establish what is normal for your
environment so that you can accurately determine when you have a real problem and
when you have simply hit against the limitations of your hardware and/or software. 


Troubleshooting An Abend - Narrowing (Isolation) & Duplication 


If the problem is not solved by now it's time to roll up your sleeves and troubleshoot. 
Now that the preliminary steps have been covered, and the initial data collected, 
troubleshooting is primarily a matter of going back and forth between "Problem
Isolation," and "Problem Duplication."    Trying to narrow in on the problem, while at the
same time trying to discover a sequence of events that will reproduce the problem.  In
simpliest terms,  an Abend is caused either by a hardware failure or by a misbehaved
NLM's. In either case the result is usually corrupted memory.   Remember from the
introduction, a software exception occurs when NetWare fails a consistency check
(performed in memory, on memory), and that a processor exception occurs when the
processor encounters an address or machine instruction that does not comply with the
rules.  Again, corrupted memory. 

Problem Isolation & Problem Duplication are almost the same.  The main difference is
that in Problem Duplication,  you are specifically trying to reproduce a problem. There
may be an nlm, which, every time it loads, the server Abends. Or perhaps if someone
does a copy while someone else is logging in the server Abends.  If you are able to find a
reproducible problem like this  you can now eliminate variables one at a time, try the test
again,  and see if the problem goes away.  In the other case, Problem Isolation
(Narrowing), the data may not have given you a clue so you have to probe around, trying
different things to see if you can narrow the problem down to a system or component. 
You may be able to determine that the problem is isolated to the disk channel because it
only happens when the disk is being accessed.  Or, you may be able to relate the problem
to a certain nlm such as a backup.  Consider these systems when trying to isolate a
problem: 

     - Disk channel,     - Lan channel,
     - System board,     - Com port, 
     - the NetWare OS,   - a 3rd party nlm product, 
     - the cabling,      - a certain type of workstation ( Win95 vs. Win 3.1 vs. dos), 
     - a certain type of shell (netx, vlm, client32, 3rd party client). 

Remember that the objective is to fine a sequence of events that will reproduce the
problem, or at least narrow down to a system, an nlm, or a piece of hardware that is
always involved when the Abend occurs.    The following troubleshooting ideas should
help you to "Divide & Conquer".  This list is in no particular order, but it is grouped
somewhat by lan, disk, and general troubleshooting ideas.

Here are some Problem Isolation & Problem Duplication Ideas.

1.   Use "server -ns," or "server -na" to bring the server up without executing the
     startup.ncf, or the autoexec.ncf respectively.  Loading "server -ns" will allow you to
     bring up the server without the volume mounting automatically.  These parameters
     also work for SFT3.
2.   Does the Abend message itself suggest anything?  LAN channel, disk channel,
     memory corruption, system board problem, a certain nlm, printing, a certain piece of
     hardware, a certain lan segment, a workstation, a router, an environmental condition,
     etc. 
3.   Use "server -na" to prevent the autoexec.ncf from running. Then load nlm's
     manually, one at a time. 
4.   Use "server -ndb" to prevent the DS database from loading and thereby eliminate
     directory services.  Note that you won't be able to log in without the database
     loaded. 

5.   What is the age of the hardware?  If nothing has changed in the environment then
     the hardware may have simply failed.  Don't assume that new hardware is always
     good.  
6.   Could hardware have received static shock?  Static is not always destructive, often it
     will cause degenerative damage to your hardware allowing it to continue to work for
     a time. 
7.   Check with the vendor of 3rd party products to see if they are aware of the kind of
     problem you are experiencing.
8.   Temporaily unload any 3rd party nlms.
9.   Unload virus scan, and server/lan monitoring nlms.
8.   Could there be power problems either at the power source or from the power
     supply?
9.   Check the cooling fan in the power supply, the case, and on the cpu.  Heat will cause
     hardware failure. 
10.  A dry, hot or dusty environment can cause hardware degradation due to static
     electric discharge.  It also increases the chance of NMI errors. 
11.  Avoid int. 15, 2, & 9, in that order.  
12.  Try to isolate the problem to a hardware subsystem, i.e.:  LAN Channel, Disk
     Channel, and System Board.  You can swap hardware or try a different interrupt or
     slot. 
13.  Although the Abend message is very generic it can still be used to point you in a
     direction.  Most Abends indicate memory corruption.  Some will be disk related,
     others lan related.  Often an Abend will include a function name, for example:
     "Abend: DeallocateMappedPage was supplied an invalid memory pointer."  The
     words that appear without spaces (DeallocateMappedPage) is the name of a function
     in NetWare's code.  In this Abend a memory pointer was sent into the
     DeallocateMappedPage function.  During a consistency check the function
     determined that the pointer was not a valid address.  When an Abend mentions
     ....interrupted.... take a look at the lan simply because the lan does more interrupting
     then anything else in the server.  As is always the case, this becomes more intuitive
     from experience.  
14.  Clean and re-seat the cards and cables.  Remember static protection.
15.  How long has the server been installed?  If it is a new install (less then a month) you
     may still have configuration and set up issues, or you could have faulty hardware
     ...even if it is new.  
16.  If you have a 16bit card in a machine with more then 16 meg of Ram, are there load
     parameters for the drivers?
17.  Go back to basics.  Every technician's tool kit should have an NE2000 lan card, and
     a basic no-nonsense disk card to use for testing purposes.   
18.  If the machine has PCI cards try a non-PCI card if possible. 
19.  Server.exe, like any other file, can become corrupt.   A corrupt server.exe can be
     difficult to track down. If you can reduce the environment to not much more then a
     server, a disk and a lan, and you still have the problem try a fresh copy of server.exe. 
     The same idea applies to any other nlm on the server.  Remember, the server.exe in
     NetWare v3.x contains the server license number.  Don't copy the wrong server.exe
     file.
20.  Run a bindfix or a dsrepair, or a vrepair.

21.  Load install and view partition or mirroring information.  Install retreives this
     information dynamically.  There can be partition table corruption, for example, that
     is not surfacing as such. When you try to access partition information in install,
     you'll get a specific error because the table cannot be read. 
22.  Always double and triple check termination, interrupt settings, SCSI ID, drive
     translation (should be off), etc. 
23.  Run vrepair.  If vrepair runs clean (zero errors) and you still suspect disk corruption,
     change the vrepair options.  Option 2 from the main menu will change the vrepair
     options.  Set them to "Write changes immediately...," "Write all changes....," and
     "Purge deleted files."  (This is not exact syntax.)  Don't be alarmed with a LOT of
     errors after these changes.  Note: Because of the nature of what vrepair is doing,
     Novell recommends that you always have a verified good back up before you run
     vrepair.

24.  Try different workstations.
25.  Try workstations on different segments.
26.  Can you isolate the lan completely by attaching a single workstation directly to the
     server?
27.  Category 5 cabling is usually required on faster lan cards.  Running the wrong cable
     can cause problems. 
28.  Heavy I/O will sometimes stress the server enough to force an error.  Try doing a
     copy *.* or an xcopy continuously.  Try it from several workstations.
29.  Use ncopy to copy a file where the source and destination are the same server. 
     Ncopy will not send any packets across the lan in this scenario.  If you still have the
     problem,  you know the lan is not part of the problem.  Any other form of copy 
     (copy, xcopy), where the server is the source or destination, will send the file across
     the wire even if it is just going to be sent back a second time across the wire.  If copy
     has the problem but ncopy does not, then the problem is probably on the lan. 




Using Novell Technical Support


If you need further troubleshooting direction, your first step should be to call a Novell
Authorized Service Center (NASC).  These Gold and Platinum dealers are Novell
NetWare trained and willing to help you.   To find the service center closest to you call 1-800-NET-WARE (638-9273) between 7:30am to Midnight CST, choose option 1, 2.

If you still need to contact Novell Technical Support Do The Following Before You
Call.   

1.   Find config.nlm included in TABND2.exe file, and run it at your server. Read the
     associated read-me file for specific instructions. The file that config.nlm creates will
     contain important server information that we can use to help troubleshoot your
     Abend.  

     Note: Before you run config you should have already updated drivers and loaded
     current os patches before we will want to look at your config information. 

2.   Next, fill out the form in Appendix A.  Have this available for reference or to fax to
     us if needed. 

3.   At this point open an incident with Novell Technical Support. 

4.   Consider the possibility that you may need to get a core memory dump from your
     server. A core memory dump takes a "snapshot" of the server's RAM as it looks at
     the time of the Abend.  The core dump is discussed further on in this document. 

     DO NOT automatically take a core dump.  Wait until a Technical Support Engineer
     instructs you to do so.  Also, Do not send us core dumps from servers that do not
     have the patches and current LAN and disk drivers loaded. We cannot spend
     time troubleshooting a problem that has already been resolved by current patches or
     updated software.  Make sure you have the current patches and current LAN and
     disk drivers!





Getting a Core Dump


If nothing to this point has solved the problem it may be necessary to get an image of you
memory (Core Memory Dump) for us to analyze. 

DO NOT automatically take a core dump.  Wait until a Technical Support Engineer
instructs you to do so.  Also, Do not send us core dumps from servers that do not have
the current patches and drivers loaded.  Current patches and drivers fix known
problem.  We cannot spend time looking at a problem we've already solved. Make sure
you have the current patches and drivers!

Dump to Floppy:                                                                 This is the simplest method, but is not practical if you have more then
                                                                                16meg of Ram.  When the server Abends you are asked if you want to
                                                                                ".. copy the diagnostic image to disk."   In NW3.12 and NW4.x the
                                                                                image will go to the c: drive by default, with an option to write it to
                                                                                floppy.

             To copy the image to floppy you need to have enough blank, formatted
             floppies to hold your entire server memory.  

Dump to Hard Drive:                                                             This method is much faster than dumping the image to floppy.  In this
                                                                                case the core dump is copied to the C:\ partition of the server as a file
                                                                                called coredump.img.  To copy the image to the C: drive you must
                                                                                have enough free space to hold the entire server memory.  If you have
                                                                                not planned this extra space into your c: partition you may be able to
                                                                                add an inexpensive IDE drive and use the entire drive as a dos
                                                                                partition.  

             After the image is copied to the drive the server can be brought back
             up and users can log in.  You will then need imgcopy.nlm.  This nlm is
             contained in the download file imgcpy.exe.  When imgcopy.nlm is
             loaded it will allow you to copy coredump.img from the c: partition to
             the NetWare partition.  Once on the NetWare partition, the file can be
             ftp'd to us or it can be backed up on tape and the tape can be sent to
             us. Before you do this, make sure that we have the software needed to
             restore the file. 

Forcing a Core Dump:                                                            Occasionally you'll need to force a core dump during a certain
                                                                                condition other then after an Abend.  For example, if the server is
                                                                                experiencing high utilization and a core dump is needed at the time the
                                                                                utilization is high.  A core dump can be taken at anytime by breaking
                                                                                into the OS' internal debugger.  NOTE that if you break into the
                                                                                debugger while the server is in use, the workstation will not be able to
                                                                                communicate with the server and will probably time out.  

             Break into the debugger by holding, simultaneously, <left shift>,
             <right shift>, <alt>, <esc>.  Here are a few commands to be used
             while in the debugger.  
                    .c   Force a core dump.
                    q    Quite to dos.
                    g    Go back to the point where you can into the debugger.
                    h    Help
                    

Dump to Server:  This is the faster way to write the dump.  It is called the network
                 method.  The problem server must have an additional lan card (must
                 be ethernet)  that gives it a client connection to a second server that
                 will be the destination for the image file. When the problem server
                 Abends the core dump is sent across the lan connection to the
                 destination server.  This is how to setup for this type of coredump:

                 1. Down the problem server and power it off.
                 2. Put in an additional Lan Card (Ethernet Only).
                 3. Bring the server back up to a dos prompt. 
                 4. Set up an nwclient directory, and load the client drivers. 
                 5. Log in to the server that will be the destination server, and
                    map a drive to the place you want the dump to be sent to (for
                    example:  f:\dumpdir). 
                 6. Go to the destination server and write down your connection
                    number  (Get this from monitor). Also, write down the drive
                    mapping (ie: F:\dumpdir).
                 7. Bring up the problem server. 
                 8. You should test the connection by  going into the debugger
                    and initiating a core dump.  Press the following keys all at
                    once (Alt , L-Shift,  R-Shift, Esc).
                 9. At the # prompt type  .c  <return>.
                 10.   Follow the prompts with a "Y" to dump to a hard drive. 
                       When you are prompted for a path, enter, 
                       f:\dumpdir\coredump.img (or the mapping you wrote
                       down if it's different).  Press <return>.  The image should
                       start copying.
                 11.   If this works go on to the next step.  If it doesn't here are
                       some of the problems we've seen: 
                       -  Hardware configuration problem on additional Lan
                       Card.
                       -  Using Netx with some version of lan drivers produces
                       bad packets.
                       -  Connection to second server , routing, network , etc....
                       -  Other normal client to server trouble shooting issues.
                 12.   If you're able to duplicate the Abend, take a core dump. 
                       Most likely, however, you will have to wait until the next
                       time it Abends on its own.  Your default connection time
                       to the destination server will be 15 minutes.  In order to
                       hold the connection longer you have two choices:

                       -  Increase the watchdog time out parameters, or 
                       -  Load netalive.nlm (included in the TABND2.EXE file). 
                       You will load netalive with two parameters,  the name of
                       the problem server, and the name of the destination server 
                       (See the readme with netalv.)
                    
                    You can use monitor.nlm on the destination server to see if
                    your connection is being maintained.  


             You will need an open support incident with Novell Technical Support
             before sending the core dump.   At that time you will be given a
             location to ftp the core dump to, or an address where you can mail a
             tape. 

Appendix A - Check list/Summary 


Incident Number: ________________ Name: __________________________  Phone:
______________________

O/S version  _______________     DS version ________________     Amount of RAM
____________

Make/Model of Machine (indicate if a clone)/Bus Type __________________________

LAN card, driver name, driver date & version  __________, __________, ___________

LAN card, driver name, driver date & version  __________, __________, ___________

HBA (controller), driver name, driver date & version  _______, ________, __________

           List the devices on this HBA: __________________________________

HBA (controller), driver name, driver date & version _______, ________, __________

           List the devices on this HBA: __________________________________

Are your drives mirrored? Y / N  Duplexed?  Y / N     Total volume space? __________

1.   Have you updated the LAN and disk drivers?  Y / N
2.   Have you applied all the appropriate OS patches? Y / N
3.   Have you tried a fresh copy of Server.exe? Y / N
4.   Is your clib.nlm current? Y / N
5.   Have you virus scanned the DOS and NetWare Partitions? Y / N
6.   What other information do you have that may help troubleshoot this problem?
7.   What changes have been made to the server recently? (Increased number of users, new
     software, upgraded software, new or different hardware, LAN or router changes,
     workstations or file server physically moved, power outages, set parameter changes,
     etc...)
8.   What hardware has been swapped out already? 
9.   Do you have config.txt ready to upload?


Appendix B - Dealing With An NMI Error


As mentioned in the main body of this document, an NMI error is a hardware problem. 
There are three types of interrupts that a processor can handle: a maskable hardware
interrupt (INTR), a non-maskable hardware interrupt (NMI), and a software interrupt
(INT).  The processor has a dedicated line on the system board bus that handles only non-maskable hardware interrupts.   According to Intel's - i486 Microprocessor Hardware
Reference Manual this NMI line can be asserted as a result of one of three catastrophic
events,: 1) an imminent power loss, 2) a bus-transfer parity error or, 3) a memory-data
parity error. When this NMI line is asserted the processor generates an NMI error.  This
error is received by the NetWare operating system and then reported to the console
screen.  There are two flavors of NMI errors, "Abend: NMI parity error generated by IO
check," and "Abend: NMI parity error generated by System Board." If the NMI is
generated by the system board there is a fairly good chance the problem is with the
system board or its' memory, although it can still be elsewhere.  If the NMI is generated
by an IO check, the problem could be anywhere.  Here is a list of hardware related items
that we have found to cause NMIs.  These ideas should help you as you troubleshoot an
NMI error. 

  1. Faulty RAM.
  2. Faulty system board
  3. Any I/O card.  Especially cards with on-board memory.
  4. Low or fluctuating power at the power source. Remember, a UPS can go bad too. 
  5. Power supply going bad.
  6. Memory extension boards.
  7. System board memory that is mismatched in either speed or brand.
  8. Conflicting interrupts.
  9. Try cleaning and reseating cards/cables/and memory modules.
  10. Incompatibility between hardware pieces. 
  11. Look at the environment and how the equipment is handled.  NMI's can often
      be traced back to static electric discharge.  A sometimes overlooked point is that
      static does not always cause immediate failure, the damage can be degenerative.
      The hard failure may not occur until sometime in the future. 
  12. This is rare, but in two separate cases NTS has seen a hard drive, and a printer
      cause NMI error's.  




Appendix C - How To Access The NetWare OS Patches And Updated Files


What file to download? 

For Novell's official statement on OS patches see the document Patch.doc.  

The file "Patlst.txt"is updated regularly to reflect current patches and files that are
available from Novell's Technical Support.  This list can be downloaded from the online
services listed below.  You can view the current version of this file from Novell's
InterNet Site:  http://netwire.novell.com (choose Technical Support / File Updates /
Current minimum OS & NLM updates.) 
 

Where to get updated files and patches?  

1)             CompuServe:                                                 Go Netwire
               Go NWOSFiles (OS Files)
               Go NWGenFiles   (General Files)
               Go NovFF  (Novell File Finder)
               Go NSD 

2)             Internet:                                                   Novell Web Page: http://www.novell.com ( choose Technical Support / File Updates )  


               Patlst.txt: http://netwire.novell.com/FileUpdt/patlst.html       Files in the NSD area use
                           "ftp://ftp.novell.com/pub/netwire/nsd/<filename>"
3)             MS Network Online:       Go Netwire

4)             Space Works:     800-577-2235

5)             Novell BBS        801-373-6999




Appendix D - Troubleshooting Tools 


Each of the files and documents included in the TABND2.EXE file is provided as a
troubleshooting aid.  Some are for specific problems and some are for troubleshooting in
general.  Here is a list of the diagnostics and Documents found in TABND2.exe.  See
readme.txt, or the documents associated with each troubleshooting tool for a more
detailed description. With the exception of Hdump.nlm, each of these utilities will run on
a 3.x or a 4.x server.

README.TXT     Readme file for TABND2.EXE.
HDUMP.NLM      Used to aid you in taking a core dump on a NW3.11 server
CONFIG.NLM     Used to document a servers configuration.
FCONSOLE.EXE   Used to down a file server from a workstation. 
IMGCOPY.NLM    Used to transfer a core dump from a dos partition to the NetWare
               volume.
NETALIVE.NLM   Used to send your core dump to a volume on another NetWare
               server. 
410PBOFF.NLM   Allows you to disable packet burst during troubleshooting. 

HIGHUTIL.CMP   Technical Information Document (TID) 1005736.  Troubleshooting
               high utilization vs. file compress at the NW4.1 server.
HIGHUTIL.SUB   TID1005436.  Troubleshooting high utilization vs. NW4.1 file
               system suballocation.
HIGHUTIL.TRB   TID1005963.  Troubleshooting high utilization issues in general. 
HIGHUTIL.ADD   TID2905856 is an addendum to TID1005963.  Troubleshooting high
               utilization. 
PATCH.DOC                                                                  TID1007561.   A statement explaining the use of NW OS patches. 
RCSI.APP       February 1995 Application Note, "Resolving Critical Server Issues." 
RECOVERY.APP   June 1995 Application Note, "Abend Recovery Techniques for
               NetWare 3 and 4 Servers." 
RECOVERY.BMP   Flow Chart graphic that goes with the document, "Abend Recovery
               Technique ..."
TABEND.WP6     This document. 
TABEND.TXT     Text version of this document.
TABENDS.WPG    Graphic of the Troubleshooting Abends flow chart



Appendix E - Using the Internal Debugger


You can break into the NetWare debugger at any time.  NOTE that if you break into the
debugger while the server is in use, the workstation will not be able to communicate with
the server and will probably time out.  Break into the debugger by holding,
simultaneously, <left shift>, <right shift>, <alt>, <esc>.  Here are a few commands to be
used while in the debugger.  These commands may be different or unavailable depending
on your version of NetWare.  

   .c  Force a core dump.   .r                                             Current running process
   v   View open screens    ?                                              Current running nlm
   h   Help            .h   More help 
   q   Quite to dos.    g   Go back to the point where you can into the
debugger.
       

Appendix F - Faxback Service 


Novell Technical Support (NTS) maintains a Faxback service with documents on known
issues, top ten current issues, and commonly ask general information.  You can reach the
NTS Faxback service by calling 801-861-5350 or call 800-NetWare (638-9273) Choose
option 2 (for Technical Support), then option 2 (for the Faxback service),  then option 1
(to skip the Faxback introduction), then  option 1 (if you have the document number) or
option 2 (for a catalog).   Below is a listing of some of the documents available on the
Faxback as of May 29, 1996.

Document Title                      Document number: 

 What is Drive Deactivation?        5000513
 Trouble Mounting CDROM's as NetWare Volumes 5000522
 Register Memory in NetWare 3.x and 4.x      5000572
 4.x Backup/Restore                 5000743
 Drive Deactivation Troubleshooting Tips     5000893
 Troubleshooting Tips for NetWare Directory Services 30065
 Partition Troubleshooting Guide    30072
 Patch List and NLM Updates         30082 
 Top Server Issues (February 1996)  1001198
 Support Tips For Adaptec EISA, VL, PCI HBA'S        5003623
 Intel Pentium Floating Point Flaw on Netware        5004623
 HCSSIT.EXE -- HCSS NLM Update      5005622
 SFT III 4.1 Features/Changes Document       02197415
 NetWare v4.1 Multimedia CD Q & A   10025532
 Replacing Failed Hard Drive in Mirrored Group       10024983
 High Utilization Recommendations   10059634
 High Utilization and Compression Document   10057362
 High Utilization and Suballocation Document         10054363
 Directory Entries Document         12020466
 PCI Technology Document            12024722

File Listings  
 NetWare 3.11 File Listing          40004
 NetWare 3.12 File Listing          40012
 NetWare 4.01 File Listing          40023



Novell Technical Support




To:   TABND Feedback

Fax Number:    1-801-861-5988

From:  



Use this form to influence what goes into future updates of this troubleshooting
document.  Or send your comments to email to Dshaver@Novell.com.  This is for
feedback only.  Novell will not be able to respond to you personally.  Thanks for your
feedback.  

Some things we would like to know include: 
    1)   Were you able to solve your problem without opening an incident with Technical
            Support? 
    2)   How can we make this document or the included files more useful to you?
    3)   Are there other issues that might lend themselves to this type of document ?
    4)   What other comments / suggestions do you have for us? 

















        Number of Pages (including cover sheet):    ____
                                


Novell, Inc. /  MS E31-1  /  122 East 1700 South  /  Provo. Ut 84606
Telephone 800-638-9273  /  Telex 37895941  /  Alternate Fax 801-861-5200