Service and Support

Chapter 7: Troubleshooting the Server and the Network

 

Objectives:
This chapter discusses the problems you are likely to encounter regarding servers and the network in general. The objectives important to this chapter are on page 7-1:
  1. Install the Newest Server Software
  2. Resolve Server Abends and Lockups
  3. Create a Memory Image to Resolve Server Abends and Lockups
  4. Troubleshoot Performance Bottlenecks
  5. Use LANalyzer to Diagnose Performance Problems
  6. Explain Disaster Recovery Options
Concepts:

The first thing to learn from this chapter is on page 7-2, where the first paragraph tells you to make sure you have loaded the latest software on the server. This means the latest versions of three kinds of software:

  • Patches (also called Service Packs or Support Packs)
  • Device drivers
  • NLMs

Regarding patches, remember that Novell recommends that you load all patches in a "patch kit". They are there for a reason, and you may not be aware that you need some of them. You may download patches from Novell's web site. Recognizing which patches you need is easier if you understand three things about their names:

  • A patch name will reference the NetWare version it applies to. For example, 312PTD.EXE is a patch for NetWare 3.12.
  • The letters "PT" in a patch name stand for "Passed Test". These are often early patches for a version of NetWare.
  • The letters "SP" stand for "Support Pack", and may be followed by a pack number. A Support Pack will contain multiple updates to a product. NW5SP3A.EXE would be Support Pack 3a for NetWare 5.

The downloadable files for patches are usually compressed.

Patches come in three types:

  • Dynamic - this patch works like any other NLM. It can be loaded and unloaded, to change the server between the "patched" and "unpatched" states.
  • Semi-static - this is an NLM that cannot be unloaded while the server is running. To take the server back to the "unpatched" state, you have to bring it down, and restart without that NLM.
  • Static - this is a program that changes the SERVER.EXE program. You cannot undo this unless you keep a copy of the earlier version of SERVER.EXE. For this reason, Novell cautions you to keep a backup of your original copy of SERVER.EXE.

To install a Service Pack:

  1. Back up your server.
  2. Download the latest service pack file from Novell. You will probably do this on a workstation.
  3. Create a directory on a server volume and copy the service pack file to that directory. (Leave the compressed version on the workstation in case you need to start over.)
  4. Decompress the service pack file by entering the following at a DOS prompt on a workstation logged in to the server:
    drive_letter:\path\filename_of_service_pack

    The file will check itself for integrity and then decompress.
  5. A file called README.TXT should be included in the decompressed package. Check it for instructions about installing the new patch. Instructions in this file will supersede standard instructions.
  6. At the server console, start NWCONFIG.
  7. Select "Product Options".
  8. Select "Install a Product Not Listed".
  9. Press F3.
  10. Enter the full path to the directory where the service pack was decompressed.
  11. Follow the on-screen prompts.
  12. Bring down and restart the server.

The command PATCHES will tell you what patches are currently loaded.

Device Drivers are the second kind of software you may need to update. There are two types:

  • Disk Drivers - the drivers used for hard drives in NetWare 4 and earlier
  • NWPA drivers - NetWare Peripheral Architecture drivers are .HAMs and .CDMs. In NetWare 5, .DSK drivers are not supported, so you must use .HAM/.CDM pairs.

New versions of NLMs are the third kind of software to update. These will be also downloadable to your server in self-extracting compressed files. Execute the file to extract the NLM, copy it to the SYS:SYSTEM directory, and then choose between restarting the server or unloading the old and loading the new.

Server Abends are discussed next. An abend is an abnormal end to a program. The program can be terminated lots of ways:

  • by the CPU
    • The CPU will generate an interrupt if it reads the error as a device needing attention
    • The CPU will generate an exception if an instruction fails, classing the failure as a fault, a trap, or an abort.
  • by the operating system through consistency check errors, which detect corrupted operating system files, bad data, bad packets, hardware failures and other failures involving memory.

Page 7-11 has a list of features in NetWare that concern abends. It is possible to have messages saved in a log, to automatically restart after an abend, and to to control the time to wait until an automatic restart.

On page 7-13, there is an example abend message. Note the first five lines of information in it:

  • Date and time of the abend
  • The abend message - this should tell you if it was generated by the processor or the operating system
  • The version of NetWare you were running
  • The current process when the abend happened
  • The contents of the stack at the time

Also on page 7-13, there is a discussion of manually shutting down the server in an abend situation. You may recall that you can page between the active server screens by pressing Alt-Esc, and that you can access a menu of those screens by pressing Ctrl-Esc. If you cannot do either, try pressing Ctrl-Alt-Esc. You should see a menu that will allow you to take the server down. If the server obeys this command, it will avoid corrupting data on the NetWare volumes.

Server Lockups are discussed on page 7-14. The problem you encounter may be a full server lockup or a partial server lockup. The difference is that some processes will still run in a partial lockup, while no processes will run in a full lockup. A lockup could be caused by an infinite loop. Whatever the cause, Novell recommends that you consider a core dump. This means dumping the contents of the server's RAM to a file. The file will be the same size as the server's RAM, so you will need that much free room in your DOS partition. The procedure for this action is included later in this chapter.

Six steps to troubleshooting a server lockup are on page 7-15. They are explained in the next several pages. You should know the details of these steps.

  1. Gather information - all error messages, hardware configuration, disk and LAN drivers, current NLMs and NCF files, recent changes, recent events and known problems.
  2. Understand the problem and identify probable causes - diagnostic questions appear on pages 7-18 and 7-19
  3. Test possible solutions - often, you only have to apply a patch or fix from Novell
  4. Use debugging tools - like MONITOR, LANalyzer, or the core dump
  5. Resolve the problem - applying knowledge and resources
  6. Document the problem - record what you did, to repeat next time, or to help undo what you did

On page 7-27, we learn that a core dump is also called a memory image. Some abend conditions will offer the option of creating the core dump file. If your condition does not, Novell offers two options to force a core dump. Three ways to create a core dump file:

  1. If the server is still responding after the abend, answer the prompts generated by NetWare, and choose the core dump.
  2. Force it by manually activating the NetWare debugger.
    1. Start the debugger by pressing both shift keys and Alt and Esc.
    2. Once in the debugger, the command to create core dump is .c
    3. When the file is finished, go back to NetWare by pressing G or exit to DOS by pressing Q.
  3. Force it by causing the CPU to issue a nonmaskable interrupt (NMI) exception, using an approved method from the PC hardware vendor. The method for doing this will vary.

As the core dump/memory image is being created, it may be stored four different ways. They differ in convenience and speed:

  • Floppy disk method - not recommended. Example: if you have 128 MB of RAM, you would need 89 floppies (1.44 MB each) to hold the file. The sheer number of disks involved also introduces the likelihood of encountering a bad disk.
  • Hard disk method - if dumping to the DOS partition of the same server, the default filename of the image is COREDUMP.IMG. The file can be copied to a NetWare drive after the server comes back up with a utility from Novell: IMGCOPY.NLM.
  • Network drive method - this is the fastest method, but it requires that the server have a second NIC, that client software be loaded in memory on the server before the abend happens, that the client have a drive mapping to a drive on another server, and that the NETALIVE.NLM program be loaded on the server to keep the client running after the abend happens.
    This is a lot of trouble to go to if you don't expect abends to happen.
  • Parallel port method - this method allows you to use a portable hard drive that connects to a parallel port. A DOS based device driver for this drive must be loaded in the server's CONFIG.SYS file. (For those of you who do not remember DOS, CONFIG.SYS is a file that is read automatically when a DOS machine boots. Its main purpose is to load device drivers needed in the DOS environment.) In this case, the portable drive would have to be connected to the server when the server boots for the device driver to be effective. You could theoretically connect it to a server, boot it up, then disconnect, and move the drive to another server. Again, a bothersome procedure, but it works if you only have one such drive.

Before sending an image file to Novell for analysis, call their support line an get an incident number. They will authorize you to send the image file, which should be renamed with the first eight digits of your incident number. Note that Novell charges for this service unless the cause of the abend is determined to be a fault in NetWare.

Performance bottlenecks are discussed next. Novell recommends that they be analyzed with two of their tools:

NetWare Management Portal - allows you to manage servers through a web browser. NetWare Management Portal can perform the following tasks:

  • Check server health status including server processes and resources
  • View the status and memory usage of all loaded modules
  • View information about processor data and all hardware adapters and resources
MONITOR - manages NetWare servers from the server console. Can perform the following:
  • View server statistics and activity
  • Assess server RAM and processor utilization
  • Set server parameter values
  • Print server parameter settings to a file

Performance bottlenecks fall into four categories:

  • Disk I/O problems - use MONITOR to check for dirty cache buffers and current disk requests growing larger. Buy faster disks, or use multiple drives to correct this. NetWare Management Portal can show you number of I/O operations per second. Watch for this number to continually grow as a trouble indicator.
  • Network I/O problems - identify these problems with MONITOR, or use NetWare Management Portal by selecting Health Monitor | Server Health | Packet Receive Buffers. Look for any packet errors, as well as increasing No ECB Available numbers. Users may also have physical problems like cables not being connected. Try getting faster NICs or creating more segments.
  • CPU problems - This is less likely if you have modern equipment. Upgrading the processor and using bus mastering cards are options.
  • Bus I/O problems - The bus may be a problem, even if the CPU is not. Bus mastering cards are a possibility here.

A discussion of Protocol Analysis begins on page 7-39. The discussion centers on the sample version of LANalyzer that came with your textbook. Four purposes for using this or other protocol analyzers are:

  • to monitor network performance
  • to troubleshoot networks
  • to optimize the network
  • to plan the growth of the network

A note on page 7-39 tells you that LANalyzer is to be run on a workstation, and that the NIC in that workstation must be run in promiscuous mode. This is defined as making the card "observe all packets on the wire, not just the ones that are addressed to it". The workstation must be a 386 or better, and the card must be a 16-bit card or better. A 32-bit card is preferred.

LANalyzer can show baseline data for periods up to six months. Novell cautions that you should not try to establish a baseline with less than a month's worth of data. Four trends are available: utilization, throughput, count of packets per second, and errors per second.

Alarm thresholds can be set. The book suggests that you can set them too low, and get alarms even though the network performance is acceptable. Analyzing trends and resetting the thresholds is recommended. Some of Novell's recommendations are listed:

  • Set Packets/second threshold 5% to 10% above normal peak.
  • Set Utilization% threshold at 5% above normal peak.
  • Broadcasts/second should start at 10. Adjust as necessary.
  • Fragments/second should start at 15. Adjust as necessary.
  • CRC errors/second should start at 5. Adjust as necessary.
  • Server Overload/minute should be set at 5. Adjust as necessary.

When LANalyzer triggers an alarm, it will do so with three indicators:

  • The Network Alarm indicator turns red (Shields up! Red Alert!)
  • A beeping sound happens if the computer has a speaker.
  • A scrolling message appears to describe the problem.

Three actions are recommended to the administrator:

  • Read the message
  • Open the error log. Read it as well.
  • Click the NetWare Expert icon to get advice.

Several types of errors are listed. Two are noted as typical and are described in greater detail:

  • CRC/alignment errors - this indicates a bad packet. Suspect a cable fault or an EMI problem.
  • Fragment errors - this indicates a packet too small to be correct (less than 64 bytes). Suspect collisions on the LAN, some of which are normal. More than 2% to 3% of all packets means you should redesign the network, if it happens in high traffic. If it happens in low traffic, suspect a bad NIC or transceiver. EXCEPTION: ATM packets are only 53 byte cells. LANalyzer may misreport these packets as errors on an Ethernet.

The symptoms of an overloaded network are described. Two symptoms noted are error messages about "receiving/sending on" the network, and slow response time launching applications. Four possible causes are listed: too many devices, increased load running applications, large files being transferred, and increase Internet traffic (like net surfing). You may notice this is happening by increases in the number of fragments and increases in the utilization. Advice about what to do with this problem is offered in the text.

Page 7-60 is about the repair utilities to use with errors in NDS or NetWare Volumes. Five approaches are listed:

  • VREPAIR is for repairing Volumes that are damaged, or have bad files. Dismount a Volume before running VREPAIR on it. A long procedure for running VREPAIR appears on pages 7-60 through 7-62. You should go over it, paying particular attention to the sections marked with triangles.
  • DSREPAIR is for recovering NDS information, like missing objects. Note the use of the -U switch with DSREPAIR, causing it to run, then exit and unload when done.
  • Restore from a backup. This is for the usual reasons you make backups. Remember that the SBACKUP utility that comes with NetWare can verify whether a backup is valid or not. Having a bad one is of no help.
  • Use a utility from someone other than Novell. ODR for NetWare is described, which can do file recovery, volume repairs, and "hand correction" of sectors on a hard disk.
  • Hire a professional in data recovery. This is generally done when a hard drive crashes, or is involved in an actual disaster like a fire or flood. Note that it is very expensive.