Advanced Novell Network Management: NetWare 6

Chapter 3: Troubleshoot and Resolve NetWare Server Issues

 

Objectives:

This chapter discusses troubleshooting and resolving server problems. The objectives important to this chapter are:

  1. Identify Server Hardware and Operating System Components
  2. Troubleshoot and Resolve NetWare Server Issues
  3. Troubleshoot and Resolve Critical Server Abends
  4. Troubleshoot and Resolve Server Communication Issues
Concepts:
Identify Server Hardware and Operating System Components

Server problems can be divided into two types: hardware and software problems. To deal with them, you can start with some basic idea of what hardware and software you have.

Hardware

Bus types - The types of bus found on a server's motherboard will affect its performance. (As a point of reference, Pentium 1 through 4 processors have a 64 bit data bus.) Four major motherboard bus architectures are listed:

  • ISA - the oldest standard listed, Industry Standard Architecture. Only 16 bits wide, this is a bottleneck in NetWare, which is a 32 bit system. These slots can generally use 16 or 8 bit boards (8 bit boards do not use the slot extension) unless the 8 bit board has a skirt (a support that drops to the system board). Boards can be 4.8 or 4.2 inches tall.
  • MCA - Micro Channel Architecture, a proprietary IBM bus. Short slots are 16 bit, long ones are 32 bit. It is unlikely you will encounter one of these.
  • EISA - Extended Industry Standard Architecture. The bus expects 32 bit boards, but some older ISA 16 and 8 bit boards may be used, if there are no conflicts. There are two rows of contacts on an EISA card. If such a card is plugged into an ISA slot, only the first row of contacts will touch the slot. Conflicts are expected. Boards are 5 inches high.
  • PCI - Peripheral Component Interconnect supports bus mastering (CPU can offload some tasks to the card). Three variations on PCI slots are shown in the text:
    • 32 bit slots, running at 33 MHz. This slot has two segments, a long and a short. The long segment is closest to the accessory cutouts on the computer's case. The slot segment order is long-short.
    • Slots that will accept a 32 bit or 64 bit card, running at 33 MHz. This slot looks like the short slot, but with an extension on it. The extension is farthest from the cutouts, and is longer than the short segment, but shorter than the long segment. Medium? The slot segment order is long-short-medium.
    • 64 bit slots, running at 66 MHz. This slot is the same length as the 32/64 slot, but the segments are in a different order: short-long-medium. The text recommends making sure that you match cards and slots carefully to get the best performance.

Hard drives and disk channels - A disk channel is the data path used to access hard drives. Three types are listed:

  • IDE/ATA - Least expensive example, found on most workstations. Most workstations will have two IDE/ATA channels, each channel supporting up to two devices.
  • SCSI - Small Computer Systems Interface. This channel comes in three types that can support 8 (Narrow), 16 (Wide), or 32 (Very Wide) devices per channel. More commonly found on servers than on workstations. SCSI drives support queueing of up to 256 commands per device.
  • Fibre Channel - Newer standard that supports connecting to storage media as far as 10 kilometers away.

Processors - NetWare 6 supports up to 32 processors on a server. Each must be at least a Pentium II or an AMD K7. To determine if a processor will work in your system, be aware that it must be supported by a PSM, Platform Support Module, included with NetWare 6 or downloaded from the manufacturer. Novell provides a PSM with NetWare 6, MPS14.PSM, that will support processors compliant with Intel Multiprocessor Specification 1.1 and 1.4.

Memory - Many problems with computers, workstations or servers, can be simplified by adding more working memory (RAM). A NetWare 6 server requires at least 256 MB of RAM. Although a NetWare 6 server can have as much as 64 GB of RAM, only the first 4 GB can be directly addressed as cache memory. RAM above 4 GB is allocated to virtual memory.

Ideally, a server should be scalable: you should be able to add more processors, memory, and/or disks to a server. Scalability includes the idea of combining several servers into a cluster, to provide redundant access to services.

Troubleshooting may be easier if you have an understanding of the NetWare Operating System. You should be aware from other classes that the file that loads when a server is started is server.exe. This file actually contains several programs. The first to be run is loader.exe, which loads server.nlm, and other nlm files. These programs load in a series of stages called loadstages, numbered 0 through 5. The startup.ncf file is loaded between loadstages 0 and 1. The autoexec.ncf file is loaded between loadstages 4 and 5.

As a server boots up, many nlm files are typically loaded. You can view the list of loaded nlm files at the server console by using the modules command. In this list, you can tell something about the kind of nlm a file is by its color in the list:

  • Cyan - Files shown in cyan are nlm files that are internal to the server.exe file.
  • Red - A version of this file is found inside server.exe, but the version that was loaded is an external file in the c:\nwserver directory.
  • White - Files shown in white are loaded as a result of a command in an ncf file or a command typed at the server console.
  • Purple - Files shown in purple (magenta) are loaded as supportive files by another nlm.

The heart of the NetWare operating system is the kernel. In previous versions of NetWare, there were two kernels, one that supported a single processor, and one that supported multiple processors. There is a single kernel for NetWare 6 which works either way.

Even with a single processor, a NetWare server manages multiple processes simultaneously. The processes are referred to as threads. The thread cannot actually be run simultaneously with only one processor, so each thread is given access to processor cycles on a rotating basis. In this manner, the server can be said to be multitasking. Some process threads can be preempted by other threads, but this feature must be included by the programmer, unless the program is written in Java, which support preemption for all programs.

Troubleshoot and Resolve NetWare Server Issues

If a server is not functioning normally, it is advised to get on the Internet and check for solutions on Novell's web site. Both the Knowledgebase and the Cool Solutions web tools should be used.

If you have a problem, but the server is not actually locked up, try inspecting various console screens with Ctrl-Esc or Alt-Esc. If possible, you can try to access the NetWare internal debugger on the server by pressing Shift-Alt on the right side of the keyboard, while pressing Shift-Esc on the left side. If you get into the debugger, you can exit it by pressing the letter G.

If the server is locked, and you cannot do any of these things, try pressing Ctrl-Alt-Esc. You should see a menu that will allow you to take the server down. If the server obeys this command, it will avoid corrupting data on the NetWare volumes.

Some general advice is offered about hard drive problems. Make sure power cables and data cables are attached correctly, interrupt conflicts are resolved, and devices on a SCSI bus are assigned unique numbers on that bus.

Server memory errors are discussed. Remember that a NetWare server still starts as a DOS machine. Problems can occur if your DOS boot partition contains files with commands in conflict with NetWare. Some versions of DOS will automatically add DOS=HIGH as a command in the config.sys file. It should be removed, along with any reference to memory managers.

Nlm files that do not give back memory when they end are said to cause the server to "leak memory", a problem commonly seen on Windows workstations. Novell recommends unloading and loading nlm files to determine which ones are doing this. Updated versions or patches may resolve the issue.

Sometimes you may need to free some memory on a server right away. Try unloading any nlm files you do not currently need, dismounting volumes not in use, and unloading name space not in use. If the number of available cache buffers is less than 20% of the total cache buffers, you may have no choice but to add RAM to the server.

Troubleshoot and Resolve Critical Server Abends

The word abend is a shortend form of the phrase abnormal end. It refers to a server process that stops running unexpectedly. Server abends can result in loss of data, locked up servers, and loss of services on the network.

Abends come in two basic types: they are either detected by the processor, or by the operating system. Processor deteced abends could be called hardware abends or processor exceptions. Operating system detected abends could be called software abends. A program could abend lots of ways:

  • by the processor
    • The processor will generate an interrupt if it reads the an as a device needing attention
    • The processor will generate an exception if an instruction fails, classing the failure as a fault, a trap, or an abort.
    • A page fault occurs when a program tries to use memory not allocated to it.
    • A General Processor Protection Exception (GPPE) occurs when a program tries to use memory above the physical limit.
    • Non-Maskable Interrupts (NMIs) are usually caused by a bad memory chip or a parity error. Machine Check Exceptions are caused by internal errors in the processor.
    • Invalid Opcodes occur when the processor receives an instruction that is not included in its instruction set. All processors have a built-in instruction set, which is usually complex (CISC) or reduced (RISC).
  • by the operating system through consistency check errors, which detect corrupted operating system files, bad data, bad packets, hardware failures and other failures involving memory.

It is possible to have abend messages saved in a log, to automatically restart a server after an abend, and to to control the time to wait until an automatic restart.

Abend log files are created in the DOS partition, but are moved to the SYS:SYSTEM folder when the server reboots. The information to be found in an abend log includes:

  • Name of the server
  • Date and time of the abend
  • The abend message - this should tell you if it was generated by the processor or the operating system
  • The version of NetWare you were running
  • The current process when the abend happened
  • The contents of the stack at the time

Sometimes, Novell recommends that you consider a core dump. This means dumping the contents of the server's RAM to a file. The file will be the same size as the server's RAM, so you will need that much free room in your DOS partition. A feature implied but not previously mentioned in older texts is that the core dump will include the images on each console screen at the time of the dump.

Novell servers can be configured for automatic response to abends. There are four potential settings for the Auto Restart After Abend parameter:

  1. Do nothing, allow the system to remain halted
  2. If the abend is a software abend, an NMI, or a Machine Check Exception: try to recover, bring the server down in the specified time, then restart the server. For other exception abends, leave the server up. (This is the default.)
  3. For all abends, bring the server down in the specified time, then restart the server.
  4. For all abends, bring the server down immediately, then restart the server.

The specified time mentioned above (Auto Restart After Abend Delay Time) can range from 0 to 60 minutes.

If the server is set to option 0 above, an abend should take the server to a command line where the administrator can type single letter commands to choose how to resolve the problem. This can be confusing, since the interface will interpret the same letter differenly depending on the type of abend that has occurred. Some commands to be aware of:

  • S - several meanings If the abend was software detected, S means "suspend the running process, update the abend log, and try to bring down the server". If the abend was processor detected, S can mean "suspend the running process and update the abend log"; or it can mean "take the running process to a safe state and update the abend log".
  • R - R means "resume the running process, update the abend log, and try to bring down the server".
  • Y - Y means "send a core dump to disk".
  • X - X means "restart server" if DOS has been unloaded; X means "update abend log and exit to DOS" if DOS has not been unloaded.

In the text, we learn that a core dump is also called a memory image. Some abend conditions will offer the option of creating the core dump file. If your condition does not, Novell offers an option to force a core dump. Note that you can now create full core dump or a cacheless core dump, one that does not record what was in cache memory. Two ways to create a core dump file:

  1. If the server is still responding after the abend, answer the prompts generated by NetWare, and choose the core dump. (The response is Y.)
  2. Force it by manually activating the NetWare debugger.
    1. Start the debugger by pressing both shift keys, Esc, and the right Alt key.
    2. Once in the debugger, the command to create core dump is .c
    3. When the file is finished, go back to NetWare by pressing G, or exit to DOS by pressing Q.

As the core dump/memory image is being created, you will be prompted for a path and filename. The default is c:\coredump.img. The file may be stored several different ways.

  • Hard disk method - if dumping to the DOS partition of the same server, the default filename of the image is COREDUMP.IMG. The file can be copied to a NetWare drive after the server comes back up with a utility from Novell: IMGCOPY.NLM.
  • Network drive method - this is the fastest method, but it requires that the server have several DBNET files loaded. DBNET is a suite of programs that allow you to save a core dump to another Novell server or a Windows workstation.

Before sending an image file to Novell for analysis, call their support line an get a support incident number. They will authorize you to send the image file, which should be renamed with the first eight digits of your incident number. Note that Novell charges for this service unless the cause of the abend is determined to be a fault in NetWare. If sending the file by FTP, Novell asks you to compress it in a ZIP file first.

Troubleshoot and Resolve Server Communication Issues

If servers cannot communicate with each other, you will have to find out why and correct the problems. One method of detecting such a problem is to run DSTRACE, and watch for errors coded "-625". These errors indicate that eDirectory communication is not taking place.

To troubleshoot server-to-server communication problems:

  • Make sure the servers are up.
  • Ping each server to verify its IP address.
  • If you network uses IPX instead of IP, use the IPXPING utility from a server to verify that the others are up. This utility requires you to know the network and node numbers for each server.
  • Ping routers to determine if paths using them are available.
  • Verify time synchronization with DSREPAIR.
  • Make sure that servers have the proper protocols loaded and bound. The bind command can be overlooked, but the server will not use a protocol without it.
  • Check processor utilization on servers with Monitor. This number should vary from moment to moment, and never stay at a high level.
  • Unload DSREPAIR if you are not running it.
  • Check that all devices are using the same frame type. Ethernet_II is usually a good choice.

To troubleshoot workstation-to-server communication problems:

  • Ping the server, check the IP stack of the workstation and the server, and use TRACERT to confirm communication across the network.
  • Check cables.
  • Check available licenses.

To prevent problems, collect a baseline of information about your network before problems occur. Running a utility like LANalyzer on your network will collect data over time. You can refer to this data to determine what is normal and what is not.

The text recommends finding bottlenecks in your network and upgrading them. The intention of the message is to tell you to find the NICs, hubs, switches, etc. in your network that run at lower data rates than the rest of your network, and upgrade them to the network data rate.

Documentation is the part of the job that seems to be done the least often. When something goes wrong, you will want good documentation about your network. Contribute to the solution by making log entries, mapping the site, keeping documentation about hardware and software, and anything else that you may suddenly need in a place where you can find it.

Finally, the chapter recommends being proactive to protect your network. Add RAM when possible. Replace worn hardware before it breaks. Practice proper grounding procedures when handling computer components. A static electricity discharge of 20 to 30 volts is all it takes to harm electronic components, but a human being cannot feel a discharge of less than about 3,000 volts.