Advanced Novell Network Management: NetWare 6

Chapter 8: Design and Set Up an NCS Cluster Configuration

 

Objectives:

This chapter discusses the business case for, and the creation of an NCS cluster:

  1. Identify the Purpose and Advantages of Implementing an NCS Solution
  2. Design and Set Up an NCS Cluster Configuration
Concepts:
Identify the Purpose and Advantages of Implementing an NCS Solution

The text presents several arguments for a business solution that will provide high levels of reliable, dependable service. It begins with some operational definitions of terms related to availability.

  • Resource and Service - a resource is defined here as any service or data that can be migrated from one server to another in case the first server fails. A server is defined as a resource available to users from a server. These definitions are circular, and at odds with other definitions for the same terms from other lessons.
  • Availability - A percentage calculated by dividing the amount of time a system is accessible by the amount of time it is intended to be accessible.
  • Uptime - The length of time a system is accessible.
  • Outage - Loss of service.
  • Downtime - This is defined as the length of time a service is unavailable. Two examples of downtime are given: one system may be down on ten occasions for ten seconds each time; while another may be down only once, but the duration is ten minutes. The text makes the point that ten outages that last only ten seconds each would be better than one outage that lasted ten minutes.
  • Reliability - The time a system is expected to run before a failure. This is related to the next term.
  • Mean Time Between Failures (MTBF) - This is an average time between failures, a statistic usually available for hardware.
  • Mean Time To Recovery (MTTR) - The average time it takes to bring a failed system back to available status.

The text combines MTBF and MTTR to introduce a theory: Availability can be defined by dividing MTBF by the sum of MTBF and MTTR. In a system that provides an immediate handoff of services to another device when the first one fails, MTTR can be considered to be zero, which means that Availability approaches one hundred percent. It is not one hundred percent, because there will always be some small time difference between the time a device fails and the time the backup system is available.

There are several ways of defining levels of high availability.

  • 24x7x365 - 24 hours a day, 7 days a week, 365 days a year. This unintentionally ignores the extra day in leap years. Assume that it means the system is available then as well.
  • 24x7x365 at 100%- This includes the level above, and extends it to include access to all system resources. The text states that this is an ideal level of service, and is not actually possible, since problems cannot always be avoided.
  • 6-6, or 6-11 - This is an example of hours of operation. Not all companies are expected to make resources available every hour of every day.
  • Five 9s guaranteed - This represents a system that is available 99.999% of the time it is intended to be available. If a system is meant to be available at all times, this would mean that it could be down for no more than 5.2 minutes in a year.
  • Four 9s - This represents a system that is available 99.99% of the time it is intended to be available. If a system is meant to be available at all times, this would mean that it could be down for no more than 52.5 minutes in a year.
  • Three 9s - This represents a system that is available 99.9% of the time it is intended to be available. If a system is meant to be available at all times, this would mean that it could be down for no more than 8.7 hours in a year.
  • Two 9s - Available 99.0% of the time. As above, this means that a system meant to be up at all times would be down no more than 87.6 hours in a year.

Some categories of reasons for system outages:

  • Physical - hardware or infrastructure problems
  • Design - Systems may be required to perform in ways that the designer did not think about.
  • Operations - Problems can be caused by user errors or the errors of your staff.
  • Environmental - This includes weather, power supply, and external vendor problems.
  • Reconfiguration - Systems must be taken down to service, upgrade, or replace equipment.
  • Single point of failure - Any component that is unique in a system can cause an outage when it fails. The power outages in much of the United States in 2003 proved to many of us that we should reconsider our power backup plans.

Features that make NCS a good solution:

  • Multinode cluster - NCS cluster can contain up to 32 servers, each of which can serve as a failover device for the others
  • Multiprocessor and multithread support
  • Flexibility - NCS can move resources when a server fails, but you can also move them manually, allowing you to take a server down as needed
  • Support for shared storage devices (SAN and RAID)
  • Centralized control
  • Failover can fan out - When a server fails, resources can be handed off to multiple other servers to balance loads (Failover is defined below).
  • Email notification - NCS can send email to notify staff of status or of events
Design and Set Up an NCS Cluster Configuration

More terminology follows, regarding clusters:

  • Cluster - 2 to 32 servers
  • Node - one server in a cluster
  • Cluster resource - a service, application, or other resource that can be moved from one server in a cluster to another server in that cluster. The term also means an eDirectory object that contains scripts to migrate a service from one server to another.
  • Shared storage device - any form of storage accessible by multiple servers
  • SAN - as defined in the text, a separate network with fast access to storage devices
  • Migration - an odd operational definition, migration is defined to mean moving a resource from one server to another
  • Failover - moving a resource from a failed server to another server, and making it available there
  • Failback - returning a resource to a server that had failed. This feature is disabled by default. It can be performed manually once you are sure the failed server is ready for use.
  • Fibre Channel - a standard for high-speed data transfer that supports copper wire as part of its system, but is intended to use fiber optic cable for best data rates
  • Master Node - the first server in cluster, it is assigned an IP address for the cluster, and serves as the contact point for the cluster
  • Slave Node - any node in the cluster other than the Master Node. A slave node can be made the Master Node if the current Master fails.
  • Cluster-enabled Volumes and Pools - NSS volumes are associated with pools that provide the failover point. This makes it possible to migrate an entire pool.
  • Heartbeat - a signal sent on your LAN by all cluster nodes to show they are still in service. If the devices do not hear a heartbeat from a node within a period called the tolerance rate (by default, 8 seconds), they will remove (cast off) the assumed missing node.
  • Tic (Transport Independent Checking) - a signal sent over a SAN, like the heartbeat on the LAN, but causing a change to the epoch number of the node in that cluster. The epoch number increments each time a node leaves or is added to the cluster. Each node has its own epoch number in the cluster.
  • Poison Pill - a voluntary abend, performed by a server cast off by the other nodes in a cluster.
  • Split Brain Detector - This seems to refer to the epoch number information stored in each server's sector on the SBD partition of the shared storage device. A split brain condition occurs when a node does not send a heartbeat in the required time, but is still functional, requiring it to be cast off by the cluster.

Cluster components:

  • 2 to 32 NetWare 6 servers, configured for IP
  • NCS 1.6 installed on each server (node) in the cluster
  • a shared disk system (A note is made here that three NetWare services do not require shared disk space for clustering: DHCP, LDAP, and licensing. Each uses the replicated nature of eDirectory.)
  • either a Fibre Channel or SCSI connection to the shared disks

As noted in the terms above, a node that does not send a heartbeat signal within the tolerance rate period will be cast off the cluster by NCS.

Management guidelines:

  • Don't attache clustered servers and non-clustered servers to the same shared disk devices.
  • Don't install NetWare 6 on a server while it is attached to a shared storage device.
  • Don't manage cluster volumes from servers that are not part of the cluster.

Troubleshooting SCSI Connections

  • Use the same kind of SCSI cards
  • Use SCSI cards that are multi-initiator enabled (This is because each server must be able to access the SCSI devices.)
  • Use SCSI cables with the same impedence.
  • Properly install, configure, and turn on the SCSI devices. This includes terminating resistors and SCSI IDs.
  • Determine whether your SCSI drives require a low-level format before they can be used with a new controller.