|
|
Advanced Novell Network Management: NetWare 6
Chapter 8: Design and Set Up an NCS Cluster Configuration
Objectives:
This chapter discusses the business case for, and the creation of an
NCS cluster:
- Identify the Purpose and Advantages of Implementing
an NCS Solution
- Design and Set Up an NCS Cluster Configuration
Concepts:
Identify the Purpose and Advantages of Implementing an NCS Solution
The text presents several arguments for a business solution that will
provide high levels of reliable, dependable service. It begins with some
operational definitions of terms related to availability.
- Resource and Service - a resource is defined here as any service or
data that can be migrated from one server to another in case the first
server fails. A server is defined as a resource available to users from
a server. These definitions are circular, and at odds with other definitions
for the same terms from other lessons.
- Availability - A percentage calculated by dividing the amount of time
a system is accessible by the amount of time it is intended to be accessible.
- Uptime - The length of time a system is accessible.
- Outage - Loss of service.
- Downtime - This is defined as the length of time a service is unavailable.
Two examples of downtime are given: one system may be down on ten occasions
for ten seconds each time; while another may be down only once, but
the duration is ten minutes. The text makes the point that ten outages
that last only ten seconds each would be better than one outage that
lasted ten minutes.
- Reliability - The time a system is expected to run before a failure.
This is related to the next term.
- Mean Time Between Failures (MTBF) - This is an average time
between failures, a statistic usually available for hardware.
- Mean Time To Recovery (MTTR) - The average time it takes to
bring a failed system back to available status.
The text combines MTBF and MTTR to introduce a theory: Availability
can be defined by dividing MTBF by the sum of MTBF and MTTR.
In a system that provides an immediate handoff of services to another
device when the first one fails, MTTR can be considered to be zero, which
means that Availability approaches one hundred percent. It is not
one hundred percent, because there will always be some small time difference
between the time a device fails and the time the backup system is available.
There are several ways of defining levels of high availability.
- 24x7x365 - 24 hours a day, 7 days a week, 365 days a year.
This unintentionally ignores the extra day in leap years. Assume that
it means the system is available then as well.
- 24x7x365 at 100%- This includes the level above, and extends
it to include access to all system resources. The text states that this
is an ideal level of service, and is not actually possible, since problems
cannot always be avoided.
- 6-6, or 6-11 - This is an example of hours of operation. Not
all companies are expected to make resources available every hour of
every day.
- Five 9s guaranteed - This represents a system that is available
99.999% of the time it is intended to be available. If a system
is meant to be available at all times, this would mean that it could
be down for no more than 5.2 minutes in a year.
- Four 9s - This represents a system that is available 99.99%
of the time it is intended to be available. If a system is meant to
be available at all times, this would mean that it could be down for
no more than 52.5 minutes in a year.
- Three 9s - This represents a system that is available 99.9%
of the time it is intended to be available. If a system is meant to
be available at all times, this would mean that it could be down for
no more than 8.7 hours in a year.
- Two 9s - Available 99.0% of the time. As above, this
means that a system meant to be up at all times would be down no more
than 87.6 hours in a year.
Some categories of reasons for system outages:
- Physical - hardware or infrastructure problems
- Design - Systems may be required to perform in ways that the
designer did not think about.
- Operations - Problems can be caused by user errors or the errors
of your staff.
- Environmental - This includes weather, power supply, and external
vendor problems.
- Reconfiguration - Systems must be taken down to service, upgrade,
or replace equipment.
- Single point of failure - Any component that is unique in a
system can cause an outage when it fails. The power outages in much
of the United States in 2003 proved to many of us that we should reconsider
our power backup plans.
Features that make NCS a good solution:
- Multinode cluster - NCS cluster can contain up to 32 servers, each
of which can serve as a failover device for the others
- Multiprocessor and multithread support
- Flexibility - NCS can move resources when a server fails, but you
can also move them manually, allowing you to take a server down as needed
- Support for shared storage devices (SAN and RAID)
- Centralized control
- Failover can fan out - When a server fails, resources can be handed
off to multiple other servers to balance loads (Failover is defined
below).
- Email notification - NCS can send email to notify staff of status
or of events
Design and Set Up an NCS Cluster Configuration
More terminology follows, regarding clusters:
- Cluster - 2 to 32 servers
- Node - one server in a cluster
- Cluster resource - a service, application, or other resource
that can be moved from one server in a cluster to another server in
that cluster. The term also means an eDirectory object that contains
scripts to migrate a service from one server to another.
- Shared storage device - any form of storage accessible by multiple
servers
- SAN - as defined in the text, a separate network with fast
access to storage devices
- Migration - an odd operational definition, migration is defined
to mean moving a resource from one server to another
- Failover - moving a resource from a failed server to another
server, and making it available there
- Failback - returning a resource to a server that had failed.
This feature is disabled by default. It can be performed manually once
you are sure the failed server is ready for use.
- Fibre Channel - a standard for high-speed data transfer that
supports copper wire as part of its system, but is intended to use fiber
optic cable for best data rates
- Master Node - the first server in cluster, it is assigned an
IP address for the cluster, and serves as the contact point for the
cluster
- Slave Node - any node in the cluster other than the Master
Node. A slave node can be made the Master Node if the current Master
fails.
- Cluster-enabled Volumes and Pools - NSS volumes are associated
with pools that provide the failover point. This makes it possible to
migrate an entire pool.
- Heartbeat - a signal sent on your LAN by all cluster nodes
to show they are still in service. If the devices do not hear a heartbeat
from a node within a period called the tolerance rate (by default, 8
seconds), they will remove (cast off) the assumed missing node.
- Tic (Transport Independent Checking) - a signal sent over a
SAN, like the heartbeat on the LAN, but causing a change to the epoch
number of the node in that cluster. The epoch number increments
each time a node leaves or is added to the cluster. Each node has its
own epoch number in the cluster.
- Poison Pill - a voluntary abend, performed by a server cast
off by the other nodes in a cluster.
- Split Brain Detector - This seems to refer to the epoch number
information stored in each server's sector on the SBD partition of the
shared storage device. A split brain condition occurs when a node does
not send a heartbeat in the required time, but is still functional,
requiring it to be cast off by the cluster.
Cluster components:
- 2 to 32 NetWare 6 servers, configured for IP
- NCS 1.6 installed on each server (node) in the cluster
- a shared disk system (A note is made here that three NetWare
services do not require shared disk space for clustering: DHCP, LDAP,
and licensing. Each uses the replicated nature of eDirectory.)
- either a Fibre Channel or SCSI connection to the shared
disks
As noted in the terms above, a node that does not send a heartbeat signal
within the tolerance rate period will be cast off the cluster by NCS.
Management guidelines:
- Don't attache clustered servers and non-clustered servers
to the same shared disk devices.
- Don't install NetWare 6 on a server while it is attached to
a shared storage device.
- Don't manage cluster volumes from servers that are not part of the
cluster.
Troubleshooting SCSI Connections
- Use the same kind of SCSI cards
- Use SCSI cards that are multi-initiator enabled (This is because each
server must be able to access the SCSI devices.)
- Use SCSI cables with the same impedence.
- Properly install, configure, and turn on the SCSI devices. This includes
terminating resistors and SCSI IDs.
- Determine whether your SCSI drives require a low-level format before
they can be used with a new controller.
|