Add healthCheck sample script to check nodes' status

git-svn-id: https://svn.code.sf.net/p/xcat/code/xcat-core/trunk@3232 8638fb3e-16cb-4fca-ae20-7b5d299a9bcd
This commit is contained in:
wanghuaz 2009-04-21 11:02:04 +00:00
parent 88078d3217
commit b6935ffafa
2 changed files with 1538 additions and 0 deletions

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,176 @@
# IBM(c) 2008 EPL license http://www.eclipse.org/legal/epl-v10.html
healthCheck.README
This README describes how to use the healthCheck script.
The syntax of the healthCheck command is:
healthCheck { [-n node_list] [-M]}
{[-p min_clock_speed] [-i method] [-m min_memory]
[-l min_freelp] [ -H [--speed speed --ignore interface_list --width width]]}
[ -h ]
-M Check status for all the Managed Nodes that are defined on this MN.
-n node_list
Specifies a comma-separated list of node host names, IP addresses for health check.
-p min_clock_speed
Specifies the minimal processor clock speed in MHz for processor monitor.
-i method
Specifies the method to do Infiniband interface status check, the supported
check methods are LL and RSCT.
-m min_memory
Specifies the minimal total memory in MB.
-l min_freelp
Specifies the minimal free large page number.
-H Check the status for HCAs.
--speed speed
Specifies the physical port speed in G bps, it should be used with -H flag.
--ignore interface_list
Specifies a comma-separated list of interface name to ignore from HCA status check,
such as ib0,ib1. It should be used with -H flag.
--width width
Specifies the physical port width, such as 4X or 12X. It should be used with -H flag.
-h Display usage information.
This script is used to check the system health for both AIX and Linux
Managed Nodes on Power6 platforms. It will use xdsh to access the target
nodes, and check the status for processor clock speed, IB interfaces,
memory and large page configuration. If xdsh is unreachable, an error
message will be given.
1. Processor clock speed check
This script will use xdsh command to access the target nodes, and run
"/usr/pmapi/tools/pmcycles -M" command on the AIX MNs or "cat
/proc/cpuinfo" command on Linux MNs to list the actual processor clock
speed in MHz. Compare this actual speed with the minimal value that user
specified in command line with -p flag, if it is smaller than the minimal
value, a warning message will be given out to indicate the unexpected low
frequency.
2. IB interface status check by llstatus
In LoadLeveler cluster environment, all the nodes are sharing the same
cluster information. So we only need to xdsh to one of these nodes, and
run LoadLeveler command "/usr/lpp/LoadL/full/bin/llstatus -a" on AIX or
"/opt/ibmll/LoadL/full/bin/llstatus -a" on Linux nodes to list the IB
interface status. If the status is not "READY", a warning message related
to its nodename and IB port will be given out. This check process needs
the "llstatus" command existed on the MNs, if it does not exist, an error
message will be output.
3. IB interface status check by lsrsrc
This script will use xdsh command to access the target nodes, and run
"/usr/bin/lsrsrc IBM.NetworkInterface Name OpState" command on AIX or
Linux MNs to list the IB interface status for each node. If the "OpState"
value is not "1", a warning message related to its nodename and IB port
will be given out.
4. Memory check
This script will use xdsh command to access the target nodes, and run
"/usr/bin/vmstat" command on AIX MNs or "cat /proc/meminfo" commands on
Linux MNs to list the total memory information. If the total memory is
smaller than the minimal value specified by the user in GB, a warning
message will be given out with the node name and its real total memory
account.
5. Free large page check
This script will use xdsh command to access the target nodes, and run
"/usr/bin/vmstat -l" command on AIX MNs or "cat /proc/meminfo" commands
on Linux MNs to list the free large page information. If the free large
page number is smaller than the minimal value specified by the user, a
warning message will be given out with the node name and its real free
large page number.
6. Check HCA status
This script will use xdsh command to access the target nodes.
For AIX nodes, we use command ibstat -v | egrep "IB PORT.*INFO|Port State
:|Physical Port" to get the HCA status of Logical Port State, Physical
Port State, Physical Port Physical State, Physical Port Speed and Physical
Port Width. The expected values are "Logical Port State: Active", "Physical
Port State: Active", "Physical Port Physical State: Link Up", "Physical
Port Width: 4X". If the actual value is not the same as expected one, a
warning message will be given out.
This is an example of the output of ibstat command:
c890f11ec01:/ # ibstat -v | egrep "IB PORT.*INFO|Port State:|Physical Port"
IB PORT 1 INFORMATION (iba0)
Logical Port State: Active
Physical Port State: Active
Physical Port Physical State: Link Up
Physical Port Speed: 2.5G
Physical Port Width: 4X
IB PORT 2 INFORMATION (iba0)
Logical Port State: Active
Physical Port State: Active
Physical Port Physical State: Link Up
Physical Port Speed: 2.5G
Physical Port Width: 4X
For Linux nodes, we use command ibv_devinfo -v | egrep "ehca|port:|state:
|width:|speed:" to get the HCA status of port state, active_width, active_speed
and phys_state. The expected values are "port state: PORT_ACTIVE",
"active_width: 4X", "phys_state: LINK_UP". If the actual value is not the
same as expected one, a warning message will be given out.
This is an example of the output of ibv_devinfo command:
c890f11ec05:~ # ibv_devinfo -v | egrep "ehca|port:|state:|width:|speed:"
hca_id: ehca0
port: 1
state: PORT_ACTIVE (4)
active_width: 4X (2)
active_speed: 2.5 Gbps (1)
phys_state: LINK_UP (5)
port: 2
state: PORT_ACTIVE (4)
active_width: 4X (2)
active_speed: 2.5 Gbps (1)
phys_state: LINK_UP (5)
But for "Physical Port Speed" on AIX nodes or "active_speed" on Linux nodes,
since SDR and DDR adapters will use the different speeds, SDR is 2.5G and DDR
is 5.0G, so the user needs to specify this "Speed" by flag "--speed", for
example:
healthCheck -N AIXNodes -H --speed 2.5
If "--speed" is not specified with "-H" flag, healthCheck script will list the
actual value of "Physical Port Speed" gotten from ibstat command for each HCAs,
so that it is easy for the user to use "grep" command to find the speed value
he/she wants.
The output format is <node_name>:<interface_name>:< Physical Port Speed >:
<speed_value>, for example:
c890f11ec01.ppd.pok.ibm.com: ib0: Physical Port Speed: 2.5G
c890f11ec01.ppd.pok.ibm.com: ib1: Physical Port Speed: 2.5G
c890f11ec02.ppd.pok.ibm.com: ib0: Physical Port Speed: 5.0G
c890f11ec02.ppd.pok.ibm.com: ib1: Physical Port Speed: 5.0G
Since the output of ibstat or ibv_devinfo is identified by HCA name and port
number, so we will use the mapping table below to map the HCA name and port
number to its interface name. Please see the table below:
Interface Name Adapter Name Port Number
ib0 iba0/ehca0 1
ib1 iba0/ehca0 2
ib2 iba1/ehca1 1
ib3 iba1/ehca1 2
......
For "Physical Port Width" on AIX nodes or "active_width" on Linux nodes, since
it could be 4X or 12X, so the user needs to specify this "width" by flag
"--width", for example:
healthCheck -N LinuxNodes -H --width 4X
If "--width" is not specified, healthCheck script will list the actual value
of "Physical Port Width" gotten from ibstat command for each HCAs, so that it
is easy for the user to use "grep" command to find the speed value he/she wants.
The output format is <node_name>:<interface_name>:< Physical Port Width >:
<width_value>, for example:
c890f11ec01.ppd.pok.ibm.com: ib0: Physical Port Width: 4X
c890f11ec01.ppd.pok.ibm.com: ib1: Physical Port Width: 4X
c890f11ec02.ppd.pok.ibm.com: ib0: Physical Port Width: 4X
For the ports that are not used by the target nodes, the user could use --ignore
flag to exclude them from HCA status check. If the user does not specify these
"unused port" with --ignore flag, healthCheck script will check all HCA check
items for all interfaces, and return the warning message to for the failed ones.
The user could use grep piped into wc -l to get the total number of "unused port".