Table of Contents
Table of Contents
{{:Design Warning}}
This design summarizes new function for xCAT to provide an infrastructure for running “health-check” scripts for cluster nodes.
NOTE: This is just a quick capture of design thoughts which may not be consistent or even possible to implement as written. Everything here is subject to change and redesign.
Basics
-
Create a directory for sample health-check scripts shipped with xCAT:
/opt/xcat/share/xcat/checkscripts
-
Have a master script that will run a list of scripts for a noderange:
nodecheck <noderange> <scriptlist> [-V|--verbose]
-
Flatten noderange to a comma-delimited list of node names. Will this exceed cmd line length on very large clusters? Could pass in noderange (e.g. group), but that will be difficult when reducing the list (see below). Maybe handling hierarchy will help (see below).
-
Call each script in order, passing in nodelist (print out fully qualified script name before executing to show progress?) and optional verbose flag.
-
If a script prints failure for a node, that node is removed from the list for the next script.
-
The nodecheck command will NOT run hierarchy. If a script needs hierarchy, it can run its own xdsh, xdsh -e, ppping, etc., which each handle hierarchy internally. This also allows for non-hierarchy checks (e.g. contacting an application server for status on the full list of nodes).
-
It is the responsibility of each checkscript to print out any error or informational messages and return an appropriate exit code
-
If scriptname is not fully qualified, search /install/checkscripts:/opt/xcat/share/xcat/checkscripts
-
checkscript input/output conventions:
- Input to script: comma-delimited list of node names
optional verbose flag: -V | --verbose
-
Output from script: 2 lines:
SUCCESS: comma-delimited list of node names (only show this line if verbose?) FAILED: comma-delimited list of node names
-
The complexity barrier for writing health check scripts should be as low as possible. They should be able to be written in any language (not just perl), so shouldn't require using any functions from our perl modules. To this end, should the node list be space delimited?
Variations/Enhancements/Other notes
-
Create new checkscripts table in xCAT database:
#node,scriptlist,comments,disable
-
If no scriptlist provide on command line, get list for each node in <noderange>. Group all nodes with identical scriptlist and pass corresponding nodelist into each checkscript in list.
-
Run scripts for different sets of nodes in parallel?
-
Provide syntax to allow a script to be run: always, or if some reg expression is true. (where $? means last return code value). Maybe something like a scriptlist value of:
<script1>,<expression>:<script2>,<expression>:<script3>,<script4>
e.g.:
is_node_alive,'
?=0':check_IB,'
?=0':check_gpfs -
Example scripts:
is_node_alive:
-
nodestat check; (pSeries) Rvitals lcds for any not 'sshd'
-
rsh check (inetd)
-
xdsh
-
name resolution
-
IB_check:
ppping -i ib0,ib1 for any failed nodes: netstat -ni, (SLES) service openibd status, ibv_devices, ibv_devinfo; (AIX) ibstat -v
GPFS_check:
-
xdsh <noderange> -v -t 10 cksum /gpfs/home/00PRESENT00
from MN if it has access to GPFS, otherwise from one node in GPFS cluster, or one GPFS I/O node:
/usr/lpp/mmfs/bin/mmgetstate -aLs
LL_check:
-
SLES: /opt/ibmll/LoadL/full/bin/llstatus
AIX: /usr/lpp/LoadL/full/bin/llstatus
-
llstatus -a
servicenode: all xcatd daemons running, dhcpd, tftpd, atftpd, bootp, dhcp, inetd, syslogd, DNS
managementnode: same as servicenode script?
News
- Apr 22, 2016: xCAT 2.11.1 released.
- Mar 11, 2016: xCAT 2.9.3 (AIX only) released.
- Dec 11, 2015: xCAT 2.11 released.
- Nov 11, 2015: xCAT 2.9.2 (AIX only) released.
- Jul 30, 2015: xCAT 2.10 released.
- Jul 30, 2015: xCAT migrates from sourceforge to github
- Jun 26, 2015: xCAT 2.7.9 released.
- Mar 20, 2015: xCAT 2.9.1 released.
- Dec 12, 2014: xCAT 2.9 released.
- Sep 5, 2014: xCAT 2.8.5 released.
- May 23, 2014: xCAT 2.8.4 released.
- Jan 24, 2014: xCAT 2.7.8 released.
- Nov 15, 2013: xCAT 2.8.3 released.
- Jun 26, 2013: xCAT 2.8.2 released.
- May 17, 2013: xCAT 2.7.7 released.
- May 10, 2013: xCAT 2.8.1 released.
- Feb 28, 2013: xCAT 2.8 released.
- Nov 30, 2012: xCAT 2.7.6 released.
- Oct 29, 2012: xCAT 2.7.5 released.
- Aug 27, 2012: xCAT 2.7.4 released.
- Jun 22, 2012: xCAT 2.7.3 released.
- May 25, 2012: xCAT 2.7.2 released.
- Apr 20, 2012: xCAT 2.7.1 released.
- Mar 19, 2012: xCAT 2.7 released.
- Mar 15, 2012: xCAT 2.6.11 released.
- Jan 23, 2012: xCAT 2.6.10 released.
- Nov 15, 2011: xCAT 2.6.9 released.
- Sep 30, 2011: xCAT 2.6.8 released.
- Aug 26, 2011: xCAT 2.6.6 released.
- May 20, 2011: xCAT 2.6 released.
- Feb 14, 2011: Watson plays on Jeopardy and is managed by xCAT!
- xCAT Release Notes Summary
- xCAT OS And Hw Support Matrix
- xCAT Test Environment Summary
History
- Oct 22, 2010: xCAT 2.5 released.
- Apr 30, 2010: xCAT 2.4 is released.
- Oct 31, 2009: xCAT 2.3 released.
xCAT's 10 year anniversary! - Apr 16, 2009: xCAT 2.2 released.
- Oct 31, 2008: xCAT 2.1 released.
- Sep 12, 2008: Support for xCAT 2
can now be purchased! - June 9, 2008: xCAT breaths life into
(at the time) the fastest
supercomputer on the planet - May 30, 2008: xCAT 2.0 for Linux
officially released! - Oct 31, 2007: IBM open sources
xCAT 2.0 to allow collaboration
among all of the xCAT users. - Oct 31, 1999: xCAT 1.0 is born!
xCAT started out as a project in
IBM developed by Egan Ford. It
was quickly adopted by customers
and IBM manufacturing sites to
rapidly deploy clusters.