xcat-core

mirror of https://github.com/xcat2/xcat-core.git synced 2025-08-22 03:00:26 +00:00

Table of Contents

Basics
Variations/Enhancements/Other notes

Table of Contents

Basics
Variations/Enhancements/Other notes

This design summarizes new function for xCAT to provide an infrastructure for running “health-check” scripts for cluster nodes.

NOTE: This is just a quick capture of design thoughts which may not be consistent or even possible to implement as written. Everything here is subject to change and redesign.

Basics

Create a directory for sample health-check scripts shipped with xCAT:

/opt/xcat/share/xcat/checkscripts
Have a master script that will run a list of scripts for a noderange:

nodecheck <noderange> <scriptlist> [-V|--verbose]
Flatten noderange to a comma-delimited list of node names. Will this exceed cmd line length on very large clusters? Could pass in noderange (e.g. group), but that will be difficult when reducing the list (see below). Maybe handling hierarchy will help (see below).
Call each script in order, passing in nodelist (print out fully qualified script name before executing to show progress?) and optional verbose flag.
If a script prints failure for a node, that node is removed from the list for the next script.
The nodecheck command will NOT run hierarchy. If a script needs hierarchy, it can run its own xdsh, xdsh -e, ppping, etc., which each handle hierarchy internally. This also allows for non-hierarchy checks (e.g. contacting an application server for status on the full list of nodes).
It is the responsibility of each checkscript to print out any error or informational messages and return an appropriate exit code
If scriptname is not fully qualified, search /install/checkscripts:/opt/xcat/share/xcat/checkscripts
checkscript input/output conventions:
- Input to script: comma-delimited list of node names
optional verbose flag: -V | --verbose
Output from script: 2 lines:

SUCCESS: comma-delimited list of node names (only show this line if verbose?) FAILED: comma-delimited list of node names
The complexity barrier for writing health check scripts should be as low as possible. They should be able to be written in any language (not just perl), so shouldn't require using any functions from our perl modules. To this end, should the node list be space delimited?

Variations/Enhancements/Other notes

Create new checkscripts table in xCAT database:

#node,scriptlist,comments,disable
If no scriptlist provide on command line, get list for each node in <noderange>. Group all nodes with identical scriptlist and pass corresponding nodelist into each checkscript in list.
Run scripts for different sets of nodes in parallel?
Provide syntax to allow a script to be run: always, or if some reg expression is true. (where $? means last return code value). Maybe something like a scriptlist value of:

<script1>,<expression>:<script2>,<expression>:<script3>,<script4>

e.g.:

is_node_alive,'?=0':check_IB,'?=0':check_gpfs
Example scripts:

is_node_alive:
nodestat check; (pSeries) Rvitals lcds for any not 'sshd'
rsh check (inetd)
xdsh
name resolution
IB_check:

ppping -i ib0,ib1 for any failed nodes: netstat -ni, (SLES) service openibd status, ibv_devices, ibv_devinfo; (AIX) ibstat -v

GPFS_check:
xdsh <noderange> -v -t 10 cksum /gpfs/home/00PRESENT00

from MN if it has access to GPFS, otherwise from one node in GPFS cluster, or one GPFS I/O node:

/usr/lpp/mmfs/bin/mmgetstate -aLs

LL_check:
SLES: /opt/ibmll/LoadL/full/bin/llstatus

AIX: /usr/lpp/LoadL/full/bin/llstatus
llstatus -a

servicenode: all xcatd daemons running, dhcpd, tftpd, atftpd, bootp, dhcp, inetd, syslogd, DNS

managementnode: same as servicenode script?

News

Apr 22, 2016: xCAT 2.11.1 released.
Mar 11, 2016: xCAT 2.9.3 (AIX only) released.
Dec 11, 2015: xCAT 2.11 released.
Nov 11, 2015: xCAT 2.9.2 (AIX only) released.
Jul 30, 2015: xCAT 2.10 released.
Jul 30, 2015: xCAT migrates from sourceforge to github
Jun 26, 2015: xCAT 2.7.9 released.
Mar 20, 2015: xCAT 2.9.1 released.
Dec 12, 2014: xCAT 2.9 released.
Sep 5, 2014: xCAT 2.8.5 released.
May 23, 2014: xCAT 2.8.4 released.
Jan 24, 2014: xCAT 2.7.8 released.
Nov 15, 2013: xCAT 2.8.3 released.
Jun 26, 2013: xCAT 2.8.2 released.
May 17, 2013: xCAT 2.7.7 released.
May 10, 2013: xCAT 2.8.1 released.
Feb 28, 2013: xCAT 2.8 released.
Nov 30, 2012: xCAT 2.7.6 released.
Oct 29, 2012: xCAT 2.7.5 released.
Aug 27, 2012: xCAT 2.7.4 released.
Jun 22, 2012: xCAT 2.7.3 released.
May 25, 2012: xCAT 2.7.2 released.
Apr 20, 2012: xCAT 2.7.1 released.
Mar 19, 2012: xCAT 2.7 released.
Mar 15, 2012: xCAT 2.6.11 released.
Jan 23, 2012: xCAT 2.6.10 released.
Nov 15, 2011: xCAT 2.6.9 released.
Sep 30, 2011: xCAT 2.6.8 released.
Aug 26, 2011: xCAT 2.6.6 released.
May 20, 2011: xCAT 2.6 released.
Feb 14, 2011: Watson plays on Jeopardy and is managed by xCAT!
xCAT Release Notes Summary
xCAT OS And Hw Support Matrix
xCAT Test Environment Summary

History

Oct 22, 2010: xCAT 2.5 released.
Apr 30, 2010: xCAT 2.4 is released.
Oct 31, 2009: xCAT 2.3 released.
xCAT's 10 year anniversary!
Apr 16, 2009: xCAT 2.2 released.
Oct 31, 2008: xCAT 2.1 released.
Sep 12, 2008: Support for xCAT 2
can now be purchased!
June 9, 2008: xCAT breaths life into
(at the time) the fastest
supercomputer on the planet
May 30, 2008: xCAT 2.0 for Linux
officially released!
Oct 31, 2007: IBM open sources
xCAT 2.0 to allow collaboration
among all of the xCAT users.
Oct 31, 1999: xCAT 1.0 is born!
xCAT started out as a project in
IBM developed by Egan Ford. It
was quickly adopted by customers
and IBM manufacturing sites to
rapidly deploy clusters.