Table of Contents
Interface
xCAT Files for the health check
- /opt/xcat/bin/xhealthcheck
The xCAT command to initiate the health check action
- /opt/xcat/sbin/dohealthcheck
The script will be run on the target node (xCAT MN/SN or compute node) to parse the check request, call each check and generate the return message.
- /install/healthcheck/<health check tools>
The 'health check tool' is an OS executable file which can perform some specific checking.
- /install/healthcheck/hc.checksum
A specific file to maintain the change of files in /install/healthcheck
The definition of health check resource
A health check resource is an unit which can be selected to check for end user. The name of 'health check resource' is composed of 'resource group' and 'sub resource name' by character '@'. e.g. I have a resource group named tcpip which is used to check any tcpip configuration, and this group includes several sub resources like 'dns', 'http', 'ib' ... Then the resource name will be tcpip@dns, tcpip@http, tcpip@ib.
From the implementation perspective, the health check resources are offered by health check tool which is an executable OS file. In general, the 'health check resource group' is name of a 'health check tool'. .e.g A file named 'tcpip' will be added in /install/healthcheck/ and it offers the checking for sub resources like 'dns', 'http', 'ib'.
Syntax of xhealthcheck
xhealthcheck -h | -v
xhealthcheck -d
Display all the available check resources which have been installed in /install/healthcheck
xhealthcheck {noderange} g1@r1 g1@r2 g2@r1 g3
if just specifying a group name, all the resources in the group will be run.
xhealthcheck {noderange} g1@r1=paramlist g3=paramlist
The paramlist is a string which separated by ','
xhealthcheck {noderange} -f check_src_file
the check_src_file contains the list of check resources
The {noderange} here is optional, that means you could ignore the {noderange} if you want to run the check on xCAT MN and SN.
The General Interface for dohealthcheck
The json format will be used. This is also the interface that is used to offer functions for external UI.
-
Input Interface:
[
{ # to make every check to be an element of array so that to control the running order of health check group1: { globalparam: [p1, p2, ...], resource1: { param: [p1, p2, ...], } }, }, { check2 ... }, ] -
Output Interface:
{ group1:{ resource1: { errorcode: 1, error: "errormessage", returncode: "0 - OK, 1 - Warning, 2 - Error ...", message: "return message", } }, }
The interface for each healthcheck tool
-
The tool can be any binary or scripts which can be run on an Operating System directly; (shell, python, peel, binary)
-
The tool must have executable permission;
-
The tool must be installed to /install/healthcheck directory before running the xhealthcheck command;
-
The tool must check the level of Operating System which it is running on. If it cannot run on the OS, return the error message like following:
{errorcode: xx, error: <Cannot run on the operating system.>}
-
The interface of tool:
-
Support the option -d
display all the check resources.
-
intput
-
output
Implementation
Inside of xhealthcheck
The inside code logic of xhealthcheck:
-
1. Check the 'health check resource' have been installed in /install/healthcheck
-
1.1 If the 'health check resource' name is 'tcpip@dns', check the existence of '/install/healthcheck/tcpip' first. Then call '/install/healthcheck/tcpip -r' to see whether it can handle 'tcpip@dns'
-
2. Check the /install/healthcheck/hcsrc.tar.gz is up to date, otherwise create it. Or recreate it when there's 'health check tool' changed (or new file added):
-
2.1 create it by tar all the files in /install/healthcheck except hc.checksum
-
2.2 regenerate the /install/healcheck/hc.checksum
Run 'ls -l /install/healthcheck' and sort the output to a string and put it in /install/healcheck/hc.checksum.
-
2.3 How to test whether there's file change in /install/healthcheck/?
If the new generated 'check sum string' is not same with the /install/healthcheck/hc.checksum, that means there was file change.
-
3. xdcp /install/healthcheck/hcsrc.tar.gz to all the target nodes at /tmp/xcat/. (This should be done by rsync so that we don't need to transmit it in every running of xhealthcheck)
-
4. Run xdsh {noderange} -e /opt/xcat/sbin/dohealthcheck <paramlist>
The <paramlist> is a json formated string. Refer to the input json format. The output format also needs follow the format in the output json format.
-
5. parse the jason format and display the result.
Note: Consider the run of healthcheck from a Web GUI:
Note: steps 2, 3, 4 are not necessary if you want to run check on xCAT MN/SN (miss the {noderange} when calling xhealthcheck)
Inside of dohealthcheck
-
1. Find the /tmp/xcat/hcsrc.tar.gz
-
2. tar out the files from /tmp/xcat/hcsrc.tar.gz to /tmp/xcat/healthcheck/
-
3. Parse the health check parameters to get a run list
g1@r1 <param> g1@r2 <param> g2 <param>
-
4. Run each entry in the run list
use the group name like 'g1' to get the file name of 'healtch check tool' run it as g1 -r r1 -p <param>
-
5. Parse the output of each check resource and generate the json output, and send back to xcat mn
Check Scenarios
-
Local run:
Run on xCAT MN or SN
-
Compute node run:
Run directly on compute node.
-
Cross run:
The target node is n1, but n1 need access n2 to finish the check
-
Hierarchy run
Run on MN, but need run sub resource on compute node
More consideration
1. Run resource on one node in parallel
News
- Apr 22, 2016: xCAT 2.11.1 released.
- Mar 11, 2016: xCAT 2.9.3 (AIX only) released.
- Dec 11, 2015: xCAT 2.11 released.
- Nov 11, 2015: xCAT 2.9.2 (AIX only) released.
- Jul 30, 2015: xCAT 2.10 released.
- Jul 30, 2015: xCAT migrates from sourceforge to github
- Jun 26, 2015: xCAT 2.7.9 released.
- Mar 20, 2015: xCAT 2.9.1 released.
- Dec 12, 2014: xCAT 2.9 released.
- Sep 5, 2014: xCAT 2.8.5 released.
- May 23, 2014: xCAT 2.8.4 released.
- Jan 24, 2014: xCAT 2.7.8 released.
- Nov 15, 2013: xCAT 2.8.3 released.
- Jun 26, 2013: xCAT 2.8.2 released.
- May 17, 2013: xCAT 2.7.7 released.
- May 10, 2013: xCAT 2.8.1 released.
- Feb 28, 2013: xCAT 2.8 released.
- Nov 30, 2012: xCAT 2.7.6 released.
- Oct 29, 2012: xCAT 2.7.5 released.
- Aug 27, 2012: xCAT 2.7.4 released.
- Jun 22, 2012: xCAT 2.7.3 released.
- May 25, 2012: xCAT 2.7.2 released.
- Apr 20, 2012: xCAT 2.7.1 released.
- Mar 19, 2012: xCAT 2.7 released.
- Mar 15, 2012: xCAT 2.6.11 released.
- Jan 23, 2012: xCAT 2.6.10 released.
- Nov 15, 2011: xCAT 2.6.9 released.
- Sep 30, 2011: xCAT 2.6.8 released.
- Aug 26, 2011: xCAT 2.6.6 released.
- May 20, 2011: xCAT 2.6 released.
- Feb 14, 2011: Watson plays on Jeopardy and is managed by xCAT!
- xCAT Release Notes Summary
- xCAT OS And Hw Support Matrix
- xCAT Test Environment Summary
History
- Oct 22, 2010: xCAT 2.5 released.
- Apr 30, 2010: xCAT 2.4 is released.
- Oct 31, 2009: xCAT 2.3 released.
xCAT's 10 year anniversary! - Apr 16, 2009: xCAT 2.2 released.
- Oct 31, 2008: xCAT 2.1 released.
- Sep 12, 2008: Support for xCAT 2
can now be purchased! - June 9, 2008: xCAT breaths life into
(at the time) the fastest
supercomputer on the planet - May 30, 2008: xCAT 2.0 for Linux
officially released! - Oct 31, 2007: IBM open sources
xCAT 2.0 to allow collaboration
among all of the xCAT users. - Oct 31, 1999: xCAT 1.0 is born!
xCAT started out as a project in
IBM developed by Egan Ford. It
was quickly adopted by customers
and IBM manufacturing sites to
rapidly deploy clusters.