Table of Contents
- Hints and Tips for Tuning Large Scale xCAT Clusters
- Introduction
- Hierarchical xCAT Clusters
- System Tuning Attributes
- Tuning ARP (Address Resolution Protocol)
- Tuning AIX Network Attributes
- Tuning AIX ulimits
- Tuning Linux ulimits
- Tuning NFS (Network File System)
- Tuning TCP/IP Buffer Size for RMC
- Tuning for HPC Applications
- Tuning httpd for xCAT node deployments
- Tuning Values for xCAT
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Table of Contents
- Hints and Tips for Tuning Large Scale xCAT Clusters
- Introduction
- Hierarchical xCAT Clusters
- System Tuning Attributes
- Tuning for HPC Applications
- Tuning httpd for xCAT node deployments
- Tuning Values for xCAT
{{:Design Warning}}
Hints and Tips for Tuning Large Scale xCAT Clusters
Introduction
xCAT supports clusters of all sizes. This document is a collection of hints, tips, and special considerations when working with large clusters, especially clusters with more than 128 nodes.
The information in this document should be viewed as example data only. Many of the suggestions are based on anecdotal experiences and may not apply to your particular environment.
Updates and new information are always welcomed, and may provide valuable help and insight to other xCAT users. Please leave comments, suggestions, and additional input in the "Discussion" tab of this wiki page.
Hierarchical xCAT Clusters
In an xCAT cluster, the single point of control is the xCAT management node. However, in order to provide sufficient scaling and performance for large clusters, it may also be necessary to have additional servers to help handle the deployment and management of the cluster nodes. In an xCAT cluster these additional servers are referred to as service nodes. xCAT only supports these two levels of management hierarchy: the xCAT management node as the top level, and xCAT service nodes as the second level.
A service node is managed by the xCAT management node. A compute node can be managed by either the xCAT management node or by a service node. All xCAT commands are initiated from the management node. xCAT will then internally distribute any hierarchical commands to the service nodes as required, and gather and consolidate the results.
When does it become important to start thinking about a hierarchical cluster? There are many factors to consider. Most clusters of 128 compute nodes or less can easily be handled by a single xCAT management node. Larger non-hierarchical clusters may be feasible given enough resources on your xCAT management node and cluster management network.
Factors:
- Operating System: Non-hierarchical Linux clusters can typically be larger than AIX clusters. AIX uses NIM for node deployment, so the largest non-hierarchical AIX cluster will most likely be determined by the number of nodes the NIM master on your xCAT management node can deploy. Each service node in a hierarchical AIX cluster is configured as a NIM master to the compute nodes it manages.
- Type of compute nodes: Will your compute nodes be stateful with its OS installed on a local disk, Linux stateless with the OS image loaded into memory, or Linux statelite or AIX stateless requiring an NFS server for the OS image? Each of these options places different burdens on the xCAT management node and service nodes.
- Deployment services: Several services are used during node deployment, depending on the operating system and type of compute nodes: DHCP, TFTP, bootp, HTTP, NFS, and others. Any one of these services can become a bottleneck when trying to deploy too many nodes from a single server.
- Management Network bandwidth: The capability of your network fabric may require you to distribute node deployment and management load across multiple subnets.
For details on setting up a Hierarchical xCAT cluster see:
[Setting_Up_an_AIX_Hierarchical_Cluster]
[Setting_Up_a_Linux_Hierarchical_Cluster]
System Tuning Attributes
Adjusting operating system tunables can improve large scale cluster performance, avoid bottlenecks, and prevent failures. The following sections are a collection of suggestions that have been gathered from various large scale HPC clusters. You should investigate and evaluate the validity of each suggestion before applying them to your cluster.
Tuning ARP (Address Resolution Protocol)
Address Resolution Protocol (ARP) is a network protocol that maps a network layer protocol address to a data link layer hardware address. For example, ARP is used to resolve an IP address to the corresponding Ethernet address.
In very large networks, the ARP table can become overloaded, which can give the appearance that xCAT is slow.
Tuning ARP on Linux
On Linux, you may see messages in your syslog /var/log/messages that you have ARP "neighbor table overflow" errors. This web article discusses this issue, and includes some suggestions for adjusting ARP related parameters:
To temporarily change the parameters on Linux, run the following from the command line:
echo "512" >/proc/sys/net/ipv4/neigh/default/gc_thresh1
echo "2048" >/proc/sys/net/ipv4/neigh/default/gc_thresh2
echo "4096" >/proc/sys/net/ipv4/neigh/default/gc_thresh3
echo "240" >/proc/sys/net/ipv4/neigh/default/gc_stale_time
To permanently change the parameters, add the following to /etc/sysctl.conf and reboot the Linux server:
net.ipv4.conf.all.arp_filter = 1
net.ipv4.conf.all.rp_filter = 1
net.ipv4.neigh.default.gc_thresh1 = 512
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096
net.ipv4.neigh.default.gc_stale_time = 240
Tuning ARP on AIX
There are several ARP tuning values that can be set on AIX. Read the discussion on "ARP Cache Tuning" in the "AIX Performance Guide"
The following are the default settings for ARP attributes in AIX 7.1:
arpqsize = 12
arpt_killc = 20
arptab_bsiz = 7
arptab_nb = 149
With these settings, the ARP table can hold 149 buckets (arptab_nb), with 7 entries per bucket (arptab_bsiz), for a total of 1043 entries. If a server connects to more than 1043 hosts concurrently, the system will purge an entry from the ARP table and replace it with a new request. Up to 12 requests (arpqsize) can be queued while this is processing. Entries will remain in the table for 20 minutes (arpt_killc).
Once you have determined the appropriate values for your cluster, you can change change the attributes by running the no command with the new values:
no -r –o arptab_nb=256
no -r –o arptab_bsiz=8
no -p -o arpqsize=64
You must reboot your server for these changes to take affect.
Tuning AIX Network Attributes
The following AIX attribute settings are recommended on all servers and compute nodes in large clusters:
no -p -o tcp_recvspace=524288
no -p -o tcp_sendspace=524288
no -p -o udp_recvspace=655360
no -p -o udp_sendspace=65536
no -p -o rfc1323=1
no -p -o sb_max=8388608
chdev -l sys0 -a maxuproc='8192'
Tuning AIX ulimits
On the Management Node and the Service Nodes, change the ulimit for root. Core is optional. You must log out and back in to take affect.
ulimit -m unlimited
ulimit -n 102400
ulimit -d unlimited
ulimit -f unlimited
ulimit -s unlimited
ulimit -t unlimited
ulimit -u unlimited
Verify:
ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) unlimited
memory(kbytes) unlimited
coredump(blocks) unlimited
nofiles(descriptors) 102400
threads(per process) unlimited
processes(per user) unlimited
Tuning Linux ulimits
SLES:
The following memory lock and size limits are recommended for large HPC clusters. Set these values in /etc/sysconfig/ulimit:
HARDLOCKLIMIT="unlimited"
SOFTLOCKLIMIT="unlimited"
HARDRESIDENTLIMIT="unlimited"
SOFTRESIDENTLIMIT="unlimited"
RedHat:
The following memory lock limits are recommended for large HPC clusters. Set these values in /etc/security/limits.conf:
* soft memlock unlimited
* hard memlock unlimited
Tuning NFS (Network File System)
xCAT uses NFS in various ways:
- (AIX) Mounting operating system images to diskless nodes through NIM
- (Linux) Mounting operating system images to statelite nodes
- Mounting persistent directories to statelite nodes
- (AIX) Installing stateful nodes through NIM
- (Linux) Mounting the /install and /tftpboot directories from the xCAT management node to service nodes
- (Linux) xCAT virtualization with KVM
Note: For Linux stateful node installs, xCAT typically uses http instead of NFS to deploy operating system packages to the nodes.
If you are experiencing NFS performance issues, you should first verify that you do not have any network issues. Networking problems often impact NFS.
For NFS servers that are serving the same image to a large number of nodes, ensure that the server has enough memory to hold the entire image in its NFS cache to reduce thrashing.
For heavily loaded Linux NFS servers, increasing the NFSD count may help improve performance:
For SLES, edit /etc/sysconfig/nfs and set **USE_KERNEL_NFSD_NUMBER** to a higher value than the default of 8.
For RedHat, update /etc/init.d/nfs and change **RPCNFSDCOUNT** to a higher
value than the default value of 8.
Restart your NFS service to ensure that the new value takes effect.
For servers that are handling large numbers of NFS operations, especially NFS writes, ensure that your disk subsystem is capable of processing the load. If you are running with a local SCSI drive, if possible you may want to consider moving the filesystem to an attached storage subsystem (RAID, etc.). If that is not option, it may be necessary to add additional local disk drives or create your filesystem so that it spans multiple drives to remove physical bottlenecks in your use of the disk subsystem.
An example of creating a filesystem spanning multiple local disks on AIX:
# Create a volume group across 8 disks
mkvg -f -y xcatvg -s 64 hdisk2 hdisk3 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9
# Create a logical volume for the jfs2log that spans multiple disks
# This is important because the jfs2log can become a bottleneck with many writes to the
# filesystem
mklv -y install_log -t jfs2log -a c -S 4K xcatvg 32 hdisk2 hdisk3
# Create a logical volume for the NFS filesystem across all the disks
mklv -y install_lv -t jfs2 -a c -e x xcatvg 1000 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9
# Create the NFS filesystem
crfs -v jfs2 -d install_lv -m /install -A yes -p rw -a agblksize=4096 -a isnapshot=no
Tuning TCP/IP Buffer Size for RMC
If you are using RMC for cluster monitoring, you may need to increase the TCP/IP buffer sizes on your xCAT management node only to prevent incorrect node status from being returned. If the buffer size is too low for the number of nodes in the cluster, a node could be reported as down even though it can be reached using ping and the RMC subsystem on the node is active.
Linux
To temporarily increase the TCP/IP buffer size on Linux, run the following from the command line:
echo 524288 > /proc/sys/net/core/rmem_max
echo 262144 > /proc/sys/net/core/rmem_default
You must also recycle RMC. Run the following from the command line:
/usr/sbin/rsct/bin/rmcctrl -k
/usr/sbin/rsct/bin/rmcctrl -s
For a permanent change, add the following lines to /etc/sysctl.conf and reboot the Linux server:
net.core.rmem_max = 524288
net.core.rmem_default = 262144
Note: To use larger values on Linux, increase both rmem_max and rmem_default in increments of 262144. For example, increase rmem_max to 786432 and rmem_default to 524288.
AIX
To temporarily increase the TCP/IP buffer size on AIX, run the following from the command line:
no -o udp_recvspace=262144
You must also recycle RMC. Run the following from the command line:
/usr/sbin/rsct/bin/rmcctrl -k
/usr/sbin/rsct/bin/rmcctrl -s
For a permanent change use the no command and the /etc/tunables/nextboot file, as follows:
- The no -r command sets the value changes to apply after reboot.
- The no -p command sets the value changes immediately and to apply after reboot.
See the AIX documentation for detailed command usage information. Note: To use larger values on AIX, increase the udp_recvspace value in increments of 262144, not to exceed the sb_max value.
no –L udp_recvspace
---------------------------------------------------------------------------------
NAME CUR DEF BOOT MIN MAX UNIT TYPE DEPENDENCIES
---------------------------------------------------------------------------------
udp_recvspace 262144 262144 262144 4K 2G-1 byte C sb_max
---------------------------------------------------------------------------------
no –L sb_max
-------------------------------------------------------------------------------
NAME CUR DEF BOOT MIN MAX UNIT TYPE DEPENDENCIES
-------------------------------------------------------------------------------
sb_max 1M 1M 1M 1 2G-1 byte D
-------------------------------------------------------------------------------
Tuning for HPC Applications
HPC Tuning Guide
See "IBM Tuning Guide for High Performance Computing Applications for IBM Power 6": <http://www.ibm.com/developerworks/wikis/download/attachments/137167333/Power6_optimization.pdf?version=1>
This document contains recommended compiler options, discussion of using SMT (simultaneous multi-threading), running legacy executables, MPI performance optimizations, high performance libraries, and benchmark results.
Tuning for PE and LAPI
The Parallel Environment Installation Guide includes information on tuning for network performance. The PE documents are available here: "Parallel Environment (PE) Library"
For PE 5.2.1, the following documentation provides the recommendations listed in this section. Review these documents for additional settings and detailed explanations:
"Tuning your Linux system for more efficient parallel job performance"
"Running large POE jobs and IP buffer usage (PE for AIX only)"
"IBM RSCT: LAPI Programming Guide", see "Chapter 13. LAPI performance considerations".
Linux
To enable more than 8 tasks in shared memory:
echo 268435456 > /proc/sys/kernel/shmmax
To avoid socket send and receive buffer overflow:
echo 1048576 > /proc/sys/net/core/wmem_max
echo 8388608 > /proc/sys/net/core/rmem_max
# Or, on Power Linux:
sysctl -w net.core.wmem_max=1048576
sysctl -w net.core.rmem_max=8388608
To avoid datagram reassembly failures when MP_UDP_PACKET_SIZE is greater than the MTU:
echo 1048576 > /proc/sys/net/ipv4/ipfrag_low_thresh
echo 8388608 > /proc/sys/net/ipv4/ipfrag_high_thresh
To enable jumbo frames:
ifconfig eth<x> mtu 9000 up
The switch must be configured with jumbo frame enabled.
To turn off interrupt coalescing on device eth0 on Power Linux:
ifdown eth0
rmmod e1000
insmod /lib/modules/2.6.5-7.97-pseries64/kernel/drivers/net/e1000/e1000.ko \
InterruptThrottleRate=0,0,0
ifconfig eth0
The path to e1000.ko may vary with kernel level.
AIX
To avoid socket send and receive buffer overflow:
no -o sb_max=8388608
To turn off interrupt coalescing on device ent0 for better latency:
ifconfig ent0 detach
chdev -l ent0 -a intr_rate= 0
If your user applications are using large pages, see The PE 5.2.1 Operation and Use Guide: "Using POE with AIX Large Pages"
On all compute nodes:
vmo -p -o lgpg_regions=64 -o lgpg_size=16777216
If your user applications require shared memory segments to be pinned, on all compute nodes:
vmo -p -o v_pinshm=1
If you decide you would like full system dumps to aid in problem diagnosis, on all compute nodes:
chdev -l sys0 -a fullcore=true
If your user applications perform better with Simultaneous Mult-Threading turned off (this is application dependent), on all compute nodes:
smtctl -m off
Tuning httpd for xCAT node deployments
For a discussion of apache2 tuning, see the external web page: "Apache Performance Tuning"
The default settings for Apache 2 on Red Hat and SLES may not allow enough simultaneous HTTP client connections to support the installation of more than about 50 nodes at a time. To enable greater scaling, the MaxClients and ServerLimit directives need to be increased from 150 to 1000. On Red Hat, change (or add) these directives in
/etc/httpd/conf/httpd.conf
On SLES (with Apache2), change (or add) these directives in
/etc/apache2/server-tuning.conf
Tuning Values for xCAT
xCAT provides several settings that may affect the management performance in your cluster.
Site Table Attributes
The xCAT site table contains many global settings for your xCAT cluster. To display current values, run the command:
lsdef -t site -l
This will only list those attributes that have been explicitly set in your site table. Many other attributes are supported. To list all possible attribute names, a brief description of each, and default values, run:
lsdef -t site -h
You may consider changing the default settings for some of these attributes to improve the performance on a large-scale cluster or if you are experiencing timeouts or failures in these areas:
consoleondemand : When set to 'yes', conserver connects and creates the console output only when the user opens the console. Default is 'no' on Linux, 'yes' on AIX. Setting this to 'no' can reduce the load conserver places on your xCAT management node. If you need this set to 'yes', you may then need to consider setting up multiple servers to run the conserver daemon, and specify the correct server on a per-node basis by setting each node's conserver attribute.
ipmimaxp : The max # of processes for ipmi hw ctrl. Default is 64. ipmiretries : The # of retries to use when communicating with BMCs. Default is 3. ipmisdrcache : If set to 'no', then the xCAT IPMI support will not cache locally the target node's SDR cache to improve performance. Default is 'no'. ipmitimeout : The timeout to use when communicating with BMCs. Default is 2.
nodestatus : If set to 'n', the nodelist.status column will not be updated during the node deployment, node discovery and power operations. Default is 'y', always update nodelist.status. Setting this to 'n' for large clusters can eliminate one node-to-server contact and one xCAT database write operation for each node during node deployment, but you will then need to determine deployment status through some other means.
powerinterval : The time of seconds that rpower command will wait between performing action on each target object. It is used for controlling the cluster boot up speed in large clusters. Default is 0.
ppcmaxp : The max # of processes for PPC hw ctrl. Default is 64. ppcretry : The max # of PPC hw connection attempts before failing. Default is 3. ppctimeout : The timeout, in milliseconds, to use when communicating with PPC hw. A value of '0' means use the default. Default is 60. maxssh : The max # of SSH connections at any one time to the hw ctrl point for PPC. Default is 8. fsptimeout : The timeout, in milliseconds, to use when communicating with FSPs. A value of '0' means use the default. The default value is 30. If you are experiencing connection timeouts on a heavily loaded management network, you may want to experiment with setting this to a larger value.
svloglocal : if set to 1, syslog on the service node will not get forwarded to the mgmt node. The default is to forward all syslog messages. The tradeoff on setting this attribute is reducing network traffic and log size versus having local management node access to all system messages from across the cluster.
useNmapfromMN : When set to yes, nodestat command should obtain the node status using nmap (if available) from the management node instead of the service node. This will improve the performance in a flat network. Default is 'no'.
Command Flags
In addition to the site table settings, some xCAT commands provide options for timeouts, retries, and fanouts to help improve the potential for command success under different conditions. If you are experiencing problems running any of these commands, you may wish to experiment with some of these values:
Consolidating xCAT Database Entries
The xCAT database contains over 50 different tables, although for most clusters you only need entries in a small number of commonly used tables. The majority of the tables are keyed by the xCAT node name. For very large clusters, consolidating individual node entries into entries keyed by a nodegroup name can reduce the amount of information stored in the database, and make handling the data more manageable.
Nodegroup database entries
With the xCAT *def commands (lsdef, mkdef, chdef, rmdef), you can manipulate common attribute values for node groups using "-t group" (object type of 'group'). These will create database entries that are keyed by nodegroup name instead of individual node names.
For example:
chdef -t group -o compute nodetype="osi,ppc" hwtype=lpar os=rhels6 arch=ppc64
will set many common attributes in the nodetype table for the nodegroup "compute".
See the xCAT documentation [Node_Group_Support] for details on using node groups and nodegroup entries in the xCAT database.
Regular expressions in database values
Another useful feature to use in conjunction with nodegroup entries, are to have entries that include regular expressions. This is especially useful where you would normally require an individual table entry for each node because of unique data values. When those data values follow an identifiable pattern, specifying the pattern as a regular expression enables you to create a single entry for the nodegroup. See the xCAT Database manpage for more details and for examples.
News
- Apr 22, 2016: xCAT 2.11.1 released.
- Mar 11, 2016: xCAT 2.9.3 (AIX only) released.
- Dec 11, 2015: xCAT 2.11 released.
- Nov 11, 2015: xCAT 2.9.2 (AIX only) released.
- Jul 30, 2015: xCAT 2.10 released.
- Jul 30, 2015: xCAT migrates from sourceforge to github
- Jun 26, 2015: xCAT 2.7.9 released.
- Mar 20, 2015: xCAT 2.9.1 released.
- Dec 12, 2014: xCAT 2.9 released.
- Sep 5, 2014: xCAT 2.8.5 released.
- May 23, 2014: xCAT 2.8.4 released.
- Jan 24, 2014: xCAT 2.7.8 released.
- Nov 15, 2013: xCAT 2.8.3 released.
- Jun 26, 2013: xCAT 2.8.2 released.
- May 17, 2013: xCAT 2.7.7 released.
- May 10, 2013: xCAT 2.8.1 released.
- Feb 28, 2013: xCAT 2.8 released.
- Nov 30, 2012: xCAT 2.7.6 released.
- Oct 29, 2012: xCAT 2.7.5 released.
- Aug 27, 2012: xCAT 2.7.4 released.
- Jun 22, 2012: xCAT 2.7.3 released.
- May 25, 2012: xCAT 2.7.2 released.
- Apr 20, 2012: xCAT 2.7.1 released.
- Mar 19, 2012: xCAT 2.7 released.
- Mar 15, 2012: xCAT 2.6.11 released.
- Jan 23, 2012: xCAT 2.6.10 released.
- Nov 15, 2011: xCAT 2.6.9 released.
- Sep 30, 2011: xCAT 2.6.8 released.
- Aug 26, 2011: xCAT 2.6.6 released.
- May 20, 2011: xCAT 2.6 released.
- Feb 14, 2011: Watson plays on Jeopardy and is managed by xCAT!
- xCAT Release Notes Summary
- xCAT OS And Hw Support Matrix
- xCAT Test Environment Summary
History
- Oct 22, 2010: xCAT 2.5 released.
- Apr 30, 2010: xCAT 2.4 is released.
- Oct 31, 2009: xCAT 2.3 released.
xCAT's 10 year anniversary! - Apr 16, 2009: xCAT 2.2 released.
- Oct 31, 2008: xCAT 2.1 released.
- Sep 12, 2008: Support for xCAT 2
can now be purchased! - June 9, 2008: xCAT breaths life into
(at the time) the fastest
supercomputer on the planet - May 30, 2008: xCAT 2.0 for Linux
officially released! - Oct 31, 2007: IBM open sources
xCAT 2.0 to allow collaboration
among all of the xCAT users. - Oct 31, 1999: xCAT 1.0 is born!
xCAT started out as a project in
IBM developed by Egan Ford. It
was quickly adopted by customers
and IBM manufacturing sites to
rapidly deploy clusters.