2
0
mirror of https://github.com/xcat2/xcat-core.git synced 2025-05-22 03:32:04 +00:00

add HA related docs

This commit is contained in:
bybai 2015-10-14 04:06:22 -04:00
parent 206116f11e
commit d9e723dd7b
6 changed files with 2642 additions and 11 deletions

View File

@ -1,10 +1,3 @@
Overview
========
The xCAT management node plays an important role in the cluster, if the management node is down for whatever reason, the administrators will lose the management capability for the whole cluster, until the management node is back up and running. In some configuration, like the Linux nfs-based statelite in a non-hierarchy cluster, the compute nodes may not be able to run at all without the management node. So, it is important to consider the high availability for the management node.
The goal of the HAMN(High Availability Management Node) configuration is, when the primary xCAT management node fails, the standby management node can take over the role of the management node, either through automatic failover or through manual procedure performed by the administrator, and thus avoid long periods of time during which your cluster does not have active cluster management function available.
Configuration considerations
============================
@ -16,9 +9,9 @@ Data synchronization mechanism
The data synchronization is important for any high availability configuration. When the xCAT management node failover occurs, the xCAT data needs to be exactly the same before failover, and some of the operating system configuration should also be synchronized between the two management nodes. To be specific, the following data should be synchronized between the two management nodes to make the xCAT HAMN work:
* xCAT database
* xCAT configuration files, like /etc/xcat, ~/.xcat, /opt/xcat
* xCAT configuration files, like ``/etc/xcat``, ``~/.xcat``, ``/opt/xcat``
* The configuration files for the services that are required by xCAT, like named, DHCP, apache, nfs, ssh, etc.
* The operating systems images repository and users customization data repository, the /install directory contains these repositories in most cases.
* The operating systems images repository and users customization data repository, the ``/install`` directory contains these repositories in most cases.
There are a lot of ways for data syncronization, but considering the specific xCAT HAMN requirements, only several of the data syncronziation options are practical for xCAT HAMN.
@ -47,7 +40,7 @@ The configuration for the high availability applications is usually complex, it
**3\. Maintenance effort**
The automatic failover brings in several high availability applications, after the initial configuration is done, additional maintenance effort will be needed. For example, taking care of the high availability applications during cluster update, the updates for the high availability applications themselves, trouble shooting any problems with the high availability applications. A simple question may be able to help you to decide: could you get technical support if some of the high availability applications run into problems? All software has bugs ...
The automatic failover brings in several high availability applications, after the initial configuration is done, additional maintenance effort will be needed. For example, taking care of the high availability applications during cluster update, the updates for the high availability applications themselves, trouble shooting any problems with the high availability applications. A simple question may be able to help you to decide: could you get technical support if some of the high availability applications run into problems? All software has bugs.
Configuration Options
=====================
@ -59,7 +52,17 @@ The combinations of data synchronization mechanism and manual/automatic failover
+-------------------+-------------------------+-----------------+--------------+
|Manual Failover | **1** | **2** | 3 |
+-------------------+-------------------------+-----------------+--------------+
|Automatic Failover | 4 | 5 | **6** |
|Automatic Failover | 4 | **5** | **6** |
+-------------------+-------------------------+-----------------+--------------+
Option 1, :ref:`setup_ha_mgmt_node_with_raid1_and disks_move`
Option 2, :ref:`setup_ha_mgmt_node_with_shared_data`
Option 3, it is doable but not currently supported.
Option 4, it is not practical.
Option 5, :ref:`setup_xcat_high_available_management_node_with_nfs`
Option 6, :ref:`setup_ha_mgmt_node_with_drbd_pacemaker_corosync`

View File

@ -1,6 +1,11 @@
High Avaiability
================
The xCAT management node plays an important role in the cluster, if the management node is down for whatever reason, the administrators will lose the management capability for the whole cluster, until the management node is back up and running. In some configuration, like the Linux nfs-based statelite in a non-hierarchy cluster, the compute nodes may not be able to run at all without the management node. So, it is important to consider the high availability for the management node.
The goal of the HAMN(High Availability Management Node) configuration is, when the primary xCAT management node fails, the standby management node can take over the role of the management node, either through automatic failover or through manual procedure performed by the administrator, and thus avoid long periods of time during which your cluster does not have active cluster management function available.
The following pages describes ways to configure the xCAT Management Node for High Availbility.
.. toctree::

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,84 @@
.. _setup_ha_mgmt_node_with_raid1_and disks_move:
Setup HA Mgmt Node With RAID1 and disks move
============================================
This documentation illustrates how to setup a second management node, or standby management node, in your cluster to provide high availability management capability, using RAID1 configuration inside the management node and physically moving the disks between the two management nodes.
When one disk fails on the primary xCAT management node, replace the failed disk and use the RAID1 functionality to reconstruct the RAID1.
When the primary xCAT management node fails, the administrator can shutdown the failed primary management node, unplug the disks from the primary management node and insert the disks into the standby management node, power on the standby management node and then the standby management immediately takes over the cluster management role.
This HAMN approach is primarily intended for clusters in which the management node manages diskful nodes or linux stateless nodes. This also includes hierarchical clusters in which the management node only directly manages the diskful or linux stateless service nodes, and the compute nodes managed by the service nodes can be of any type.
If the compute nodes use only readonly nfs mounts from the MN management node, you can use this doc as long as you recognize that your nodes will go down while you are failing over to the standby management node. If the compute nodes depend on the management node being up to run its operating system over NFS, this doc is not suitable.
Configuration requirements
==========================
#. The hardware type/model are not required to be identical on the two management nodes, but it is recommended to use similar hardware configuration on the two management nodes, at least have similar hardware capability on the two management nodes to support the same operating system and have similar management capability.
#. Hardware RAID: Most of the IBM servers provide hardware RAID option, it is assumed that the hardware RAID configuration will be used in this HAMN configuration, if hardware RAID is not available on your servers, the software RAID MIGHT also work, but use it at your own risk.
#. The network connections on the two management nodes must be the same, the ethx on the standby management node must be connected to same network with the ethx on the primary management node.
#. Use router/switch for routing: if the nodes in the cluster need to connect to the external network through gateway, the gateway should be on the router/switch instead of the management node, the router/switch have their own redundancy.
Configuration procedure
=======================
Configure hardware RAID on the two management nodes
-----------------------------------------------------
Follow the server documentation to setup the hardware RAID1 on the standby management node first, and then move the disks to the primary management node, setup hardware RAID1 on the primary management node.
Install OS on the primary management node
------------------------------------------------
Install operating system on the primary management node using whatever method and configure the network interfaces.
Make sure the attribute **HWADDR** is not specified in the network interface configuration file, like ifcfg-eth0.
Initial failover test
----------------------
This is a sanity check, need to make sure the disks work on the two management nodes, just in case the disks do not work on the standby management node, we do not need to redo too much. **DO NOT** skip this step.
Power off the primary management node, unplug the disks from the primary management node and insert them into the standby management node, boot up the standby management node and make sure the operating system is working correctly, and the network interfaces could connect to the network.
If there are more than one network interfaces managed by the same network driver, like ``e1000``, the network interfaces sequence might be different on the two management nodes even if the hardware configuration is identical on the two management nodes, you need to test the network connections during initial configuration to make sure it works.
It is unlikely to happen, but just in case the ip addresses on the management node are assigned by DHCP, make sure the DHCP server is configured to assign the same ip address to the network interfaces on the two management nodes.
After this, fail back to the primary management node, using the same procedure mentioned above.
Setup xCAT on the Primary Management Node
-------------------------------------------
Follow the doc :doc:`xCAT Install Guide <../../guides/install-guides/index>` to setup xCAT on the primary management node
Continue setting up the cluster
--------------------------------
You can now continue to setup your cluster. Return to using the primary management node. Now setup your cluster using the following documentation, depending on your Hardware,OS and type of install you want to do on the Nodes :doc:`Admin Guide <../../guides/admin-guides/index>`.
For all the xCAT docs: http://xcat-docs.readthedocs.org
During the cluster setup, there is one important thing to consider:
**Network services on management node**
Avoid using the management node to provide network services that are needed to be run continuously, like DHCP, named, ntp, put these network services on the service nodes if possible, multiple service nodes can provide network services redundancy, for example, use more than one service nodes as the name servers, DHCP servers and ntp servers for each compute node; if there is no service node configured in the cluster at all, static configuration on the compute nodes, like static ip address and /etc/hosts name resolution, can be used to eliminate the dependency with the management node.
Failover
========
The failover procedure is simple and straightforward:
#. Shutdown the primary management node
#. Unplug the disks from the primary management node, insert these disks into the standby management node
#. Boot up the standby management node
#. Verify the standby management node could now perform all the cluster management operations.

View File

@ -0,0 +1,495 @@
.. _setup_ha_mgmt_node_with_shared_data:
Setup HA Mgmt Node With Shared Data
===================================
This documentation illustrates how to setup a second management node, or standby management node, in your cluster to provide high availability management capability, using shared data between the two management nodes.
When the primary xCAT management node fails, the administrator can easily have the standby management node take over role of the management node, and thus avoid long periods of time during which your cluster does not have active cluster management function available.
The xCAT high availability management node(``HAMN``) through shared data is not designed for automatic setup or automatic failover, this documentation describes how to use shared data between the primary management node and standby management node, and describes how to perform some manual steps to have the standby management node takeover the management node role when the primary management node fails. However, high availability applications such as ``IBM Tivoli System Automation(TSA)`` and Linux ``Pacemaker`` could be used to achieve automatic failover, how to configure the high availability applications is beyond the scope of this documentation, you could refer to the applications documentation for instructions.
The nfs service on the primary management node or the primary management node itself will be shutdown during the failover process, so any NFS mount or other network connections from the compute nodes to the management node should be temporarily disconnected during the failover process. If the network connectivity is required for compute node run-time operations, you should consider some other way to provide high availability for the network services unless the compute nodes can also be taken down during the failover process. This also implies:
#. This HAMN approach is primarily intended for clusters in which the management node manages linux diskful nodes or stateless nodes. This also includes hierarchical clusters in which the management node only directly manages the linux diskful or linux stateless service nodes, and the compute nodes managed by the service nodes can be of any type.
#. If the nodes use only readonly nfs mounts from the MN management node, then you can use this doc as long as you recognize that your nodes will go down while you are failing over to the standby management node.
What is Shared Data
====================
The term ``Shared Data`` means that the two management nodes use a single copy of xCAT data, no matter which management node is the primary MN, the cluster management capability is running on top of the single data copy. The acess to the data could be done through various ways like shared storage, NAS, NFS, samba etc. Based on the protocol being used, the data might be accessable only on one management node at a time or be accessable on both management nodes in parellel. If the data could only be accessed from one management node, the failover process need to take care of the data access transition; if the data could be accessed on both management nodes, the failover does not need to consider the data access transition, it usually means the failover process could be faster.
``Warning``: Running database through network file system has a lot of potential problems and is not practical, however, most of the database system provides database replication feature that can be used to synronize the database between the two management nodes
Configuration Requirements
==========================
#. xCAT HAMN requires that the operating system version, xCAT version and database version be identical on the two management nodes.
#. The hardware type/model are not required to be the same on the two management nodes, but it is recommended to have similar hardware capability on the two management nodes to support the same operating system and have similar management capability.
#. Since the management node needs to provide IP services through broadcast such as DHCP to the compute nodes, the primary management node and standby management node should be in the same subnet to ensure the network services will work correctly after failover.
#. Setting up HAMN can be done at any time during the life of the cluster, in this documentation we assume the HAMN setup is done from the very beginning of the xCAT cluster setup, there will be some minor differences if the HAMN setup is done from the middle of the xCAT cluster setup.
The example given in this document is for RHEL 6. The same approach can be applied to SLES, but the specific commands might be slightly different. The examples in this documentation are based on the following cluster environment:
Virtual IP Alias Address: 9.114.47.97
Primary Management Node: rhmn1(9.114.47.103), netmask is 255.255.255.192, hostname is rhmn1, running RHEL 6.
Standby Management Node: rhmn2(9.114.47.104), netmask is 255.255.255.192, hostname is rhmn2. Running RHEL 6.
You need to substitute the hostnames and ip address with your own values when setting up your HAMN environment.
Configuring Shared Data
=======================
``Note``: Shared data itself needs high availability also, the shared data should not become a single point of failure.
The configuration procedure will be quite different based on the shared data mechanism that will be used. Configuring these shared data mechanisms is beyond the scope of this documentation. After the shared data mechanism is configured, the following xCAT directory structure should be on the shared data, if this is done before xCAT is installed, you need to create the directories manually; if this is done after xCAT is installed, the directories need to be copied to the shared data. ::
/etc/xcat
/install
~/.xcat
/<dbdirectory>
``Note``:For mysql, the database directory is ``/var/lib/mysql``; for postgresql, the database directory is ``/var/lib/pgsql``; for DB2, the database directory is specified with the site attribute databaseloc; for sqlite, the database directory is /etc/xcat, already listed above.
Here is an example of how to make directories be shared data through NFS: ::
mount -o rw <nfssvr>:/dir1 /etc/xcat
mount -o rw <nfssvr>:/dir2 /install
mount -o rw <nfssvr>:/dir3 ~/.xcat
mount -o rw <nfssvr>:/dir4 /<dbdirectory>
``Note``: if you need to setup high availability for some other applications, like the HPC software stack, between the two xCAT management nodes, the applications data should be on the shared data.
Setup xCAT on the Primary Management Node
=========================================
#. Make the shared data be available on the primary management node.
#. Set up a ``Virtual IP address``. The xcatd daemon should be addressable with the same ``Virtual IP address``, regardless of which management node it runs on. The same ``Virtual IP address`` will be configured as an alias IP address on the management node (primary and standby) that the xcatd runs on. The Virtual IP address can be any unused ip address that all the compute nodes and service nodes could reach. Here is an example on how to configure Virtual IP address: ::
ifconfig eth0:0 9.114.47.97 netmask 255.255.255.192
The option ``firstalias`` will configure the Virtual IP ahead of the interface ip address, since ifconfig will not make the ip address configuration be persistent through reboots, so the Virtual IP address needs to be re-configured right after the management node is rebooted. This non-persistent Virtual IP address is designed to avoid ip address conflict when the crashed previous primary management is recovered with the Virtual IP address configured.
#. Add the alias ip address into the ``/etc/resolv.conf`` as the nameserver. Change the hostname resolution order to be using ``/etc/hosts`` before using name server, change to "hosts: files dns" in ``/etc/nsswitch.conf``.
#. Change hostname to the hostname that resolves to the Virtual IP address. This is required for xCAT and database to be setup properly.
#. Install xCAT. The procedure described in :doc:`xCAT Install Guide <../../guides/install-guides/index>` could be used for the xCAT setup on the primary management node.
#. Check the site table master and nameservers and network tftpserver attribute is the Virtual ip: ::
lsdef -t site
If not correct: ::
chdef -t site master=9.114.47.97
chdef -t site nameservers=9.114.47.97
chdef -t network tftpserver=9.114.47.97
Add the two management nodes into policy table: ::
tabdump policy
"1.2","rhmn1",,,,,,"trusted",,
"1.3","rhmn2",,,,,,"trusted",,
#. (Optional) DB2 only, change the databaseloc in site table: ::
chdef -t site databaseloc=/dbdirectory
#. Install and configure database. Refer to the doc [**doto:** choosing_the_Database] to configure the database on the xCAT management node.
Verify xcat is running on correct database by running: ::
lsxcatd -a
#. Backup the xCAT database tables for the current configuration on standby management node, using command : ::
dumpxCATdb -p <your_backup_dir>.
#. Setup a crontab to backup the database each night by running ``dumpxCATdb`` and storing the backup to some filesystem not on the shared data.
#. Stop the xcatd daemon and some related network services from starting on reboot: ::
service xcatd stop
chkconfig --level 345 xcatd off
service conserver off
chkconfig --level 2345 conserver off
service dhcpd stop
chkconfig --level 2345 dhcpd off
#. Stop Database and prevent the database from auto starting at boot time, use mysql as an example: ::
service mysqld stop
chkconfig mysqld off
#. (Optional) If DFM is being used for hardware control capabilities, install DFM package, setup xCAT to communicate directly to the System P server's service processor.::
xCAT-dfm RPM
ISNM-hdwr_svr RPM
#. If there is any node that is already managed by the Management Node,change the noderes table tftpserver & xcatmaster & nfsserver attributes to the Virtual ip
#. Set the hostname back to original non-alias hostname.
#. After installing xCAT and database, you could setup service node or compute node.
Setup xCAT on the Standby Management Node
=========================================
#. Make sure the standby management node is NOT using the shared data.
#. Add the alias ip address into the ``/etc/resolv.conf`` as the nameserver. Change the hostname resolution order to be using ``/etc/hosts`` before using name server. Change "hosts: files dns" in /etc/nsswitch.conf.
#. Temporarily change the hostname to the hostname that resolves to the Virtual IP address. This is required for xCAT and database to be setup properly. This only needs to be done one time.
Also configure the Virtual IP address during this setup. ::
ifconfig eth0:0 9.114.47.97 netmask 255.255.255.192
#. Install xCAT. The procedure described in :doc:`xCAT Install Guide <../../guides/install-guides/index>` can be used for the xCAT setup on the standby management node. The database system on the standby management node must be the same as the one running on the primary management node.
#. (Optional) DFM only, Install DFM package: ::
xCAT-dfm RPM
ISNM-hdwr_svr RPM
#. Setup hostname resolution between the primary management node and standby management node. Make sure the primary management node can resolve the hostname of the standby management node, and vice versa.
#. Setup ssh authentication between the primary management node and standby management node. It should be setup as "passwordless ssh authentication" and it should work in both directions. The summary of this procedure is:
a. cat keys from ``/.ssh/id_rsa.pub`` on the primary management node and add them to ``/.ssh/authorized_keys`` on the standby management node. Remove the standby management node entry from ``/.ssh/known_hosts`` on the primary management node prior to issuing ssh to the standby management node.
b. cat keys from ``/.ssh/id_rsa.pub`` on the standby management node and add them to ``/.ssh/authorized_keys`` on the primary management node. Remove the primary management node entry from ``/.ssh/known_hosts`` on the standby management node prior to issuing ssh to the primary management node.
#. Make sure the time on the primary management node and standby management node is synchronized.
#. Stop the xcatd daemon and related network services from starting on reboot: ::
service xcatd stop
chkconfig --level 345 xcatd off
service conserver off
chkconfig --level 2345 conserver off
service dhcpd stop
chkconfig --level 2345 dhcpd off
#. Stop Database and prevent the database from auto starting at boot time. Use mysql as an example: ::
service mysqld stop
chkconfig mysqld off
#. Backup the xCAT database tables for the current configuration on standby management node, using command: ::
dumpxCATdb -p <yourbackupdir>.
#. Change the hostname back to the original hostname.
#. Remove the Virtual Alias IP. ::
ifconfig eth0:0 0.0.0.0 0.0.0.0
File Synchronization
====================
For the files that are changed constantly such as xcat database, ``/etc/xcat/*``, we have to put the files on the shared data; but for the files that are not changed frequently or unlikely to be changed at all, we can simply copy the the files from the primary management node to the standby management node or use crontab and rsync to keep the files synchronized between primary management node and standby management node. Here are some files we recommend to keep synchronization between the primary management node and standby management node:
SSL Credentials and SSH Keys
--------------------------------
To enable both the primary and the standby management nodes to ssh to the service nodes and compute nodes, the ssh keys should be kept synchronized between the primary management node and standby management node. To allow xcatd on both the primary and the standby management nodes to communicate with xcatd on the services nodes, the xCAT SSL credentials should be kept synchronized between the primary management node and standby management node.
The xCAT SSL credentials reside in the directories ``/etc/xcat/ca``, ``/etc/xcat/cert`` and ``$HOME/.xcat/``. The ssh host keys that xCAT generates to be placed on the compute nodes are in the directory ``/etc/xcat/hostkeys``. These directories are on the shared data.
In addition the ssh root keys in the management node's root home directory (in ~/.ssh) must be kept in sync between the primary management node and standby management node. Only sync the key files and not the authorized_key file. These keys will seldom change, so you can just do it manually when they do, or setup a cron entry like this sample: ::
0 1 * * * /usr/bin/rsync -Lprgotz $HOME/.ssh/id* rhmn2:$HOME/.ssh/
Now go to the Standby node and add the Primary's id_rsa.pub to the Standby's authorized_keys file.
Network Services Configuration Files
------------------------------------
A lot of network services are configured on the management node, such as DNS, DHCP and HTTP. The network services are mainly controlled by configuration files. However, some of the network services configuration files contain the local hostname/ipaddresses related information, so simply copying these network services configuration files to the standby management node may not work. Generating these network services configuration files is very easy and quick by running xCAT commands such as makedhcp, makedns or nimnodeset, as long as the xCAT database contains the correct information.
While it is easier to configure the network services on the standby management node by running xCAT commands when failing over to the standby management node, an exception is the ``/etc/hosts``; the ``/etc/hosts`` may be modified on your primary management node as ongoing cluster maintenance occurs. Since the ``/etc/hosts`` is very important for xCAT commands, the ``/etc/hosts`` will be synchronized between the primary management node and standby management node. Here is an example of the crontab entries for synchronizing the ``/etc/hosts``: ::
0 2 * * * /usr/bin/rsync -Lprogtz /etc/hosts rhmn2:/etc/
Additional Customization Files and Production files
----------------------------------------------------
Besides the files mentioned above, there may be some additional customization files and production files that need to be copied over to the standby management node, depending on your local unique requirements. You should always try to keep the standby management node as an identical clone of the primary management node. Here are some example files that can be considered: ::
/.profile
/.rhosts
/etc/auto_master
/etc/auto/maps/auto.u
/etc/motd
/etc/security/limits
/etc/netscvc.conf
/etc/ntp.conf
/etc/inetd.conf
/etc/passwd
/etc/security/passwd
/etc/group
/etc/security/group
/etc/exports
/etc/dhcpsd.cnf
/etc/services
/etc/inittab
(and more)
``Note``:
If the IBM HPC software stack is configured in your environment, execute additional steps required to copy additional data or configuration files for HAMN setup.
The dhcpsd.cnf should be syncronized between the primary management node and standby management node only when the DHCP configuration on the two management nodes are exactly the same.
Cluster Maintenance Considerations
==================================
The standby management node should be taken into account when doing any maintenance work in the xCAT cluster with HAMN setup.
#. Software Maintenance - Any software updates on the primary management node should also be done on the standby management node.
#. File Synchronization - Although we have setup crontab to synchronize the related files between the primary management node and standby management node, the crontab entries are only run in specific time slots. The synchronization delay may cause potential problems with HAMN, so it is recommended to manually synchronize the files mentioned in the section above whenever the files are modified.
#. Reboot management nodes - In the primary management node needs to be rebooted, since the daemons are set to not auto start at boot time, and the shared data will not be mounted automatically, you should mount the shared data and start the daemons manually.
``Note``: after software upgrade, some services that were set to not autostart on boot might be started by the software upgrade process, or even set to autostart on boot, the admin should check the services on both primary and standby management node, if any of the services are set to autostart on boot, turn it off; if any of the services are started on the backup management node, stop the service.
At this point, the HA MN Setup is complete, and customer workloads and system administration can continue on the primary management node until a failure occurs. The xcatdb and files on the standby management node will continue to be synchronized until such a failure occurs.
Failover
========
There are two kinds of failover, planned failover and unplanned failover. The planned failover can be useful for updating the management nodes or any scheduled maintainance activities; the unplanned failover covers the unexpected hardware or software failures.
In a planned failover, you can do necessary cleanup work on the previous primary management node before failover to the previous standby management node. In a unplanned failover, the previous management node probably is not functioning at all, you can simply shutdown the system.
Take down the Current Primary Management Node
---------------------------------------------
xCAT ships a sample script ``/opt/xcat/share/xcat/hamn/deactivate-mn`` to make the machine be a standby management node. Before using this script, you need to review the script carefully and make updates accordingly, here is an example of how to use this script: ::
/opt/xcat/share/xcat/hamn/deactivate-mn -i eth1:2 -v 9.114.47.97
On the current primary management node:
If the management node is still available and running the cluster, perform the following steps to shutdown.
#. (DFM only) Remove connections from CEC and Frame. ::
rmhwconn cec,frame
rmhwconn cec,frame -T fnm
#. Stop the xCAT daemon.
``Note``: xCAT must be stopped on all Service Nodes also, and LL if using the database. ::
service xcatd stop
service dhcpd stop
#. unexport the xCAT NFS directories
The exported xCAT NFS directories will prevent the shared data partitions from being unmounted, so the exported xCAT NFS directories should be unmounted before failover: ::
exportfs -ua
#. Stop database
Use mysql as an example: ::
service mysqld stop
#. Unmount shared data
All the file systems on the shared data need to be unmounted to make the previous standby management be able to mount the file systems on the shared data. Here is an example: ::
umount /etc/xcat
umount /install
umount ~/.xcat
umount /db2database
When trying to umount the file systems, if there are some processes that are accessing the files and directories on the file systems, you will get "Device busy" error. Then stop or kill all the processes that are accessing the shared data file systems and retry the unmount.
#. Unconfigure Virtual IP: ::
ifconfig eth0:0 0.0.0.0 0.0.0.0
If the ifconfig command has been added to rc.local, remove it from rc.local.
Bring up the New Primary Management Node
----------------------------------------
Execute script ``/opt/xcat/share/xcat/hamn/activate-mn`` to make the machine be a primary management node: ::
/opt/xcat/share/xcat/hamn/activate-mn -i eth1:2 -v 9.114.47.97 -m 255.255.255.0
On the new primary management node:
#. Configure Virtual IP: ::
ifconfig eth0:0 9.114.47.97 netmask 255.255.255.192
You can put the ifconfig command into rc.local to make the Virtual IP be persistent after reboot.
#. Mount shared data: ::
mount /etc/xcat
mount /install
mount /.xcat
mount /db2database
#. Start database, use mysql as an example: ::
service mysql start
#. Start the daemons: ::
service dhcpd start
service xcatd start
service hdwr_svr start
service conserver start
#. (DFM only) Setup connection for CEC and Frame: ::
mkhwconn cec,frame -t
mkhwconn cec,frame -t -T fnm
chnwm -a
#. Setup network services and conserver
**DNS**: run ``makedns``. Verify dns services working for node resolution. Make sure the line ``nameserver=<virtual ip>`` is in ``/etc/resolv.conf``.
**DHCP**: if the dhcpd.leases is not syncronized between the primary management node and standby management node, run ``makedhcp -a`` to setup the DHCP leases. Verify dhcp is operational.
**conserver**: run makeconservercf. This will recreate the ``/etc/conserver.cf`` config files for all the nodes.
#. (Optional)Setup os deployment environment
This step is required only when you want to use this new primary management node to perform os deployment tasks.
The operating system images definitions are already in the xCAT database, and the operating system image files are already in ``/install`` directory.
Run the following command to list all the operating system images. ::
lsdef -t osimage -l
If you are seeing ssh problems when trying to ssh the compute nodes or any other nodes, the hostname in ssh keys under directory $HOME/.ssh needs to be updated.
#. Restart NFS service and re-export the NFS exports
Because of the Virtual ip configuration and the other network configuration changes on the new primary management node, the NFS service needs to be restarted and the NFS exports need to be re-exported. ::
exportfs -ua
service nfs stop
service nfs start
exportfs -a
Setup the Cluster
-----------------
At this point you have setup your Primary and Standby management node for HA. You can now continue to setup your cluster. Return to using the Primary management node attached to the shared data. Now setup your Hierarchical cluster using the following documentation, depending on your Hardware,OS and type of install you want to do on the Nodes. Other docs are available for full disk installs :doc:`Admin Guide <../../guides/admin-guides/index>`.
For all the xCAT docs: http://xcat-docs.readthedocs.org
Appendix A Configure Shared Disks
=================================
The following two sections describe how to configure shared disks on Linux. And the steps do not apply to all shared disks configuration scenarios, you may need to use some slightly different steps according to your shared disks configuration.
The operating system is installed on the internal disks.
#. Connect the shared disk to both management nodes
To verify the shared disks are connected correctly, run the sginfo command on both management nodes and look for the same serial number in the output. Please be aware that the sginfo command may not be installed by default on Linux, the sginfo command is shipped with package sg3_utils, you can manually install the package sg3_utils on both management nodes.
Once the sginfo command is installed, run sginfo -l command on both management nodes to list all the known SCSI disks, for example, enter: ::
sginfo -l
Output will be similar to: ::
/dev/sdd /dev/sdc /dev/sdb /dev/sda
/dev/sg0 [=/dev/sda scsi0 ch=0 id=1 lun=0]
/dev/sg1 [=/dev/sdb scsi0 ch=0 id=2 lun=0]
/dev/sg2 [=/dev/sdc scsi0 ch=0 id=3 lun=0]
/dev/sg3 [=/dev/sdd scsi0 ch=0 id=4 lun=0]
Use the ``sginfo -s <device_name>`` to identify disks with the same serial number on both management nodes, for example:
On the primary management node: ::
[root@rhmn1 ~]# sginfo -s /dev/sdb
Serial Number '1T23043224 '
[root@rhmn1 ~]#
On the standby management node: ::
[root@rhmn2~]# sginfo -s /dev/sdb
Serial Number '1T23043224 '
We can see that the ``/dev/sdb`` is a shared disk on both management nodes. In some cases, as with mirrored disks and when there is no matching of serial numbers between the two management nodes, multiple disks on a single server can have the same serial number, In these cases, format the disks, mount them on both management nodes, and then touch files on the disks to determine if they are shared between the management nodes.
#. Create partitions on shared disks
After the shared disks are identified, create the partitions on the shared disks using fdisk command on the primary management node. Here is an example: ::
fdisk /dev/sdc
Verify the partitions are created by running ``fdisk -l``.
#. Create file systems on shared disks
Run the ``mkfs.ext3`` command on the primary management node to create file systems on the shared disk that will contain the xCAT data. For example: ::
mkfs.ext3 -v /dev/sdc1
mkfs.ext3 -v /dev/sdc2
mkfs.ext3 -v /dev/sdc3
mkfs.ext3 -v /dev/sdc4
If you place entries for the disk in ``/etc/fstab``, which is not required, ensure that the entries do not have the system automatically mount the disk.
``Note``: Since the file systems will not be mounted automatically during system reboot, it implies that you need to manually mount the file systems after the primary management node reboot. Before mounting the file systems, stop xcat daemon first; after the file systems are mounted, start xcat daemon.
#. Verify the file systems on the primary management node.
Verify the file systems could be mounted and written on the primary management node, here is an example: ::
mount /dev/sdc1 /etc/xcat
mount /dev/sdc2 /install
mount /dev/sdc3 ~/.xcat
mount /dev/sdc4 /db2database
After that, umount the file system on the primary management node: ::
umount /etc/xcat
umount /install
umount ~/.xcat
umount /db2database
#. Verify the file systems on the standby management node.
On the standby management node, verify the file systems could be mounted and written. ::
mount /dev/sdc1 /etc/xcat
mount /dev/sdc2 /install
mount /dev/sdc3 ~/.xcat
mount /dev/sdc4 /db2database
You may get errors "mount: you must specify the filesystem type" or "mount: special device /dev/sdb1 does not exist" when trying to mount the file systems on the standby management node, this is caused by the missing devices files on the standby management node, run ``fidsk /dev/sdx`` and simply select "w write table to disk and exit" in the fdisk menu, then retry the mount.
After that, umount the file system on the standby management node: ::
umount /etc/xcat
umount /install
umount ~/.xcat
umount /db2database

View File

@ -0,0 +1,511 @@
.. _setup_xcat_high_available_management_node_with_nfs:
Setup xCAT HA Mgmt with NFS pacemaker and corosync
====================================================================================
In this doc, we will configure a xCAT HA cluster using ``pacemaker`` and ``corosync`` based on NFS server. ``pacemaker`` and ``corosync`` only support ``x86_64`` systems, more information about ``pacemaker`` and ``corosync`` refer to doc :ref:`setup_ha_mgmt_node_with_drbd_pacemaker_corosync`.
Prepare environments
--------------------
The NFS SERVER IP is: c902f02x44 10.2.2.44
The NFS shares are ``/disk1/install``, ``/etc/xcat``, ``/root/.xcat``, ``/root/.ssh/``, ``/disk1/hpcpeadmin``
First xCAT Management node is: rhmn1 10.2.2.235
Second xCAT Management node is: rhmn2 10.2.2.233
Virtual IP: 10.2.2.150
This example will use static IP to provision nodes, so we do not use dhcp service. If you want to use dhcp service, you should consider to save dhcp related configuration files in NFS server.
The DB is SQLlite. There is no service node in this example.
Prepare NFS server
--------------------
In NFS server 10.2.2.44, execute commands to export fs; If you want to use another non-root user to manage xCAT, such as hpcpeadmin.
You should create a directory for ``/home/hpcpeadmin``; Execute commands in NFS server c902f02x44. ::
# service nfs start
# mkdir ~/.xcat
# mkdir -p /etc/xcat
# mkdir -p /disk1/install/
# mkdir -p /disk1/hpcpeadmin
# mkdir -p /disk1/install/xcat
# vi /etc/exports
/disk1/install *(rw,no_root_squash,sync,no_subtree_check)
/etc/xcat *(rw,no_root_squash,sync,no_subtree_check)
/root/.xcat *(rw,no_root_squash,sync,no_subtree_check)
/root/.ssh *(rw,no_root_squash,sync,no_subtree_check)
/disk1/hpcpeadmin *(rw,no_root_squash,sync,no_subtree_check)
# exportfs -a
Install First xCAT MN rhmn1
------------------------------
Execute steps on xCAT MN rhmn1
#. Configure IP alias in rhmn1: ::
ifconfig eth0:0 10.2.2.250 netmask 255.0.0.0
#. Add alias ip into ``/etc/resolv.conf``: ::
#vi /etc/resolv.conf
search pok.stglabs.ibm.com
nameserver 10.2.2.250
``rsync`` /etc/resolv.conf to ``c902f02x44:/disk1/install/xcat/``: ::
rsync /etc/resolv.conf c902f02x44:/disk1/install/xcat/
Add alias iprhmn2,rhmn1 into ``/etc/hosts``: ::
#vi /etc/hosts
10.2.2.233 rhmn2 rhmn2.pok.stglabs.ibm.com
10.2.2.235 rhmn1 rhmn1.pok.stglabs.ibm.com
``rsync`` /etc/hosts to ``c902f02x44:/disk1/install/xcat/``: ::
rsync /etc/hosts c902f02x44:/disk1/install/xcat/
#. Install first xcat MN rhmn1
Mount share nfs from 10.2.2.44: ::
# mkdir -p /install
# mkdir -p /etc/xcat
# mkdir -p /home/hpcpeadmin
# mount 10.2.2.44:/disk1/install /install
# mount 10.2.2.44:/etc/xcat /etc/xcat
# mkdir -p /root/.xcat
# mount 10.2.2.44:/root/.xcat /root/.xcat
# mount 10.2.2.44:/root/.ssh /root/.ssh
# mount 10.2.2.44:/disk1/hpcpeadmin /home/hpcpeadmin
Create new user hpcpeadmin, change it password to hpcpeadminpw: ::
# USER="hpcpeadmin"
# GROUP="hpcpeadmin"
# /usr/sbin/groupadd -f ${GROUP}
# /usr/sbin/useradd ${USER} -d /home/${USER} -s /bin/bash
# /usr/sbin/usermod -a -G ${GROUP} ${USER}
# passwd ${USER}
Change new user hpcpeadmin as sudoers: ::
# USERNAME="hpcpeadmin"
# SUDOERS_FILE="/etc/sudoers"
# sed s'/Defaults requiretty/#Defaults requiretty'/g ${SUDOERS_FILE} > /tmp/sudoers
# echo "$USERNAME ALL=(ALL) NOPASSWD:ALL" >> /tmp/sudoers
# cp -f /tmp/sudoers ${SUDOERS_FILE}
# chown hpcpeadmin:hpcpeadmin /home/hpcpeadmin
# rm -rf /tmp/sudoers
Check the result: ::
#su - hpcpeadmin
$ sudo cat /etc/sudoers|grep hpcpeadmin
hpcpeadmin ALL=(ALL) NOPASSWD:ALL
$exit
Download xcat-core tar ball and xcat-dep tar ball from github, and untar them: ::
# mkdir /install/xcat
# mv xcat-core-2.8.4.tar.bz2 /install/xcat/
# mv xcat-dep-201404250449.tar.bz2 /install/xcat/
# cd /install/xcat
# tar -jxvf xcat-core-2.8.4.tar.bz2
# tar -jxvf xcat-dep-201404250449.tar.bz2
# cd xcat-core
# ./mklocalrepo.sh
# cd ../xcat-dep/rh6/x86_64/
# ./mklocalrepo.sh
# yum clean metadata
# yum install xCAT
# source /etc/profile.d/xcat.sh
#. Use vip in site table and networks table: ::
# chdef -t site master=10.2.2.250 nameservers=10.2.2.250
# chdef -t network 10_0_0_0-255_0_0_0 tftpserver=10.2.2.250
# tabdump networks
~]#netname,net,mask,mgtifname,gateway,dhcpserver,tftpserver,nameservers,ntpservers,logservers,dynamicrange,staticrange,staticrangeincrement,nodehostname,ddnsdomain,vlanid,domain,comments,disable
"10_0_0_0-255_0_0_0","10.0.0.0","255.0.0.0","eth0","10.2.0.221",,"10.2.2.250",,,,,,,,,,,,
#. Add 2 nodes into policy table: ::
#tabedit policy
"1.2","rhmn1",,,,,,"trusted",,
"1.3","rhmn2",,,,,,"trusted",,
#. Backup xcatDB(optional): ::
dumpxCATdb -p <yourbackupdir>.
#. Check and handle the policy table to allow the user to run commands: ::
# chtab policy.priority=6 policy.name=hpcpeadmin policy.rule=allow
# tabdump policy
/#priority,name,host,commands,noderange,parameters,time,rule,comments,disable
"1","root",,,,,,"allow",,
"1.2","rhmn1",,,,,,"trusted",,
"1.3","rhmn2",,,,,,"trusted",,
"2",,,"getbmcconfig",,,,"allow",,
"2.1",,,"remoteimmsetup",,,,"allow",,
"2.3",,,"lsxcatd",,,,"allow",,
"3",,,"nextdestiny",,,,"allow",,
"4",,,"getdestiny",,,,"allow",,
"4.4",,,"getpostscript",,,,"allow",,
"4.5",,,"getcredentials",,,,"allow",,
"4.6",,,"syncfiles",,,,"allow",,
"4.7",,,"litefile",,,,"allow",,
"4.8",,,"litetree",,,,"allow",,
"6","hpcpeadmin",,,,,,"allow",,
#. Make sure xCAT commands are in the user's path ::
# su - hpcpeadmin
$ echo $PATH | grep xcat
/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/share/xcat/tools:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hpcpeadmin/bin
$lsdef -t site -l
#. Stop the xcatd daemon and some related network services from starting on reboot ::
# service xcatd stop
Stopping xCATd [ OK ]
# chkconfig --level 345 xcatd off
# service conserver stop
conserver not running, not stopping [PASSED]
# chkconfig --level 2345 conserver off
# service dhcpd stop
# chkconfig --level 2345 dhcpd off
Remove the Virtual Alias IP ::
# ifconfig eth0:0 0.0.0.0 0.0.0.0
Install second xCAT MN node rhmn2
-------------------------------------
The installation steps are the exactly same with above part ``Install fist xCAT MN node rhmn1``, using the same VIP with rhmn1.
SSH Setup Across nodes rhmn1 and rhmn2
---------------------------------------------
Setup ssh across nodes rhmn1 and rhmn2, make sure rhmn1 can ssh to rhmn2 using no password: ::
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
rsync -ave ssh /etc/ssh/ rhmn2:/etc/ssh/
rsync -ave ssh /root/.ssh/ rhmn2:/root/.ssh/
``Note``: if they can ssh each other using password, it is enough.
Install corosync and pacemaker on both rhmn2 and rhmn1
-------------------------------------------------------------
#. Download crmsh pssh python-pssh: ::
wget download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/x86_64/crmsh-2.1-1.1.x86_64.rpm
wget download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/x86_64/pssh-2.3.1-4.2.x86_64.rpm
wget download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/x86_64/python-pssh-2.3.1-4.2.x86_64.rpm
rpm -ivh python-pssh-2.3.1-4.2.x86_64.rpm
rpm -ivh pssh-2.3.1-4.2.x86_64.rpm
yum install redhat-rpm-config
rpm -ivh crmsh-2.1-1.1.x86_64.rpm
#. Install ``corosync`` and ``pacemaker`` from OS repositories: ::
#cd /etc/yum.repos.d
#cat rhel-local.repo
[rhel-local]
name=HPCCloud configured local yum repository for rhels6.5/x86_64
baseurl=http://10.2.0.221/install/rhels6.5/x86_64
enabled=1
gpgcheck=0
[rhel-local1]
name=HPCCloud1 configured local yum repository for rhels6.5/x86_64
baseurl=http://10.2.0.221/install/rhels6.5/x86_64/HighAvailability
enabled=1
gpgcheck=0
#. Install ``corosync`` and ``pacemaker``, then generate ssh key:
Install ``corosync`` and ``pacemaker``: ::
yum install -y corosync pacemaker
Generate a Security Key, first generate a security key for authentication for all nodes in the cluster,
On one of the systems in the corosync cluster enter: ::
corosync-keygen
It will look like the command is not doing anything. It is waiting for entropy data
to be written to ``/dev/random`` until it gets 1024 bits. You can speed that process
up by going to another console for the system and entering: ::
cd /tmp
wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.32.8.tar.bz2
tar xvfj linux-2.6.32.8.tar.bz2
find .
This should create enough i/o, needed for entropy.
Then you need to copy that file to all of your nodes and put it in /etc/corosync/
with ``user=root``, ``group=root`` and mode 0400: ::
chmod 400 /etc/corosync/authkey
scp /etc/corosync/authkey vm2:/etc/corosync/
#. Edit corosync.conf: ::
#cat /etc/corosync/corosync.conf
#Please read the corosync.conf.5 manual page
compatibility: whitetank
totem {
version: 2
secauth: off
threads: 0
interface {
member {
memberaddr: 10.2.2.233
}
member {
memberaddr: 10.2.2.235
}
ringnumber: 0
bindnetaddr: 10.2.2.0
mcastport: 5405
}
transport: udpu
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
#. Configure ``pacemaker``: ::
#vi /etc/corosync/service.d/pcmk
service {
name: pacemaker
ver: 1
}
#. Synchronize: ::
for f in /etc/corosync/corosync.conf /etc/corosync/service.d/pcmk; do scp $f rhmn2:$f; done
#. Start ``corosync`` and ``pacemaker`` in both rhmn1 and rhmn2: ::
# /etc/init.d/corosync start
Starting Corosync Cluster Engine (corosync): [ OK ]
# /etc/init.d/pacemaker start
Starting Pacemaker Cluster Manager[ OK ]
#. Verify and let stonith false: ::
# crm_verify -L -V
error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
# crm configure property stonith-enabled=false
Customize corosync/pacemaker configuration for xCAT
------------------------------------------------------
Please be aware that you need to apply ALL the configuration at once. You cannot pick and choose which pieces to put in, and you cannot put some in now, and some later. Don't execute individual commands, but use crm configure edit instead.
Check that both rhmn2 and chetha are standby state now: ::
rhmn1 ~]# crm status
Last updated: Wed Aug 13 22:57:58 2014
Last change: Wed Aug 13 22:40:31 2014 via cibadmin on rhmn1
Stack: classic openais (with plugin)
Current DC: rhmn2 - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, 2 expected votes
14 Resources configured.
Node rhmn1: standby
Node rhmn2: standby
Execute ``crm configure edit`` to add all configure at once: ::
rhmn1 ~]# crm configure edit
node rhmn1
node rhmn2 \
attributes standby=on
primitive ETCXCATFS Filesystem \
params device="10.2.2.44:/etc/xcat" fstype=nfs options=v3 directory="/etc/xcat" \
op monitor interval=20 timeout=40
primitive HPCADMIN Filesystem \
params device="10.2.2.44:/disk1/hpcpeadmin" fstype=nfs options=v3 directory="/home/hpcpeadmin" \
op monitor interval=20 timeout=40
primitive ROOTSSHFS Filesystem \
params device="10.2.2.44:/root/.ssh" fstype=nfs options=v3 directory="/root/.ssh" \
op monitor interval=20 timeout=40
primitive INSTALLFS Filesystem \
params device="10.2.2.44:/disk1/install" fstype=nfs options=v3 directory="/install" \
op monitor interval=20 timeout=40
primitive NFS_xCAT lsb:nfs \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s \
op monitor interval=41s
primitive NFSlock_xCAT lsb:nfslock \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s \
op monitor interval=43s
primitive ROOTXCATFS Filesystem \
params device="10.2.2.44:/root/.xcat" fstype=nfs options=v3 directory="/root/.xcat" \
op monitor interval=20 timeout=40
primitive apache_xCAT apache \
op start interval=0 timeout=600s \
op stop interval=0 timeout=120s \
op monitor interval=57s timeout=120s \
params configfile="/etc/httpd/conf/httpd.conf" statusurl="http://localhost:80/icons/README.html" testregex="</html>" \
meta target-role=Started
primitive dummy Dummy \
op start interval=0 timeout=600s \
op stop interval=0 timeout=120s \
op monitor interval=57s timeout=120s \
meta target-role=Started
primitive named lsb:named \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s \
op monitor interval=37s
primitive dhcpd lsb:dhcpd \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s" \
op monitor interval="37s"
primitive xCAT lsb:xcatd \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s \
op monitor interval=42s \
meta target-role=Started
primitive xCAT_conserver lsb:conserver \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s \
op monitor interval=53
primitive xCATmnVIP IPaddr2 \
params ip=10.2.2.250 cidr_netmask=8 \
op monitor interval=30s
group XCAT_GROUP INSTALLFS ETCXCATFS ROOTXCATFS HPCADMIN ROOTSSHFS \
meta resource-stickiness=100 failure-timeout=60 migration-threshold=3 target-role=Started
clone clone_named named \
meta clone-max=2 clone-node-max=1 notify=false
colocation colo1 inf: NFS_xCAT XCAT_GROUP
colocation colo2 inf: NFSlock_xCAT XCAT_GROUP
colocation colo4 inf: apache_xCAT XCAT_GROUP
colocation colo7 inf: xCAT_conserver XCAT_GROUP
colocation dummy_colocation inf: dummy xCAT
colocation xCAT_colocation inf: xCAT XCAT_GROUP
colocation xCAT_makedns_colocation inf: xCAT xCAT_makedns
order Most_aftergrp inf: XCAT_GROUP ( NFS_xCAT NFSlock_xCAT apache_xCAT xCAT_conserver )
order Most_afterip inf: xCATmnVIP ( apache_xCAT xCAT_conserver )
order clone_named_after_ip_xCAT inf: xCATmnVIP clone_named
order dummy_order0 inf: NFS_xCAT dummy
order dummy_order1 inf: xCAT dummy
order dummy_order2 inf: NFSlock_xCAT dummy
order dummy_order3 inf: clone_named dummy
order dummy_order4 inf: apache_xCAT dummy
order dummy_order7 inf: xCAT_conserver dummy
order dummy_order8 inf: xCAT_makedns dummy
order xcat_makedns inf: xCAT xCAT_makedns
order dummy_order5 inf: dhcpd dummy
property cib-bootstrap-options: \
dc-version=1.1.8-7.el6-394e906 \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes=2 \
stonith-enabled=false \
last-lrm-refresh=1406859140
\#vim:set syntax=pcmk
Verify auto fail over
-------------------------
#. Online rhmn1
Currently, rhmn2 and rhmn1 status are standby, let us online rhmn1: ::
rhmn2 ~]# crm node online rhmn1
rhmn2 /]# crm status
Last updated: Mon Aug 4 23:16:44 2014
Last change: Mon Aug 4 23:13:09 2014 via crmd on rhmn2
Stack: classic openais (with plugin)
Current DC: rhmn1 - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, 2 expected votes
12 Resources configured.
Node rhmn2: standby
Online: [ rhmn1 ]
Resource Group: XCAT_GROUP
xCATmnVIP (ocf::heartbeat:IPaddr2): Started rhmn1
INSTALLFS (ocf::heartbeat:Filesystem): Started rhmn1
ETCXCATFS (ocf::heartbeat:Filesystem): Started rhmn1
ROOTXCATFS (ocf::heartbeat:Filesystem): Started rhmn1
NFS_xCAT (lsb:nfs): Started rhmn1
NFSlock_xCAT (lsb:nfslock): Started rhmn1
apache_xCAT (ocf::heartbeat:apache): Started rhmn1
xCAT (lsb:xcatd): Started rhmn1
xCAT_conserver (lsb:conserver): Started rhmn1
dummy (ocf::heartbeat:Dummy): Started rhmn1
Clone Set: clone_named [named]
Started: [ rhmn1 ]
Stopped: [ named:1 ]
#. xcat on rhmn2 is not working while it is running in rhmn1: ::
rhmn2 /]# lsdef -t site -l
Unable to open socket connection to xcatd daemon on localhost:3001.
Verify that the xcatd daemon is running and that your SSL setup is correct.
Connection failure: IO::Socket::INET: connect: Connection refused at /opt/xcat/lib/perl/xCAT/Client.pm line 217.
rhmn2 /]# ssh rhmn1 "lsxcatd -v"
Version 2.8.4 (git commit 7306ca8abf1c6d8c68d3fc3addc901c1bcb6b7b3, built Mon Apr 21 20:48:59 EDT 2014)
#. Let rhmn1 standby and rhmn2 online, xcat will run on rhmn2: ::
rhmn2 /]# crm node online rhmn2
rhmn2 /]# crm node standby rhmn1
rhmn2 /]# crm status
Last updated: Mon Aug 4 23:19:33 2014
Last change: Mon Aug 4 23:19:40 2014 via crm_attribute on rhmn2
Stack: classic openais (with plugin)
Current DC: rhmn1 - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, 2 expected votes
12 Resources configured.
Node rhmn1: standby
Online: [ rhmn2 ]
Resource Group: XCAT_GROUP
xCATmnVIP (ocf::heartbeat:IPaddr2): Started rhmn2
INSTALLFS (ocf::heartbeat:Filesystem): Started rhmn2
ETCXCATFS (ocf::heartbeat:Filesystem): Started rhmn2
ROOTXCATFS (ocf::heartbeat:Filesystem): Started rhmn2
NFSlock_xCAT (lsb:nfslock): Started rhmn2
xCAT (lsb:xcatd): Started rhmn2
Clone Set: clone_named [named]
Started: [ rhmn2 ]
Stopped: [ named:1 ]
rhmn2 /]#lsxcatd -v
Version 2.8.4 (git commit 7306ca8abf1c6d8c68d3fc3addc901c1bcb6b7b3, built Mon Apr 21 20:48:59 EDT 2014)