2
0
mirror of https://github.com/xcat2/xcat-core.git synced 2025-07-30 08:11:20 +00:00

Cleanup service node setup docs

This commit is contained in:
Mark Gurevich
2017-02-08 16:48:38 -05:00
parent d7ca3e7b69
commit a02f2523ae
4 changed files with 62 additions and 90 deletions

View File

@@ -1,66 +1,18 @@
Appendix B: Diagnostics
=======================
* **root ssh keys not setup** -- If you are prompted for a password when ssh to
the service node, then check to see if /root/.ssh has authorized_keys. If
the directory does not exist or no keys, on the MN, run xdsh service -K,
to exchange the ssh keys for root. You will be prompted for the root
password, which should be the password you set for the key=system in the
passwd table.
* **XCAT rpms not on SN** --On the SN, run rpm -qa | grep xCAT and make sure
the appropriate xCAT rpms are installed on the servicenode. See the list of
xCAT rpms in :ref:`setup_service_node_stateful_label`. If rpms
missing check your install setup as outlined in Build the Service Node
Stateless Image for diskless or :ref:`setup_service_node_stateful_label` for
diskful installs.
* **otherpkgs(including xCAT rpms) installation failed on the SN** --The OS
repository is not created on the SN. When the "yum" command is processing
the dependency, the rpm packages (including expect, nmap, and httpd, etc)
required by xCATsn can't be found. In this case, check whether the
``/install/postscripts/repos/<osver>/<arch>/`` directory exists on the MN.
If it is not on the MN, you need to re-run the "copycds" command, and there
will be some file created under the
``/install/postscripts/repos/<osver>/<arch>`` directory on the MN. Then, you
need to re-install the SN, and this issue should be gone.
* **Error finding the database/starting xcatd** -- If on the Service node when
you run tabdump site, you get "Connection failure: IO::Socket::SSL:
connect: Connection refused at ``/opt/xcat/lib/perl/xCAT/Client.pm``". Then
restart the xcatd daemon and see if it passes by running the command:
service xcatd restart. If it fails with the same error, then check to see
if ``/etc/xcat/cfgloc`` file exists. It should exist and be the same as
``/etc/xcat/cfgloc`` on the MN. If it is not there, copy it from the MN to
the SN. The run service xcatd restart. This indicates the servicenode
postscripts did not complete successfully. Check to see your postscripts
table was setup correctly in :ref:`add_service_node_postscripts_label` to the
postscripts table.
* **Error accessing database/starting xcatd credential failure**-- If you run
tabdump site on the servicenode and you get "Connection failure:
IO::Socket::SSL: SSL connect attempt failed because of handshake
problemserror:14094418:SSL routines:SSL3_READ_BYTES:tlsv1 alert unknown ca
at ``/opt/xcat/lib/perl/xCAT/Client.pm``", check ``/etc/xcat/cert``. The
directory should contain the files ca.pem and server-cred.pem. These were
suppose to transfer from the MN ``/etc/xcat/cert`` directory during the
install. Also check the ``/etc/xcat/ca`` directory. This directory should
contain most files from the ``/etc/xcat/ca`` directory on the MN. You can
manually copy them from the MN to the SN, recursively. This indicates the
the servicenode postscripts did not complete successfully. Check to see
your postscripts table was setup correctly in
:ref:`add_service_node_postscripts_label` to the postscripts table. Again
service xcatd restart and try the tabdump site again.
* **Missing ssh hostkeys** -- Check to see if ``/etc/xcat/hostkeys`` on the SN,
has the same files as ``/etc/xcat/hostkeys`` on the MN. These are the ssh
keys that will be installed on the compute nodes, so root can ssh between
compute nodes without password prompting. If they are not there copy them
from the MN to the SN. Again, these should have been setup by the
servicenode postscripts.
* **root ssh keys not setup** -- If you are prompted for a password when ssh to the service node, then check to see if ``/root/.ssh`` directory on MN has ``authorized_keys`` file. If the directory does not exist or no keys, run ``xdsh service -K``, to exchange the ssh keys for root. You will be prompted for the root password, which should be the password you set for the ``key=system`` in the passwd table.
* **Errors running hierarchical commands such as xdsh** -- xCAT has a number of
commands that run hierarchically. That is, the commands are sent from xcatd
on the management node to the correct service node xcatd, which in turn
processes the command and sends the results back to xcatd on the management
node. If a hierarchical command such as xcatd fails with something like
"Error: Permission denied for request", check ``/var/log/messages`` on the
management node for errors. One error might be "Request matched no policy
rule". This may mean you will need to add policy table entries for your
xCAT management node and service node:
* **XCAT rpms not on SN** -- On the SN, run ``rpm -qa | grep xCAT`` and make sure the appropriate xCAT rpms are installed on the servicenode. See the list of xCAT rpms in :ref:`setup_service_node_stateful_label`. If rpms are missing, check your install setup as outlined in :ref:`setup_service_node_stateless_label` for diskless or :ref:`setup_service_node_stateful_label` for diskful installs.
* **otherpkgs(including xCAT rpms) installation failed on the SN** -- The OS repository is not created on the SN. When the "yum" command is processing the dependency, the rpm packages (including expect, nmap, and httpd, etc) required by xCATsn can't be found. In this case, check whether the ``/install/postscripts/repos/<osver>/<arch>/`` directory exists on the MN. If it is not on the MN, you need to re-run the ``copycds`` command, and there will be files created under the ``/install/postscripts/repos/<osver>/<arch>`` directory on the MN. Then, you need to re-install the SN.
* **Error finding the database/starting xcatd** -- If on the Service node when you run tabdump site, you get "Connection failure: IO::Socket::SSL: connect: Connection refused at ``/opt/xcat/lib/perl/xCAT/Client.pm``". Then restart the xcatd daemon and see if it passes by running the command ``service xcatd restart``. If it fails with the same error, then check to see if ``/etc/xcat/cfgloc`` file exists. It should exist and be the same as ``/etc/xcat/cfgloc`` on the MN. If it is not there, copy it from the MN to the SN. The run ``service xcatd restart``. This indicates the servicenode postscripts did not complete successfully. Run ``lsdef <service node> -i postscripts -c`` and verify ``servicenode`` postscript appears on the list..
* **Error accessing database/starting xcatd credential failure**-- If you run ``tabdump site`` on the service node and get "Connection failure: IO::Socket::SSL: SSL connect attempt failed because of handshake problemserror:14094418:SSL routines:SSL3_READ_BYTES:tlsv1 alert unknown at ``/opt/xcat/lib/perl/xCAT/Client.pm``", check ``/etc/xcat/cert``. The directory should contain the files ``ca.pem`` and ``server-cred.pem``. These were suppose to transfer from the MN ``/etc/xcat/cert`` directory during the install. Also check the ``/etc/xcat/ca`` directory. This directory should contain most files from the ``/etc/xcat/ca`` directory on the MN. You can manually copy them from the MN to the SN, recursively. This indicates the the servicenode postscripts did not complete successfully. Run ``lsdef <service node> -i postscripts -c`` and verify ``servicenode`` postscript appears on the list. Run ``service xcatd restart`` again and try the tabdump site again.
* **Missing ssh hostkeys** -- Check to see if ``/etc/xcat/hostkeys`` on the SN, has the same files as ``/etc/xcat/hostkeys`` on the MN. These are the ssh keys that will be installed on the compute nodes, so root can ssh between compute nodes without password prompting. If they are not there copy them from the MN to the SN. Again, these should have been setup by the servicenode postscripts.
* **Errors running hierarchical commands such as xdsh** -- xCAT has a number of commands that run hierarchically. That is, the commands are sent from xcatd on the management node to the correct service node xcatd, which in turn processes the command and sends the results back to xcatd on the management node. If a hierarchical command such as xcatd fails with something like "Error: Permission denied for request", check ``/var/log/messages`` on the management node for errors. One error might be "Request matched no policy rule". This may mean you will need to add policy table entries for your xCAT management node and service node.
* **/install is not mounted on service node from managemen mode** -- If service node does not have ``/install`` directory mounted from management node, run ``lsdef -t site clustersite -i installloc`` and verify ``installloc="/install"``

View File

@@ -6,33 +6,51 @@ Diskful (Stateful) Installation
Any cluster using statelite compute nodes must use a stateful (diskful) Service Nodes.
**Note: All xCAT Service Nodes must be at the exact same xCAT version as the xCAT Management Node**. Copy the files to the Management Node (MN) and untar them in the appropriate sub-directory of ``/install/post/otherpkgs``
**Note:** All xCAT Service Nodes must be at the exact same xCAT version as the xCAT Management Node.
**Note for the appropriate directory below, check the ``otherpkgdir=/install/post/otherpkgs/rhels7/x86_64`` attribute of the osimage defined for the servicenode.**
For example, for osimage rhels7-x86_64-install-service ::
Configure ``otherpkgdir`` and ``otherpkglist`` for service node osimage
----------------------------------------------------------------------
mkdir -p /install/post/otherpkgs/**rhels7**/x86_64/xcat
cd /install/post/otherpkgs/**rhels7**/x86_64/xcat
* Create a subdirectory ``xcat`` under a path specified by ``otherpkgdir`` attribute of the service node os image, selected during the :doc:`../define_service_nodes` step.
For example, for osimage *rhels7-x86_64-install-service* ::
[root@fs4 xcat]# lsdef -t osimage rhels7-x86_64-install-service -i otherpkgdir
Object name: rhels7-x86_64-install-service
otherpkgdir=/install/post/otherpkgs/rhels7/x86_64
[root@fs4 xcat]# mkdir -p /install/post/otherpkgs/rhels7/x86_64/xcat
* Download or copy `xcat-core` and `xcat-dep` .bz2 files into that `xcat` directory ::
wget https://xcat.org/files/xcat/xcat-core/<version>_Linux/xcat-core/xcat-core-<version>-linux.tar.bz2
wget https://xcat.org/files/xcat/xcat-dep/<version>_Linux/xcat-dep-<version>-linux.tar.bz2
* untar the `xcat-core` and `xcat-dep` .bz2 files ::
cd /install/post/otherpkgs/<os>/<arch>/xcat
tar jxvf core-rpms-snap.tar.bz2
tar jxvf xcat-dep-*.tar.bz2
Next, add rpm names into your own version of service.<osver>.<arch>.otherpkgs.pkglist file. In most cases, you can find an initial copy of this file under ``/opt/xcat/share/xcat/install/<platform>`` . Or copy one from another similar platform. ::
* Verify the following entries are included in the package file specified by the ``otherpkglist`` attribute of the service node osimage. ::
mkdir -p /install/custom/install/rh
cp /opt/xcat/share/xcat/install/rh/service.rhels7.x86_64.otherpkgs.pkglist \
/install/custom/install/rh
vi /install/custom/install/rh/service.rhels7.x86_64.otherpkgs.pkglist
xcat/xcat-dep/<os>/<arch>/xCATsn
xcat/xcat-dep/<os>/<arch>/conserver-xcat
xcat/xcat-dep/<os>/<arch>/perl-Net-Telnet
xcat/xcat-dep/<os>/<arch>/perl-Expect
Make sure the following entries are included in the
/install/custom/install/rh/service.rhels7.x86_64.otherpkgs.pkglist: ::
For example, for the osimage *rhels7-x86_64-install-service* ::
xCATsn
conserver-xcat
perl-Net-Telnet
perl-Expect
[root@fs4 ~]# lsdef -t osimage rhels7-x86_64-install-service -i otherpkglist
Object name: rhels7-x86_64-install-service
otherpkglist=/opt/xcat/share/xcat/install/rh/service.rhels7.x86_64.otherpkgs.pkglist
[root@fs4 ~]# cat /opt/xcat/share/xcat/install/rh/service.rhels7.x86_64.otherpkgs.pkglist
xcat/xcat-core/xCATsn
xcat/xcat-dep/rh7/x86_64/conserver-xcat
xcat/xcat-dep/rh7/x86_64/perl-Net-Telnet
xcat/xcat-dep/rh7/x86_64/perl-Expect
[root@fs4 ~]#
**Note: you will be installing the xCAT Service Node rpm xCATsn meta-package on the Service Node, not the xCAT Management Node meta-package. Do not install both.**
**Note:** you will be installing the xCAT Service Node rpm xCATsn meta-package on the Service Node, not the xCAT Management Node meta-package. Do not install both.
Update the rhels6 RPM repository (rhels6 only)
----------------------------------------------
@@ -71,12 +89,12 @@ Update the rhels6 RPM repository (rhels6 only)
**Note:** you should use comps-rhel6-Server.xml with its key as the group file.
Set the node status to ready for installation
---------------------------------------------
Set the node boot state to ready for installation
-------------------------------------------------
Run nodeset to the osimage name defined in the provmethod attribute on your Service Node. ::
Run **nodeset** command to the osimage name defined in the ``provmethod`` attribute on your Service Node. ::
nodeset service osimage="<osimagename>"
nodeset <service_node> osimage="<osimagename>"
For example ::

View File

@@ -1,3 +1,5 @@
.. _setup_service_node_stateless_label:
Diskless (Stateless) Installation
=================================

View File

@@ -2,10 +2,10 @@ Verify Service Node Installation
================================
* ssh to the service nodes. You should not be prompted for a password.
* Check to see that the xcat daemon xcatd is running.
* Run some database command on the service node, e.g tabdump site, or nodels,
and see that the database can be accessed from the service node.
* Check that ``/install`` and ``/tftpboot`` are mounted on the service node
from the Management Node, if appropriate.
* Make sure that the Service Node has Name resolution for all nodes, it will
service.
* Check to see that the xcat daemon ``xcatd`` is running.
* Run some database command on the service node, e.g ``tabdump site``, or ``nodels``, and see that the database can be accessed from the service node.
* Check that ``/install`` and ``/tftpboot`` are mounted on the service node from the Management Node, if appropriate.
* Make sure that the Service Node has name resolution for all nodes it will service.
* Run ``updatenode <compute node> -V -s`` on management node and verify output contains ``Running command on <service node>`` that indicates the command from management node is sent to service node to run against compute node target.
See :doc:`Appendix B <../appendix/appendix_b_diagnostics>` for possible solutions.