2
0
mirror of https://github.com/xcat2/xcat-core.git synced 2025-06-03 03:50:08 +00:00

Merge pull request #4990 from neo954/cuda-doc

NVIDIA CUDA driver installation document updating
This commit is contained in:
zet809 2018-03-28 12:55:21 +08:00 committed by GitHub
commit dc63be034b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 112 additions and 92 deletions

View File

@ -1,22 +1,20 @@
Deploy CUDA nodes
=================
Diskful
Diskful
-------
* To provision diskful nodes using osimage ``rhels7.2-ppc64le-install-cudafull``: ::
* To provision diskful nodes using osimage ``rhels7.5-ppc64le-install-cudafull``: ::
nodeset <noderange> osimage=rhels7.2-ppc64le-install-cudafull
nodeset <noderange> osimage=rhels7.5-ppc64le-install-cudafull
rsetboot <noderange> net
rpower <noderange> boot
rpower <noderange> boot
Diskless
--------
* To provision diskless nodes using osimage ``rhels7.2-ppc64le-netboot-cudafull``: ::
* To provision diskless nodes using osimage ``rhels7.5-ppc64le-netboot-cudafull``: ::
nodeset <noderange> osimage=rhels7.2-ppc64le-netboot-cudafull
nodeset <noderange> osimage=rhels7.5-ppc64le-netboot-cudafull
rsetboot <noderange> net
rpower <noderange> boot
rpower <noderange> boot

View File

@ -5,9 +5,9 @@ CUDA (Compute Unified Device Architecture) is a parallel computing platform and
For more information, see NVIDIAs website: https://developer.nvidia.com/cuda-zone
xCAT supports CUDA installation for Ubuntu 14.04.3 and RHEL 7.2LE on PowerNV (Non-Virtualized) for both diskful and diskless nodes.
xCAT supports CUDA installation for Ubuntu 14.04.3 and RHEL 7.5 on PowerNV (Non-Virtualized) for both diskful and diskless nodes.
Within the NVIDIA CUDA Toolkit, installing the ``cuda`` package will install both the ``cuda-runtime`` and the ``cuda-toolkit``. The ``cuda-toolkit`` is intended for developing CUDA programs and monitoring CUDA jobs. If your particular installation requires only running GPU jobs, it's recommended to install only the ``cuda-runtime`` package.
Within the NVIDIA CUDA Toolkit, installing the ``cuda`` package will install both the ``cuda-runtime`` and the ``cuda-toolkit``. The ``cuda-toolkit`` is intended for developing CUDA programs and monitoring CUDA jobs. If your particular installation requires only running GPU jobs, it's recommended to install only the ``cuda-runtime`` package.
.. toctree::
:maxdepth: 2

View File

@ -1,15 +1,15 @@
RHEL 7.2 LE
===========
RHEL 7.5
========
xCAT provides a sample package list (pkglist) files for CUDA. You can find them:
xCAT provides a sample package list (pkglist) files for CUDA. You can find them:
* Diskful: ``/opt/xcat/share/xcat/install/rh/cuda*``
* Diskless: ``/opt/xcat/share/xcat/netboot/rh/cuda*``
Diskful images
---------------
--------------
The following examples will create diskful images for ``cudafull`` and ``cudaruntime``. The osimage definitions will be created from the base ``rhels7.2-ppc64le-install-compute`` osimage.
The following examples will create diskful images for ``cudafull`` and ``cudaruntime``. The osimage definitions will be created from the base ``rhels7.5-ppc64le-install-compute`` osimage.
**[Note]**: There is a requirement to reboot the machine after the CUDA drivers are installed. To satisfy this requirement, the CUDA software is installed in the ``pkglist`` attribute of the osimage definition where a reboot will happen after the Operating System is installed.
@ -18,18 +18,18 @@ cudafull
#. Create a copy of the ``install-compute`` image and label it ``cudafull``: ::
lsdef -t osimage -z rhels7.2-ppc64le-install-compute \
lsdef -t osimage -z rhels7.5-ppc64le-install-compute \
| sed 's/install-compute:/install-cudafull:/' \
| mkdef -z
| mkdef -z
#. Add the CUDA repo created in the previous step to the ``pkgdir`` attribute: ::
chdef -t osimage -o rhels7.2-ppc64le-install-cudafull -p \
pkgdir=/install/cuda-7.5/ppc64le/cuda-core,/install/cuda-7.5/ppc64le/cuda-deps
chdef -t osimage -o rhels7.5-ppc64le-install-cudafull -p \
pkgdir=/install/cuda-9.2/ppc64le/cuda-core,/install/cuda-9.2/ppc64le/cuda-deps
#. Use the provided ``cudafull`` pkglist to install the CUDA packages: ::
chdef -t osimage -o rhels7.2-ppc64le-install-cudafull \
chdef -t osimage -o rhels7.5-ppc64le-install-cudafull \
pkglist=/opt/xcat/share/xcat/install/rh/cudafull.rhels7.ppc64le.pkglist
cudaruntime
@ -37,54 +37,54 @@ cudaruntime
#. Create a copy of the ``install-compute`` image and label it ``cudaruntime``: ::
lsdef -t osimage -z rhels7.2-ppc64le-install-compute \
lsdef -t osimage -z rhels7.5-ppc64le-install-compute \
| sed 's/install-compute:/install-cudaruntime:/' \
| mkdef -z
| mkdef -z
#. Add the CUDA repo created in the previous step to the ``pkgdir`` attribute: ::
chdef -t osimage -o rhels7.2-ppc64le-install-cudaruntime -p \
pkgdir=/install/cuda-7.5/ppc64le/cuda-core,/install/cuda-7.5/ppc64le/cuda-deps
chdef -t osimage -o rhels7.5-ppc64le-install-cudaruntime -p \
pkgdir=/install/cuda-9.2/ppc64le/cuda-core,/install/cuda-9.2/ppc64le/cuda-deps
#. Use the provided ``cudaruntime`` pkglist to install the CUDA packages: ::
chdef -t osimage -o rhels7.2-ppc64le-install-cudaruntime \
chdef -t osimage -o rhels7.5-ppc64le-install-cudaruntime \
pkglist=/opt/xcat/share/xcat/instal/rh/cudaruntime.rhels7.ppc64le.pkglist
Diskless images
---------------
The following examples will create diskless images for ``cudafull`` and ``cudaruntime``. The osimage definitions will be created from the base ``rhels7.2-ppc64le-netboot-compute`` osimage.
The following examples will create diskless images for ``cudafull`` and ``cudaruntime``. The osimage definitions will be created from the base ``rhels7.5-ppc64le-netboot-compute`` osimage.
**[Note]**: For diskless, the install of the CUDA packages MUST be done in the ``otherpkglist`` and **NOT** the ``pkglist`` as with diskful. The requirement for rebooting the machine is not applicable in diskless nodes because the image is loaded on each reboot.
**[Note]**: For diskless, the install of the CUDA packages MUST be done in the ``otherpkglist`` and **NOT** the ``pkglist`` as with diskful. The requirement for rebooting the machine is not applicable in diskless nodes because the image is loaded on each reboot.
cudafull
^^^^^^^^
#. Create a copy of the ``netboot-compute`` image and label it ``cudafull``: ::
lsdef -t osimage -z rhels7.2-ppc64le-netboot-compute \
lsdef -t osimage -z rhels7.5-ppc64le-netboot-compute \
| sed 's/netboot-compute:/netboot-cudafull:/' \
| mkdef -z
| mkdef -z
#. Verify that the CUDA repo created in the previous step is available in the directory specified by the ``otherpkgdir`` attribute.
#. Verify that the CUDA repo created in the previous step is available in the directory specified by the ``otherpkgdir`` attribute.
The ``otherpkgdir`` directory can be obtained by running lsdef on the osimage: ::
# lsdef -t osimage rhels7.2-ppc64le-netboot-cudafull -i otherpkgdir
Object name: rhels7.2-ppc64le-netboot-cudafull
otherpkgdir=/install/post/otherpkgs/rhels7.2/ppc64le
# lsdef -t osimage rhels7.5-ppc64le-netboot-cudafull -i otherpkgdir
Object name: rhels7.5-ppc64le-netboot-cudafull
otherpkgdir=/install/post/otherpkgs/rhels7.5/ppc64le
Create a symbolic link of the CUDA repository in the directory specified by ``otherpkgdir`` ::
ln -s /install/cuda-7.5 /install/post/otherpkgs/rhels7.2/ppc64le/cuda-7.5
ln -s /install/cuda-9.2 /install/post/otherpkgs/rhels7.5/ppc64le/cuda-9.2
#. Change the ``rootimgdir`` for the cudafull osimage: ::
chdef -t osimage -o rhels7.2-ppc64le-netboot-cudafull \
rootimgdir=/install/netboot/rhels7.2/ppc64le/cudafull
chdef -t osimage -o rhels7.5-ppc64le-netboot-cudafull \
rootimgdir=/install/netboot/rhels7.5/ppc64le/cudafull
#. Create a custom pkglist file to install additional operating system packages for your CUDA node.
#. Create a custom pkglist file to install additional operating system packages for your CUDA node.
#. Copy the default compute pkglist file as a starting point: ::
@ -102,7 +102,7 @@ cudafull
#. Set the new file as the ``pkglist`` attribute for the cudafull osimage: ::
chdef -t osimage -o rhels7.2-ppc64le-netboot-cudafull \
chdef -t osimage -o rhels7.5-ppc64le-netboot-cudafull \
pkglist=/install/custom/netboot/rh/cudafull.rhels7.ppc64le.pkglist
@ -111,48 +111,48 @@ cudafull
#. Create the otherpkg.pkglist file for cudafull: ::
vi /install/custom/netboot/rh/cudafull.rhels7.ppc64le.otherpkgs.pkglist
# add the following packages
cuda-7.5/ppc64le/cuda-deps/dkms
cuda-7.5/ppc64le/cuda-core/cuda
# add the following packages
cuda-9.2/ppc64le/cuda-deps/dkms
cuda-9.2/ppc64le/cuda-core/cuda
#. Set the ``otherpkg.pkglist`` attribute for the cudafull osimage: ::
chdef -t osimage -o rhels7.2-ppc64le-netboot-cudafull \
chdef -t osimage -o rhels7.5-ppc64le-netboot-cudafull \
otherpkglist=/install/custom/netboot/rh/cudafull.rhels7.ppc64le.otherpkgs.pkglist
#. Generate the image: ::
genimage rhels7.2-ppc64le-netboot-cudafull
genimage rhels7.5-ppc64le-netboot-cudafull
#. Package the image: ::
packimage rhels7.2-ppc64le-netboot-cudafull
packimage rhels7.5-ppc64le-netboot-cudafull
cudaruntime
^^^^^^^^^^^
#. Create a copy of the ``netboot-compute`` image and label it ``cudaruntime``: ::
lsdef -t osimage -z rhels7.2-ppc64le-netboot-compute \
lsdef -t osimage -z rhels7.5-ppc64le-netboot-compute \
| sed 's/netboot-compute:/netboot-cudaruntime:/' \
| mkdef -z
#. Verify that the CUDA repo created previously is available in the directory specified by the ``otherpkgdir`` attribute.
#. Verify that the CUDA repo created previously is available in the directory specified by the ``otherpkgdir`` attribute.
#. Obtain the ``otherpkgdir`` directory using the ``lsdef`` command: ::
# lsdef -t osimage rhels7.2-ppc64le-netboot-cudaruntime -i otherpkgdir
Object name: rhels7.2-ppc64le-netboot-cudaruntime
otherpkgdir=/install/post/otherpkgs/rhels7.2/ppc64le
# lsdef -t osimage rhels7.5-ppc64le-netboot-cudaruntime -i otherpkgdir
Object name: rhels7.5-ppc64le-netboot-cudaruntime
otherpkgdir=/install/post/otherpkgs/rhels7.5/ppc64le
#. Create a symbolic link to the CUDA repository in the directory specified by ``otherpkgdir`` ::
ln -s /install/cuda-7.5 /install/post/otherpkgs/rhels7.2/ppc64le/cuda-7.5
ln -s /install/cuda-9.2 /install/post/otherpkgs/rhels7.5/ppc64le/cuda-9.2
#. Change the ``rootimgdir`` for the cudaruntime osimage: ::
chdef -t osimage -o rhels7.2-ppc64le-netboot-cudaruntime \
rootimgdir=/install/netboot/rhels7.2/ppc64le/cudaruntime
chdef -t osimage -o rhels7.5-ppc64le-netboot-cudaruntime \
rootimgdir=/install/netboot/rhels7.5/ppc64le/cudaruntime
#. Create the ``otherpkg.pkglist`` file to do the install of the CUDA runtime packages:
@ -161,19 +161,52 @@ cudaruntime
vi /install/custom/netboot/rh/cudaruntime.rhels7.ppc64le.otherpkgs.pkglist
# Add the following packages:
cuda-7.5/ppc64le/cuda-deps/dkms
cuda-7.5/ppc64le/cuda-core/cuda-runtime-7-5
cuda-9.2/ppc64le/cuda-deps/dkms
cuda-9.2/ppc64le/cuda-core/cuda-runtime-9-2
#. Set the ``otherpkg.pkglist`` attribute for the cudaruntime osimage: ::
chdef -t osimage -o rhels7.2-ppc64le-netboot-cudaruntime \
chdef -t osimage -o rhels7.5-ppc64le-netboot-cudaruntime \
otherpkglist=/install/custom/netboot/rh/cudaruntime.rhels7.ppc64le.otherpkgs.pkglist
#. Generate the image: ::
genimage rhels7.2-ppc64le-netboot-cudaruntime
genimage rhels7.5-ppc64le-netboot-cudaruntime
#. Package the image: ::
packimage rhels7.2-ppc64le-netboot-cudaruntime
packimage rhels7.5-ppc64le-netboot-cudaruntime
POWER9 Setup
------------
NVIDIA POWER9 CUDA driver need some additional setup. Refer the URL below for details.
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#power9-setup
xCAT includes a script, ``cuda_power9_setup`` as example, to help user handle this situation.
Diskful osimage
^^^^^^^^^^^^^^^
For diskful deployment, there is no need to change the osimage definition. Instead, add this postscript to your compute node postbootscrtips list.
chdef p9compute -p postbootscripts=cuda_power9_setup
Disless osimage
^^^^^^^^^^^^^^^
For diskless deployment, the script need to add to the postinstall script of the osimage. And it should be run in the chroot environment. Please refer the following commands as an example.
mkdir -p /install/custom/netboot
cp /opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.postinstall /opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.postinstall
cat >>/install/custom/netboot/rh/cudafull.rhels7.ppc64le.postinstall <-EOF
cp /install/postscripts/cuda_power9_setup /install/netboot/rhels7.5/ppc64le/compute/rootimg/tmp/cuda_power9_setup"
chroot /install/netboot/rhels7.5/ppc64le/compute/rootimg" /tmp/cuda_power9_setup
rm -f /install/netboot/rhels7.5/ppc64le/compute/rootimg/tmp/cuda_power9_setup
EOF
chdef -t osimage rhels7.5-ppc64le-netboot-cudafull postinstall=/opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.postinstall

View File

@ -1,31 +1,27 @@
RHEL 7.2 LE
===========
RHEL 7.5
========
#. Create a repository on the MN node installing the CUDA Toolkit: ::
# For cuda toolkit name: /root/cuda-repo-rhel7-7-5-local-7.5-18.ppc64le.rpm
# extract the contents from the rpm
# For cuda toolkit name: /path/to/cuda-repo-rhel7-9-2-local-9.2.64-1.ppc64le.rpm
# extract the contents from the rpm
mkdir -p /tmp/cuda
cd /tmp/cuda
rpm2cpio /root/cuda-repo-rhel7-7-5-local-7.5-18.ppc64le.rpm | cpio -i -d
rpm2cpio /path/to/cuda-repo-rhel7-9-2-local-9.2.64-1.ppc64le.rpm | cpio -i -d
# Create the repo directory under xCAT /install dir for cuda 7.5
mkdir -p /install/cuda-7.5/ppc64le/cuda-core
cp -r /tmp/cuda/var/cuda-repo-7-5-local/* /install/cuda-7.5/ppc64le/cuda-core
# Create the repo directory under xCAT /install dir for cuda 9.2
mkdir -p /install/cuda-9.2/ppc64le/cuda-core
cp /tmp/cuda/var/cuda-repo-9-2-local/*.rpm /install/cuda-9.2/ppc64le/cuda-core
# Create the yum repo files
createrepo /install/cuda-9.2/ppc64le/cuda-core
# Create the yum repo files
createrepo /install/cuda-7.5/ppc64le/cuda-core
#. The NVIDIA CUDA Toolkit contains rpms that have dependencies on other external packages (such as ``DKMS``). These are provided by EPEL. It's up to the system administrator to obtain the dependency packages and add those to the ``cuda-deps`` directory: ::
mkdir -p /install/cuda-7.5/ppc64le/cuda-deps
cd /install/cuda-7.5/ppc64le/cuda-deps
mkdir -p /install/cuda-9.2/ppc64le/cuda-deps
# Copy the DKMS rpm to this directory
ls
dkms-2.2.0.3-30.git.7c3e7c5.el7.noarch.rpm
# Execute createrepo in this directory
createrepo /install/cuda-7.5/ppc64le/cuda-deps
# Copy the DKMS rpm to this directory
cp /path/to/dkms-2.4.0-1.20170926git959bd74.el7.noarch.rpm /install/cuda-9.2/ppc64le/cuda-deps
# Execute createrepo in this directory
createrepo /install/cuda-9.2/ppc64le/cuda-deps

View File

@ -3,43 +3,36 @@ Update NVIDIA Driver
If the user wants to update the newer NVIDIA driver on the system, follow the :doc:`Create CUDA software repository </advanced/gpu/nvidia/repo/index>` document to create another repository for the new driver.
The following example assumes the new driver is in ``/install/cuda-7.5/ppc64le/nvidia_new``.
The following example assumes the new driver is in ``/install/cuda-9.2/ppc64le/nvidia_new``.
Diskful
-------
#. Change pkgdir for the cuda image: ::
chdef -t osimage -o rhels7.2-ppc64le-install-cudafull \
pkgdir=/install/cuda-7.5/ppc64le/nvidia_new,/install/cuda-7.5/ppc64le/cuda-deps
chdef -t osimage -o rhels7.5-ppc64le-install-cudafull \
pkgdir=/install/cuda-9.2/ppc64le/nvidia_new,/install/cuda-9.2/ppc64le/cuda-deps
#. Use xdsh command to remove all the NVIDIA rpms: ::
xdsh <noderange> "yum remove *nvidia* -y"
xdsh <noderange> "yum remove *nvidia* -y"
#. Run updatenode command to update NVIDIA driver on the compute node: ::
updatenode <noderange> -S
#. Reboot compute node: ::
rpower <noderange> off
rpower <noderange> on
#. Verify the newer driver level: ::
nvidia-smi | grep Driver
Diskless
--------
To update a new NVIDIA driver on diskless compute nodes, re-generate the osimage pointing to the new NVIDIA driver repository and reboot the node to load the diskless image.
To update a new NVIDIA driver on diskless compute nodes, re-generate the osimage pointing to the new NVIDIA driver repository and reboot the node to load the diskless image.
Refer to :doc:`Create osimage definitions </advanced/gpu/nvidia/osimage/index>` for specific instructions.
Refer to :doc:`Create osimage definitions </advanced/gpu/nvidia/osimage/index>` for specific instructions.