diff --git a/docs/source/guides/admin-guides/manage_clusters/common/deployment/enable_kdump.rst b/docs/source/guides/admin-guides/manage_clusters/common/deployment/enable_kdump.rst index 8a6158914..10f5ec4eb 100644 --- a/docs/source/guides/admin-guides/manage_clusters/common/deployment/enable_kdump.rst +++ b/docs/source/guides/admin-guides/manage_clusters/common/deployment/enable_kdump.rst @@ -1,87 +1,64 @@ -Enable Kdump Over Ethernet +Enable kdump Over Ethernet ========================== Overview -------- -kdump is an advanced crash dumping mechanism. When enabled, the system is booted from the context of another kernel. This second kernel reserves a small amount of memory, and its only purpose is to capture the core dump image in case the system crashes. Being able to analyze the core dump helps significantly to determine the exact cause of the system failure. +kdump is an feature of the Linux kernel that allows the system to be booted from the context of another kernel. This second kernel reserves a small amount of memory and its only purpose is to capture the core dump in the event of a kernel crash. The ability to analyze the core dump helps to determine causes of system failures. xCAT Interface -------------- -The pkglist, exclude and postinstall files location and name can be obtained by running the following command: :: +The following attributes of an osimage should be modified to enable ``kdump``: - lsdef -t osimage +* pkglist +* exlist +* postinstall +* dump +* crashkernelsize +* postscripts -Here is an example: :: +Configure the ``pkglist`` file +------------------------------ - lsdef -t osimage rhels7.1-ppc64le-netboot-compute - Object name: rhels7.1-ppc64le-netboot-compute - exlist=/opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.exlist - imagetype=linux - osarch=ppc64le - osdistroname=rhels7.1-ppc64le - osname=Linux - osvers=rhels7.1 - otherpkgdir=/install/post/otherpkgs/rhels7.1/ppc64le - permission=755 - pkgdir=/install/rhels7.1/ppc64le - pkglist=/opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.pkglist - postinstall=/opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.postinstall - profile=compute - provmethod=netboot - rootimgdir=/install/netboot/rhels7.1/ppc64le/compute +The ``pkglist`` for the osimage needs to include the appropriate RPMs. The following list of RPMs are provided as a sample, always refer to the Operating System specific documentataion to ensure the required packages are there for ``kdump`` support. -In above example, pkglist file is /opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.pkglist, exclude files is in /opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.exlist, and postinstall file is /opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.postinstall. - -Setup pkglist -------------- - -Before setting up kdump, the appropriate rpms should be added to the pkglist file.Here is the rpm packages list which needs to be added to pkglist file for kdump for different OS. - -* **[RHEL]** :: +* **[RHELS]** :: kexec-tools crash -* **[SLES11]** :: +* **[SLES]** :: kdump kexec-tools makedumpfile -* **[SLES10]** :: - - kernel-kdump - kexec-tools - kdump - makedumpfile - * **[Ubuntu]** :: -The exclude file ----------------- +Modify the ``exlist`` file +-------------------------- -The base diskless image excludes the /boot directory, but it is required for kdump. Update the exlist file and remove the entry for /boot. Then run the packimage or liteimg command to update your image with the changes. +The default diskless image created by ``copycds`` excludes the ``/boot`` directory in the exclude list file, but this is required for ``kdump``. - +Update the ``exlist`` for the target osimage and remove the line ``/boot``: :: -The postinstall file --------------------- + ./boot* # <-- remove this line -The kdump will create a new initrd which used in the dumping stage. The /tmp or /var/tmp directory will be used as the temporary directory. These 2 directory only are allocated 10M space by default. You need to enlarge it to 200M. Modify the postinstall file to increase /tmp space. +Run ``packimage`` to update the diskless image with the changes. + +The ``postinstall`` file +------------------------ + +The kdump will create a new initrd which used in the dumping stage. The ``/tmp`` or ``/var/tmp`` directory will be used as the temporary directory. These 2 directory only are allocated 10M space by default. You need to enlarge it to 200M. Modify the postinstall file to increase ``/tmp`` space. * **[RHELS]** :: tmpfs /var/tmp tmpfs defaults,size=200m 0 2 -* **[SLES10]** :: - - tmpfs /var/tmp tmpfs defaults,size=200m 0 2 - * **[SLES11]** :: tmpfs /tmp tmpfs defaults,size=200m 0 2 @@ -90,113 +67,107 @@ The kdump will create a new initrd which used in the dumping stage. The /tmp or -The dump attribute ------------------- +The ``dump`` attribute +---------------------- -In order to support kdump, the dump attribute was added into linuximage table, which is used to define the remote path where the crash information should be dumped to. Use the chdef command to change the image's dump attribute using the URI format. :: +To support kernel dumps, the ``dump`` attribute **must** be set on the osimage definition. If not set, kdump service will not be enabled. The ``dump`` attribute defines the NFS remote path where the crash information is to be stored. + +Use the ``chdef`` command to set a value of the ``dump`` attribute: :: chdef -t osimage dump=nfs:/// -The can be excluded if the destination NFS server is the service or management node. :: +If the NFS server is the Service Node or Management Node, the server can be left out: :: chdef -t osimage dump=nfs:/// -The crashkernelsize attribute ------------------------------ +**Note:** Only NFS is currently supported as a storage location. Make sure the NFS remote path(``nfs:///``) is exported and it is read-writeable to the node where kdump service is enabled. -For system x machine, on sles10 set the crashkernelsize attribute like this: :: - chdef -t osimage crashkernelsize=M@16M +The ``crashkernelsize`` attribute +--------------------------------- -On sles11, rhels6 and above set the crashkernelsize attribute like this: :: +To allow the Operating System to automatically reserve the appropriate amount of memory for the ``kdump`` kernel, set ``crashkernelsize=auto``. + +For setting specific sizes, use the following example: + +* For System X machines, set the ``crashkernelsize`` using this format: :: chdef -t osimage crashkernelsize=M -Where recommended value is 256. For more information about the size can refer to the following information: - ``_. - - ``_. - - ``_. +* For Power System AC922, set the ``crashkernelsize`` using this format: :: - ``_. - -For system p machine, set the crashkernelsize attribute to this: :: + chdef -t osimage crashkernelsize=M + +* For System P machines, set the ``crashkernelsize`` using this format: :: chdef -t osimage crashkernelsize=@32M -For Power System AC922, set the crashkernelsize attribute to this: :: - - chdef -t osimage crashkernelsize=M +**Notes**: the value of the ``crashkernelsize`` depends on the total physical memory size on the machine. For more about size, refer to `Appedix`_ -Where recommended value is 256, more information can refer the kdump document for the system x. - -**Notes**: The ``crashkernel=`` option can be defined in multiple ways. If it is hard to decide the specific crashkernelsize, auto value can be used, the auto value enables automatic configuration of reserved memory based on the total amount of memory in the system like ``crashkernel=auto``. - -When your node starts, and you get a kdump start error like this: :: +If kdump start error like this: :: Your running kernel is using more than 70% of the amount of space you reserved for kdump, you should consider increasing your crashkernel -You should modify this attribute using this chdef command: :: +The ``crashkernelsize`` is not large enough, you should change the ``crashkernelsize`` larger until the error message disappear. - chdef -t osimage crashkernelsize=512M@32M +The ``enablekdump`` postscript +------------------------------ -If 512M@32M is not large enough, you should change the crashkernelsize larger like 1024M until the error message disappear. - -The enablekdump postscript --------------------------- - -This postscript enablekdump is used to start the kdump service when the node is booting up. Add it to your nodes list of postscripts by running this command: :: +xCAT provides a postscript ``enablekdump`` that can be added to the Nodes to automatically start the ``kdump`` service when the node boots. Add to the nodes using the following command: :: chdef -t node -p postscripts=enablekdump -Notes ------ +Manually trigger a kernel panic on Linux +---------------------------------------- -Currently, only NFS is supported for the setup of kdump. +Normally, kernel ``panic()`` will trigger booting into capture kernel. Once the kernel panic is triggered, the node will reboot into the capture kernel, and a kernel dump (vmcore) will be automatically saved to the directory on the specified NFS server (````). -If the dump attribute is not set, the kdump service will not be enabled. +Check your Operating System specific documentation for the path where the kernel dump is saved. For example: -Make sure the NFS remote path(nfs:///) is exported and it is read-writeable to the node where kdump service is enabled. +* **[RHELS6]** :: -How to trigger kernel panic on Linux ------------------------------------- - -Normally, kernel panic() will trigger booting into capture kernel. Once the kernel panic is triggered, the node will reboot into the capture kernel, and a kernel dump (vmcore) will be automatically saved to the directory on the specified NFS server (). - -#. For RHESL6 the directory is /var/crash/-