Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products

NetWorker: Troubleshooting Guide for Red Hat Cluster Service Issue

Summary: This article provides an overview of how to approach NetWorker service startup issues for NetWorker servers deployed on Red Hat pacemaker (pcs) clusters. This article is appropriate for NetWorker backup administrators and NetWorker support to aid in troubleshooting these issues. ...

This article applies to   This article does not apply to 

Instructions

NetWorker servers can be deployed in a cluster failover configuration on Red Hat nodes using pacemaker (pcs) services. In this configuration type, NetWorker is installed on two or more nodes and the NetWorker server databases reside on a shared storage location which is passed between nodes depending on which node is the "active" node in the pacemaker. The NetWorker server uses a shared cluster name and IP address so that its naming and addressing is consistent regardless of which node is hosting the services. See the NetWorker Cluster Integration Guide for details on how to set up NetWorker in a cluster. This guide is available on the Dell Support Product Page


Cluster Topology:

This article uses an example cluster with the following configuration:

NetWorker Cluster Topology
Hostname
IP Address
Function
lnx-node1.amer.lan
192.168.9.108
Physical Node 1
lnx-node2.amer.lan
192.168.9.109
Physical Node 2
lnx-nwcluster.amer.lan
192.168.9.110
Logical Name used by NetWorker

The file system on the nodes manages NetWorker using symbolic links.

Active Node:
An active node where the NetWorker server is started symbolically links /nsr to the shared storage location:
root@lnx-node1:~# ls -l / | grep nsr
lrwxrwxrwx.   1 root root     14 Oct  5 10:49 nsr -> /nsr_share/nsr
drwxr-xr-x.  11 root root    116 Aug 31 17:20 nsr.NetWorker.local
drwxr-xr-x.   3 root root     17 Aug 31 17:23 nsr_share
Passive Node:
A "passive" node symbolically links /nsr to /nsr.NetWorker.local:
root@lnx-node2:~# ls -l / | grep nsr
lrwxrwxrwx.   1 root root     20 Oct  3 17:08 nsr -> /nsr.NetWorker.local
drwxr-xr-x.  11 root root    116 Aug 31 17:19 nsr.NetWorker.local
drwxr-xr-x.   2 root root      6 Aug 31 17:18 nsr_share
When a node is in a passive state, the nsrexecd (NetWorker client) software is always running using /nsr.NetWorker.local. Each physical node has its own client resource using the physical node's DNS resolvable name and IP address. The NetWorker server only runs using the shared storage (/nsr_share) and uses the shared IP address and hostname. This can only be active on one node at a time. 

The following pacemaker (pcs) commands are used to get an overview of the pacemaker configuration and status:
  • Cluster configuration:
pcs status
Example:
root@lnx-node1:~# pcs status 
Cluster name: rhelclus 
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-10-05 10:59:19 -04:00) 
Cluster Summary: 
  * Stack: corosync 
  * Current DC: lnx-node1.amer.lan (version 2.1.5-9.3.el8_8-a3f44794f94) - partition with quorum 
  * Last updated: Thu Oct 5 10:59:20 2023 
  * Last change: Thu Oct 5 10:59:13 2023 by root via cibadmin on lnx-node1.amer.lan 
  * 2 nodes configured 
  * 3 resource instances configured 

Node List: 
  * Online: [ lnx-node1.amer.lan lnx-node2.amer.lan ] 

Full List of Resources: 
  * Resource Group: NW_group: 
    * fs (ocf::heartbeat:Filesystem): Started lnx-node1.amer.lan 
    * ip (ocf::heartbeat:IPaddr): Started lnx-node1.amer.lan 
    * nws (ocf::EMC_NetWorker:Server): Started lnx-node1.amer.lan 

Daemon Status: 
  corosync: active/enabled 
  pacemaker: active/enabled 
  pcsd: active/enabled

From the above output, we can determine how many nodes are in the cluster and if any are offline or in standby status. The output also shows which node is hosting the shared file system (fs), cluster resource IP address (ip), and the NetWorker services (nws). The resource names used here are the defaults used in the NetWorker Cluster Integration Guide; however, it is possible that different names are used. If you are using different names, make note of the resource names and replace as needed when following the instructions in this article.
  • Pacemaker resource configuration:
pcs resource config

Example:

root@lnx-node1:~# pcs resource config 
Group: NW_group 
  Resource: fs (class=ocf provider=heartbeat type=Filesystem)
    Attributes: fs-instance_attributes 
      device=/dev/sdb1 
      directory=/nsr_share 
      fstype=xfs 
    Operations: 
      monitor: fs-monitor-interval-20 
        interval=20 
        timeout=300 
      start: fs-start-interval-0s 
        interval=0s 
        timeout=60s 
      stop: fs-stop-interval-0s interval=0s timeout=60s 
  Resource: ip (class=ocf provider=heartbeat type=IPaddr) 
    Attributes: ip-instance_attributes 
      cidr_netmask=24 
      ip=192.1xx.9.1x0 
      nic=ens192 
    Operations: 
      monitor: ip-monitor-interval-15 
        interval=15 
        timeout=120 
      start: ip-start-interval-0s 
        interval=0s 
        timeout=20s 
      stop: ip-stop-interval-0s 
        interval=0s 
        timeout=20s 
  Resource: nws (class=ocf provider=EMC_NetWorker type=Server) 
    Meta Attributes: nws-meta_attributes 
      is-managed=true 
    Operations: 
      meta-data: nws-meta-data-interval-0 
        interval=0 
        timeout=10 
      migrate_from: nws-migrate_from-interval-0 
        interval=0 
        timeout=120
      migrate_to: nws-migrate_to-interval-0 
        interval=0 
        timeout=60 
      monitor: nws-monitor-interval-100 
        interval=100 
        timeout=1200 
      start: nws-start-interval-0 
        interval=0 
        timeout=600 
      stop: nws-stop-interval-0 
        interval=0 
        timeout=600 
      validate-all: nws-validate-all-interval-0 
        interval=0 
        timeout=10
 
The above command details each pcs resources configuration. Important things to make note of during the initial overview:
  • FS resource "device=": This is the device used as the mountpoint for the shared storage on the node file system. This device must be the same on each node. This is discussed later in this KB.
  • FS resource "directory=": This is the directory which the shared NetWorker storage uses. The directory should be associated as the mountpoint for the "device=" field. This is discussed later in this KB.
  • IP resource "ip=": This is the IP address which is associated with the logical (shared) hostname used by the NetWorker server. This IP address is hosted on the active node.
  • Pacemaker visibility of the shared address and storage:
lcmap

Example:

root@lnx-node1:~# lcmap
type: NSR_CLU_TYPE;
clu_type: NSR_LC_TYPE;
interface version: 1.0;

type: NSR_CLU_VIRTHOST;
hostname: 192.168.9.110;
local: TRUE;
owned paths: /nsr_share;

clu_nodes: lnx-node1.amer.lan lnx-node2.amer.lan;

 

NOTE: The hostname should return the IP address matched from the pcs resource config "ip=" field. The owned paths should match the pcs resource config "directory=" field. In some instances, when a startup issue is observed, the lcmap command does not return the hostname, local, or owned paths fields; this is indicative of an issue.
 

Initial Diagnosis:

If NetWorker services fail to start check the pcs resource status to see which resource is failing:

pcs status
Example: 
root@lnx-node1:~# pcs status 
... 
... 
Node List: 
  * Online: [ lnx-node1.amer.lan lnx-node2.amer.lan ] 

Full List of Resources: 
  * Resource Group: NW_group: 
    * fs    (ocf::heartbeat:Filesystem):   Started lnx-node1.amer.lan 
    * ip    (ocf::heartbeat:IPaddr):       Started lnx-node1.amer.lan 
    * nws   (ocf::EMC_NetWorker:Server):   Started lnx-node1.amer.lan 

Daemon Status: 
  corosync: active/enabled 
  pacemaker: active/enabled 
  pcsd: active/enabled
 
If a failure is observed, there is a general failure error returned. The failed resources show as FAILED. 
  • FS (Filesystem): If the Filesystem is in a failed state, see below section on Filesystem Failures.
  • IP (IPaddr): If the IPaddr is in a failed state, see below section on IPaddr Failures.
  • NWS (Server): If the NetWorker server is in a failed state, perform the following:
  1. Review the NetWorker server's daemon.raw for any failure messages which appear during startup. The server's /nsr_share/nsr/daemon.raw is located in the shared storage path. The physical nodes client daemon is in the /nsr.NetWorker.local/logs/daemon.raw. See Dell article NetWorker: How to use nsr_render_log
  2. If default logging is not sufficient, enable debug by the following:
    1. Attempt to restart the "Server" resource: 
pcs resource cleanup nws
  1. Use the dbgcommand to enable debug on the nsrd process:
dbgcommand -n nsrd Debug=#
Set a debug level using numbers 1 to 9. Monitor the daemon.raw for any additional messages which may direct to an issue.
  1. Review the /var/log/pcsd/pcsd.log for any errors.
  2. Review the /var/log/pacemaker/pacemaker.log for any errors.
  3. Review the /var/log/messages file for any errors.
NOTE: When reviewing the pcsd, pacemaker, and messages logs look for messages which were logged during the same time frame NetWorker services attempted to start. Review for any errors/failures which coincide with the service startup failure.
 

Filesystem Failures:

  1. Review the pacemaker resources:
pcs resource
  1. Review the pacemaker resource configuration for the Filesystem resource:
pcs resource fs
Example:
 
Make note of the device path, directory path, and fstype.
root@lnx-node1:~# pcs resource
  * Resource Group: NW_group:
    * fs        (ocf::heartbeat:Filesystem):     Started lnx-node1.amer.lan
    * ip        (ocf::heartbeat:IPaddr):         Started lnx-node1.amer.lan
    * nws       (ocf::EMC_NetWorker:Server):     Started lnx-node1.amer.lan
root@lnx-node1:~# pcs resource config fs
Resource: fs (class=ocf provider=heartbeat type=Filesystem)
  Attributes: fs-instance_attributes
    device=/dev/sdb1
    directory=/nsr_share
    fstype=xfs
  Operations:
    monitor: fs-monitor-interval-20
      interval=20
      timeout=300
    start: fs-start-interval-0s
      interval=0s
      timeout=60s
    stop: fs-stop-interval-0s
      interval=0s
      timeout=60s
  1. Confirm whether the device is mounted on the FS:
df -h

Example:

root@lnx-node1:~# df -h | grep /nsr_share /dev/sdb1                                     94G  1.5G   92G   2% /nsr_share
  1. Confirm if the mountpoint is configured correctly; associating the device with the path:
lsblk

Example:

root@lnx-node1:~# lsblk
NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda             8:0    0   40G  0 disk
├─sda1          8:1    0  600M  0 part /boot/efi
├─sda2          8:2    0    1G  0 part /boot
└─sda3          8:3    0 38.4G  0 part
  ├─rhel-root 253:0    0 34.4G  0 lvm  /
  └─rhel-swap 253:1    0    4G  0 lvm  [SWAP]
sdb             8:16   0  100G  0 disk
└─sdb1          8:17   0 93.1G  0 part /nsr_share
sr0            11:0    1 1024M  0 rom
  1. Confirm that the file system used by the device is correct:
blkid
Example:
root@lnx-node1:~# blkid 
/dev/mapper/rhel-root: UUID="7cf2f957-18d8-45b8-bf8f-6361aadc3517" BLOCK_SIZE="512" TYPE="xfs" 
/dev/sda3: UUID="QpZ2hK-OuE2-igN0-Ryba-EwMN-uxq1-LE48hD" TYPE="LVM2_member" PARTUUID="1193db91-4b63-4b33-a4d4-03a22317e064" 
/dev/sda1: UUID="F243-AD41" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="6c81bd63-0249-4bdf-afdb-cdde72034162" 
/dev/sda2: UUID="7677ad6b-8191-4a45-8a8a-16cf7d00d72c" BLOCK_SIZE="512" TYPE="xfs" PARTUUID="57481b7a-83ec-4cd8-bf2d-bca09ac27040" 
/dev/sdb1: UUID="600bca60-dd5d-4162-bf77-0537daa3b1e5" BLOCK_SIZE="512" TYPE="xfs" PARTLABEL="networker" PARTUUID="769aaac2-764b-431d-be21-3b5753d6a5d3" 
/dev/mapper/rhel-swap: UUID="537962b6-07d4-4a40-9687-deab2e488936" TYPE="swap"

If the fs (Filesystem) resource is failing to start. This is indicative of an issue outside of NetWorker. The cluster's system administrator should be engaged to review the cluster's file system configuration and confirm that no issues are observed with the shared storage used by pacemaker. Review additional system logs regarding any failures with the system or its devices:
  • /var/log/pcsd/pcsd.log 
  • /var/log/pacemaker/pacemaker.log
  • /var/log/messages


IPaddr Failures:

  1. Review the pacemaker resources:
pcs resource
  1. Review the pacemaker resource configuration for the Filesystem resource:
pcs resource config ip
Example:
 
Make note of the IP address and NIC.
 
root@lnx-node1:~# pcs resource
  * Resource Group: NW_group:
    * fs (ocf::heartbeat:Filesystem): Started lnx-node1.amer.lan
    * ip (ocf::heartbeat:IPaddr): Started lnx-node1.amer.lan
    * nws (ocf::EMC_NetWorker:Server): Started lnx-node1.amer.lan
root@lnx-node1:~# pcs resource config ip
Resource: ip (class=ocf provider=heartbeat type=IPaddr)
  Attributes: ip-instance_attributes
    cidr_netmask=24
    ip=192.1xx.9.1x0
    nic=ens192
  Operations:
    monitor: ip-monitor-interval-15
      interval=15
      timeout=120
    start: ip-start-interval-0s
      interval=0s
      timeout=20s stop:
    ip-stop-interval-0s
      interval=0s
      timeout=20s
  1. Confirm if the NIC is available on the system:
ifconfig -a
Example: 
root@lnx-node1:~# ifconfig -a 
ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.1xx.9.1x8 netmask 255.255.255.0 broadcast 192.1xx.9.255
        inet6 fe80::250:56ff:fea5:48e1 prefixlen 64 scopeid 0x20<link>
        ether 00:50:56:a5:48:e1 txqueuelen 1000 (Ethernet)
        RX packets 953865 bytes 349705527 (333.5 MiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 1190983 bytes 179749786 (171.4 MiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
        inet 127.0.0.1 netmask 255.0.0.0 
        inet6 ::1 prefixlen 128 scopeid 0x10<host>
        loop txqueuelen 1000 (Local Loopback)
        RX packets 129798 bytes 13274289 (12.6 MiB)
        RX errors 0 dropped 0 overruns 0 frame 0 
        TX packets 129798 bytes 13274289 (12.6 MiB) 
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
 
The IP address shown with ifconfig matches the physical node name; however, the clustered IP is reachable through this NIC when the node is active. Ensure that both nodes are configured to use the same NIC names.
  1. Does the IP address resolve to the correct (logical) hostname used by the NetWorker server?
nslookup ip 

nslookup logical_name_FQDN 

nslookup logical_name_short
Example:
root@lnx-node1:~# nslookup 192.1xx.9.1x0 
110.9.1xx.1x2.in-addr.arpa name = lnx-nwcluster.amer.lan. 

root@lnx-node1:~# nslookup lnx-nwcluster.amer.lan. 
Server: 192.1xx.9.1x0 
Address: 192.1xx.9.100#53 

Name: lnx-nwcluster.amer.lan 
Address: 192.1xx.9.1x0 

root@lnx-node1:~# nslookup lnx-nwcluster 
Server: 192.1xx.9.1x0 
Address: 192.1xx.9.100#53 

Name: lnx-nwcluster.amer.lan 
Address: 192.1xx.9.1x0


It is also recommended to perform the same steps against the physical node's IP address, FQDN, and shortname. See Dell article Troubleshooting DNS and Name Resolution Issues.

  1. Can you reach the cluster IP address using ping?
ping -c 4 ip
Example:
root@lnx-node1:~# ping -c 4 192.1xx8.9.1x0 
PING 192.1xx8.9.1x0 (192.1xx.9.1x0) 56(84) bytes of data. 
64 bytes from 192.1xx.9.1x0: icmp_seq=1 ttl=64 time=0.051 ms 
64 bytes from 192.1xx.9.1x0: icmp_seq=2 ttl=64 time=0.043 ms 
64 bytes from 192.1xx.9.1x0: icmp_seq=3 ttl=64 time=0.033 ms 
64 bytes from 192.1xx.9.1x0: icmp_seq=4 ttl=64 time=0.034 ms 

--- 192.1xx.9.1x0 ping statistics --- 4 packets transmitted, 
4 received, 0% packet loss, time 3108ms rtt min/avg/max/mdev = 0.033/0.040/0.051/0.008 ms
If the IP (IPaddr) resource is failing to start. This is indicative of an issue outside of NetWorker. The cluster's system administrator and network administrator should be engaged to review the cluster's network configuration and confirm that no issues are observed. Review additional system logs regarding any failures with the system or its devices:
  • /var/log/pcsd/pcsd.log 
  • /var/log/pacemaker/pacemaker.log
  • /var/log/messages


Other PCS Commands:

Operation Command
Pacemaker or PCS version:
pcs --version
Pacemaker Overview
pcs status
Pacemaker resource overview
pcs resource
Determine path-ownership in a cluster
lcmap
Enable (start) resource
pcs resource enable resource_name
Start pcs resource with debug
pcs resource debug-start resource_name 
Review pcs resource configuration settings
pcs resource config resource_name
Disable (stop) resource:
pcs resource disable resource_name  
Restart failed resource
pcs resource cleanup resource_name
Stop pacemaker on node
pcs stop cluster [--force]
Start pacemaker
pcs cluster start [--all]
Put node in standby
pcs node standby node_name
Bring node out of standby 
pcs node unstandby node_name

Important Logs and Files:

Path Purpose Supplemental Commands
/var/log/messages Contains global system messages regarding system resources and services.
grep 'pacemaker.*\(error\|warning\)' /var/log/messages
/var/log/pacemaker/pacemaker.log  Default pacemaker information logging for pacemaker resources and functions. N/A
/var/log/pcsd/pcsd.log Default pacemaker service/daemon (pcsd) log.  N/A
/var/log/cluster/corosync.log Default pacemaker node communication log.  N/A
/usr/sbin/nw_hae.log NetWorker (nws) resource start log as defined in /usr/lib/ocf/resource.d/EMC_NetWorker/Server N/A
/usr/lib/ocf/resource.d/EMC_NetWorker/Server  NetWorker pacemaker configuration file. This is what operations are performed/managed by pcs. N/A

Affected Products

NetWorker

Products

NetWorker Family, NetWorker Series
Article Properties
Article Number: 000218281
Article Type: How To
Last Modified: 06 May 2024
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.