NetWorker: Troubleshooting Guide for Red Hat Cluster Service Issue

Summary: This article provides an overview of how to approach NetWorker service startup issues for NetWorker servers deployed on Red Hat pacemaker (pcs) clusters. This article is appropriate for NetWorker backup administrators and NetWorker support to aid in troubleshooting these issues. ...

This article applies to This article does not apply to

Instructions

NetWorker servers can be deployed in a cluster failover configuration on Red Hat nodes using pacemaker (pcs) services. In this configuration type, NetWorker is installed on two or more nodes and the NetWorker server databases reside on a shared storage location which is passed between nodes depending on which node is the "active" node in the pacemaker. The NetWorker server uses a shared cluster name and IP address so that its naming and addressing is consistent regardless of which node is hosting the services. See the NetWorker Cluster Integration Guide for details on how to set up NetWorker in a cluster. This guide is available on the Dell Support Product Page.

Cluster Topology:

This article uses an example cluster with the following configuration:

NetWorker Cluster Topology

Hostname	IP Address	Function
lnx-node1.amer.lan	192.168.9.108	Physical Node 1
lnx-node2.amer.lan	192.168.9.109	Physical Node 2
lnx-nwcluster.amer.lan	192.168.9.110	Logical Name used by NetWorker

The file system on the nodes manages NetWorker using symbolic links.

Active Node:

An active node where the NetWorker server is started symbolically links /nsr to the shared storage location:

root@lnx-node1:~# ls -l / | grep nsr
lrwxrwxrwx.   1 root root     14 Oct  5 10:49 nsr -> /nsr_share/nsr
drwxr-xr-x.  11 root root    116 Aug 31 17:20 nsr.NetWorker.local
drwxr-xr-x.   3 root root     17 Aug 31 17:23 nsr_share

Passive Node:

A "passive" node symbolically links /nsr to /nsr.NetWorker.local:

root@lnx-node2:~# ls -l / | grep nsr
lrwxrwxrwx.   1 root root     20 Oct  3 17:08 nsr -> /nsr.NetWorker.local
drwxr-xr-x.  11 root root    116 Aug 31 17:19 nsr.NetWorker.local
drwxr-xr-x.   2 root root      6 Aug 31 17:18 nsr_share

When a node is in a passive state, the nsrexecd (NetWorker client) software is always running using /nsr.NetWorker.local. Each physical node has its own client resource using the physical node's DNS resolvable name and IP address. The NetWorker server only runs using the shared storage (/nsr_share) and uses the shared IP address and hostname. This can only be active on one node at a time.

The following pacemaker (pcs) commands are used to get an overview of the pacemaker configuration and status:

Cluster configuration:

pcs status

Example:

root@lnx-node1:~# pcs status 
Cluster name: rhelclus 
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-10-05 10:59:19 -04:00) 
Cluster Summary: 
  * Stack: corosync 
  * Current DC: lnx-node1.amer.lan (version 2.1.5-9.3.el8_8-a3f44794f94) - partition with quorum 
  * Last updated: Thu Oct 5 10:59:20 2023 
  * Last change: Thu Oct 5 10:59:13 2023 by root via cibadmin on lnx-node1.amer.lan 
  * 2 nodes configured 
  * 3 resource instances configured 

Node List: 
  * Online: [ lnx-node1.amer.lan lnx-node2.amer.lan ] 

Full List of Resources: 
  * Resource Group: NW_group: 
    * fs (ocf::heartbeat:Filesystem): Started lnx-node1.amer.lan 
    * ip (ocf::heartbeat:IPaddr): Started lnx-node1.amer.lan 
    * nws (ocf::EMC_NetWorker:Server): Started lnx-node1.amer.lan 

Daemon Status: 
  corosync: active/enabled 
  pacemaker: active/enabled 
  pcsd: active/enabled

From the above output, we can determine how many nodes are in the cluster and if any are offline or in standby status. The output also shows which node is hosting the shared file system (fs), cluster resource IP address (ip), and the NetWorker services (nws). The resource names used here are the defaults used in the NetWorker Cluster Integration Guide; however, it is possible that different names are used. If you are using different names, make note of the resource names and replace as needed when following the instructions in this article.

Pacemaker resource configuration:

pcs resource config

Example:

root@lnx-node1:~# pcs resource config 
Group: NW_group 
  Resource: fs (class=ocf provider=heartbeat type=Filesystem)
    Attributes: fs-instance_attributes 
      device=/dev/sdb1 
      directory=/nsr_share 
      fstype=xfs 
    Operations: 
      monitor: fs-monitor-interval-20 
        interval=20 
        timeout=300 
      start: fs-start-interval-0s 
        interval=0s 
        timeout=60s 
      stop: fs-stop-interval-0s interval=0s timeout=60s 
  Resource: ip (class=ocf provider=heartbeat type=IPaddr) 
    Attributes: ip-instance_attributes 
      cidr_netmask=24 
      ip=192.1xx.9.1x0 
      nic=ens192 
    Operations: 
      monitor: ip-monitor-interval-15 
        interval=15 
        timeout=120 
      start: ip-start-interval-0s 
        interval=0s 
        timeout=20s 
      stop: ip-stop-interval-0s 
        interval=0s 
        timeout=20s 
  Resource: nws (class=ocf provider=EMC_NetWorker type=Server) 
    Meta Attributes: nws-meta_attributes 
      is-managed=true 
    Operations: 
      meta-data: nws-meta-data-interval-0 
        interval=0 
        timeout=10 
      migrate_from: nws-migrate_from-interval-0 
        interval=0 
        timeout=120
      migrate_to: nws-migrate_to-interval-0 
        interval=0 
        timeout=60 
      monitor: nws-monitor-interval-100 
        interval=100 
        timeout=1200 
      start: nws-start-interval-0 
        interval=0 
        timeout=600 
      stop: nws-stop-interval-0 
        interval=0 
        timeout=600 
      validate-all: nws-validate-all-interval-0 
        interval=0 
        timeout=10

The above command details each pcs resources configuration. Important things to make note of during the initial overview:

FS resource "device=": This is the device used as the mountpoint for the shared storage on the node file system. This device must be the same on each node. This is discussed later in this KB.
FS resource "directory=": This is the directory which the shared NetWorker storage uses. The directory should be associated as the mountpoint for the "device=" field. This is discussed later in this KB.
IP resource "ip=": This is the IP address which is associated with the logical (shared) hostname used by the NetWorker server. This IP address is hosted on the active node.

Pacemaker visibility of the shared address and storage:

lcmap

Example:

root@lnx-node1:~# lcmap
type: NSR_CLU_TYPE;
clu_type: NSR_LC_TYPE;
interface version: 1.0;

type: NSR_CLU_VIRTHOST;
hostname: 192.168.9.110;
local: TRUE;
owned paths: /nsr_share;

clu_nodes: lnx-node1.amer.lan lnx-node2.amer.lan;

NOTE: The hostname should return the IP address matched from the pcs resource config "ip=" field. The owned paths should match the pcs resource config "directory=" field. In some instances, when a startup issue is observed, the lcmap command does not return the hostname, local, or owned paths fields; this is indicative of an issue.

Initial Diagnosis:

If NetWorker services fail to start check the pcs resource status to see which resource is failing:

pcs status

Example:

root@lnx-node1:~# pcs status 
... 
... 
Node List: 
  * Online: [ lnx-node1.amer.lan lnx-node2.amer.lan ] 

Full List of Resources: 
  * Resource Group: NW_group: 
    * fs    (ocf::heartbeat:Filesystem):   Started lnx-node1.amer.lan 
    * ip    (ocf::heartbeat:IPaddr):       Started lnx-node1.amer.lan 
    * nws   (ocf::EMC_NetWorker:Server):   Started lnx-node1.amer.lan 

Daemon Status: 
  corosync: active/enabled 
  pacemaker: active/enabled 
  pcsd: active/enabled

If a failure is observed, there is a general failure error returned. The failed resources show as FAILED.

FS (Filesystem): If the Filesystem is in a failed state, see below section on Filesystem Failures.
IP (IPaddr): If the IPaddr is in a failed state, see below section on IPaddr Failures.
NWS (Server): If the NetWorker server is in a failed state, perform the following:

Review the NetWorker server's daemon.raw for any failure messages which appear during startup. The server's /nsr_share/nsr/daemon.raw is located in the shared storage path. The physical nodes client daemon is in the /nsr.NetWorker.local/logs/daemon.raw. See Dell article NetWorker: How to use nsr_render_log
If default logging is not sufficient, enable debug by the following:
1. Attempt to restart the "Server" resource:

pcs resource cleanup nws

Use the dbgcommand to enable debug on the nsrd process:

dbgcommand -n nsrd Debug=#

Set a debug level using numbers 1 to 9. Monitor the daemon.raw for any additional messages which may direct to an issue.

Review the /var/log/pcsd/pcsd.log for any errors.
Review the /var/log/pacemaker/pacemaker.log for any errors.
Review the /var/log/messages file for any errors.

NOTE: When reviewing the pcsd, pacemaker, and messages logs look for messages which were logged during the same time frame NetWorker services attempted to start. Review for any errors/failures which coincide with the service startup failure.

Filesystem Failures:

Review the pacemaker resources:

pcs resource

Review the pacemaker resource configuration for the Filesystem resource:

pcs resource fs

Example:

Make note of the device path, directory path, and fstype.

root@lnx-node1:~# pcs resource
  * Resource Group: NW_group:
    * fs        (ocf::heartbeat:Filesystem):     Started lnx-node1.amer.lan
    * ip        (ocf::heartbeat:IPaddr):         Started lnx-node1.amer.lan
    * nws       (ocf::EMC_NetWorker:Server):     Started lnx-node1.amer.lan
root@lnx-node1:~# pcs resource config fs
Resource: fs (class=ocf provider=heartbeat type=Filesystem)
  Attributes: fs-instance_attributes
    device=/dev/sdb1
    directory=/nsr_share
    fstype=xfs
  Operations:
    monitor: fs-monitor-interval-20
      interval=20
      timeout=300
    start: fs-start-interval-0s
      interval=0s
      timeout=60s
    stop: fs-stop-interval-0s
      interval=0s
      timeout=60s

Confirm whether the device is mounted on the FS:

df -h

Example:

root@lnx-node1:~# df -h | grep /nsr_share /dev/sdb1                                     94G  1.5G   92G   2% /nsr_share

Confirm if the mountpoint is configured correctly; associating the device with the path:

lsblk

Example:

root@lnx-node1:~# lsblk
NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda             8:0    0   40G  0 disk
├─sda1          8:1    0  600M  0 part /boot/efi
├─sda2          8:2    0    1G  0 part /boot
└─sda3          8:3    0 38.4G  0 part
  ├─rhel-root 253:0    0 34.4G  0 lvm  /
  └─rhel-swap 253:1    0    4G  0 lvm  [SWAP]
sdb             8:16   0  100G  0 disk
└─sdb1          8:17   0 93.1G  0 part /nsr_share
sr0            11:0    1 1024M  0 rom

Confirm that the file system used by the device is correct:

blkid

Example:

root@lnx-node1:~# blkid 
/dev/mapper/rhel-root: UUID="7cf2f957-18d8-45b8-bf8f-6361aadc3517" BLOCK_SIZE="512" TYPE="xfs" 
/dev/sda3: UUID="QpZ2hK-OuE2-igN0-Ryba-EwMN-uxq1-LE48hD" TYPE="LVM2_member" PARTUUID="1193db91-4b63-4b33-a4d4-03a22317e064" 
/dev/sda1: UUID="F243-AD41" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="6c81bd63-0249-4bdf-afdb-cdde72034162" 
/dev/sda2: UUID="7677ad6b-8191-4a45-8a8a-16cf7d00d72c" BLOCK_SIZE="512" TYPE="xfs" PARTUUID="57481b7a-83ec-4cd8-bf2d-bca09ac27040" 
/dev/sdb1: UUID="600bca60-dd5d-4162-bf77-0537daa3b1e5" BLOCK_SIZE="512" TYPE="xfs" PARTLABEL="networker" PARTUUID="769aaac2-764b-431d-be21-3b5753d6a5d3" 
/dev/mapper/rhel-swap: UUID="537962b6-07d4-4a40-9687-deab2e488936" TYPE="swap"

If the fs (Filesystem) resource is failing to start. This is indicative of an issue outside of NetWorker. The cluster's system administrator should be engaged to review the cluster's file system configuration and confirm that no issues are observed with the shared storage used by pacemaker. Review additional system logs regarding any failures with the system or its devices:

/var/log/pcsd/pcsd.log
/var/log/pacemaker/pacemaker.log
/var/log/messages

IPaddr Failures:

Review the pacemaker resources:

pcs resource

Review the pacemaker resource configuration for the Filesystem resource:

pcs resource config ip

Example:

Make note of the IP address and NIC.

root@lnx-node1:~# pcs resource
  * Resource Group: NW_group:
    * fs (ocf::heartbeat:Filesystem): Started lnx-node1.amer.lan
    * ip (ocf::heartbeat:IPaddr): Started lnx-node1.amer.lan
    * nws (ocf::EMC_NetWorker:Server): Started lnx-node1.amer.lan
root@lnx-node1:~# pcs resource config ip
Resource: ip (class=ocf provider=heartbeat type=IPaddr)
  Attributes: ip-instance_attributes
    cidr_netmask=24
    ip=192.1xx.9.1x0
    nic=ens192
  Operations:
    monitor: ip-monitor-interval-15
      interval=15
      timeout=120
    start: ip-start-interval-0s
      interval=0s
      timeout=20s stop:
    ip-stop-interval-0s
      interval=0s
      timeout=20s

Confirm if the NIC is available on the system:

ifconfig -a

Example:

root@lnx-node1:~# ifconfig -a 
ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.1xx.9.1x8 netmask 255.255.255.0 broadcast 192.1xx.9.255
        inet6 fe80::250:56ff:fea5:48e1 prefixlen 64 scopeid 0x20<link>
        ether 00:50:56:a5:48:e1 txqueuelen 1000 (Ethernet)
        RX packets 953865 bytes 349705527 (333.5 MiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 1190983 bytes 179749786 (171.4 MiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
        inet 127.0.0.1 netmask 255.0.0.0 
        inet6 ::1 prefixlen 128 scopeid 0x10<host>
        loop txqueuelen 1000 (Local Loopback)
        RX packets 129798 bytes 13274289 (12.6 MiB)
        RX errors 0 dropped 0 overruns 0 frame 0 
        TX packets 129798 bytes 13274289 (12.6 MiB) 
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

The IP address shown with ifconfig matches the physical node name; however, the clustered IP is reachable through this NIC when the node is active. Ensure that both nodes are configured to use the same NIC names.

Does the IP address resolve to the correct (logical) hostname used by the NetWorker server?

nslookup ip 

nslookup logical_name_FQDN 

nslookup logical_name_short

Example:

root@lnx-node1:~# nslookup 192.1xx.9.1x0 
110.9.1xx.1x2.in-addr.arpa name = lnx-nwcluster.amer.lan. 

root@lnx-node1:~# nslookup lnx-nwcluster.amer.lan. 
Server: 192.1xx.9.1x0 
Address: 192.1xx.9.100#53 

Name: lnx-nwcluster.amer.lan 
Address: 192.1xx.9.1x0 

root@lnx-node1:~# nslookup lnx-nwcluster 
Server: 192.1xx.9.1x0 
Address: 192.1xx.9.100#53 

Name: lnx-nwcluster.amer.lan 
Address: 192.1xx.9.1x0

It is also recommended to perform the same steps against the physical node's IP address, FQDN, and shortname. See Dell article Troubleshooting DNS and Name Resolution Issues.

Can you reach the cluster IP address using ping?

ping -c 4 ip

Example:

root@lnx-node1:~# ping -c 4 192.1xx8.9.1x0 
PING 192.1xx8.9.1x0 (192.1xx.9.1x0) 56(84) bytes of data. 
64 bytes from 192.1xx.9.1x0: icmp_seq=1 ttl=64 time=0.051 ms 
64 bytes from 192.1xx.9.1x0: icmp_seq=2 ttl=64 time=0.043 ms 
64 bytes from 192.1xx.9.1x0: icmp_seq=3 ttl=64 time=0.033 ms 
64 bytes from 192.1xx.9.1x0: icmp_seq=4 ttl=64 time=0.034 ms 

--- 192.1xx.9.1x0 ping statistics --- 4 packets transmitted, 
4 received, 0% packet loss, time 3108ms rtt min/avg/max/mdev = 0.033/0.040/0.051/0.008 ms

If the IP (IPaddr) resource is failing to start. This is indicative of an issue outside of NetWorker. The cluster's system administrator and network administrator should be engaged to review the cluster's network configuration and confirm that no issues are observed. Review additional system logs regarding any failures with the system or its devices:

/var/log/pcsd/pcsd.log
/var/log/pacemaker/pacemaker.log
/var/log/messages

Other PCS Commands:

Operation	Command
Pacemaker or PCS version:	pcs --version
Pacemaker Overview	pcs status
Pacemaker resource overview	pcs resource
Determine path-ownership in a cluster	lcmap
Enable (start) resource	pcs resource enable resource_name
Start pcs resource with debug	pcs resource debug-start resource_name
Review pcs resource configuration settings	pcs resource config resource_name
Disable (stop) resource:	pcs resource disable resource_name
Restart failed resource	pcs resource cleanup resource_name
Stop pacemaker on node	pcs stop cluster [--force]
Start pacemaker	pcs cluster start [--all]
Put node in standby	pcs node standby node_name
Bring node out of standby	pcs node unstandby node_name

Important Logs and Files:

Path	Purpose	Supplemental Commands
/var/log/messages	Contains global system messages regarding system resources and services.	grep 'pacemaker.*\(error\\|warning\)' /var/log/messages
/var/log/pacemaker/pacemaker.log	Default pacemaker information logging for pacemaker resources and functions.	N/A
/var/log/pcsd/pcsd.log	Default pacemaker service/daemon (pcsd) log.	N/A
/var/log/cluster/corosync.log	Default pacemaker node communication log.	N/A
/usr/sbin/nw_hae.log	NetWorker (nws) resource start log as defined in /usr/lib/ocf/resource.d/EMC_NetWorker/Server	N/A
/usr/lib/ocf/resource.d/EMC_NetWorker/Server	NetWorker pacemaker configuration file. This is what operations are performed/managed by pcs.	N/A

Affected Products

NetWorker

Products

NetWorker Family, NetWorker Series

Article Number: 000218281

Article Type: How To

Last Modified: 06 May 2024

Version: 4

Check if your device is covered by Support Services.

NetWorker: Troubleshooting Guide for Red Hat Cluster Service Issue

Summary: This article provides an overview of how to approach NetWorker service startup issues for NetWorker servers deployed on Red Hat pacemaker (pcs) clusters. This article is appropriate for NetWorker backup administrators and NetWorker support to aid in troubleshooting these issues. ...

Instructions

Cluster Topology:

Initial Diagnosis:

Filesystem Failures:

IPaddr Failures:

Other PCS Commands:

Important Logs and Files:

Affected Products

Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

NetWorker: Troubleshooting Guide for Red Hat Cluster Service Issue

Detailed Article

Instructions

Affected Products

Instructions

Cluster Topology:

Initial Diagnosis:

Filesystem Failures:

IPaddr Failures:

Other PCS Commands:

Important Logs and Files:

Affected Products

Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services