This article describes the recommended best practices for the backup of non-PowerScale Hadoop environments to a Dell PowerScale cluster. With its robust erasure-coding data protection that provides greater than 80% storage efficiency, Dell PowerScale is an ideal backup target for data located on a Hadoop cluster. DistCp (distributed copy) is a standard tool that comes with all Hadoop distributions and versions. DistCp can copy entire Hadoop directories. DistCp runs as a MapReduce job to perform file copies in parallel, fully using your systems if required. There is also an option to limit the bandwidth to control the impact on other tasks.
ENVIRONMENT
This article uses the following test environment:
Because DistCp is a standard Hadoop tool, the approach outlined in this document is applicable to most, if not all other Hadoop distributions and versions.
While reading this document, assume that the data to back up is located on the PHD Hadoop HDFS cluster in the directory /mydata. The examples back up this data to the PowerScale cluster in the directory /ifs/hadoop/backup/mydata.
Figure 1: Backup a Hadoop Cluster to Isilon
The simplest backup command is shown below:
[gpadmin@phddas2-0 ~]$ hadoop distcp -skipcrccheck -update /mydata hdfs://all-nc-s-hdfs/backup/mydata
You can run the above command on any host which has the Hadoop client (hadoop) installed. The user running the command must have permissions to read the source files and write the target files.
The options -skipcrccheck and -update must be specified to avoid the CRC check on the target files that are placed on the PowerScale cluster. PowerScale does not store the Hadoop CRC, and calculating it would be too expensive. Therefore, these options are required to prevent errors related to the CRC check.
The next parameter "/mydata" is the source path on the source Hadoop cluster. This could also be "/" to back up your entire HDFS namespace. Since the path is not fully qualified, it uses the HDFS NameNode specified in the fs.defaultFS parameter of core-site.xml.
The final parameter "hdfs://all-nc-s-hdfs/backup/mydata" is the target path on your PowerScale cluster. The host portion "all-nc-s-hdfs" can be a relative or fully qualified DNS name such as all-nc-s-hdfs.example.com. It should be the SmartConnect Zone DNS name for your PowerScale cluster. The directory portion "/backup/mydata" is relative to the HDFS root path defined in your PowerScale cluster access zone. If your HDFS root path is /ifs/hadoop, then this value refers to /ifs/hadoop/backup/mydata.
Files whose sizes are identical on the source and target directories are assumed to be unchanged and are not copied. In particular, file timestamps are not used to determine changed files. For more details on DistCp, see the Hadoop DistCp Version 2 Guide.
COPYING PERMISSIONS
By default, the owner, group, and permissions of the target files are reset to the default for new files created by the user initiating DistCp. Any owner, group, and permissions defined for the source file are lost. To retain this information from the source files, use the -p option. Because the -p option must perform chown/chgrp, the user initiating DistCp must be a superuser on the target system. The root user on the PowerScale cluster works for this purpose. For example: [root@phddas2-0 ~]$ hadoop distcp -skipcrccheck -update -pugp /mydata hdfs://all-nc-s-hdfs/backup/mydata
USING SNAPSHOTS FOR YOUR BACKUP SOURCE
The backup of large datasets may take a long time. Files that exist at the beginning of the DistCp process when the directory structure is scanned, may no longer exist when that file is copied. This change in files produces errors. Further, an application may require a consistent single point-in-time backup for it to be usable. To deal with these issues, it is recommended that you create an HDFS snapshot of your source to ensure that the dataset does not change during the backup process. This is unrelated to the SnapshotIQ feature of your target PowerScale cluster.
To use HDFS snapshots, you must first allow snapshots for a particular directory:[gpadmin@phddas2-0 ~]$ hdfs dfsadmin -allowSnapshot /mydata
Allowing snapshot on /mydata succeeded
Immediately before a backup with DistCp, create the HDFS snapshot:
[gpadmin@phddas2-0 ~]$ hdfs dfs -createSnapshot /mydata backupsnap Created snapshot /mydata/.snapshot/backupsnap
The name of this snapshot is backupsnap. You can access it at the HDFS path /mydata/.snapshot/backupsnap. Any changes to your HDFS files after this snapshot are not reflected in the subsequent backup. You can back up the snapshot to PowerScale using the following command:
[gpadmin@phddas2-0 ~]$ hadoop distcp -skipcrccheck -update /mydata/.snapshot/backupsnap hdfs://all-nc-s-hdfs/backup/mydata
When the backup command finishes running, you can delete the snapshot. Doing so frees up any space used to hold older versions of files that were modified since the snapshot:
[gpadmin@phddas2-0 ~]$ hdfs dfs -deleteSnapshot /mydata backupsnap
USING PowerScale SNAPSHOTS FOR YOUR BACKUP TARGET
Independent from using snapshots for your backup source, you may want to keep multiple snapshots of your backup target directory to restore older versions of files.
To create snapshots on PowerScale, you must have a SnapshotIQ license. You can create snapshots using the web admin interface or CLI. To create a single PowerScale snapshot manually with the CLI, SSH into any PowerScale node and run the following:all-nc-s-1# isi snapshot snapshots create /ifs/hadoop/backup/mydata --name backup-2014-07-01 --expires 1D --verbose
Created snapshot backup-2014-07-01 with ID 6
You can add this command to the backup process discussed in the Scheduling Backups section below.
For more details regarding PowerScale OneFS snapshots, see the PowerScale OneFS CLI Administration Guide for your version of OneFS: PowerScale OneFS Info Hubs
SYNCIQ REPLICATION FOR MULTIPLE PowerScale CLUSTERS
After the DistCp backup to the PowerScale cluster completes, you can use OneFS SyncIQ to replicate snapshots across a WAN to other PowerScale clusters. Replicated snapshots can provide a versatile and efficient component of your disaster recovery strategy.
Figure 2: SynIQ Replication for multiple Isilon clusters
HANDLING DELETED FILES
By default, files deleted from the source Hadoop cluster are not deleted from the target Hadoop cluster. If you require this behavior, add the -delete argument to the DistCp command. When using this command, it is recommended to use snapshots on the backup target to allow for the recovery of deleted files.
SCHEDULING BACKUPS
You can automate and schedule the steps to back up a Hadoop cluster using various methods. Apache Oozie is often used to automate Hadoop tasks, and it directly supports DistCp. CRON can also be used to run a Shell script. To automate running commands in an SSH session, enable password-less SSH. The password-less SSH allows a CRON user to connect to your Hadoop client and your PowerScale cluster (if using SnapshotIQ).
The backup target files on PowerScale are accessible from Hadoop applications in the same way as the source files, due to PowerScale’s support for HDFS. You can use your backup data directly, without having to first restore it to your original source Hadoop environment. This capability saves analysis time. For example, if you run a MapReduce command like this:hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep /mydata/mydataset1 output1 ABC
You can run the MapReduce job against the backup dataset on PowerScale using the following command:hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep hdfs://all-nc-s-hdfs/backup/ /mydata/mydataset1 output1 ABC
To specify a fully qualified Hadoop path instead of using the fs.defaultFS parameter, check with your application provider for details. Also, a PowerScale cluster that is designed for backup and archive, instead of for high performance, is likely not to provide the same performance as your primary Hadoop environment. Testing is recommended, or consult with Dell PowerScale for proper sizing.