The compression techniques involved in a Data Domain use state-of-the-art techniques to reduce the physical space required by customer data. As such, the technologies and measurements of compression levels are complex topics.
This document discusses some of the terminologies, tradeoffs, and measures in order to better explain the compression types used, and other aspects of compression in a Data Domain system.
APPLIES TO: All Data Domain Models
Last updated: January 2024
Compression is a data reduction technology which aims to store a dataset using less physical space. In Data Domain systems (DDOS), we do deduplication and local compression to compress user data. De-duplication, or "dedupe," is used to identify redundant data segments and store only unique data segments. Local compression further compresses the unique data segments with certain compression algorithms, such as lz, gzfast, gz
, so on. The overall user data compression in DDOS is the joint effort of deduplication and local compression. DDOS uses "compression ratio" to measure the effectiveness of its data compression. Generally, it is the ratio of the total user data size to the total size of compressed data or the used physical space size.
Data Domain file system is a "log-structured" deduplication file system. A log-structured file system only appends data to the system and deletion by itself cannot free physical space. Such file systems rely on garbage collection to reclaim no-longer-needed space. The characteristics of the log-structured file system and the deduplicated technology combined together make it tricky to clearly understand all aspects of compression in DDOS.
For compression, there are many aspects we can measure. In this document, we discuss the details step by step to help understand DDOS compression. At first, we explain the overall system compression effect, which tells us the realistic compression achieved in a Data Domain system, the amount of user data, the amount of physical space consumed, and the ratio of them. This ratio is called "system effective compression ratio" in this document. DDOS conducts deduplication inline and tracks the statistics of the original user data segments, post-dedupe unique data segments, and the local compression effect on the unique data segments. These inline compression statistics are used to measure the inline compression effect. Inline compression statistics may be measured for each write. Also, DDOS tracks the statistics at different levels; files, MTrees, and the entire system.
The contents of this document can be applied to all DDOS releases until publication of this document, up to DDOS 7.13. There is no guarantee that all the contents are accurate for future releases. In releases prior to 5.0, the entire system has only one mtree and the term mtree is not explicitly called out.
The system-wide overall compression effect is measured by the system effective compression ratio, which is the ratio of the user data size to the size of used physical space. It is reported by the filesys show compression (FSC) CLI command (the corresponding information is also available on UI). A sample output of FSC is shown at below:
# filesys show compression
From: 2023-12-31 03:00 To: 2024-01-07 03:00
Active Tier:
Pre-Comp Post-Comp Global-Comp Local-Comp Total-Comp
(GiB) (GiB) Factor Factor Factor
(Reduction %)
---------------- -------- --------- ----------- ---------- -------------
Currently Used:* 6439.6 113.4 - - 56.8x (98.2)
Written:
Last 7 days 135421.3 1782.0 35.1x 2.2x 76.0x (98.7)
Last 24 hrs 532.5 1.5 334.3x 1.1x 356.5x (99.7)
---------------- -------- --------- ----------- ---------- -------------
* Does not include the effects of pre-comp file deletes/truncates
since the last cleaning on 2024/01/05 11:34:13.
The system effective compression ratio is reported at row 1 of the result section in the CLI output. The row is highlighted above. The total user data size is labeled as "Pre-Comp." The total consumed physical space (by both data and metadata) is labeled as "Post-Comp."
The "Pre-Comp" number and "Post-Comp" number are both read at runtime. FSC implicitly synchronizes the entire system, then queries the two numbers. These two numbers are measured in the same way as the "filesys show space" command.
System effective compression ratio = Pre-Comp/Post-Comp
The rest of the FSC output describes the inline compression statistics, and we discuss them later.
There are some operations that can affect the system effective compression ratio:
Fastcopy
When a fastcopy is done from a file in the active namespace (not a snapshot), it is a perfect deduplication, as no extra physical space is needed for the target file. The effect of a fastcopy is that we increase the user data size without consuming additional physical space. This increases the system effective compression ratio. When many fastcopies are done, the system effective compression ratio may become artificially high.
Virtual synthetics
Virtual synthetic backups tend to show a high system effective compression ratio. This is because virtual synthetics make logical full backups, but only transfers changed or new data to Data Domain systems. The impact to system effective compression ratio of virtual synthetics is somewhat like the effect of fastcopy.
Overwrites
Overwrites consume more physical space but do not increase the logical size of the dataset, thus overwrites lower the system effective compression ratio.
Storing sparse files
Sparse files contain large "holes" that are counted in the logical size but do not consume physical space due to compression. As a result, they can make the system effective compression ratio seem high.
Storing small files
DDOS adds nearly 1 KB overhead to each file for certain internal metadata. When a system stores a significant number of small files (sizes less than 1 KB or in single-digit kilobytes), the overhead of metadata drags the effective compression ratio down.
Storing pre-compressed or pre-encrypted files
Compression and encryption can amplify the level of data change and reduce the possibility of deduplication. Such files usually cannot be deduplicated well and bring the system effective compression ratio lower.
Deletes
Deletions reduce the logical size of the system, but the system does not get the corresponding unused space back until garbage collection runs. Many deleted files make the compression ratio low until Garbage Collection (GC) runs.
Garbage Collection (GC) or Cleaning
GC reclaims the space consumed by data segments that are no longer seen by any file. If a lot of files have been deleted recently, GC may increase the system compression ratio by reducing the physical space consumption footprint.
Aggressively taking snapshots
When we take a snapshot of a Mtree, we do not change the logical size of the dataset. However, all the data segments referenced by the snapshot must be locked down, even if all files captured by the snapshot are deleted after the snapshot was taken. GC cannot reclaim the space that is still needed by snapshots; therefore having lots of snapshots may make the system effective compression ratio appear low. However, snapshots are useful crash recovery facilities. We should never hesitate to take snapshots or set up proper snapshot schedules when needed.
DDOS conducts deduplication inline, as data is written to the system. It tracks the effects of inline deduplication and local compression for each write, and accumulates the statistics at the file level. Per-file inline compression statistics are further aggregated at the mtree level and at the system level. Compression is measured based on three numbers in the inline statistics:
Based on the above three numbers, DDOS defines two more fine-granularity compression ratios:
The accumulated inline compression statistics are part of the file metadata in DDOS and are stored in the file inode. DDOS provides tools to check the inline compressions at all three levels; file, MTree, and system-wide. We detail them in the following sections.
3.1 File Compression
File compression can be checked with the "filesys show compression <path>" CLI command, which reports the accumulated compression statistics stored in the file inode. When a directory is specified, the inline compression statistics of all the files under that directory are summed up and reported. In the CLI output, raw_bytes is labeled as "Original Bytes"; pre_lc_size is labeled as "Globally Compressed"; post_lc_bytes is marked as "Locally Compressed"; the other overheads are reported as "Meta-data." The two examples are captured from an actual DD:
Example 1: Inline compression statistics of a file
# filesys show compression /data/col1/main/dir1/file_1
Total files: 1; bytes/storage_used: 7.1
Logical Bytes: 53,687,091,200
Original Bytes: 11,463,643,380
Globally Compressed: 4,373,117,751
Locally Compressed: 1,604,726,416
Meta-data: 18,118,232
Example 2: Inline compression statistics of all files under a directory, including all subdirectories
# filesys show compression /data/col1/main/dir1
Total files: 13; bytes/storage_used: 7.1
Logical Bytes: 53,693,219,809
Original Bytes: 11,501,978,884
Globally Compressed: 4,387,212,404
Locally Compressed: 1,608,444,046
Meta-data: 18,241,880
The system reports the overall inline compression ratio in the above CLI output as "bytes/storage_used." However, care must be taken in interpreting the above information, as it can be misleading for various reasons. One reason is that the pre_lc_size and post_lc_size are recorded at the time the data operations are processed. When the file that originally added those segments gets deleted, the number of the unique data segments in the remaining file should be increased.
As an example, assume a file sample.file is backed up to a Data Domain, and in the first backup, the compression information of the file is pre_lc_size=10GiB, post_lc_size=5Gib.
Next, assume that the data of this file is unique with no data sharing with any other file. In the second backup of the file, further assume that the file gets ideal deduplication, such that both pre_lc_size and post_lc_size should be zero because all segments of the file already existed on the system. When the first backup is deleted, the second backup of the file becomes the only file that references the 5 GiB of data segments. In this case, ideally, the pre_lc_size and post_lc_size of the file in the second backup should be updated from both being zero to be 10 GiB and 5 GiB, respectively. However, there is no way to detect which files that should be done for, so the inline compression statistics of the existing files are left unchanged.
Another factor that affects the above numbers is the cumulative statistics. When a file gets a lot of overwrites, it is impossible to track the extent to which the cumulative statistics reflect the writes that introduced the live data. Thus, over a long time, the inline compression statistics can only be treated as a heuristics to roughly estimate the compression of a particular file.
Another fact worth highlighting is that the inline compression of a file cannot be measured for an arbitrary time interval. The file inline compression statistics are a cumulative result and cover all the writes that the file has ever received. When a file receives lots of overwrites, the raw_bytes can be far larger than the logical size of the file. For sparse files, the file sizes may be larger than the "Original Bytes."
3.2 MTree Compression
We can check the compression of a particular mtree with the "mtree show compression"
(MSC) CLI command. The absolute values of the inline compression statistics are cumulative over the lifetime of the MTree. Given that the lifetime of a MTree can be many years long, these values become less and less informative over time. To address this issue, we use the amount of change (deltas) of the inline compression statistics and report compression only for certain time intervals. The underlying approach is that we periodically dump the MTree inline compression statistics to a log. When a client queries MTree compression with the MSC command, we use the log to calculate the deltas of the numbers for compression reporting. By default, MSC reports compression for the last 7 days and the last 24 hours, though anytime period of interest can be specified.
To demonstrate, assume the following log for MTree A:
3:00AM, raw_bytes=11000GB, pre_lc_size=100GB, post_lc_size=50GB 4:00AM, raw_bytes=12000GB, pre_lc_size=200GB, post_lc_size=100GB
Then the compression of MTree A for this hour is:
g_comp = (12000-11000)/(200-100) = 10x
l_comp = (200-100)/(100-50) = 2x
overall compression ratio = (12000-11000)/(100-50) = 20x
The above compression ratio calculation does nothing with the dataset size. For example, the above mtree may only have 500 GB logical data.
MSC supports the "daily" and "daily-detailed" options, as does the "filesys show compression" command. When "daily" is specified, the command reports the daily compression in a calendar fashion. It uses the daily deltas of the raw_bytes and post_lc_size to compute the daily compression ratio. When "daily-detailed" is specified, the command shows all three deltas (of the raw_bytes, pre_lc_size, and post_lc_size, respectively) for each day; it also computes the g_comp and l_comp alongside the total compression factor.
Sample output from these systems is in the Appendix.
3.3 System Compression
Once we understand how compression is reported on MTrees, it is straightforward to extend the concept to the entire system. The system-wide compression inline statistics collection and reporting are exactly the same as with MTrees. The only difference is the scope, as one is in a particular MTree, while one is over the entire system. The results can be checked by using the "filesys show compression" command. An example of this can be found in Section 2. The "last 7 days" and "last 24 hours" system compression is reported in the last two lines of the result section in the FSC output.
On DDs with cloud tier implemented, the storage is separated into the active tier and the cloud tier, which are two independent deduplication domains. Users can inject data only into the active tier. Later, DDOS data-movement functions can be used to migrate data from the active tier to the cloud tier. Thus the space and compression measurement and reporting are handled independently in each tier. But at a file level, we do not differentiate by tier and report inline compression statistics; they are exactly the same as what we described in Section 3.1.
The last topic to highlight is some of the characteristics of deduplication, which is called "global compression" in many Data Domain documents. Although it contains the word "compression," it is entirely different than the traditional concept of compression, which is also provided by DDOS under the name "local compression."
Local compression reduces the size of a piece of data using a certain algorithm (some kinds of data are not compressible and applying compression algorithms on them may slightly increase data size). Usually, once an algorithm is decided, the data itself is the only factor of the compression ratio.
However, deduplication is different - it is not a local concept, it is "global." An incoming data segment is deduped against all the existing data segments in a deduplicated domain, which includes all the data on non-cloud Data Domain systems. The data segment itself does not matter in the deduplication procedure.
In practice, we rarely see high deduplication ratio in the initial backup of a dataset. In initial backups, often the major data reduction comes from local compression. When subsequent backups land on the Data Domain, deduplication shows its strength and becomes the dominant factor for compression. The effectiveness of deduplication relies on the fact that the change rate of a dataset is low from backup to backup. For this reason, datasets with high change rates cannot be well deduped. When the backup application inserts its own metadata chunks (called markers by Data Domain) into the backup images at high frequency, it also may not get a good deduplication ratio. Our marker-handling techniques can help sometimes, but not always.
Given these observations, what can we expect?
Measuring compression is difficult in deduplicated file systems, but it is even harder in log-structured deduplicated file systems. We must understand how deduplication works and how compression statistics are tracked. Compression ratios are useful information to understand the behavior of a particular system. The system effective compression ratio is the most important, reliable, and informative measure. The inline compression statistics can be helpful too, but they might be no more than heuristics in some circumstances.
Appendix: Sample output of "mtree show compression"
command
Assume that there is a MTree holding 254792.4 GiB of data. It has received 4379.3 GiB of new data in the last 7 days, and 78.4 GiB in the last 24 hours (other time intervals can be specified). The "daily" option reports the inline compression statistics for the last 33 days. When the "daily-detailed" option is provided, the total compression ratios are further detailed by separating them into global and local compression ratios.
mtree list output:
# mtree list /data/col1/main
Name Pre-Comp (GiB) Status
--------------- -------------- ------
/data/col1/main 254792.4 RW
--------------- -------------- ------
D : Deleted
Q : Quota Defined
RO : Read Only
RW : Read Write
RD : Replication Destination
IRH : Retention-Lock Indefinite Retention Hold Enabled
ARL : Automatic-Retention-Lock Enabled
RLGE : Retention-Lock Governance Enabled
RLGD : Retention-Lock Governance Disabled
RLCE : Retention-Lock Compliance Enabled
M : Mobile
m : Migratable
# mtree show compression /data/col1/main
From: 2023-09-07 12:00 To: 2023-09-14 12:00
Pre-Comp Post-Comp Global-Comp Local-Comp Total-Comp
(GiB) (GiB) Factor Factor Factor
(Reduction %)
------------- -------- --------- ----------- ---------- -------------
Written:
Last 7 days 4379.3 883.2 3.4x 1.5x 5.0x (79.8)
Last 24 hrs 784.6 162.1 3.3x 1.4x 4.8x (79.3)
------------- -------- --------- ----------- ---------- -------------
With "daily" option:
# mtree show compression /data/col1/main daily
From: 2023-08-12 12:00 To: 2023-09-14 12:00
Sun Mon Tue Wed Thu Fri Sat Weekly
----- ----- ----- ----- ----- ----- ----- ------ -----------------
-13- -14- -15- -16- -17- -18- -19- Date
432.0 405.9 284.1 438.8 347.0 272.7 331.4 2511.8 Pre-Comp
85.5 66.2 45.3 81.9 61.4 57.4 66.3 464.1 Post-Comp
5.0x 6.1x 6.3x 5.4x 5.7x 4.7x 5.0x 5.4x Total-Comp Factor
-20- -21- -22- -23- -24- -25- -26-
478.0 387.8 450.2 533.1 386.0 258.4 393.6 2887.1
100.6 81.5 100.8 119.0 84.0 40.6 75.3 601.8
4.8x 4.8x 4.5x 4.5x 4.6x 6.4x 5.2x 4.8x
-27- -28- -29- -30- -31- -1- -2-
27.6 1.0 0.4 470.7 467.3 517.7 641.9 2126.7
4.9 0.2 0.1 83.9 92.3 89.8 140.1 411.2
5.6x 5.6x 4.3x 5.6x 5.1x 5.8x 4.6x 5.2x
-3- -4- -5- -6- -7- -8- -9-
539.6 495.0 652.8 658.7 537.1 398.7 305.5 3587.3
110.8 108.0 139.4 137.0 111.5 78.3 48.3 733.3
4.9x 4.6x 4.7x 4.8x 4.8x 5.1x 6.3x 4.9x
-10- -11- -12- -13- -14-
660.2 738.3 787.2 672.9 796.9 3655.5
143.9 152.5 167.6 126.9 163.3 754.2
4.6x 4.8x 4.7x 5.3x 4.9x 4.8x
----- ----- ----- ----- ----- ----- ----- ------ -----------------
Pre-Comp Post-Comp Global-Comp Local-Comp Total-Comp
(GiB) (GiB) Factor Factor Factor
(Reduction %)
-------------- -------- --------- ----------- ---------- -------------
Written:
Last 33 days 14768.3 2964.5 3.4x 1.5x 5.0x (79.9)
Last 24 hrs 784.6 162.1 3.3x 1.4x 4.8x (79.3)
-------------- -------- --------- ----------- ---------- -------------
Key:
Pre-Comp = Data written before compression
Post-Comp = Storage used after compression
Global-Comp Factor = Pre-Comp / (Size after de-dupe)
Local-Comp Factor = (Size after de-dupe) / Post-Comp
Total-Comp Factor = Pre-Comp / Post-Comp
Reduction % = ((Pre-Comp - Post-Comp) / Pre-Comp) * 100
With "daily-detailed" option:
# mtree show compression /data/col1/main daily-detailed
From: 2023-08-12 12:00 To: 2023-09-14 12:00
Sun Mon Tue Wed Thu Fri Sat Weekly
----- ----- ----- ----- ----- ----- ----- ------ -----------------
-13- -14- -15- -16- -17- -18- -19- Date
432.0 405.9 284.1 438.8 347.0 272.7 331.4 2511.8 Pre-Comp
85.5 66.2 45.3 81.9 61.4 57.4 66.3 464.1 Post-Comp
3.5x 4.1x 4.3x 3.6x 3.8x 3.3x 3.4x 3.7x Global-Comp Factor
1.4x 1.5x 1.5x 1.5x 1.5x 1.4x 1.5x 1.5x Local-Comp Factor
5.0x 6.1x 6.3x 5.4x 5.7x 4.7x 5.0x 5.4x Total-Comp Factor
80.2 83.7 84.1 81.3 82.3 78.9 80.0 81.5 Reduction %
-20- -21- -22- -23- -24- -25- -26-
478.0 387.8 450.2 533.1 386.0 258.4 393.6 2887.1
100.6 81.5 100.8 119.0 84.0 40.6 75.3 601.8
3.3x 3.3x 3.0x 3.0x 3.3x 4.1x 3.6x 3.3x
1.4x 1.5x 1.5x 1.5x 1.4x 1.5x 1.4x 1.5x
4.8x 4.8x 4.5x 4.5x 4.6x 6.4x 5.2x 4.8x
79.0 79.0 77.6 77.7 78.2 84.3 80.9 79.2
-27- -28- -29- -30- -31- -1- -2-
27.6 1.0 0.4 470.7 467.3 517.7 641.9 2126.7
4.9 0.2 0.1 83.9 92.3 89.8 140.1 411.2
4.4x 3.7x 2.6x 3.8x 3.5x 3.9x 3.2x 3.5x
1.3x 1.5x 1.6x 1.5x 1.4x 1.5x 1.5x 1.5x
5.6x 5.6x 4.3x 5.6x 5.1x 5.8x 4.6x 5.2x
82.1 82.2 76.8 82.2 80.3 82.7 78.2 80.7
-3- -4- -5- -6- -7- -8- -9-
539.6 495.0 652.8 658.7 537.1 398.7 305.5 3587.3
110.8 108.0 139.4 137.0 111.5 78.3 48.3 733.3
3.4x 3.1x 3.2x 3.4x 3.3x 3.4x 4.1x 3.3x
1.4x 1.5x 1.5x 1.4x 1.4x 1.5x 1.6x 1.5x
4.9x 4.6x 4.7x 4.8x 4.8x 5.1x 6.3x 4.9x
79.5 78.2 78.6 79.2 79.2 80.4 84.2 79.6
-10- -11- -12- -13- -14-
660.2 738.3 787.2 672.9 796.9 3655.5
143.9 152.5 167.6 126.9 163.3 754.2
3.1x 3.4x 3.2x 3.7x 3.4x .3x
1.5x 1.4x 1.5x 1.4x 1.5x 1.5x
4.6x 4.8x 4.7x 5.3x 4.9x 4.8x
78.2 79.3 78.7 81.1 79.5 79.4
----- ----- ----- ----- ----- ----- ----- ------ -----------------
Pre-Comp Post-Comp Global-Comp Local-Comp Total-Comp
(GiB) (GiB) Factor Factor Factor
(Reduction %)
-------------- -------- --------- ----------- ---------- -------------
Written:
Last 33 days 14768.3 2964.5 3.4x 1.5x 5.0x (79.9)
Last 24 hrs 784.6 162.1 3.3x 1.4x 4.8x (79.3)
-------------- -------- --------- ----------- ---------- -------------
Key:
Pre-Comp = Data written before compression
Post-Comp = Storage used after compression
Global-Comp Factor = Pre-Comp / (Size after de-dupe)
Local-Comp Factor = (Size after de-dupe) / Post-Comp
Total-Comp Factor = Pre-Comp / Post-Comp
Reduction % = ((Pre-Comp - Post-Comp) / Pre-Comp) * 100