Data deduplication and compression are two examples of data reduction technologies. Storage vendor Quantum says that using deduplication, "Customers report an average 125 percent increase in backup performance, 87 percent fewer failed backup jobs, and typical disk capacity reductions of 90 percent or more—95 percent in virtual environments."
The "reductions of 90 percent" means a 10:1 deduplication ratio. For example a 1 terabyte system can effectively store 10 terabytes. These are high efficiencies that can change the way an enterprise backs up its data. Tape backup can be retired or reduced. Off-site backups can be updated over the Internet instead of depending on couriered magnetic tapes.
Examples of hardware and software solutions include EMC (Data Domain DDX, Global Deduplication Array, Avamar, Networker), Quantum DXi, Nexsan DeDupe SG, ExaGrid, Sepaton, Symantec (Backup Exec, NetBackup), CommVault, FalconStor, Permabit, NetApp, HP, Sun and IBM (Diligent, TSM Tivoli Storage Manager).
Deduplication systems cost a few thousand, to tens or hundreds of thousands of dollars depending on the storage capacity. They are built into standard backup software, NAS (network attached storage), VTL (virtual tape library), disk-to-disk backup appliances, and storage arrays.
What is Data Deduplication and How Does it Work?
Data dedupe works by storing only a single instance or copy of data. If two files are exactly the same the second file doesn't need to be stored. This is handled internally by the device and is transparent to the user.
To increase efficiency, hashes (checksums) of the files are compared. Fixed block deduplication compares standard sized bytes of data instead of files. Variable length blocks (also called byte deduplication) uses variable sized bytes of data.
The deduplication process works well with random access hard disks, not serial access tape systems. Deduplication also works better with backup systems:
- Full backups from different computers means multiple copies of the same operating system files from different computers: ideal candidates for deduplication.
- Multiple generations of full backups of the same computer means multiple copies of the same unchanged data files.
This is why there is more interest in deduplication in backup systems, compared to file servers or other systems: the deduplication ratios are higher. One blog reports only a 10% to 28% reduction in storage when deduplication was used on a file server.
Source, Target, Inline, Post-processing and Global Deduplication
The deduplication can be controlled:
- At the backup program (source-based deduplication).
- At the storage device (target-based deduplication).
Source-based deduplication is when the deduplication is done by the backup software. The hash of the new data (block or file) is sent over the network to the storage server for comparison. If a match is found, the new data isn't sent to the storage server.
Advantage
- Reduces network traffic, good for remote backup over the Internet.
Disadvantage
- May need new backup software.
Target-based deduplication means that the deduplication is done at the storage server (NAS, VTL), transparent to the backup software.
Advantages
- No change required to existing backup software.
- Can be faster than source-based deduplication, as long as the network can keep up.
Disadvantage
- Creates a lot of network traffic, or at least the same as before deduplication was deployed.
Inline deduplication means that the deduplication is done at the same time as the backup. Source-based deduplication is inline deduplication, but inline deduplication can be also be target-based deduplication (controlled at the storage device, transparent to the backup software). Post-processing deduplication does the deduplication after the backup has finished. This is usually target-based deduplication.
Global deduplication allows deduplication (single instance comparison) to span across more than one system. This can be between:
- Separate physical sites (headquarters and branch offices).
- Separate storage devices in the same site.
- Separate controllers in the same storage array.
By enlarging the pool of files that are compared, global deduplication can increase deduplication ratios. Global deduplication is currently relatively rare, being unavailable even in some well-known products.
The Best Data Deduplication System
One important performance number is the deduplication ratio or factor, as this will impact the amount of disk storage required, and therefore the cost of the system. Deduplication ratio calculators are available from storage vendors, to estimate the expected performance if deduplication is implemented.
Backup systems for enterprises are large, expensive and mission-critical. Deduplication is only one factor in selecting a backup system such as a VTL and associated backup software. Snapshots, continuous data protection, open file backup, backing up mobile users, archiving to tape, backup windows, high availability, virtual machine support, encryption, error logging and disaster recovery are just some of the other factors that need to be considered.
For more information, SearchDataBackup.com has a series of detailed articles on various aspects of data deduplication (free registration is required for some articles). Linux Magazine has a good explanation of source versus target deduplication.
To understand backup systems, it's useful to understand NAS, backup tape formats, and RAID disks.
Join the Conversation