Data Deduplication for XO backups

Backup May 1, 2017

Let's try to answer this question: by doing XO Delta Backup, is it possible to save some space on the remote storage while using deduplication?

The answer is a bit more complicated than "yes" or "no": it depends.

Ultra-quick recap on deduplication: it means that the storage system identifies a block which it has already stored, only a reference to the existing block is stored, rather than the whole new block. Read this for more.

Remote configuration

In this example, we use a Debian VM with ZFS on Linux, and enabling the deduplication feature. A NFS share is running on this, and added as a remote in XOA.

VMs from the same template

In this case, 4x VMs are created from the same template (Debian 8 Cloud Init ready Template). FYI, each exported VM is around 1.7GiB independently.

Then, those 4 VMs are started and then the backup job manually triggered, toward the NFS Remote.

Let's see the result on the disk:

 [1.7G]  20170426T114146Z_full.vhd
 [1.7G]  20170426T114153Z_full.vhd
 [1.7G]  20170426T114021Z_full.vhd
 [1.7G]  20170426T114015Z_full.vhd

Is deduplication working?

# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank   496G  1.69G   494G         -     0%     0%  4.06x  ONLINE  -

Hell yes! Only 1/4 on the space is used, because blocks from those 4 VMs are almost identical. Dedup ratio is 4x, but for 10 VMs from the same initial template, it would have been 10x.

To good to be true? Let's try the same experiment, but by creating manually 4 VMs with the same Operating System.

VMs created separately

This time, the "empty" XenServer template Debian 8 is used, with a Debian ISO to install it.
4 VMs are installed separately, then started, followed by the backup task. Files seems exactly the same when I list them, but what about dedup?

# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank   496G  5.28G   491G         -     0%     1%  1.29x  ONLINE  -

Only 1.29x! Well, having up to 25% space saved in not bad, but it's not really impressive. Why? That's because blocks are not written exactly in the same place during the Debian installation on 4 different VMs. So it's harder to get the benefit of deduplication.

But if you create all your VMs from the same template, it's another story. Also, keep in mind you'll get a LOT of space saved from the identical blocks (ie the initial template), then next backup with "random" data between all your VMs will limit the deduplication relevance.

Conclusion

Deduplication is not a miracle solution: it works if there is exact same block position between your VMs disks. In other words: VMs from the exact same base disk template.

Is there another way for your backup to use less space? Delta backup are already taking only the space needed after the first full. But continuous merging is "fragmenting" the full itself: the solution would be to "compact" it (ie: remove all the useless 0's in all the blocks).

That's totally doable, just need sometime to run, because we must read the whole full. And it's already in our roadmap!