The question of how encryption and deduplication work together is a question that comes up often. The concern is that if you encrypt data, every block will be different, deduplication won’t work and now you need more storage than you did before. And often a lot more storage. But this isn’t always the case. In this article, we explore how HyTrust DataControl encryption can work well in conjunction with the deduplication capabilities provided by VMware VSAN storage as well as storage arrays that support deduplication.

How Does Disk Encryption and Deduplication Work?

HyTrust DataControl provides disk encryption at the partition level for Windows and Linux servers. The following figure shows a block device before and after encryption.

All blocks are encrypted in such that post encryption, every encrypted block is different from every other encrypted block even if the blocks were the same before being encrypted. We’ll skip the details of how this works. Just remember that all blocks must be different for encryption to be effective.

Now how does deduplication work? Look at the following figure:

On the left, we are showing a number of blocks in a disk. The “green” blocks (0, 2 and n) are identical and blocks 1 and n-1 are different from each other as well as different from the “green” blocks. When blocks are passed through the deduplication layer, the dedupe engine “hashes” the blocks. All blocks that hash to the same value are considered identical. When this occurs, there is no need to store 3 identical copies, just one copy and some meta-data to specify that there are 3 copies and which blocks in the device they correspond to.

Where is Dedupe Most Effective?

If you have 1,000 Windows VMs, you have 1,000 copies of the Windows operating system installed (think C:drive). Most VMs are deployed from a single gold template and therefore, the VM disks start off being identical. These VMs can essentially be reduced to a single copy (of storage space allocated) through deduplication. Okay, so maybe not exactly 100% but very close.

As time goes by, these Windows VMs will start to differ. The C: (Windows OS) will not change too much but data drives likely will. There will be different databases, different webservers, other types of applications, user data and so on. These differences mean that dedupe will likely become less effective over time.

What DataControl does to save space?

Moving on from the encryption example shown above where we get a different encrypted block for every block in every disk, we’ll now discuss how this is unlikely to happen given how people deploy VMs. There is one exception that we will explore in detail and describe the HyTrust solution to circumvent it.

Only Encrypt What’s Been Allocated

Now let’s look at a slightly different scenario as shown below:

In this figure, only blocks 0, 2 and n have been allocated by the filesystem. By default, we will only encrypt blocks that have been used. We’ll ignore the rest. There are two possible scenarios:

  1. The disk is thin-provisioned. This means that only blocks written to by the VM are actually allocated. The goal here is the same for deduplication – save storage space! Since we are not encrypting these “unused” blocks, no additional space is required.
  2. The disk is “thick”. In other words, all blocks are allocated before the OS is installed and a filesystem created on the disk. In this scenario, we only encrypt the blocks that are in use by the filesystem and any other blocks can run through the dedupe engine and reduce the amount of storage considerably.

NOTE – by default, DataControl on Windows will detect “holes” (blocks not allocated or zero-filled blocks) and not encrypt them.

Clones and Templates

DataControl supports two types of templates/clones:

  1. The ability to copy an encrypted VM and clone the associated encryption keys.
  2. The ability to create a template from an encrypted VM and clone the associated encryption keys as new VMs are spun up from the clone.

We show what happens in Figure 4 as a new VM is created from a clone.

When a clone is taken, there are 2 scenarios:

  1. Copy-on-write. In this case, blocks are only allocated when the second VM writes to any of its disk blocks. Thus, although the disks of the first VM are encrypted, there is no increase in storage after taking the clone.
  2. There is a full copy of the disk. Here, all blocks are allocated and copied from the original VM’s disk. But from a dedupe perspective, there is no change in storage allocated. Block 0 of VM-2’s disk is identical to block 0 of VM-1’s disk so the dedupe engine just increases the reference count.

In both cases, as each VM writes to storage, the blocks will change and will become unique. In this case, there will be an increase in storage regardless of whether data is being encrypted or not.

Further Optimization for Dedupe Storage

The one time that encryption and dedupe do not work well together can be summarized by the following use case – there are 100 Windows VMs already instantiated and they need to be encrypted. We have a C: for the operating system and a D: for data. We install the DataControl Policy Agent on each VM and encrypt all of the drives.

If the C: drives are not going to be encrypted, there is unlikely to be much, if any, reduction in dedupe since data drives are very likely to be unique (per VM). Of course, this is very much dependent on the type of data.

But let’s assume that we want to encrypt the C: drives. With 100 copies of Windows already instantiated, there will be 100 keys created to encrypt those 100 C: drives. And once encryption is completed there will be no dedupe since every encrypted block, in every C: drive in every VM, will be different.

Now what happens if we used a single encryption key for those 100 C: drives? Let’s assume that those 100 C: drives are identical. In this case the dedupe result is as shown in the figure below.

Let’s take block 0 as an example. On each of the 100 VMs, the resulting encrypted block is going to be identical since the key is the same. The same is true for block 1 and so on. Thus, at the storage layer, for block 0 we are only going to store a single block and it will have a reference count of 100. The only thing that we are losing is dedupe of identical blocks within a single VM C: drive which we believe are likely to be minimal.

Therefore, by using this feature, which is available in HyTrust DataControl, we should only see a small drop in dedupe efficiency regardless of how VMs are deployed or when the encryption takes place. In our VSAN tests with a single key across a range of VMs, we see about 90% of the deduplication that you would obtain without encryption. A win-win situation.