Comression & Deduplication – Oil & Water or Milk & Cookies
UPDATE
Oil & Water?
Last week Mike Davis from Ocarina Networks published a blog post “Compression and Dedupe like Oil & Water?” It was a good piece and from what I understand, and I don’t know Mike, but he will be taking over blogging as Sunshine has moved on to greener pastures and I wish her the best. The reason for this piece is because Mike made some interesting statements in his piece and I had some questions. I know the guys at Wikibon have ideas on this topic and I tried asking my questions via twitter and then on his blog but haven’t received any feedback (trust me, I am not nieve, I know we are all very busy) so thought it would be interesting to share my thoughts and try to start some dialog.
Mike stated:
“If you apply a compression-only workflow to a dataset let’s say you get 50%. Now run the same data set through a dedupe-only workflow and you’ll get maybe 20% (remember this is primary storage not backup data). Now take those little chunks and pointers from the dedupe workflow and compress them; you might get an additional 35% for a total of 55%. So compression of deduped data is less effective than on the raw data-set, but the combination (for this example) has eeked out a 5% advantage over the compression-only workflow.”
I understand Mike to be saying that if you used deduplicaiton and compression you could potentially get an additional 5% optimization of your storage over standard compression. My question is, At what cost? I don’t necessarily mean $ cost either, while this is a factor, but at what cost to the end user and the IT administrator. When I think of capacity optimization for primary storage, here is what I believe the requirements are for IT:
- Optimization cannot cause any impact to the performance of the storage array
- Optimization cannot cause any change in downstream processes for the systems administrator
- Optimization cannot cause any increase in storage management functions
- *The solution needs to be heterogeneous (I just remembered this one)
If the optimization technology cannot ensure that these key storage functions are maintained, then quite frankly, the solution is not a solution for primary storage.
Lets think about this from IT’s perspective. First, if I implement a solution that can’t optimize data in real-time, then it must be done post process or once the data is stored. If this is the case, I need to find time on the array when the workload is low enough to allow the solution to perform the necessary I/O and hence load on the system. Given the fact that storage systems are busy with users more than 8 hours per day, there needs to be time to take snapshots as well as time to backup the system where is there time to perform the optimization of the storage?
Second, once the storage is optimized, is it readable? In other words, if a user needs some of the optimized data, can the application that wrote the data get at the data? If it is deduplicated and compressed, it cannot be accessed. (In fact, today the only deduplication technology that allows a file to be viewed (read only) is Avamar.) This means that if the data is required for additional processing, then IT must rehydrate the data in order for it to be used. This creates additional process work for the IT administrator as well as disk capacity for the rehydrated data so your not really saving the space.
Finally, any solution that interferes with IT business processes, such as backup, again are not good solutions. IT has spent an inordinate amount of time and money on their backup best practices. If an optimization technology that is used in front of the backup process significantly alters that process, then its ROI would have to be unbelievably compelling to make such a dramatic shift that it is almost unheard of. If a capacity optimization technology such as deduplicaiton is implemented on primary storage, it changes the files on the file system such that when IT goes to backup the new deduplicated file, it is considered a brand new file. Now, a file that would not have been required to be backed up (incremental backups don’t backup unchanged files) now has to be backed up. Yes, this new file/blob takes up less disk capacity, but is IT going to go back and remove the non-deduplicated file from tape? So in reality your storing more. Additionally, how do I track this file/blob in my backup system? Is it indexed as a file? If not, how do I recover this file in the event of a disaster? Most importantly, how does a deduplication technology such as Avamar or Data Domain backup this new deduplicated file? The backup vendors with deduplication technology tell folks who encrypt or compress (and now deduplicate) their primary capacity to unencrypt, decompress or rehydrate their data before using their deduplication because if any of these characteristics are utilized then it will ruin their deduplicaiton ratio? How is this good for primary storage if I have to rehydrate to do backups? How is this good for the backup environment?
The new finally. Folks have asked about other solutions such as NTAPs compression, EMC’s compression or ZFS. Again, all good solutions where given certain use cases would be a good fit but the problem with each of these solutions is vendor lock in, they are not heterogeneous. In order to keep maintain flexibility in IT, it is important to purchase heterogeneous solutions.
Lets also think about what the 5% actually means to the array. If I deduplicate 10 TB of capacity 20% I am left with 8 TB. The additional 35% of compression is on the 8 TB not the 10 TB so I still have 5.2 TB of capacity. Standard LZ compression, again depending on data type should yield 50% compression at a minimum giving you 5 TB of capacity. I think in this case compression would be a better solution.
Compression and Deduplication – Milk & Cookies
The reality is, as with every answer in IT is that for every use case there is an ‘it depends’ answer. Compression and deduplicaiton actually can co-exist. They can even co-exist on different tiers of storage if done properly. If compression is done right, in real time such that there is no impact to primary storage from an integration with the application perspective, performance perspective, or downstream process perspective then compression on primary storage is the right answer, and I would also say if you could deduplicate data in the same manner, it would be a viable solution as well but unlike compression, there are no deduplication solutions that can achieve these characteristics.
Now dedupication on ‘primary’ (and primary is in quotes again, its an ‘it depends’ situation) can be done if the primary storage is an archive where real-time data access and no secondary operation such as backup is going to occur. But if the ‘primary’ storage is active storage than the only way to do any capacity optimization, given the tools that are available today, is real-time compression. Additionally, in order to maximize the solution, IT would want real-time compression that utilized random access compression as well. If random access compression is used, then the impact to downstream process such as backup and deduplication are enhanced, not degraded.
Today Storwize is the only solution that provides IT with real-time primary storage compression, that is transparent to the application that wrote the data (meaning it can read the compressed data) done using random access techniques that doesn’t require the data to be decompressed before the data is backed up.
Now when the data is backed up, if you use deduplication, the Storwize technology can enhance deduplication bringing a 10x optimization to a 14x optimization saving IT money not only on their CapEx expenditures, but on their OpEx expenditures as well. This is why I think compression and data deduplication are more like milk and cookies than oil and water, but I encourage your thoughts.






Hi. Just to ask, you state a couple of commercially available solutions for compression and/or deduplication on primary storage, but, with significant CPU horsepower and RAM, isn’t ZFS a known realtime compression solution and possibly a reliable deduplication solution for primary storage?
Matt,
ZFS is a ‘real-time’ compression file system, the only issue is that while they do talk about no performance issues, they have a ton – it isn’t fast enough to run real time and the only way to make it so is with a ton of horsepower. So my first question, ‘at what cost’ is still the case. Now, if ZFS starts to have good performance, and as they claim, there is no change to the application (which I am not positive about but for this purpose, I’ll assume there isn’t), then this is a good start.
The next issue is does it work with your backups? Now if your backing up to tape the question is, what ends up on tape and can it be found easily enough should you need to do a recovery? Given the fact that IT has invested heavily over the last couple of years, and they all tell you to decompress your data before you backup – the question is – does it work with deduplicaiton? There is only one technology that does this today because it does random access compression and that is a part of the IP of Storwize. So the new question is, does it make sense to stop doing deduplication for the segment of your data that is 4x or more than your primary with all the fulls and incrementals using a compression technology on primary that may only save you 50% of the primary footprint and may be slow? Again, there may be cases where that is an okay solution, as with every answer in IT – it depends.
An excellent post!
BTW – It’s a shame there is no industry standard benchmarks for primary storage compression/dedup. This would allow potential IT buyers to compare performance and data reduction at the same time.
I left the following comment on Mike Davis’s blog on June 2nd:
“I don’t know why, but there is a misconception among practitioners about the technologies we all refer to as compression and deduplication. They aren’t oil and water. They’re more like two different flavors of water.
A few of my past comments about the topic can be found here:
http://wikibon.org/wiki/v/Pitfalls_of_compressing_online_storage
I wrote, “De-dupe (of the type the storage industry now markets) IS a form of compression. Traditional “dedupe” (aka compression) occurs within a file (e.g. gif, jpg, zip, tar, etc) where the data resides along with its dictionary. In contrast, storage-level de-dupe is executed across files and repositories using an external “dictionary” managed separately from the files. That is to say, compression is little more than ultra-granular deduplication occurring within a file.”
Obviously, it is always possible to further compress data that has been “deduplicated” by looking for redundancy at a more granular level than that which was used to deduplicate the original data. It is no different, in practice, than turning up the level of compression or deduplication from low to high such that comparisons are made between decreasingly smaller strings.
So, Mike, figures such as 50% compression, 20% deduplication and a combined 55% (dedupe and compression) are absolutely meaningless without additional context. The outcome of each test would depend on the base (and relative) levels of compression/deduplication. Crank up dedupe (using smaller or variable length strings) and the benefit from further compression will drop. Dial back the dedupe and the benefits from further compression will jump. A useful (if imperfect) analogy: It’s a bit like compressing a JPG in Photoshop using the “low” setting, then compressing the output file a second time using the “high” setting…albeit lossy.”
To-date Mike has not approved the comment to appear on his blog, nor has he replied to it. Frankly, there’s not much he can say.
It’s definitely Milk & Cookies.