Category: Data Deduplication

A Blueprint for Primary Storage Optimization


During the past three to four months the storage industry has seen a spike in the number of reports, white papers and news articles surrounding the evolution of primary storage technology, capacity optimization (it is 2010’s Hottest Storage Technology).

The reason this technology is getting a lot of ‘air play’ these days is due to the fact that this technology is so critical to help control the growth and costs of storage.  In 2010 the EMC sponsored IDC Report The Digital Universe Decade – Are You Ready? was release and stated that:

  • In 2009, amid the “Great Recession,” the amount of digital information grew 62% over 2008 to 800 billion gigabytes (0.8 Zettabytes).
  • The amount of digital information created annually will grow by a factor of 44 from 2009 to 2020…

The folks at Wikibon also released an info graph that exposes the true explosion of data.

Information Explosion & Cloud Storage
Via: Wikibon

When you combine storage capacity (and the foot print it takes up) along with the power it takes to run it and cool it as well as the human resource it takes to manage it, you soon realize we cannot keep ‘just adding more cheap disk’ in an effort to manage the storage demands.  High Tech companies with high tech labs are also telling IT that ‘they are out of tricks’ when it comes to the ability to continue deliver disk drive that double capacity every 18 months.  It is for these reasons that primary storage optimization technologies have stepped into the ‘lime light’ as it serves as a means to help control the growth of primary storage including the foot print, power, cooling and man power required to manage it.

However, as we all know in IT, no two environments are the same and what may be good for one may not be good for another.  When looking at primary storage optimization there seem to be a number of available technologies and ways to deploy these technologies and the key question is what is right for ‘my’ environment.

Marketing, FUD and Doing What You Do Best


Rather than leave a lengthy comment on Tom Cook’s blog post from Friday Compression and Dedupe: Business Value and Data Safety (and from a marketing perspective, Friday’s are bad days to post blogs – especially in the summer) – I thought I would respond here (this may get lengthy as Tom made a number of points which I need comment on).

The first thing I do want to say is that when doing technical marketing; the proper strategy would be to not be on defense but rather take an offensive approach.  However, given the amount of FUD that Tom put in his latest blog post, I have to defend compression to some degree.

Now, I think we can all agree that data compression and data deduplication are two technologies that can complement one another very well.  Avamar (EMC) deduplicates the data at the source and then compresses the data before sending it to the Avamar Data Store gaining tremendous efficiency in network utilization.  ProtecTIER (IBM) compresses the data once it is deduplicated at the target device before it stores the data.  Other solutions also combine compression and data deduplication.

I’d like to comment on some key point Tom made in his piece where he is just blatantly wrong:

1)      Compression identifies redundant data across a very small window, usually 64 KB. – While this may be true for other compression technologies, this is not true for Storwize.  Storwize performs compression where the initial window is not fixed in size at all; it is the resultant write that is fixed in size.  This size is also specifically mapped to the I/O patter of the data being written.  The goal is such that in 1 I/O Storwize can do all the work it needs to on a particular file or LUN and it is for this reason Storwize has no performance penalty.

2)      Compression produces data reduction rates at most 2X for most data types. – Seems Tom needs a lesson in the most common answer in IT – “IT DEPENDS”.  Data compression ratios are 100% tied to the data type.  For a true indication of data compression ratios see Figure 1.

Setting the Record Straight on Backup


Or should I say, ‘Setting the Record Straight on Backing Up Optimized Data’?  Carter discusses on this blog they myriad of ways to perform backups on optimized data.  (His blog actually reads more like a white paper explaining how backup needs to be configured to work with his product.)  One of the ways Carter describes to do backup is via NDMP and says “… is the most complicated.” The funny thing is that this is the way that 90% of enterprises backup their NAS data.  The other scenarios are not quite stated correctly or are again designed to lead users to believe their solution is ‘simple’ when they really add complexity (however, I’ll let the backup community debate that – I have been in backup for 10+ years and I know this won’t go over on them, nor do I want to waste too much blog space).  Finally the last scenario they discuss isn’t backup – its replication, but I’ll address that too. Let’s address these one at a time.  First, Carter mentions that in some scenarios there is a need to rehydrate data in order to back it up.  The process of rehydrating data may not require that the array have the physical capacity to store the data before it is backed up, but the array will require the CPU resources, I/O resources, bandwidth and time to rehydrate to data to back it up.  George goes on to say that this situation is “ugly, but not that ugly”.  I will tell you any time you put more resource requirements on systems that do backups, your running the risk that backups won’t get done.  One of the greatest challenges in IT is backup.  Backup administrators are running into backup window problems all the time.  Data is growing not shrinking; having to do more work on more data in order to protect it is a recipe for failure.  In my previous comments I may have incorrectly stated you need more disk space to do the backups, but I did correctly state that the array will require more system resources.  And where do these resources come from?  When the system is idle?  When is your storage array idle? Now, what if all you had to do was – well nothing.  Storwize sits in front of primary storage and stores your data, compressed, in real-time with no performance impact and preserving the envelope of the data file.  Then when it comes time to backup, the backup administrator does absolutely nothing different that he/she did yesterday.  Same shares are backed up, same clients, and all the work is done by the Storwize appliance, there is no load on the filer.  The next question is can Storwize keep up with the backup stream and the answer is YES.  As you saw in the Wikibon CORE blog, our time to compress is on the order of magnitude of milliseconds – the time to decompress is even less.  (I should also mention one thing Carter failed to mention, in order for backups to come off their system ‘transparently’ you need a software agent on the client – who wants to manage more clients?

Compressed Thoughts – Compression and Deduplication


This video doesn’t talk about the merits of one versus the other but how when compression (or capacity optimization is done right) it should enhance data deduplication, not impact it.  Enjoy and for more videos like this one go to the StorwizeChannel.

Post to Twitter

Storage’s 2010 Hottest Technology


Each year there tends to be one technology that stands out in the storage space.  In 2009 it was data deduplication.  At the end of 2008 EMC made an acquisition of a source based deduplicaiton solution called Avamar.  Later, in 2009, they announced a strategic partnership with Quantum for data deduplication at the target.  Then in 2009 EMC made a bid against NetApp for Data Domain and won.  In addition, NetApp had data deduplication announcements with its ASIS technology.  Quantum, Falconstor, and Symantec all had their own story with data deduplication and a host of non-public companies such as Permabit, Sepaton, and Exagrid all were talking about the merits of data deduplication.

As the story goes, if you haven’t put data deduplication in your backup environment yet you’re either in an environment where there is not one iota of duplicate data, which is highly unlikely, or the company you work for has gobs of money and has no problem:

  1. Backing up to slow tape
  2. No worries about slow recovery from tape
  3. Keeping massive amounts of data on unreliable tape
  4. Backing up full streams of data to disk (and wasting valuable storage space)

What I am saying is that if you haven’t implemented a data deduplication solution by now, you have been left in the technology dust.  Data deduplication just makes too much sense.  I know we have all heard the expression “No one ever got fired for buying X.”  But has anyone ever got promoted because they bought X?  I have to believe that the IT team that can save their company 50% or more of their storage will get promoted.  Storage is a cost drain on IT.  It’s the applications that make a company money.  Its time to start focusing some of those valuable IT dollars on the applications that make your company money, its time to be the IT Super Hero!

Comression & Deduplication – Oil & Water or Milk & Cookies


UPDATE

Oil & Water?

Last week Mike Davis from Ocarina Networks published a blog post “Compression and Dedupe like Oil & Water?”  It was a good piece and from what I understand, and I don’t know Mike, but he will be taking over blogging as Sunshine has moved on to greener pastures and I wish her the best.  The reason for this piece is because Mike made some interesting statements in his piece and I had some questions.  I know the guys at Wikibon have ideas on this topic and I tried asking my questions via twitter and then on his blog but haven’t received any feedback (trust me, I am not nieve, I know we are all very busy) so thought it would be interesting to share my thoughts and try to start some dialog.

Mike stated:

“If you apply a compression-only workflow to a dataset let’s say you get 50%. Now run the same data set through a dedupe-only workflow and you’ll get maybe 20% (remember this is primary storage not backup data). Now take those little chunks and pointers from the dedupe workflow and compress them; you might get an additional 35% for a total of 55%. So compression of deduped data is less effective than on the raw data-set, but the combination (for this example) has eeked out a 5% advantage over the compression-only workflow.”

I understand Mike to be saying that if you used deduplicaiton and compression you could potentially get an additional 5% optimization of your storage over standard compression.  My question is, At what cost?  I don’t necessarily mean $ cost either, while this is a factor, but at what cost to the end user and the IT administrator.  When I think of capacity optimization for primary storage, here is what I believe the requirements are for IT:

  1. Optimization cannot cause any impact to the performance of the storage array
  2. Optimization cannot cause any change in downstream processes for the systems administrator
  3. Optimization cannot cause any increase in storage management functions

The Myths about Compression and Data Deduplication


 How many of you have heard that compression and deduplication just don’t belong together?  Like oil and water.  I know from experience, when I worked for EMC, the Avamar sales reps and the Data Domain sales reps would tell their customers that the best thing to do if they had encrypted or compressed primary data, that they uncompress it to get the savings in their backups that deduplication promises.

This is wrong on a number of levels.  First, the shear nature of telling a customer to not compress primary storage data only to get down stream benefits is counter intuitive.  Second, if the customer has already changed their processes in order to accommodate compressed primary data, then the deduplication backup vendor is asking their customers to again change the customer’s process.  Not to mention it costs the customer more money in primary storage, and lastly undermines the decision made by the customer to compress the data in the first place.  If you really want to insult your customer, tell them the decision they made to save money was a bad one. Finally, all data deduplication technologies utilize LZ compression on their data ‘chunks’ to further reduce their data size, and then use this added compression benefit to talk about their deduplication ratios.

The reality is, with traditional compression implementations, the affects of deduplication are not significantly realized.  The reason is due to how traditional compression writes the files it compresses.  If a file is changed, from the point of the change, through the rest of the file, the new compressed file is essentially a new file.  When deduplication (even variable block deduplication) looks at this file and finds the initial changed blocks, the rest of the file will also be different and the deduplication ratios will be significantly reduced.  (Essentially it turns the highly affective ‘variable block’ deduplication into ‘fixed block’ deduplication and research shows that fixed block deduplication is 3 to 5 times less efficient than variable block deduplication.  Now that you’ve spent all that money for an expensive variable block solution, are you really getting the benefits?)

How Much Backup Capacity Does Deduplication Really Save?


There is a lot of discussion around data deduplication for backup these days.  (I wish I could deduplicate all the turkey I ate last week.)  In fact, Gartner claims that “…by 2012, deduplication will be applied to 75% of backups.”  And when asked “Why?” the response was “…deduplication is too compelling to ignore.”  But I say “prove it”.  So I put together some backup capacity numbers for storing data on tape (non-compressed and compressed) versus storing data, deduplicated (fixed block and variable block), on disk and the numbers show a dramatic savings in backup space which translates into cost savings.

The Parameters

As with any ‘analysis’ numbers can be ‘spun’ to make them say what you want.  That said, I tried to be as straight forward as possible, so let me also show my methodology so you can see how my numbers were derived.

  • I charted the amount of capacity created using a retention policy of:
    • 14 Dailies
    • 4 Weeklies
    • 12 Monthlies
  • I selected 10TB of primary storage capacity
  • I did this for file system backups only
  • I charted the data for 30%, 40%, 50% and 60% primary storage growth rates
  • I charted traditional tape based backup (non-compressed)
  • I charted traditional tape based backup (compressed, 2:1)
  • I charted fixed block disk based deduplicated backup
  • I charted variable block disk based deduplicated backup (3 to 5 times more efficient than fixed block deduplication)

The Effect

The first thing to think about is the sheer number of full backup copies that must be maintained when utilizing the above retention schedule.  The above retention policy leads to 17.2 copies of the primary storage (12 yearly’s + 4 monthlies + the equivalent of 1.2 with dailies = 17.2 copies) .  Translation: one terabyte of primary storage becomes 17.2 terabytes of tape storage.  This means, backup administrators need to pay for the physical tapes as well as the offsite transport and storage costs.  Now 17.2 terabytes of tape doesn’t sound like much but keep in mind that is for 1TB of primary capacity.  Ten TB of primary capacity yields 172 TB of tape capacity.  Now add in year over year storage growth.  At 30% primary storage growth, the backup storage growth grows 23%, at 40% primary storage growth, the backup storage growth grows 29%, at 50% primary storage growth, the backup storage growth grows 33% and at 60% primary storage growth and the backup storage grows 38%.

Enterprise Data Protection at the Edge


What does that really mean?  When I worked for Veritas, back in 1998 we acquired a company based out of Canada called TeleBackup that backed up desktop / laptops.  In 1999 Veritas acquired Seagate and the Backup Exec product which also had a desktop / laptop option.  These products were meant to eventually be integrated into the main backup applications but never were.  Additionally, a lot of that software was given away (hard to make a business on that) and for the most part,  lived on a shelf somewhere and was never installed.

In 2004 I worked for Connected Corporate (acquired by Iron Mountain), who’s sole business was desktop / laptop backup.  (In fact, from 2000 to 2004 I worked as an analyst for ESG covering all the vendors in the backup space and used the Connected product to backup my work laptop – and it actually saved my hide once.)  While the company executed a successful exit, the business was (and probably still is) only about a $20M to $40M business.

Why do I bring this up?  There is a new reality in IT these days.  I have said it before, IT is accountable for 100% of the data created in any company, including that stored on desktop/laptops.  This means that not only do they have to provide a location to store this data but IT also needs to provide tools to protect this information and ensure that this information is highly recoverable for both business productivity purposes as well as corporate and legal governance.   This means that desktop / laptop backup is now gaining a lot more visibility in the enterprise.

However, desktop / laptop data protection is one of those areas in IT that is just a nuisance because it seems like it should be an easy problem to solve, but there are so many moving parts to it that it ends up falling by the wayside.

A successful desktop / laptop backup technology needs three very specific capabilities:

  • Integrate seamlessly with the existing backup solution in the enterprise

Architecting for Recovery


Here is a shocker for you, backup IS a science.  Good backup administrators / architects are worth their weight in gold.  CIO’s just wish backup would go away.   Backup costs money, it’s not strategic, it chews up man power and when it is ‘running’ (successfully or not) no one really pays attention to it, but when it fails or more likely when you need to restore data and can’t, someone can lose their job – so backup is VERY important, it is a science and to architect a backup environment correctly  it takes time, skill, money and someone who knows what they are dong.

Good backup administrators architect for recovery, not for backup.  Prove it you say.  Okay, question: “Why do backup administrators do full backups of Exchange every night?”  Answer – because it is way easier and much faster to perform a one step full recovery for Exchange than it is to lay down the weekly full and apply the incrementals.  Since mail is considered a “critical application” in the enterprise these days, and down time is critical for this application, good backup administrators architect for the least amount of downtime for the application.  This also applies to databases.  Ninety-five percent of all databases are actually snapped for quick recovery and I would also bet that a full backups is performed on them (or the snap) every evening.

Recovery is a primary driver of any good backup architecture but lately I have been hearing a great deal of talk around ‘backup consolidation’.  The reality is, there is no ‘one size fits all’ when it comes to backup software or hardware.  Consolidating backup software may make your environment easier to manage, but does it provide you the tools/technology you need to maximize your data protection objectives in your environment?  Consolidating backup targets (tape / disk) may yield fewer devices to manage, but what happens to your overall backup and recovery performance when doing so?  While new technologies may help fine-tune the science side of backup, they still need an artist’s touch.