Tag: "Backup"

Comression & Deduplication – Oil & Water or Milk & Cookies


UPDATE

Oil & Water?

Last week Mike Davis from Ocarina Networks published a blog post "Compression and Dedupe like Oil & Water?"  It was a good piece and from what I understand, and I don't know Mike, but he will be taking over blogging as Sunshine has moved on to greener pastures and I wish her the best.  The reason for this piece is because Mike made some interesting statements in his piece and I had some questions.  I know the guys at Wikibon have ideas on this topic and I tried asking my questions via twitter and then on his blog but haven't received any feedback (trust me, I am not nieve, I know we are all very busy) so thought it would be interesting to share my thoughts and try to start some dialog.

Mike stated:

"If you apply a compression-only workflow to a dataset let’s say you get 50%. Now run the same data set through a dedupe-only workflow and you’ll get maybe 20% (remember this is primary storage not backup data). Now take those little chunks and pointers from the dedupe workflow and compress them; you might get an additional 35% for a total of 55%. So compression of deduped data is less effective than on the raw data-set, but the combination (for this example) has eeked out a 5% advantage over the compression-only workflow."

I understand Mike to be saying that if you used deduplicaiton and compression you could potentially get an additional 5% optimization of your storage over standard compression.  My question is, At what cost?  I don't necessarily mean $ cost either, while this is a factor, but at what cost to the end user and the IT administrator.  When I think of capacity optimization for primary storage, here is what I believe the requirements are for IT:

  1. Optimization cannot cause any impact to the performance of the storage array
  2. Optimization cannot cause any change in downstream processes for the systems administrator
  3. Optimization cannot cause any increase in storage management functions
PDF Creator    Send article as PDF   

The Myths about Compression and Data Deduplication


 How many of you have heard that compression and deduplication just don’t belong together?  Like oil and water.  I know from experience, when I worked for EMC, the Avamar sales reps and the Data Domain sales reps would tell their customers that the best thing to do if they had encrypted or compressed primary data, that they uncompress it to get the savings in their backups that deduplication promises.

This is wrong on a number of levels.  First, the shear nature of telling a customer to not compress primary storage data only to get down stream benefits is counter intuitive.  Second, if the customer has already changed their processes in order to accommodate compressed primary data, then the deduplication backup vendor is asking their customers to again change the customer’s process.  Not to mention it costs the customer more money in primary storage, and lastly undermines the decision made by the customer to compress the data in the first place.  If you really want to insult your customer, tell them the decision they made to save money was a bad one. Finally, all data deduplication technologies utilize LZ compression on their data ‘chunks’ to further reduce their data size, and then use this added compression benefit to talk about their deduplication ratios.

The reality is, with traditional compression implementations, the affects of deduplication are not significantly realized.  The reason is due to how traditional compression writes the files it compresses.  If a file is changed, from the point of the change, through the rest of the file, the new compressed file is essentially a new file.  When deduplication (even variable block deduplication) looks at this file and finds the initial changed blocks, the rest of the file will also be different and the deduplication ratios will be significantly reduced.  (Essentially it turns the highly affective ‘variable block’ deduplication into ‘fixed block’ deduplication and research shows that fixed block deduplication is 3 to 5 times less efficient than variable block deduplication.  Now that you’ve spent all that money for an expensive variable block solution, are you really getting the benefits?)

PDF Creator    Send article as PDF   

How Much Backup Capacity Does Deduplication Really Save?


There is a lot of discussion around data deduplication for backup these days.  (I wish I could deduplicate all the turkey I ate last week.)  In fact, Gartner claims that “…by 2012, deduplication will be applied to 75% of backups.”  And when asked “Why?” the response was “…deduplication is too compelling to ignore.”  But I say “prove it”.  So I put together some backup capacity numbers for storing data on tape (non-compressed and compressed) versus storing data, deduplicated (fixed block and variable block), on disk and the numbers show a dramatic savings in backup space which translates into cost savings.

The Parameters

As with any ‘analysis’ numbers can be ‘spun’ to make them say what you want.  That said, I tried to be as straight forward as possible, so let me also show my methodology so you can see how my numbers were derived.

  • I charted the amount of capacity created using a retention policy of:
    • 14 Dailies
    • 4 Weeklies
    • 12 Monthlies
  • I selected 10TB of primary storage capacity
  • I did this for file system backups only
  • I charted the data for 30%, 40%, 50% and 60% primary storage growth rates
  • I charted traditional tape based backup (non-compressed)
  • I charted traditional tape based backup (compressed, 2:1)
  • I charted fixed block disk based deduplicated backup
  • I charted variable block disk based deduplicated backup (3 to 5 times more efficient than fixed block deduplication)

The Effect

The first thing to think about is the sheer number of full backup copies that must be maintained when utilizing the above retention schedule.  The above retention policy leads to 17.2 copies of the primary storage (12 yearly’s + 4 monthlies + the equivalent of 1.2 with dailies = 17.2 copies) .  Translation: one terabyte of primary storage becomes 17.2 terabytes of tape storage.  This means, backup administrators need to pay for the physical tapes as well as the offsite transport and storage costs.  Now 17.2 terabytes of tape doesn’t sound like much but keep in mind that is for 1TB of primary capacity.  Ten TB of primary capacity yields 172 TB of tape capacity.  Now add in year over year storage growth.  At 30% primary storage growth, the backup storage growth grows 23%, at 40% primary storage growth, the backup storage growth grows 29%, at 50% primary storage growth, the backup storage growth grows 33% and at 60% primary storage growth and the backup storage grows 38%.

Free PDF    Send article as PDF   

Enterprise Data Protection at the Edge


What does that really mean?  When I worked for Veritas, back in 1998 we acquired a company based out of Canada called TeleBackup that backed up desktop / laptops.  In 1999 Veritas acquired Seagate and the Backup Exec product which also had a desktop / laptop option.  These products were meant to eventually be integrated into the main backup applications but never were.  Additionally, a lot of that software was given away (hard to make a business on that) and for the most part,  lived on a shelf somewhere and was never installed.

In 2004 I worked for Connected Corporate (acquired by Iron Mountain), who’s sole business was desktop / laptop backup.  (In fact, from 2000 to 2004 I worked as an analyst for ESG covering all the vendors in the backup space and used the Connected product to backup my work laptop – and it actually saved my hide once.)  While the company executed a successful exit, the business was (and probably still is) only about a $20M to $40M business.

Why do I bring this up?  There is a new reality in IT these days.  I have said it before, IT is accountable for 100% of the data created in any company, including that stored on desktop/laptops.  This means that not only do they have to provide a location to store this data but IT also needs to provide tools to protect this information and ensure that this information is highly recoverable for both business productivity purposes as well as corporate and legal governance.   This means that desktop / laptop backup is now gaining a lot more visibility in the enterprise.

However, desktop / laptop data protection is one of those areas in IT that is just a nuisance because it seems like it should be an easy problem to solve, but there are so many moving parts to it that it ends up falling by the wayside.

A successful desktop / laptop backup technology needs three very specific capabilities:

  • Integrate seamlessly with the existing backup solution in the enterprise
Fax Online    Send article as PDF   

Architecting for Recovery


Here is a shocker for you, backup IS a science.  Good backup administrators / architects are worth their weight in gold.  CIO’s just wish backup would go away.   Backup costs money, it’s not strategic, it chews up man power and when it is 'running' (successfully or not) no one really pays attention to it, but when it fails or more likely when you need to restore data and can't, someone can lose their job - so backup is VERY important, it is a science and to architect a backup environment correctly  it takes time, skill, money and someone who knows what they are dong.

Good backup administrators architect for recovery, not for backup.  Prove it you say.  Okay, question: “Why do backup administrators do full backups of Exchange every night?”  Answer - because it is way easier and much faster to perform a one step full recovery for Exchange than it is to lay down the weekly full and apply the incrementals.  Since mail is considered a “critical application” in the enterprise these days, and down time is critical for this application, good backup administrators architect for the least amount of downtime for the application.  This also applies to databases.  Ninety-five percent of all databases are actually snapped for quick recovery and I would also bet that a full backups is performed on them (or the snap) every evening.

Recovery is a primary driver of any good backup architecture but lately I have been hearing a great deal of talk around ‘backup consolidation’.  The reality is, there is no ‘one size fits all’ when it comes to backup software or hardware.  Consolidating backup software may make your environment easier to manage, but does it provide you the tools/technology you need to maximize your data protection objectives in your environment?  Consolidating backup targets (tape / disk) may yield fewer devices to manage, but what happens to your overall backup and recovery performance when doing so?  While new technologies may help fine-tune the science side of backup, they still need an artist’s touch.

Create PDF    Send article as PDF   

Deduplication – Older than You Think


So I am a big fan of National Public Radio – NPR.  Today I learned that yesterday 10/29/09 was the 40th anniversary of the ‘internet’.  Now, I am sure there are a number of theories on when the internet was started and who started it, but safe to say that at this time in history 40 years ago, two guys from California sent the first 5 letter message, ‘Hello’, over a wire between two computers and internet messaging was born.

Since this point in time people have been trying to reduce the amount of data sent over the internet.  From email to instant messaging, from full files to compressed files and from disk drives to USB drives – people are always trying to make information trafficked over the internet smaller and faster.  No surprise coming from a group of people who have turned every term on the internet into an acronym, from USB, ISP, PDA, and LCD to SRM, ARM, and DPM, techies are always trying to stuff more data into smaller spaces.

Over the past 2 years data deduplication has become the latest fad in putting more data into a smaller space.  By removing redundant ‘blocks’ of data from the mass of files stored it is conceivable to reduce your data foot print by as much as 70%.  Deduplication is playing a predominate role in backup, especially backup over the WAN.  With deduplication, you can easily move your data over the WAN to a central data center for protection moving only small changes (blocks not files) of data and make even more room for FaceBook, Hulu, iTunes and more.  What is next for the internet.

PDF Printer    Send article as PDF   

Comprehensive Capacity Optimization – Deduplication 2.0


Technology is great isn't it?  When someone thinks they have a new idea on the same old technology foundation they call it "X 2.0".  I have been watching the banter between analysts and vendors (specifically NTAP’s Dr. Dedupe and Permabit’s CEO Tom Cook) on the topic of Deduplication 2.0 and it is my belief that the proverbial boat is being missed (since we are using water analogies).  I have been watching these guys hash it out for the past few weeks and decided I have to jump in.  I find the real value to these conversations is the value to the end user.  At the end of the day, it doesn't really matter who 'coined' or 'invented' a term (like deduplication 2.0) but what does matter is if  the term actually helps describe a technology and how that technology can be leveraged to make things better in the data center.  We should focus on the implications of this new generation of deduplication - ‘deduplication 2.0’.

In May I delivered a presentation to a number of EMC customers on the topic of Data Deduplication 2.0 - Comprehensive Capacity Optimization.  The point of my presentation was simple (and keep in mind this was before the Data Domain acquisition); there are a number of capacity optimization technologies/capabilities that are available to customers today.  Originally these deduplication technologies were used primarily for backup purposes but slowly, deduplication is making its way into primary storage. Deduplication in primary storage makes a lot of sense FOR DATA THAT IS STATIC.  Why only static data?  Static data is data that isn't used frequently (doesn't mean it's not important, it just simply is not accessed often); because access to this data is infrequent, the performance requirements for this data is less than that of active data. Remember; nothing in IT is free.  If I deduplicate data, in order to use it, I must ‘rehydrate’ it and thus there is a performance implication so I want to be careful where I deduplicate data so as not to inhibit performance on production data.

PDF Download    Send article as PDF   

The Side Effects of Backup on Server Virtualization


Server virtualization has changed the IT landscape dramatically.  It has become a magic potion curing a number of ills in the physical server world such as low individual CPU utilization and excess use of space, power and cooling in the data center.  However, like all potions that cure what ails you, there can be side effects.  You need to be careful of what the Witch Doctor orders.

When I speak with customers who have aggressively implemented a virtual server infrastructure, 9 out of 10 will tell me that they underestimated the affect that virtualization would have on their backups and backup process and how backup might actually make virtualization less of the magic potion they had hoped, when not considered during the virtual server assessment and planning process.  So what is the issue?  Backup is a virtualization bottleneck, and without addressing it, you may not be able to obtain the server consolidation ratios you had been expecting which can have a negative effect on your virtual server TCO and ROI.

This is a timely discussion as VMworld has just concluded.  VMware users flocked to VMworld looking for best practices when it comes to implementing virtual server technology.  Because virtualization allows IT to reduce the overall physical hardware infrastructure, users will be looking at how to maximize their server consolidation ratios (get as many virtual servers on a physical server as they can and still provide good application performance).

I often hear that companies assess their environments by looking at the production applications on their physical server environment, identify their work loads and translating that into some consolidation ratio of physical servers to virtual servers.  I also hear, from these same customers, that backup was never taken into consideration during the assessment phase when trying to identify the best possible consolidation ratios.  These customers implement their new virtual server environments, install the backup agent they had previously been using for physical server backups and attempt to backup their virtual servers and they find that they would only be able to protect 50% to 60% of the new environment.  Why?

PDF    Send article as PDF   

A Data Protection Reference Architecture – The Final Chapter


The Architecture

This ‘architecture’ diagram, as you can see, is not a typical architecture diagram, but hopefully it can be used to align your business and business objectives with the technologies that are available and can best be applied to solve your issues helping to balance, cost, complexity and compliance.

This diagram can also be used to do a couple of other things.  It can help you begin to classify your data and align your  data to your business objectives.  It also lets you begin to identify what data or data services in your environment that may be more important to you than others and based on this help you to choose areas you may want to outsource or move to the cloud.

As you can tell, there really is not one solution for meeting all your data protection needs.  The challenge comes with managing multiple solutions in an effort to meet your business objectives.  While there are only a few technologies available that allow you to manage your environment across all your RPOs and RTOs, it is important that I point out EMC’s NetWorker is able to do this, centralizing your data protection infrastructure  for ease of management.  It allows you to manage traditional backup, source based deduplicated backup with Avamar, CDP with RecoverPoint, as well as the EMC disk libraries and tape where the data is stored.  Now, I am not saying that NetWorker solves all of your data protection challenges, nor am I suggesting that replacing one traditional backup technology for another is the right answer, but what I am saying is that if you’re looking to have all the feature functionality required to meet all your business objectives and you want easier management, NetWorker is one avenue to get you there.  Additionally, the underlying image of the triangle represents data protection management.  Putting all the new technology in place is one thing, managing it, and ensuring you are now meeting your business needs is another.  EMC's Data Protection Advisor can help here as well.

PDF Creator    Send article as PDF   

A Data Proteciton Reference Architecture – Part 3


The 'Fat Middle'

In the 'fat middle' of the triangle, as I stated last week, there are a number of ways to protection information.  I have chosen to break apart the middle into two categories.  The reality is, this is meant to be used as a tool for helping you lay out a strategy so your boxes could be based on capacity and could end up in different areas of the triangle depending upon your business needs.  The thing to keep in mind is that it's not about your environment matching these boxes exactly, but it's about making sure that all of the critical data that requires backup with a 24 hour RPO is protected; you then alignthe data value in the box with the most appropriate technology to 1) solve the challenge 2) fit best in your environment.

SMB / ROBO

First, let me clarify my terminology.  ROBO is remote office, back office and SMB is small to medium business.  If we think about the business needs that are most important in this arena, they are:

1)      Low cost

2)      Simplicity (one tool)

3)      24 hour RPO is adequate

Small and medium businesses, as well as remote offices, need a robust data protection solution that allows them to meet their backup windows and that has the ability to recover data that is not any older than 24 hours (RPO).  The RTO drives whether the backup target is disk or tape.   Faster recoveries come from disk.  Another thing to keep in mind is that there isn’t usually a lot of technical expertise at these sites so the backup application needs to be very simple to manage.

Backup appliances or appliance-like backup technologies tend to work very well in these environments.  A self contained backup appliance, (disk based) with the ability to replicate efficiently to another site for disaster protection is a great solution for sites like these.

PDF Creator    Send article as PDF