Tag: "Dedupe"

A Blog with no Comments?


Today I read a very well written blog by The SANMan.  The only issue is, you can't comment on his blog.  This is the first technology blog I have seen like this.  So, I will have to post my thought here.

In his post "NetApp Takes the "Primary" Lead for Data Reduction" - which seems more like theory and a commercial for NTAP than reality (see comments @ The Register) the SANMan states:

"Yes, Ocarina and Storwize have appliances that compress and uncompress data as it’s alternatively stored and read but what performance overhead do such technologies have when hundreds of end users concurrently access the same email attachment? As for Oracle’s Solaris ZFS file system sub level deduplication which is yet to see the light of day one wonders how much hot water it will get Oracle into should it turn out to be a direct rip off of the NetApp model."

I have two comments:

1) You are right - you CAN'T do deduplicaiton on primary if you affect performance.  All indications for customers are that they cannot use NTAP deduplicaiton or even compression 'in-line' as the performance is just too terrible so all processes must be done post-process.

2) I direct your attention to the Wikibon Blog on CORE - "Dedupe Rates Matter...Just Not as Much as You Think" - Storwize can do in-line data optimization without any performance degradation.  So the question is - if customers can 'Optimize without Compromise' - why wouldn't they?

Updated 6/7/2010 - Oh, quick question - how does the SANMan get away with the graphics he uses?  I would think that Walt Disney & Pixar would get a bit upset with the use of the character Carl Fredricksen, no?

PDF    Send article as PDF   

Setting the Record Straight on Backup


Or should I say, ‘Setting the Record Straight on Backing Up Optimized Data’?  Carter discusses on this blog they myriad of ways to perform backups on optimized data.  (His blog actually reads more like a white paper explaining how backup needs to be configured to work with his product.)  One of the ways Carter describes to do backup is via NDMP and says “… is the most complicated.” The funny thing is that this is the way that 90% of enterprises backup their NAS data.  The other scenarios are not quite stated correctly or are again designed to lead users to believe their solution is ‘simple’ when they really add complexity (however, I’ll let the backup community debate that – I have been in backup for 10+ years and I know this won’t go over on them, nor do I want to waste too much blog space).  Finally the last scenario they discuss isn’t backup – its replication, but I’ll address that too. Let’s address these one at a time.  First, Carter mentions that in some scenarios there is a need to rehydrate data in order to back it up.  The process of rehydrating data may not require that the array have the physical capacity to store the data before it is backed up, but the array will require the CPU resources, I/O resources, bandwidth and time to rehydrate to data to back it up.  George goes on to say that this situation is “ugly, but not that ugly”.  I will tell you any time you put more resource requirements on systems that do backups, your running the risk that backups won’t get done.  One of the greatest challenges in IT is backup.  Backup administrators are running into backup window problems all the time.  Data is growing not shrinking; having to do more work on more data in order to protect it is a recipe for failure.  In my previous comments I may have incorrectly stated you need more disk space to do the backups, but I did correctly state that the array will require more system resources.  And where do these resources come from?  When the system is idle?  When is your storage array idle? Now, what if all you had to do was – well nothing.  Storwize sits in front of primary storage and stores your data, compressed, in real-time with no performance impact and preserving the envelope of the data file.  Then when it comes time to backup, the backup administrator does absolutely nothing different that he/she did yesterday.  Same shares are backed up, same clients, and all the work is done by the Storwize appliance, there is no load on the filer.  The next question is can Storwize keep up with the backup stream and the answer is YES.  As you saw in the Wikibon CORE blog, our time to compress is on the order of magnitude of milliseconds – the time to decompress is even less.  (I should also mention one thing Carter failed to mention, in order for backups to come off their system ‘transparently’ you need a software agent on the client – who wants to manage more clients?

PDF Creator    Send article as PDF   

Compressed Thoughts – Compression and Deduplication


This video doesn't talk about the merits of one versus the other but how when compression (or capacity optimization is done right) it should enhance data deduplication, not impact it.  Enjoy and for more videos like this one go to the StorwizeChannel.

PDF Creator    Send article as PDF   

Deduplication – Older than You Think


So I am a big fan of National Public Radio – NPR.  Today I learned that yesterday 10/29/09 was the 40th anniversary of the ‘internet’.  Now, I am sure there are a number of theories on when the internet was started and who started it, but safe to say that at this time in history 40 years ago, two guys from California sent the first 5 letter message, ‘Hello’, over a wire between two computers and internet messaging was born.

Since this point in time people have been trying to reduce the amount of data sent over the internet.  From email to instant messaging, from full files to compressed files and from disk drives to USB drives – people are always trying to make information trafficked over the internet smaller and faster.  No surprise coming from a group of people who have turned every term on the internet into an acronym, from USB, ISP, PDA, and LCD to SRM, ARM, and DPM, techies are always trying to stuff more data into smaller spaces.

Over the past 2 years data deduplication has become the latest fad in putting more data into a smaller space.  By removing redundant ‘blocks’ of data from the mass of files stored it is conceivable to reduce your data foot print by as much as 70%.  Deduplication is playing a predominate role in backup, especially backup over the WAN.  With deduplication, you can easily move your data over the WAN to a central data center for protection moving only small changes (blocks not files) of data and make even more room for FaceBook, Hulu, iTunes and more.  What is next for the internet.

PDF Printer    Send article as PDF   

The Side Effects of Backup on Server Virtualization


Server virtualization has changed the IT landscape dramatically.  It has become a magic potion curing a number of ills in the physical server world such as low individual CPU utilization and excess use of space, power and cooling in the data center.  However, like all potions that cure what ails you, there can be side effects.  You need to be careful of what the Witch Doctor orders.

When I speak with customers who have aggressively implemented a virtual server infrastructure, 9 out of 10 will tell me that they underestimated the affect that virtualization would have on their backups and backup process and how backup might actually make virtualization less of the magic potion they had hoped, when not considered during the virtual server assessment and planning process.  So what is the issue?  Backup is a virtualization bottleneck, and without addressing it, you may not be able to obtain the server consolidation ratios you had been expecting which can have a negative effect on your virtual server TCO and ROI.

This is a timely discussion as VMworld has just concluded.  VMware users flocked to VMworld looking for best practices when it comes to implementing virtual server technology.  Because virtualization allows IT to reduce the overall physical hardware infrastructure, users will be looking at how to maximize their server consolidation ratios (get as many virtual servers on a physical server as they can and still provide good application performance).

I often hear that companies assess their environments by looking at the production applications on their physical server environment, identify their work loads and translating that into some consolidation ratio of physical servers to virtual servers.  I also hear, from these same customers, that backup was never taken into consideration during the assessment phase when trying to identify the best possible consolidation ratios.  These customers implement their new virtual server environments, install the backup agent they had previously been using for physical server backups and attempt to backup their virtual servers and they find that they would only be able to protect 50% to 60% of the new environment.  Why?

Create PDF    Send article as PDF   

Betamax Redux


I often joke w/ customers that when my friends were growing up they would dream of being a professional baseball player or a rock star and I used to dream of becoming a data protection technologist.  Recently I read something very profound in Chuck Hollis’s internal EMC blog. Chuck said, "Decide what you're passionate about ...and write about it... it is hard to write about stuff you don't care about."  I am passionate about data protection.  Not because data proteciton is "cool" or anything, but it is one of the most important practices in the data center.  It is also one of the most challenging practices in the data center and it involes not just technology but people and process as well.  I had an old boss once who said, "Where there is chaos, there is cash."  and given the fact that the data protection market is a $10B market, I would say he was correct.  I have started this blog along with my colleagues because we truly believe in what we do, who we work for, the challenges we solve and benefits we bring to a customers challenging world around data protection.  We write because we are passionate about data protection, not because we are being paid to.

Something I read a while ago in Tony Assaro’s blog, Leaders Dilemma as well as Setting the Record Straight really got me charged up but I wasn’t sure how I wanted to comment. Tony, you see, writes for money (not passion), which means he has to write ‘for’ the company that is paying him and at the same time, spend time ‘Manufacturing Confusion’ in the market. (Sorry Tony, I liked you better as an analyst when you heard all the vendors product messages and would form an opinion about what was really going on in the market.) What I am referring to are the comments specifically about "EMC is the one big player going after this market in earnest with three different products (which will confuse the market and themselves)". Quite frankly, EMC’s philosophy and message to its customers regarding data deduplication isn’t confusing at all. In fact when I speak with our customers, they believe we have one of the more thoughtful and consistent messages around this topic.  So in an effort to educate, let me share EMC’s data deduplication philosophy and how EMC will take backup, beyond.  EMC will:

PDF Download    Send article as PDF