Tag: "Deduplication"

Top 10 Reasons Real-time Compression Provides Extraordinary Storage Efficiency


Over the past few weeks I have witnessed the proverbial mudslinging that takes place in the blogosphere when marketing feathers are ruffled.  Most recently I was reading Rich Anderson of The StorageSavvy Blog.  The article was "Compression better than Dedup?  NetApp Confirms!"

I have to agree with Rich on many fronts.  First, "When all you have is a hammer, everything is a nail."  Rich points out vendors have to sell "what's in the bag" so it is conceivable that all problems look like they can be solved with their solution.  If you look back over the last few years NTAP has always had a "me too" reputation.  Whatever the industry has, they have one too and its better.  For the last few years, while competing against Storwize, they have pulled the EMC tactic of trying to stall a market by saying, "We have optimization for primary storage with deduplication."   The reality is, you can't use it in real time, it is a resource hog, and again Rich mentions, the only use case it works well on in primary storage is VMware (and that is ONLY IF the customer stores their data outside the .vmdk file otherwise compression is much better).  Now that NTAP has compression their story has changed saying that compression on primary storage is better for most use cases.  Duh!  The folks at Storwize (now IBM Real-time Compression) have been saying that for years.  Why, deduplication is great for repetitive data sets, i.e. backup, not primary storage.  There just isn't that much repetitive data in primary storage.  Again, NTAP is trying to stall the market saying they have "in-line" compression for primary storage.  Sorry guys, not good enough.  In-line is NOT Real-time.  Rich also points out that the key characteristics of storage for customers are capacity and performance.  Patrick Rogers of NTAP has said publically that compression WILL indeed impact performance and that they even have a tool that will tell you how much performance will be impacted.  While NTAP may say compression is "free", we all know nothing worth having in life is free, you get what you pay for.  If you need the performance to do compression you are going to have to perform a major upgrade to  your filer in order to just be able to perform compression let alone try to do compression in real time.  No real savings there.

Create PDF    Send article as PDF   

The Storage Network


With the impending name change to the "Storwize" product, the marketing folks at the old "Storwize" are at it again with their "viral video" campaign.  Not sure how many of you have seen the movie or even the trailer to "The Social Network" that grossed $23M in the US brining it to #1 in the box office last week .  Its a story of a guy that started in college with an idea and turned it into something big.  Much like Storwize - an idea that started with only a few in Israel and has now been acquired by IBM for multi millions of dollars and will become a key part to IBM's overall "Storage Efficiency" strategy.  This new trailer "The Storage Network" highlights too may realities of today's data management issues.  Hope you enjoy it.

Video created by MediaBoss Studios

(BTW: In case you didn't get it Storwize is now IBM Real-time Compression)!

Fax Online    Send article as PDF   

IBM Day 1 – It’s Official


Between time off with the family this summer and all the work required to get done between 'signing' a deal to be acquired and 'closing' a deal to get acquired, the blog has been a bit slow.  But I am here now to tell you it is official.  Storwize is now Storwize, an IBM company.

As for myself, I am looking forward to the work of integrating the Storwize Technology into the IBM Storage portfolio.  The Storwize group will live under the STG organization under Brian Truskowski.  There is a new ground swell taking head at IBM these days all around storage efficiency.  To get a better understanding, please have a look at my new colleague, Tony Pearson's blog discussing storage efficiency.  My job will be now to evangelize how IT now needs to take a look at all of the available storage "services" (clones, snapshots, thin provisioning, replication, compression, deduplication, etc...) can help to create an overall storage solution that allows them to reduce their over all $/TB on not only capital expense, but also on operational expense.

Lets face it, data growth isn't slowing down and there is never a one size fits all solution for storage.  The great part about being a part of IBM now is that we have all the tools to pick from to architect a data storage solution, end to end, that allows customers to reduce their overall $/TB for both primary as well as secondary storage and make that storage much more efficient and work for the end user.

This is going to be an exciting time.  I am also anxious to continue the Storage Alchemist blog.  EMC, under the guise of Polly Pearson and Chuck Hollis taught me that social media is great, but social media done right, in a collaborative and thoughtful way can drive influence.  I join some of the best bloggers around from IBM.  (I have added Tony's "Inside System Storage" - It is a great read.)

PDF Creator    Send article as PDF   

A Blueprint for Primary Storage Optimization


During the past three to four months the storage industry has seen a spike in the number of reports, white papers and news articles surrounding the evolution of primary storage technology, capacity optimization (it is 2010’s Hottest Storage Technology).

The reason this technology is getting a lot of ‘air play’ these days is due to the fact that this technology is so critical to help control the growth and costs of storage.  In 2010 the EMC sponsored IDC Report The Digital Universe Decade - Are You Ready? was release and stated that:

  • In 2009, amid the “Great Recession,” the amount of digital information grew 62% over 2008 to 800 billion gigabytes (0.8 Zettabytes).
  • The amount of digital information created annually will grow by a factor of 44 from 2009 to 2020…

The folks at Wikibon also released an info graph that exposes the true explosion of data.

Information Explosion & Cloud Storage
Via: Wikibon

When you combine storage capacity (and the foot print it takes up) along with the power it takes to run it and cool it as well as the human resource it takes to manage it, you soon realize we cannot keep ‘just adding more cheap disk’ in an effort to manage the storage demands.  High Tech companies with high tech labs are also telling IT that ‘they are out of tricks’ when it comes to the ability to continue deliver disk drive that double capacity every 18 months.  It is for these reasons that primary storage optimization technologies have stepped into the ‘lime light’ as it serves as a means to help control the growth of primary storage including the foot print, power, cooling and man power required to manage it.

However, as we all know in IT, no two environments are the same and what may be good for one may not be good for another.  When looking at primary storage optimization there seem to be a number of available technologies and ways to deploy these technologies and the key question is what is right for ‘my’ environment.

PDF Printer    Send article as PDF   

Marketing, FUD and Doing What You Do Best


Rather than leave a lengthy comment on Tom Cook’s blog post from Friday Compression and Dedupe: Business Value and Data Safety (and from a marketing perspective, Friday’s are bad days to post blogs – especially in the summer) – I thought I would respond here (this may get lengthy as Tom made a number of points which I need comment on).

The first thing I do want to say is that when doing technical marketing; the proper strategy would be to not be on defense but rather take an offensive approach.  However, given the amount of FUD that Tom put in his latest blog post, I have to defend compression to some degree.

Now, I think we can all agree that data compression and data deduplication are two technologies that can complement one another very well.  Avamar (EMC) deduplicates the data at the source and then compresses the data before sending it to the Avamar Data Store gaining tremendous efficiency in network utilization.  ProtecTIER (IBM) compresses the data once it is deduplicated at the target device before it stores the data.  Other solutions also combine compression and data deduplication.

I’d like to comment on some key point Tom made in his piece where he is just blatantly wrong:

1)      Compression identifies redundant data across a very small window, usually 64 KB. – While this may be true for other compression technologies, this is not true for Storwize.  Storwize performs compression where the initial window is not fixed in size at all; it is the resultant write that is fixed in size.  This size is also specifically mapped to the I/O patter of the data being written.  The goal is such that in 1 I/O Storwize can do all the work it needs to on a particular file or LUN and it is for this reason Storwize has no performance penalty.

2)      Compression produces data reduction rates at most 2X for most data types. – Seems Tom needs a lesson in the most common answer in IT – “IT DEPENDS”.  Data compression ratios are 100% tied to the data type.  For a true indication of data compression ratios see Figure 1.

Free PDF    Send article as PDF   

Gravity Applies to Everyone!


There was an interesting announcement today regarding Permabit who is now providing primary storage optimization through OEMs and having their solution embedded into the storage system.  This further drives home the point of where capacity optimization should live.  I do have a couple of questions however:

1)      What is the performance like?  I see phrases such as “High Performance Data Optimization Software” but don’t see any performance metrics – such as ‘no performance degradation’ for customers utilizing the solution.  Or testing metrics from their ‘partners’ (as it probably isn’t in production yet) – which brings up another question:

2)      Why were none of the ‘design win’ partners quoted in this announcement?

3)      Rehydration – Mr. Floyd states:

Permabit's Floyd claims Albireo can maintain data integrity because data written to disk isn't altered, and the reduction takes place out of the data path. When parallel processing is used, deduped data doesn't have to be rehydrated when it's accessed.

The question is – if it doesn’t need to be rehydrated, then how does the application read it?  I can only assume that Mr. Floyd means the data doesn’t have to be rehydrated on disk, which is fine, the question become: a) how does the application know what the data is? (Ocarina uses an agent to help them understand the data, but this is another thing to manage) and b) What is the performance of the system looking up all of the hash keys to reassemble the data on the fly, so how much more storage resources will this consume?

4)      Back to performance – Permabit states:

When done inline, data will flow to the Albireo library before going to disk. Post-process deduplication will write data to disk first, then scan and eliminate duplicated data. The parallel option sends data to disk while still in memory, and applies updates the same way as post-processing without having to read data off disk. Each method has different amounts of latency and reduction efficiencies.

PDF Download    Send article as PDF   

A Blog with no Comments?


Today I read a very well written blog by The SANMan.  The only issue is, you can't comment on his blog.  This is the first technology blog I have seen like this.  So, I will have to post my thought here.

In his post "NetApp Takes the "Primary" Lead for Data Reduction" - which seems more like theory and a commercial for NTAP than reality (see comments @ The Register) the SANMan states:

"Yes, Ocarina and Storwize have appliances that compress and uncompress data as it’s alternatively stored and read but what performance overhead do such technologies have when hundreds of end users concurrently access the same email attachment? As for Oracle’s Solaris ZFS file system sub level deduplication which is yet to see the light of day one wonders how much hot water it will get Oracle into should it turn out to be a direct rip off of the NetApp model."

I have two comments:

1) You are right - you CAN'T do deduplicaiton on primary if you affect performance.  All indications for customers are that they cannot use NTAP deduplicaiton or even compression 'in-line' as the performance is just too terrible so all processes must be done post-process.

2) I direct your attention to the Wikibon Blog on CORE - "Dedupe Rates Matter...Just Not as Much as You Think" - Storwize can do in-line data optimization without any performance degradation.  So the question is - if customers can 'Optimize without Compromise' - why wouldn't they?

Updated 6/7/2010 - Oh, quick question - how does the SANMan get away with the graphics he uses?  I would think that Walt Disney & Pixar would get a bit upset with the use of the character Carl Fredricksen, no?

PDF    Send article as PDF   

Setting the Record Straight on Backup


Or should I say, ‘Setting the Record Straight on Backing Up Optimized Data’?  Carter discusses on this blog they myriad of ways to perform backups on optimized data.  (His blog actually reads more like a white paper explaining how backup needs to be configured to work with his product.)  One of the ways Carter describes to do backup is via NDMP and says “… is the most complicated.” The funny thing is that this is the way that 90% of enterprises backup their NAS data.  The other scenarios are not quite stated correctly or are again designed to lead users to believe their solution is ‘simple’ when they really add complexity (however, I’ll let the backup community debate that – I have been in backup for 10+ years and I know this won’t go over on them, nor do I want to waste too much blog space).  Finally the last scenario they discuss isn’t backup – its replication, but I’ll address that too. Let’s address these one at a time.  First, Carter mentions that in some scenarios there is a need to rehydrate data in order to back it up.  The process of rehydrating data may not require that the array have the physical capacity to store the data before it is backed up, but the array will require the CPU resources, I/O resources, bandwidth and time to rehydrate to data to back it up.  George goes on to say that this situation is “ugly, but not that ugly”.  I will tell you any time you put more resource requirements on systems that do backups, your running the risk that backups won’t get done.  One of the greatest challenges in IT is backup.  Backup administrators are running into backup window problems all the time.  Data is growing not shrinking; having to do more work on more data in order to protect it is a recipe for failure.  In my previous comments I may have incorrectly stated you need more disk space to do the backups, but I did correctly state that the array will require more system resources.  And where do these resources come from?  When the system is idle?  When is your storage array idle? Now, what if all you had to do was – well nothing.  Storwize sits in front of primary storage and stores your data, compressed, in real-time with no performance impact and preserving the envelope of the data file.  Then when it comes time to backup, the backup administrator does absolutely nothing different that he/she did yesterday.  Same shares are backed up, same clients, and all the work is done by the Storwize appliance, there is no load on the filer.  The next question is can Storwize keep up with the backup stream and the answer is YES.  As you saw in the Wikibon CORE blog, our time to compress is on the order of magnitude of milliseconds – the time to decompress is even less.  (I should also mention one thing Carter failed to mention, in order for backups to come off their system ‘transparently’ you need a software agent on the client – who wants to manage more clients?

PDF Creator    Send article as PDF   

Compressed Thoughts – Compression and Deduplication


This video doesn't talk about the merits of one versus the other but how when compression (or capacity optimization is done right) it should enhance data deduplication, not impact it.  Enjoy and for more videos like this one go to the StorwizeChannel.

Create PDF    Send article as PDF   

Compressed Thoughts – A History of Capacity Optimization


Users and Vendors alike have always had skepticism around capacity optimization and different points throughout history.  The key, for end users is to be able to answer the questions:

  1. What are you doing to my data? - This is directly related to availability and data integrity.  If you can't ensure the customer will always have access to their data and it will be 'their' data then there is an issue.
  2. Whatever you do to my data, you can't cause slowlyness.  Which goes right to the performance question.

(These are consequently the two main characteristics being 99% of every storage purchase.)

Vendors always whine that they will sell less disk and these technologies will negatively impact margins.  The reality is that the disk storage market is 100% elastic.  Add reduplication to your backup environment, I promise you will buy more backup diks storage and keep more data on-line.  Humans are funny like that.

Watch this brief video to see the storage challenges of today and how solving them will have the same 'feeling' it did even 10 years ago.

Fax Online    Send article as PDF