Tag: "Ocarina"

Gravity Applies to Everyone!


There was an interesting announcement today regarding Permabit who is now providing primary storage optimization through OEMs and having their solution embedded into the storage system.  This further drives home the point of where capacity optimization should live.  I do have a couple of questions however:

1)      What is the performance like?  I see phrases such as “High Performance Data Optimization Software” but don’t see any performance metrics – such as ‘no performance degradation’ for customers utilizing the solution.  Or testing metrics from their ‘partners’ (as it probably isn’t in production yet) – which brings up another question:

2)      Why were none of the ‘design win’ partners quoted in this announcement?

3)      Rehydration – Mr. Floyd states:

Permabit's Floyd claims Albireo can maintain data integrity because data written to disk isn't altered, and the reduction takes place out of the data path. When parallel processing is used, deduped data doesn't have to be rehydrated when it's accessed.

The question is – if it doesn’t need to be rehydrated, then how does the application read it?  I can only assume that Mr. Floyd means the data doesn’t have to be rehydrated on disk, which is fine, the question become: a) how does the application know what the data is? (Ocarina uses an agent to help them understand the data, but this is another thing to manage) and b) What is the performance of the system looking up all of the hash keys to reassemble the data on the fly, so how much more storage resources will this consume?

4)      Back to performance – Permabit states:

When done inline, data will flow to the Albireo library before going to disk. Post-process deduplication will write data to disk first, then scan and eliminate duplicated data. The parallel option sends data to disk while still in memory, and applies updates the same way as post-processing without having to read data off disk. Each method has different amounts of latency and reduction efficiencies.

PDF Printer    Send article as PDF   

Setting the Record Straight on Backup


Or should I say, ‘Setting the Record Straight on Backing Up Optimized Data’?  Carter discusses on this blog they myriad of ways to perform backups on optimized data.  (His blog actually reads more like a white paper explaining how backup needs to be configured to work with his product.)  One of the ways Carter describes to do backup is via NDMP and says “… is the most complicated.” The funny thing is that this is the way that 90% of enterprises backup their NAS data.  The other scenarios are not quite stated correctly or are again designed to lead users to believe their solution is ‘simple’ when they really add complexity (however, I’ll let the backup community debate that – I have been in backup for 10+ years and I know this won’t go over on them, nor do I want to waste too much blog space).  Finally the last scenario they discuss isn’t backup – its replication, but I’ll address that too. Let’s address these one at a time.  First, Carter mentions that in some scenarios there is a need to rehydrate data in order to back it up.  The process of rehydrating data may not require that the array have the physical capacity to store the data before it is backed up, but the array will require the CPU resources, I/O resources, bandwidth and time to rehydrate to data to back it up.  George goes on to say that this situation is “ugly, but not that ugly”.  I will tell you any time you put more resource requirements on systems that do backups, your running the risk that backups won’t get done.  One of the greatest challenges in IT is backup.  Backup administrators are running into backup window problems all the time.  Data is growing not shrinking; having to do more work on more data in order to protect it is a recipe for failure.  In my previous comments I may have incorrectly stated you need more disk space to do the backups, but I did correctly state that the array will require more system resources.  And where do these resources come from?  When the system is idle?  When is your storage array idle? Now, what if all you had to do was – well nothing.  Storwize sits in front of primary storage and stores your data, compressed, in real-time with no performance impact and preserving the envelope of the data file.  Then when it comes time to backup, the backup administrator does absolutely nothing different that he/she did yesterday.  Same shares are backed up, same clients, and all the work is done by the Storwize appliance, there is no load on the filer.  The next question is can Storwize keep up with the backup stream and the answer is YES.  As you saw in the Wikibon CORE blog, our time to compress is on the order of magnitude of milliseconds – the time to decompress is even less.  (I should also mention one thing Carter failed to mention, in order for backups to come off their system ‘transparently’ you need a software agent on the client – who wants to manage more clients?

Free PDF    Send article as PDF   

Comression & Deduplication – Oil & Water or Milk & Cookies


UPDATE

Oil & Water?

Last week Mike Davis from Ocarina Networks published a blog post "Compression and Dedupe like Oil & Water?"  It was a good piece and from what I understand, and I don't know Mike, but he will be taking over blogging as Sunshine has moved on to greener pastures and I wish her the best.  The reason for this piece is because Mike made some interesting statements in his piece and I had some questions.  I know the guys at Wikibon have ideas on this topic and I tried asking my questions via twitter and then on his blog but haven't received any feedback (trust me, I am not nieve, I know we are all very busy) so thought it would be interesting to share my thoughts and try to start some dialog.

Mike stated:

"If you apply a compression-only workflow to a dataset let’s say you get 50%. Now run the same data set through a dedupe-only workflow and you’ll get maybe 20% (remember this is primary storage not backup data). Now take those little chunks and pointers from the dedupe workflow and compress them; you might get an additional 35% for a total of 55%. So compression of deduped data is less effective than on the raw data-set, but the combination (for this example) has eeked out a 5% advantage over the compression-only workflow."

I understand Mike to be saying that if you used deduplicaiton and compression you could potentially get an additional 5% optimization of your storage over standard compression.  My question is, At what cost?  I don't necessarily mean $ cost either, while this is a factor, but at what cost to the end user and the IT administrator.  When I think of capacity optimization for primary storage, here is what I believe the requirements are for IT:

  1. Optimization cannot cause any impact to the performance of the storage array
  2. Optimization cannot cause any change in downstream processes for the systems administrator
  3. Optimization cannot cause any increase in storage management functions
PDF Creator    Send article as PDF