Skip to content

Spectra Logic Backup and Recover Blog

Backup, Archive, HSM - What's the Difference Anyway?

Part One or Two

One of the interesting things I have discovered since I have been talking with so many HPC customers is that the term “backup” is seldom used.  You might ask if they aren’t doing traditional backups, then why would we, a backup solutions provider,  want to talk to them. Well, first you need to fully understand the difference between backup and archive.  Archive is a word you will hear more often in the HPC and M&E environments, especially if there is data in excess of the petabyte range and large files that aren’t accessed frequently but need to be kept indefinitely. 

In this blog, which is the first of a two part series, I will provide some fundamental information that can help you differentiate backup from archive.  In the subsequent blog, part two, we will peel the covers back on the process that is different from backup and archive and similar to the traditional HSM (Hierarchical Storage Management). This information will prove to be valuable for those HPC or other data intensive customers who may claim that they don’t do backups.  Stay tuned for more on this subject later.

The differences between backup and archive:

Backup: simply refers to the creation of a copy of data and storing it somewhere for restoration in the event the original version of the data was compromised in some way.  We evangelize the concept of backups because we know, and most customers realize, that data can accidentally be deleted, corruption could occur, data loss, or even worse, a natural disaster could wipe out the entire data center.

Backup is simply safeguarding or protecting the data that is being used by duplicating that data.  This is usually done in a rotating cycle or through schedules including: daily incremental which are kept for seven days, a weekly full kept for a month, a monthly full kept for a year and a yearly full kept for seven years.  Although this process has proven effective and most of the backup applications on the market today are ideal for doing this, problems occur when you start having multiple copies of the same data consuming a lot more hardware than necessary, not to mention the associated costs of running and managing that hardware. 

With backup – think business continuity

One of the key differences when comparing backup strategies to  archiving, is the difficulty of singling out select files for long term retention.  Everything in the backup gets lumped into the large full backup at the end of the year or seven years and called an “archive”.  It may in fact be called an archive but a recovery would function more like a backup recovery, which could be very costly and time consuming.  Backup strategies are more for business continuity purposes and not necessary for long term archiving.

With archive – think long-term retention

Archive: The main difference between an archive and a backup is that an archive refers to a single collection of records or data that is designated for long-term retention.  When the data is moved from the production environment to the archive environment it is tagged or indexed by metadata that assists in quickly locating that particular file or chunk of data through a search mechanism.  This process and the sophisticated software that performs it make locating a single file much more efficient than it would be in a traditional backup.  An archive is generally found in a common file system structure and the determination of where the file is located is a function of file system.  The file system may have several different storage devices that the archived data is stored on based on a number of attributes such as size, type, last accessed, etc.  This system could be a combination of expensive disk, such as fiber channel, less expensive disk, such as SATA or SAS and tape.  The key is how the data is “structured.”  In most cases, the data may never be accessed again, but it is necessary to keep it for historical purposes, regulatory compliance or unplanned event.  The goal with creating an archive is to keep it separate from the backup rotation cycle.  It is recommended that a separate copy of the archived data be made and kept in a separate location so there are at least two copies of the final archive.

Many environments will include both backup and archive.  Through the use of sophisticated software features that are available today, customers can establish policies that determine type, size, age, last accessed, remaining disk space and other characteristics of stored data that can automate the process of deciding whether to keep the data in the backup cycle or move it to the archive pool.

These two functions can be performed within a single library in separate partitions.  The software can then provide notification of what tapes need to be exported based on the function that was performed on those tapes, backup or archive.  I have seen numbers as high as 80% indicating how much data is duplicated within a storage infrastructure because the differences between backup and archive aren’t fully understood.  At the end of the day, knowing the difference and the benefits of backup and archive technologies, when to use them and how to balance the the two functions in an environment can drastically reduce the amount of redundancy, complexity and storage operating costs.

In Part Two of this discussion, we will look at how archives that contain production data, no matter how old or infrequently accessed, can still be retrieved online using high density and high speed tape systems and secondary disk systems.  Stay tuned for my next post which will look at enduring access to data. 

Want to talk more? I’ll be in Dearborn Michigan at the IDC HPC User Forum and DICE Alliance 2010 events next week. Contact me at jimm@spectralogic.com.

Dedupe to Tape - a Qualified Maybe

Dedupe, Dedupe, Dedupe....  In the last few weeks, every time I look online, someone is talking about dedupe.   New models, customer testimonials or new places to use dedupe.   Everyone sees the value in dedupe when looking at backup to disk, but where else can we use it?  Some talk about dedupe on primary disk storage, I would love to dedupe hills when riding my bike, I am sure Captain Kirk would like to dedupe the Tribbles and I think we would all love to find a way to dedupe our bills.  At the same time, I don't want to dedupe everything - I already don't go on as many dates as I would like.  What about deduping backups to tape? 


There has been some fun and exciting conversation about dedupe to tape in the last few days.  I started reading this latest flourish of conversation at Storage Soup.  As someone that's evening plans got deduped, I had plenty of time to poke around online.  It all seems to have started with W. Curtis Preston's comments that dedupe to tape has its place.  Of course, others jumped in, disagreeing.  I have been paying attention to the CommVault feature, and learning what I can about it for a while now.   CommVault was my first gig out of the Air Force, so I have a soft spot for them.  I also think the make good software, so while I never really thought the idea of dedupe to tape was a good one, I kept an open mind about their implementation.

 
Should an organization dedupe to tape?  It seems that everyone writing about it this week agrees that the answer is no if they want to recover from that copy of the data.  That certainly makes sense to me, and is in line with what I have told people for a couple of years now when they have asked what I thought.  Curtis' comment makes a lot of sense:
 
"They recommend it for a very specific user case: the tapes you know you're making that you never plan to restore from"  
 
After 10 years on the vendor side of backup and storage, I have been in a lot of data centers all over the country.  My idle curiosity often gets the best of me, and I ask - what's on those tapes on the wall?  It became a pretty typical refrain - "We have no idea, they have been here longer than anyone in the shop."  We really do have backups we never think we will need to recover; we just store them because we have to.  Overall, this speaks to process problems and using backup for something it isn't designed for. (Why don't more organizations actually start archiving?)  That is a conversation for a different blog entry.  
 
When I work with an account team and customer to architect a data protection scheme, I always think worst case.  Today, when we backup most of data to disk first, recovery from tape most likely means you are having a bad day.  50 tape swaps for a server recovery would make that a very bad day.  At that moment no one will feel better that they saved a couple hundred bucks on tapes six months ago. 
 
It seems the ideal application for deduped tape would be:
 
o    Large amount of data – LTO-4 is big; LTO-5 will be huge.  If you are not filling lots of these tapes, there does not seem to be any benefit
o    Archived for a long time - if your retention period is only a few months, the savings in media won't be that great, as you would get to recycle your tapes.
o    Expect very, very few recoveries from tape copy
o    Expect no large recoveries from tape copy
o    Will never have to bring tapes in from off site location for recovery
 
So, would I dedupe to tape? I don’t think so, but then I would hope my retention policies would not include long term retention with almost no expectation of recovery. I think some organizations will see great benefits from dedupe to tape, and others will not.  If I found myself in that situation would I? Yes. Ultimatly, I think the value in the dedupe to tape conversation with many organizations is it will force them to think about the process that creates all those old copies of data. That kind of review can always lead to better data protection.
 
Follow me at www.Twitter.com/3pedal.