Skip to content

Spectra Logic Backup and Recover Blog

Got Big Data? 4 Things to Look for at the NAB Show

Spectra Logic’s Kevin Dudak is a contributing blogger for the Inside Big Data Blog. His most recent post has been reprinted below with permission from Rich Brueckner:

Got Big Data? 4 Things to Look for at the NAB Show

I just got an email from the organizers of the NAB Show in Las Vegas this April about my registration confirmation. I’ve always enjoyed attending this show, as it has a lot of things you don’t see at the typical IT show. How many storage shows have provisions for helicopters to fly in and be displayed?

This show draws a wide cross section of organizations, with lots of educational seminars, as well as vendor displays. What they all have in common is data. The capture side creates the raw data, editors and post-production change the data, and broadcast distributes finished data.

It got me thinking, there is so much to see and do at this show, what do I want to make sure I learn at NAB? I will specifically be looking at four things:

  • 4K and Beyond– Video takes space, and HD 1080P video takes lots of space. It has become so easy to capture HD video these days with things like the Go Pro Hero Cam (they should have a cool booth in the Expo Hall), I’ve got hours of footage from bike rides, autocrosses and other events consuming lots of storage. 4K makes 1080P looks small, and there is talk about what follows that. Increases in sensor resolutions will drive bigger video files not just in media, but also in security and other applications. I want to see what is coming, as storage systems need to be ready to hold the additional data.
  • Digital Workflow– I think everyone looking at Big Data issues can learn a lot here. M&E has spent several years converting to digital file-based workflow. This means lots of huge, high value files that need to be analyzed for meta data creation, modified, rendered and distributed. Every year, they get a little more efficient at the process, something we can all learn from.
  • Storage – Of course, as a storage guy, I am going to be interested in any new ways to use storage. There will be everything from extreme speed storage for playout at broadcast stations to long term archival storage of digital assets. Asset management of the archives ties in here and is equally important. A lot of the problems that are being solved for long term archive today for M&E will have a broader application in the near term.
  • Data Movement– How does someone move a Terabyte of data, or a Petabyte securely and with confidence? I am very interested to see how different organizations are solving the data mobility challenge. With 4K video capture becoming more mainstream, and higher resolutions on the horizon, many entertainment companies are a generation or two ahead of the rest of technology users on this front. Lessons learned here will help us all.

There is a lot of potential to learn some interesting things at NAB this year. They are facing the same challenges many Big Data industries are, but they come at it without the preconceived notions of how IT is supposed to work. I think that gives them the potential to create some interesting solutions that can benefit us all.

Big Data Software – More Than Just Analytics

Spectra Logic’s Kevin Dudak is a contributing blogger for the Inside Big Data Blog. His most recent post, Big Data Software – More Than Just Analytics, has been reprinted below with permission from Rich Brueckner:

Big Data Software – More Than Just Analytics

I noticed a funny thing the other day while on the Storage Networking World website, looking at the different things on the agenda. At the top of the list is the Big Data track the first day of the show. That’s pretty predictable, given everyone seems to be talking about Big Data these days.

Surprisingly, Hadoop isn’t mentioned once in any of the Big Data track session descriptions. Some of the Big Data Track sessions are about Data Analytics, bringing Big Data to the enterprise and where to start with Big Data – in all of this I am sure Hadoop will come up, but it’s interesting it was not mentioned in the titles or descriptions.

At most other events that have a Big Data focus you see Hadoop everywhere. In fact, the feedback from some people that went to the Strata Conference was Hadoop and Big Data are inseparable. It seems that many have begun to believe that Big Data = Hadoop… but does it?

If Big Data equals Hadoop, then Big Data equals Analytics. But Big Data isn’t that simple. Processing, programming, networking and storage have some type of implication to Big Data, and I am sure we will discover many more important aspects over the next year or two.

What were once simple tasks are much more complex when dealing is massive data sets. The impact to storage systems and networks when looking at data protection alone are beyond what most organizations have considered before. With data sets that can now be larger the many disk arrays, migrating to new systems is complex and time consuming.

SNW isn’t focused on Data Scientists, but on storage managers, and while Hadoop will surely be talked about, it won’t be the focus of the day. As time goes on more disciplines will start to look at the implication Big Data plays in their part of the IT ecosystem.

Thinking About Big Data on the Eve of Spring Trade Show Season

Spectra Logic’s Kevin Dudak recently became a contributing blogger for the Inside Big Data  Blog. His first post Thinking About Big Data on the Eve of Spring Trade Show Season has been reprinted below with permission from Rich Brueckner:

Thinking about Big Data on the eve of the spring trade show season

The month of March brings longer days, warmer weather and the start of the spring trade show season.  There seem to be as many trade shows as there are interest and industries.  Last year, we saw a lot of people start talking about Big Data at these shows.  The trend most likely will continue, with Big Data taking a bigger share of the conversation.  

Given the years I have been in the storage industry, it should come as no surprise that I tend to look at the storage part of Big Data.  Over the last year we have heard a lot about the analytics side of Big Data.  It is exciting seeing all the amazing things we can do, and things we can learn from the massive amount of data we have at our finger tips these days.  Without a doubt, we will continue to see much of the conversation focus on leveraging our data sets with tools like Hadoop.  Sometimes, it seems we forget that Big Data is more than just the analytics; it is also about storing and managing potentially massive data sets.  2012 will see users and vendors starting to address the changes Big Data brings to storage.

The 2012 Tape Summit and the HPC Symposium kick off the season.    The second annual Tape Summit is the gathering of top manufactures in the Data Tape, including drive, library, software and media companies; as well as press, analysts and bloggers.  You don’t see tape and Big Data in the same conversation too often, but I think the tape industry will be looking to change that this year.  We will be hearing about Linear Tape File System (LTFS,) continued innovation in data management software and possibly the coming LTO6 and how all of these can have a big impact on storing lots of data.

The HPC Symposium will see presentations from some of the top organizations in the distributed high performance world.  Many of the lessons the HPC world has learned over the last 5 years will make the adoption of Big Data easier and more effective. 

I’ll be watching to see how LTFS might be a good answer to Big Data portability.  We are seeing LTFS gain traction in some verticals like Media and Entertainment already.  The question of how to move Petabytes of data, either to seed a cloud provider or just move to a different location has always been a problem.  LTFS might just provide a good answer.

Dealing with massive data sets, be it integrity checking the data or protecting it is a struggle we all face at one time or another.  We are starting to see a new crop of software vendors, some in the Active Archive Alliance, that are creating data storage environments. 

Finally, with the expected shipment of LTO6 this calendar year, we will see a doubling of native capacity on media.  There should be performance improvements as well.  Since the LTO consortium is attending Tape Summit, hopefully we will get more details on it, and how it might affect the economy of storing big data.

As March rolls on, we should start to see a lot of information coming out of events such as the HPC Symposium and the Tape Summit on not only how to analyze Big Data, but how to manage and store it when it isn’t being crunch. 

“Why I’m Thankful…for Big Data Storage”

We should all be Thankful as “Big Data” improves storage for everyone.

It’s the beginning of the Holiday season, with Thanksgiving travel in full swing.  I’ll be getting 10 hours of windshield time shortly, as I’m headed to see family.

As more of our customers have moved into the world of “Big Data” we have been looking at how to make storage ready for ExaScale.  ExaScale sized storage has challenges that storing a handful of Terabytes never imagined.  Spectra announced the 12thgeneration of BlueScaleearlier this month with a lot of advancements for Big Data customers.  While “Big Data” can mean a lot of different things to different organizations, one thing that is common is the need to storage and manage huge amounts of information.  We spent hours working with our customers over the last year looking at where we could make massive storage easier to use. 

Simply booting up a multi-petabytelibrary can be time consuming.  Traditionally, a library will reinventory itself when rebooting.  This takes a few minutes on a library with 50 tapes, but will take hours on a library with 15,000 tapes.  Spectra’s BlueScale 12 operating systemwill not force a fresh inventory on reboot.  If you didn’t change any tapes, why waste all that time?  If you did open the library and change things, then you can tell the system to update the inventory, whichwill save our customers hours.

The number of components that might need code updates over the life of the library grows with data storage as well.  What would take a few minutes with a 2 drive tape library could take hours with a 120 drive library.  With BlueScale 12, updates are done in parallel, so 120 drive sleds can be updated in the time of one.   

Of course, most organizations are not rebooting their libraries or updating firmware every month.  We have continued to increase the assistance Media Lifecycle Managementgives our customers.  The analytics we evaluateon our Certified Media combined with Data Integrity Verificationon the data written automatically lets informs administrators if there is an issue.  They do not need to spend any time managing it, it just works.  BlueScale 12 adds MLMsupport for TS1140 technologytape media in Spectra T-Finitylibraries. 

These enhancements and more, like the XML interface, Carbide Clean and RAIT make managing the largest storage environments easy and reliable.  The great thing about Spectra T-Series libraries is they all run BlueScale.  OursmallerT50ecustomers get the same software updates and benefits as our largest T-Finitycustomers.    All our customers do not generate multiple petabytes of data, but they all have data that is important to their business.  Being able to bring the advances that“Big Data” drives to all our customers is something I am thankful for.

Now, back to the road. Safe travels this holiday season!

Rethinking Storage with Big Data

The annual Spectra sales kick off meeting just wrapped up.  I am working on getting this entry done before I leave for a week’s vacation.  It was an interesting few days of discussions and presentations.  There were a lot of outside speakers this year, really adding to the variety of topics and viewpoints. 

I ended up on the schedule three times this year.  I would talk to my agent about  when they were scheduled, if only I had one.  One of my sessions was an update on the Big Data market.  As I got up to start the presentation, I was surprised at how well it tied into everything we heard already.  It seems most market segments and verticals are facing a Big Data challenge of one type or another.  

Most of the external conversations about Big Data seem to immediately go to analytics.  I think it is fascinating how we can derive and learn so much from data we already have.  As we learn more about getting value from our data, we want even more data to analyze.  This makes me somewhat surprised that little of the conversations really focus on storage problems of Big Data.

I talked with the sales team with week about how Big Data is changing the storage rules.  There are a lot of things that work on a 10 TB data set that are not practical with a 1 PB data set.   As you amass hundreds of Terabytes of data, and start heading toward a Petabyte, you need to look at the basic questions again:  How do I store it?  How do I protect it?  How do I move it?  The implications of these questions when viewed at the Petebyte level are interesting. 

If the data does not change much, moving these big data sets into an active archive can help.  In an archive, disk and tape both serve as primary storage systems.  What a lot of people don't initially consider is that tape might be the best primary storage platform for these data sets.  Nothing beats the TCO of tape, making it the most affordable storage platform for these large data sets.  Bandwidth isn't just expensive, there simply isn’t enough of it to replicate Petabyte data sets.  Tape's native portability makes it possible to move data at massive scales. Static data written to two different tapes does not need traditional weekly backups.  This might just solve a few problems. 

As a guy with years of experience with disk, it is interesting to look at the basic questions again in the era of Big Data.  I think tape has a good fit in some of the areas while disk fits others, and look forward to exploring it more in the future.  But not anymore today.  I am off to ride my bike across Iowa.

 

 

Archive and Backup

Last week I looked at how archiving data can help take the strain off backup systems.  The theory is pretty simple: if you have less data to backup, then it is easier to meet backup windows with fewer resources.

It should not come as a shock that this basic idea also helps in a disaster recovery event.  I am going to stick with the same example as last week, which is a customer with 100 TB of data.  I am doing this both to be consistent for those of you that read my article last week, and also because I am lazy and have already done most of the work. 

Before I get too deep into this extremely interesting conversation, I want to make sure we are all thinking about the same scenario in respect to archive.  Many people think of archives as a place where data goes to die.  Today's Active Archive systems are very different.  Most file-based data that is fairly static can easily reside in the archive.  When needed, the data is read from the archive directly into the application requesting it.  Active Archiving is as much about how we manage data as it is how we store it.

It is this direct accessibility of the data that has the biggest impact to disaster recovery.  All good archive software can create multiple copies of the data in the archive.   These copies can be cross platform.  In last week’s blog, I assumed three total copies of the archived data, one on disk, two on tape.  One of the tape copies is described as off site. 

 

 

Before implementing and archive solution, this 100 TB organization needed to plan to recover 100 TB of data.   This includes moving the data, having storage to recover it to and most important, the time to recover 100 TB. 

After moving 80 TB into the archive, recovery looks different.  All that needs to be done to access the 80 TB in the archive is to ensure the archive application is running.  The data is directly accessible from the archive medium, either disk or tape.  Actual data recovery is now reduced to the 20 TB that are not archived. 

I think of this as a double win.  Day to day, an offsite DR copy gets created and moved.  Instead of 100 TB of data to keep constantly up to date at the remote site, we now only have to keep 20 TB up to date.  The Archive application keeps the 80 TB of archive data automatically updated.  Since that data does not change much, is does not require large amounts of data to move.  Then, when recovery is necessary, only 20 TB must be recovered, rather than the full 100 TB of data, as we can access the archive without recovery.

This just might make it possible for people to implement better DR plans.

How are you using archive in relation to disaster recovery? We’d love to hear your stories and best practices.

Follow me on Twitter.com/3pedal.

 

How Archive Can Fix Backup

I have talked with a lot of storage and backup administrators over the years.  While everyone has challenges that are just as unique as their data sets and organizations, there is a common theme.  Users typically say, "My backups take too long," "I need faster hardware," "I take too many tapes offsite" or "Even with deduplication, I can't afford to replicate". 

Constant data growth and shrinking backup windows are the one, two punch that makes it very difficult for us to successfully protect all of our data.  For many organizations, backup can be the highest bandwidth application on the network.  After all, we are trying to move a copy of all of our data over the network every weekend.  This touches the primary storage system the data resides on, the clients that access it, the entire network infrastructure and of course the backup servers and storage.  No wonder troubleshooting backup performance problems can drive even a teetotaler to drink. 

The traditional way to deal with backup performance issue is to deploy more, faster gear.  We buy faster networks and install more connections and procure larger, faster storage systems, be they disk or tape.  I know several companies whose first 10 GbE network clients were the actual backup servers.  We deploy deduplication to try and shrink the footprint of backup. 

As we go through all of these architecture changes, we never look at the root of the problem, there is too much data to backup.  If we had less data to protect, the stress on the systems would be much lower.  Of course, we can't just delete a bunch of data because it will make our life easier.  So, how do we reduce the data we have to manage?  We can archive it.

It is easy to forget that most of the data stored by a company is static and rarely used.  After all, we use data every day.  But in truth, that data we manage daily is typically only a very small percentage of the total data a company stores.  Let’s consider a 100TB environment:

For this example, we will assume that every terabyte (TB) of data in production ultimately creates 25TB of data in the backup between all copies.  This is based on four daily backups (each 10% the size of the full);  a full weekly backup saved for four weeks; plus end of month backups saved for one year; and finally, end-of-year backups saved for seven years.   So, a 100TB data set ends up driving 2.5 petabytes of backup data.  Combine the 100 TB of production data with the 2.5 PB of backups and the organization has to manage a total of 2.6 PB-.  That is pretty significant.  If you want to complete a full backup of the 100 TB in 24 hours, you need 1,215 MB/sec nonstop backup performance. 

So, let’s archive 80% of the data in our example.  It seems reasonable that 80% can be archived for most organizations. 

The 20% "production" data still drives a 25X expansion in backup data, but now only needs 500TB to store, and a 243MB/sec environment to protect in 24 hours.  The 80TB that gets archived in this example has three copies made for offsite and redundancy.  One of these copies is stored off site for disaster recovery.  This consumes an additional 240TB of storage.  All told, instead of 2.6PB of data in the environment, the same data now consumes 760TB. 

Instead of increasing the performance of your servers, software, storage and network to meet backup windows, a solid archive strategy lets you effectively reduce the total amount of data to be backed up. 

What are your best practices for archiving?
Follow me at www.Twitter.com/3pedal

It's beginning to look a lot like Christmas

Writing a Holiday blog about storage is more challenging than I expected.  I love the season as much as anyone a good laugh more than most, so I was the natural pick. 

The early suggestions were all interesting.  Is Deduplication like the Red Ryder BB gun in The Christmas Story? You might shoot your eye out, but most likely, it will be a good thing to get?  We did the 12 days of Storage Christmas last year, so it is too soonto do that again.

I tried to figure out to make one of my favorite Christmas movies work, but just could not come up with a tie in with Die Hard.  (Store Hard: The story of a vacationing police detective trying to stop terrorist criminal masterminds from consuming all the storage during a Christmas Party?)

I thought about doing a storage centric Night Before Christmas version, but if there was ever a good data source to deduplicate, it is the countless versions of the Night Before Christmas.  So why not a butcher a Christmas Song. 

 

It’s beginning to look a lot like Christmas
Ev'rywhere you go;
Take a look in the server room, full of stuff that's new
With shiny lights the wink and blink and glow.

Its beginning to look a lot like Christmas
Space for everyone
But the coolest thing to have is storage that will grow
Till the cows come home.

Some deduping gear and a wan link that works
Is the plan roll out DR
To do more with less and store what we have
Is management's hope all  a long;
And everyone can hardly wait for work to start again.

It's beginning to look a lot like Christmas
Ev'rywhere you go;
A library in the just installed, with tapes to store it all
Your data will be safe all winter long.

It's beginning to look a lot like Christmas;
There's gifts under the tree,
Both disk and tape have found a way to pal around
In your backup plan.

 

  

The Zombie DR Survival Guide (Halloween special edition)

 I’ve worked for most of my professional life in data storage, focused primarily on data protection.  I am always looking for ways to better protect data.  I typically think about how to make things run more quickly and reliably, and with less user intervention.  Occasionally, I look at identifying and preparing for potentially unexpected outage causes.  I’ve experienced recoveries or complete infrastructure failures caused by various random acts: a spilled glass of milk, using a floor buffer and plugging a network cable into the incorrect port.  These all caused unimagined disruption, but this week, Halloween week, I am thinking about something much worse. 

 I have been aware of the pending Zombie invasion for some time now.  I am ashamed to admit that I’ve been mostly focused on personal survival, not that of my data.  The reality is that just like any other disaster, once it is over, we will need our data to get things back to normal.  If an army of undead walking the streets has you frightened, don't worry.  You can take steps to protect your data—and yourself.

                                      

Let's be honest, personal survival is important when contemplating a Zombie attack (or fire, flood, storm…).  Don't fret if you have not started planning yet; it’s not too late.  There are several resources available to get you started.  I recommend starting with the Zombie Survival Guide.  It is a great first step.

After you start planning for the care of yourself and family, we need to start looking at your data.  In many ways this is going to be similar to other DR scenarios, with a couple of unique differences. 

When you hear that the Zombies are coming, you don't want to spend time thinking, you need to get moving right away.  Having checklists will help you make sure you don't forget anything.  You don't want to be facing down a horde of Zombies just when you remember you should have brought ammo.  A checklist will help you remember what to bring when you bug out of the data center. 

There is no scientific evidence I am aware of that suggests Zombies are attracted to IT equipment, but some studies have suggested they find the hum of air handlers hypnotic.  Since they tend to be pretty rough on basically everything around them, you will not be able to count on recovery at your data center once the Zombies are gone.  This means you have to have a copy of your backups out of the building. 

We are all building a personal survival kit, but you also need a data survival kit.  If you have preselected a recovery site and are replicating data nightly, then you can pre-stage your kit.  The benefit to this solution is you have less stuff to carry.  The down side is a loss in flexibility when recovering.  Hopefully your remote site didn't get hit by Zombies, too.  The other option is storage tapes, as they are extremely portable.  In this case, you will carry your data survival kit with you.  Now you can recover anywhere. 

Where should your recovery site be located?  For suggestions I talked with noted data and Zombie expert, (and Spectra Logic sales engineer), Mark Pinder.  Most of Mark's suggestions are simply reiterations of traditional DR planning, but he did have a Zombie specific suggestion.  "Zombies are pretty tough, but do have a few weaknesses you can exploit.  One is an inability to climb.  When specing out a Zombie-proof recovery site, I always insist on an elevated facility.  The only way in or out should be climbing a ladder encased in a tube." 

So what should you include in your data survival kit?

  • A self-describing backup of your data.  This means both the data and the indexes and catalogs that tell the backup software about them.
  • A copy of the backup history report.  Knowing what backups are on which tapes, and where the backup catalog is helps speed up the recovery process.
  • A clone of Bruce Campbell to guard the entrance. 

                                            

  • A copy of the software and your license keys.  Copies of the backups are useless if you don't have the software to read them.
  • Recovery plan.  This plan should include the order of recovery, plans for equipment replacement, personnel involved in recovery operations and their designated replacements should they be “Zombiefied” (a technical term, by the way) and notification procedures for the data users when access is restored.
  • Money or other items that can be used to barter with.

This is obviously a quick review, but should get you started.  If you can be prepared for a Zombie invasion, you should be able to handle a normal disaster with no problem.

To my readers: have a happy, safe Halloween.

 

Follow me at www.Twitter.com/3pedal

Deduplication: A Personality Analysis

I just finished re-reading Keith Schultz’s recent article, which is a test lab review of three different deduplication appliances on the market today.  Spectra nTier Deduplication was one of the three. I always like to read reviews about my products, especially when they are positive.  If you have not read it, I don’t want to spoil the surprise, but will tell you that InfoWorld found that all the systems……. worked.  As much as I wish my deduplication appliance was the only one that worked, I must remain honest.  Most, if not all, mainstream deduplication solutions carry out their primary goal:  reducing your data.   

 

Keith did learn that the systems had different personalities, and in my opinion, almost all of the deduplication systems on the market have a unique personality.  Spectra nTier Deduplicationis VTL based, and well suited in environments with tape-based backup and archive systems.  VTL not only fits well with tape-based environments, it also offers high performance. Additionally,  nTier’s ability to send block data over Fibre Channel offers significant performance advantages over NAS. 

 

Personality can open doors, but only character can keep them open. Our nTier definitely has character, and its personality is all about both high performance and tight tape integration for backup and archive.

 

Each deduplication solution has its own distinct personality, and no single product can do everything.  If you chose a dedupe appliance with a NAS interface, you may sacrifice performance in exchange for the ability to host some production data.  For some customers, this will be a good trade, for others it will not be acceptable or provide efficient backup and archive.t.  nTier Deduplication is designed for backup and archive data, and does a great job with that work flow, but was not designed to be used  as a primary storage device. 

 

What should an organization consider when looking at a deduplication solution?  First, they need to know how they will use it and what is important.  Is integration with backup software and performance more critical than other factors like performance?  Once the organization has a good feel for the personality of the device they need, it should be easy to reduce the number of systems to consider to a “short list” of two or three solutions.  After that, learn more about each one and if need be, test them before purchasing.

 

Follow me at www.Twitter.com/3pedal

  

More Entries