Tape services – Beyond Just Data Recovery

What happens when you need to access files from an old backup tape that is no longer compatible with your back up system, tape drive or backup software?

The rapidly changing world of IT means that new innovations are constantly replacing the latest technology. With changes to back up regimes, old tapes become redundant despite requests for old files to be restored. Furthermore, data compliance regulations require businesses to retain data for many years, often longer than the availability of the technology used to store it.

Causes of tape failure and data loss
•    Corruption – operational error, mishandling of the tape or accidental overwrites caused by inserting or partially formatting the wrong tape.
•    Physical damage – broken tapes, dirty drives, expired tapes and damage caused by fire, flood or other natural disaster
•    Software upgrades – inability for data on tape to be read by new application or servers

Tape recovery process
•    Tape recoveries are performed in dust-free cleanroom environments
•    Tapes and tape drives are carefully dismounted, examined and processed
•    Proprietary tools can “force” the drive to read around the bad area to recover your data successfully
•    Drives are imaged and a copy of the disk is created and transferred to new system

Read More

Storage Server Data Disasters – Common Scenarios (Part I)

When a data loss occurs on something as valuable as a server, it is essential to the life of your business to get back up and running as soon as possible.

Here is a sampling of specific types of disasters accompanied with actual engineering notes from recent Remote Data Recovery jobs:

Causes of partition/volume/file system corruption disasters:
•    Corrupted file system due to system crash
•    File system damaged to automatic volume repair utilities
•    File system corruption due partition/volume resizing utilities
•    Corrupt volume management settings

Case study
Severe damage to partition/volume information to Windows 2000 workstation; had used 3rd party recovery software – didn’t work, reinstalled OS but was looking for 2nd partition/volume, found it and it was a 100% recovery.
Evaluation time: 46 minutes (evaluation time represents the time it takes to evaluate the problem, make necessary file system changes to access data, and to report on all of the directories and files that can be recovered)

Causes of specific file error disasters:
•    Corrupted business system database; file system is fine
•    Corrupted message database; file system is fine
•    Corrupted user files

Case study
Windows 2000 server, volume repair tool damaged file system; target directories unavailable. Complete access to original files critical. Remote data recovery safely repaired volume; restored original data, 100% recovery.
Evaluation time: 20 minutes

Exchange 2000 server, severely corrupted information store; corruption cause unknown. Scanned information store file for valid user mailboxes, results took up to 48 hours due to the corruption. Backup was one month old/not valid for users.
Evaluation time: 96 hours (1.5 days)

Read More

Storage Server Data Disasters – Common Scenarios (Part II)

Possible causes of hardware related disasters:
•    Server hardware upgrades (storage controller firmware, BIOS, RAID firmware)
•    Expanding storage array capacity by adding larger drives to controller
•    Failed array controller
•    Failed drive on storage array
•    Multiple failed drives on storage array
•    Storage array failure but drives are working
•    Failed boot drive
•    Migration to new storage array system

Case study
Netware volume server, Traditional NWFS, failing hard drive made volume inaccessible; Netware would not mount volume. Errors on hard drive were not in the data area and drive was still functional. Copied all of the data to another volume; 100% recovery.
Evaluation time: 1 hour

Causes of software related disasters:
•    Business system software upgrades (service packs, patches to business system)
•    Anti-virus software deleted/truncated suspect file in error and data has been deleted, overwritten or both

Case study
Partial drive copy overwrite using third party tools, overwrite started and then crashed 1% into the process, found a large portion of the original data. Rebuilt file system, provided reports on recoverable data; customer will be requiring that we test some files to verify quality of recovery.
Evaluation time: 1 hour

Causes of user error disasters:
•    During a data loss disaster, restored backup data to exact location, thereby overwriting it
•    Deleted files
•    Overwritten operating system with reinstall of OS or application software

Case study
User’s machine had the OS reinstalled – restore CD was used; user looking for Outlook PST file. Searched for PST data through the drive because original file system completely overwritten. Found three potential files that might contain the user’s data, after using PST recovery tools we found one of those files to contain all of the user’s email; there were missing messages, majority of the messages/attachments came back.
Evaluation time: 5 hours

Causes of operating system related disasters:
•    Server OS upgrades (service packs, patches to OS)
•    Migration to different OS

Case study
Netware traditional, 2TB volume, damage to file system when trying to expand size of volume, repaired on drive, volume mountable. Evaluation time: 4 hours

When a data loss occurs on something as valuable as a server, it is essential to the life of your business to get back up and running as soon as possible.

Read More

Computer Data Storage Tips

data storageSuccessful server recoveries: Preventing further damage when a server goes down

Despite the industry improvements in backup systems or storage array systems, server failures are a common occurrence that can leave a business paralyzed. Whether the failure is hardware-related, software-related, the result of human error or due to a natural disaster, the number of data loss events is increasing as businesses rely on their corporate server structure and document storage volumes.

How to increase the chances of a successful recovery:

•    Use a volume defragmenter regularly: A defragmenter moves the pieces of each file or folder to one location on the volume, so that each occupies a single, contiguous space on the disk drive. This helps improve the quality of recovery, making files and folders easier for data recovery specialists to locate. Do not run defragmenter utilities on suspected bad drives – if drives are bad, this could have damaging effects

•    Perform a valid backup before making hardware or software changes

•    If a drive is making unusual mechanical noises, turn it off immediately and get assistance from your data recovery company

•    Before removing drives, label the drives with their original position and RAID array

•    Never restore data to the server that has lost the data – always restore to a separate server or alternate location

•    In Microsoft Exchange or SQL failures, never try to repair the original information store or database files – make a copy and perform recovery operations on the copy

•    When replacing drives on RAID systems, never replace a failed drive with a drive that was part of a previous RAID system – always zero out the replacement drive before using

•    In a power loss situation with a RAID array, if the file system looks suspicious, is uncountable or the data is inaccessible after power is restored, do not run volume repair utilities. Do not run volume repair utilities on suspected bad drive

Read More

Hardware Life Cycle Management(Part I)

hardware life cycleEvery IT professional can tell a horror story about an upgrade, roll-out, or migration gone awry. So many factors are involved; hardware, software, compatibility, timing, data, procedures, security protocols, and of course the well-meaning but imperfect human.

Over 2008, IT departments and staff can look forward to a number of upgrade projects for their computer system infrastructure. According to Gartner, Inc., the number of PC shipments during fourth quarter 2007 increased 13.1% over the same period in 2006. Global PC shipments during 2007 increased 13.4% over 2006 – equating to 271.2 million units in 2007.

While a slower economy than in previous years may lower the number of units, the fact that organizations have been investing in new units shows that Hardware Life-cycle Managementis still a mainstay of corporate IT’s responsibilities and will continue to be such.

IT professionals realize that scheduled change is a pattern for the industry. Whether this change involves accommodating new users, replacing old servers, or upgrading staff to newer systems, there is always change within the computer organization. Sometimes it is easy to only rely on hardware or software budgets for your roadmap. However, these budgets may be short-sighted and lack proper planning. Using accounting budgets alone to manage hardware may not take into consideration the overall life span of the equipment.

Equipment/software life-cycles and your road map
Managing IT equipment and product life-cycles is an important function of IT department staff. As a goal, equipment life-cycle management should reduce failures and data-loss because computer equipment is replaced before it fails, and it should reduce the total cost of equipment management over its lifetime. Depending on the organization, equipment life-cycles are based on different criteria.

•    Warranty expiration: If your IT infrastructure has a mix of equipment in place, with different makes and types of equipment, then your warranty-based product life-cycle management will be complicated. Using this approach is not only short-sighted, it also mirrors the first time you bought the equipment. Consider the expanding department who needs to plead with the CFO or budgetary manager for a non-planned equipment purchase. Three years later when the warranty expires, the department will be back again on their knees begging for replacements or an extension to the expiring warranties. Whichever the case, it will be an unplanned expense.

•    Waiting until equipment fails: In our economy, budgets are tight and management rightfully wants to get the most production or usage out of a piece of equipment before having to replace it. This approach is very risky and will usually cost more in the end.  IT equipment rarely fails at a “convenient” time.  If you’re lucky, the failure occurs during a slower period and your IT department is equipped to get you back up and running quickly.  In reality, this is not usually the case. Consider the real cost of equipment failure if it is month-end or year-end and the server with the financial data crashes; or a company has just secured a large contract and at the eleventh hour, one or more workstations fails or becomes intermittent causing wasted downtime on the project and inefficient use of personnel resources.

•    Capital expense budgets: Some IT departments base their product life-cycles on departmental accounting policies for capital expense purchases. Of course, this alternative method can have a knock-on effect when there is a business need for expansion and this wasn’t considered in the fiscal budget. Additionally, in larger user environments, departments may have control of their own capital expense budgets, so there may be many departments with different budget needs. When the life-cycle of one department’s equipment is complete, the number of fragmented purchases may actually reduce your company’s buying power. In contrast, a more structured approach would concentrate equipment purchases to various times throughout the year. This method is preferred by CFO or budget managers who will use a predefined purchase allocation per business unit or department to facilitate budget planning for the next year.

Read More

Hardware Life Cycle Management(Part II)

There are a number of financial planning exercises that can help you determine if capital expenses for PC hardware with complete parts and service contracts for the life of the unit are best suited for your IT infrastructure.

Alternatively, leased IT equipment may be more cost effective and would assist in maintaining a more comprehensive IT equipment life-cycle program.

As we dig further into this topic, you will see that hardware and software deployment planning is just the start of discussion for the IT group. Migration planning raises more questions than answers and these questions start with equipment and software life-cycle management. For example, planning discussions can start with these questions:

•    What is your IT department’s roadmap for equipment management?
•    How about the users you support, does your roadmap align with their needs?
•    What requirements have inter-company business owners or department managers contributed to the overall equipment management policy? Are any of the suggested requirements based on some of the above mentioned methods? (i.e., does the accounting department determine the life-cycle or does the OEM warranty determine the life-cycle, or is the policy just to “run the equipment into the ground”?)

Visualising the product map of the software your organisation uses and planning your major equipment purchases within a timeline helps structure your hardware retirement strategy.   By synchronising your hardware purchases with your software investment, you can minimise large capital expenditures and stagger departmental purchases so that you can qualify for volume discounts.

Additionally, if your organisation qualifies for specific licensing models, you may be able to plan your software purchasing on alternate years from your hardware purchasing. Take Microsoft’s core software products as an example (Fig. 1).

Figure 1: Recent Microsoft software product launches

hardware life cycle 2
It is tempting to think that only hardware equipment has a life-cycle, yet the above example clearly shows that software too has a life-cycle. Could your IT infrastructure benefit from synchronising your life-cycle management of both PC hardware units and software licenses? Where does your organisation envision product adoption and integration with respect to manufacturer rollout? Finally, does your PC hardware for servers, desktops, and laptops or notebooks align with or complement that vision?

Read More

Hardware Life Cycle Management(Part III)

Planning for a migration
Planning for product life-cycles necessitates an implementation strategy. Migration of computer systems has evolved from the manual process of a complete rebuild and then copying over the data files to an intelligent method of transferring the settings of a particular system and then the data files.

Many IT professionals can attest to the fact that there is a large investment in time and fine tuning of new servers. Whether it’s complexity of domain controllers, user and group policies, security policies, operating system patches, and additional services to users – all of these require time to set up. Fine tuning the server after the rollout can be time consuming as well. Once completed, a computer system administrator wants to have the confidence that the equipment and operating system are going to operate normally.

Thought needs to be given as well to the settings and other customization that users have done on their workstations. Some users are allowed to have a number of rights over their machine and can thus customize software installations, default file locations to alternate locations, or can have a number of programs that are unknown to the IT department. This can make a unilateral migration unsuccessful because of all of the unique user settings. The aftereffect is a disaster with users having missing software and data files, lost productivity as they re-customize their workstations, and worst of all, overwritten or lost files.

Deployment test labs are a must for migration preparation. A test lab should include, at a minimum, a domain controller, one or two sample production file servers, and enough workstations, sample data, and users to simulate a user environment. Virtualization software can assist with testing automated upgrades and migrations. The software tools to do the actual migration are varied – some are from operating system software vendors, others may be third party applications or enterprise software suites that provide other archiving functions. There are a number of documents and suggestions for migration techniques (some are listed in the references).

The success of a migration rests on analysis, planning, and testing before rolling out changes. For example, one company with over 28,000 employees has a very detailed migration plan for its users. The IT department used a lab, separate from the corporate network infrastructure, to test deployments and had a team working specifically on migration. The team had completed the test-lab phase of their plan and the migration was successful in that controlled environment.

The next phase was to roll out a test case on some of the smaller departments within the company.  The test case migration was scheduled to run automatically when the users logged in. The migration of the user computers to a new operating system started as planned. After the migration, the user computers automatically started downloading and installing software updates (a domain policy). Unfortunately, one of these updates had not been tested. The unexpected result was that user computers in the test case departments were inoperable.

Some of the users in the test case contacted the IT Help Desk for assistance. IT immediately started troubleshooting the operational issues of the problem without realizing that this was caused by a migration test case error. Other users in the department who felt technically savvy tried solving the problem themselves. This made matters worse when one user reformatted and reinstalled the operating system and overwrote a large portion of original data files.

Fortunately for this company, their plan was built in phases and had break-points along the way so that the success of the migration could be measured. The failure in this case was two-fold in that there were some domain policies that had not been implemented on test lab servers, and the effect of a migration plus the application of software updates had not been fully tested. The losses were serious for some users, yet minimal for the entire organization.

For other migration rollouts, the losses can be much more serious. For example, one company’s IT department created a logon script to apply software updates. However, an un-tested line of the script started a reinstall of the operating system. So as users were logging into their computers at the start of the week, most noticed that the startup was taking longer than usual. When they finally were able to access their computer desktop, they noticed that all of their user files and settings were gone.

The scripting problem was not seen during the test lab phase, IT staff said. Over 300 users were affected and nearly 100 computers required data recovery services.

This illustrates the importance of the planning and testing phases of a migration. Creating a test environment that mirrors the IT infrastructure will go a long way toward anticipating and fixing problems. But despite the most thought-out migration, the most experienced data professionals know that they can expect the unexpected. Where can you turn if your migration rollout results in a disaster?

Read More

Hardware Life Cycle Management(Part IV)

Migration disaster-recovery options
Even the best planning for any deployment can result in disaster for users and critical data. In order to be completely prepared, include data recovery planning within your plan. Questions for your team to ask are:

•    How do we handle an unexpected event during the deployment process?
•    Do we have enough break-points within the automation to catch errors?
•    Can a backup be performed before the deployment?
•    How much time or resources would it take to recover from migration disaster?
•    What alternatives do we have if there is a hardware failure during the migration?
•    What data recovery vendors do we have relationships with that can get back our data in a timely way and also maintain quality?

Being prepared for the worst ensures the greatest success. Think seriously about the disaster recovery side of the project and build in data safety processes so that data loss is minimized.

In the event that a deployment causes widespread accidental data loss, or that key systems or workstations are affected, know when to stop and get professional data recovery assistance.

Many times data loss goes from serious to disastrous because inexperienced IT staff work to resolve the problem. After running software found on the Internet in a panic, the data loss becomes more severe. When all internal options are exhausted, a professional data recovery firm is finally engaged. Not only has precious time been lost, the damage to the data has increased or becomes unrecoverable.

All data recovery companies and offerings are not the same. Data recovery companies that claim to specialize in data recovery, yet in reality use off-the-shelf recovery tools are far more limited in their capabilities.

Read More

Avoiding storage system failures

There are many ways to reduce or eliminate the impact of storage system failures. You may not be able to prevent a disaster from happening, but you may be able to minimize the disruption of service to your clients.

There are many ways to add redundancy to primary storage systems. Some of the options can be quite costly and only large business organizations can afford the investment. These options include duplicate storage systems or identical servers, known as ‘mirror sites’. Additionally, elaborate backup processes or file-system ‘snapshots’ that always have a checkpoint to restore to, provide another level of data protection.

Experience has shown there are usually multiple or rolling failures that happen when an organization has a data disaster. Therefore, to rely on just one restoration protocol is shortsighted. A successful storage organization will have multiple layers of restoration pathways.

We has heard thousands of IT horror stories of initial storage failures turning into complete data calamities. In an effort to bring back a system, some choices can permanently corrupt the data. Here are several risk mitigation policies that storage administrators can adopt that will help minimize data loss when a disaster happens:

Offline storage system: Avoid forcing an array or drive back on-line. There is usually a valid reason for a controller card to disable a drive or array, forcing an array back on-line may expose the volume to file system corruption.

Rebuilding a failed drive: When rebuilding a single failed drive, it is import to allow the controller card to finish the process. If a second drive fails or go off-line during this process, stop and get professional data recovery services involved. During a rebuild, replacing a second failed drive will change the data on the other drives.

Storage system architecture: Plan the storage system’s configuration carefully. We have seen many cases with multiple configurations used on a single storage array. For example, three RAID 5 arrays (each holding six drives) are striped in a RAID 0 configuration and then spanned. Keep a simple storage configuration and document each aspect of it.

During an outage: If the problem escalates up to the OEM technical support, always ask “Is the data integrity at risk?” or, “Will this damage my data in any way?” If the technician says that there may be a risk to the data, stop and get professional data recovery services involved.

Read More

Unique data protection schemes

Storage system manufacturers are pursuing unique ways of processing large amounts of data while still being able to provide redundancy in case of disaster. Some large SAN units incorporate intricate device block-level organization, essentially creating a low-level file system from the RAID perspective. Other SAN units have an internal block-level transaction log in place so that the control processor of the SAN is tracking all of the block-level writes to the individual disks. Using this transaction log, the SAN unit can recover from unexpected power failures or shutdowns.

Some computer scientists specializing in the storage system field are proposing adding more intelligence to the RAID array controller card so that it is ‘file system aware.’ This technology would provide more recoverability in case disaster struck, the goal being the storage array would become more self-healing.

Other ideas along these lines are to have a heterogeneous storage pool where multiple computers can access information without being dependant on a specific system’s file system. In organizations where there are multiple hardware and system platforms, a transparent file system will provide access to data regardless of what system wrote the data.

Other computer scientists are approaching the redundancy of the storage array quite differently. The RAID concept is in use on a vast number of systems, yet computer scientists and engineers are looking for new ways to provide better data protection in case of failure. The goals that drive this type of RAID development are data protection and redundancy without sacrificing performance.

Read More