Storage System manufacturers are pursuing unique ways of processing large amounts of data while still being able to provide redundancy in case of disaster. Some large SAN units incorporate intricate device block-level organization, essentially creating a low-level file system from the RAID perspective. Other SAN units have an internal block-level transaction log in place so that the Control Processor of the SAN is tracking all of the block-level writes to the individual disks. Using this transaction log, the SAN unit can recover from unexpected power failures or shutdowns.
Some computer scientists specializing in the storage system field are proposing adding more intelligence to the RAID array controller card so that it is ‘file system aware.’ This technology would provide more recoverability in case disaster struck, the goal being the storage array would become more self-healing.
Other ideas along these lines are to have a heterogeneous storage pool where multiple computers can access information without being dependant on a specific system’s file system. In organizations where there are multiple hardware and system platforms, a transparent file system will provide access to data regardless of what system wrote the data.
Other computer scientists are approaching the redundancy of the storage array quite differently. The RAID concept is in use on a vast number of systems, yet computer scientists and engineers are looking for new ways to provide better data protection in case of failure. The goals that drive this type of RAID development are data protection and redundancy without sacrificing performance.
Reviewing the University of California, Berkeley report about the amount of digital information that was produced 2003 is staggering. You or your client’s site may not have terabytes or pet bytes of information, yet during a data disaster, every file is critically important.
Avoiding Storage System Failures
There are many ways to reduce or eliminate the impact of storage system failures. You may not be able to prevent a disaster from happening, but you may be able to minimize the disruption of service to your clients.
There are many ways to add redundancy to primary storage systems. Some of the options can be quite costly and only large business organizations can afford the investment. These options include duplicate storage systems or identical servers, known as ‘mirror sites’. Additionally, elaborate backup processes or file-system ‘snapshots’ that always have a checkpoint to restore to, provide another level of data protection.
Experience has shown there are usually multiple or rolling failures that happen when an organization has a data disaster. Therefore, to rely on just one restoration protocol is shortsighted. A successful storage organization will have multiple layers of restoration pathways.
Here are several risk mitigation policies that storage administrators can adopt that will help minimize data loss when a disaster happens:
Offline storage system — Avoid forcing an array or drive back on-line. There is usually a valid reason for a controller card to disable a drive or array, forcing an array back on-line may expose the volume to file system corruption.
Rebuilding a failed drive — When rebuilding a single failed drive, it is import to allow the controller card to finish the process. If a second drive fails or go off-line during this process, stop and get professional data recovery services involved. During a rebuild, replacing a second failed drive will change the data on the other drives.
Storage system architecture — Plan the storage system’s configuration carefully. We have seen many cases with multiple configurations used on a single storage array. For example, three RAID 5 arrays (each holding six drives) are striped in a RAID 0 configuration and then spanned. Keep a simple storage configuration and document each aspect of it.
During an outage — If the problem escalates up to the OEM technical support, always ask “Is the data integrity at risk?” or, “Will this damage my data in any way?” If the technician says that there may be a risk to the data, stop and get professional data recovery services involved.