![]() |
|
|
|
By: Dwayne Melançon, Vice President of Corporate and Business Development, Tripwire, Inc. One of the most significant events involving lost data came as a result of the terrorist attacks on Sept.11, 2001. Of the 131 sites affected, only two performed a successful “failover.” Of the 129 that failed, 70% of data was recovered after 120 hours, but 30% was lost forever. This means $3.1 billion worth of technology did not work as expected.1 Why weren’t the affected organizations’ disaster recovery efforts more successful? First, the event was beyond the scope of most existing disaster recovery plans. Second, the complexity of the IT environments made testing and verification impractical if not impossible. Third, there was a lack of process automation—a reliance on manual intervention and no enterprise-wide best practices. Without an audit trail, there is no automated record to account for changes, updates or fixes to data and systems. One change to a single piece of software can undo whole systems. If failover testing is performed just once a year, there is the danger that up to a year’s worth of data will be lost with the next outage. Without the synchronization of production and disaster recovery infrastructure, data integrity suffers, with the potential for loss unbounded. Here are the three factors that are most often to blame for any unsuccessful failover: Unplanned/Undocumented Changes Additionally, undocumented changes introduce risk. If you lose infrastructure without documentation of the changes that have been made to that infrastructure, how can you ever hope to rebuild it to the same specifications? Too Much Access Lack of Accountability Managing and Prioritizing Change is Essential Adding a configuration audit and control solution provides a universal view that can continually monitor all systems to discover unplanned, undocumented or unauthorized changes, alerting IT staff to such instances, giving them the opportunity to reconcile or resolve issues. This helps keep systems in a known and trusted state. Configuration audit and control also provides the ability to enforce a zero tolerance policy for undocumented changes. Eliminating unplanned changes also reduces time-consuming “firefighting” and frees up resources for a more useful activity–such as ensuring your BCP is in good working order. Limiting change to a specific window of time (planned vs. unplanned) is an effective way to police change activity. You can use this change window as a low effort start to begin tracking the number and type of unauthorized changes, and to begin the process of identifying and prioritizing the assets in your data center with the greatest risk of crashing and causing an outage, or failing after a disaster event. These assets are the most at risk and as a result, are in most need of detective controls to ensure that the change management process is not circumvented. Limiting Access and Maintaining Accountability As for accountability, IT is one area of business where everything, from day-to-day operations to urgent matters, can be managed entirely through facts and analysis. There’s no reason to “manage by gut feeling” or circumvent processes; in fact, circumventing can ruin a BCP. Adherence to change management policies and procedures is a strategy that benefits the business as a whole and ensures recovery after a disaster. 1 Source: Center for Research on the Epidemiology of Disasters; SunGard; U.S. FEMA
About the Author:
|
|
|
|