The Forum
Lead Story
The President's Corner - A Message from David Cannon
Featured Columns
From the Editor's Desk
Organizational Gatekeeping: itSMF USA Governance
Committee Updates - Happy New Year from the itSMF USA Publications Team
Focus on Three
ITIL®/ITSM Related Articles
A CIO Case Study on IT Service Demand Management
IT Financial Management and the Service Lifecycle - How the Pieces of the Puzzle Come Together
The Role of Financial Management for IT Services in IT Service Management
The Cost of Doing Business
ITIL® v3, Service Portfolio, Service Catalog and Financial Management
The Role of the Business Services Catalog in Disaster Recovery Planning
Mock Disaster Drill for Small Businesses
Three Threats to Disaster Recovery and How to Diffuse Them
ITSC Planning: Performing Business Impact Analysis
Quick Tips for IT Governance
LIG News
Wisconsin LIG Celebrates another Successful Year
Greater L.A. One Day IT Service Management Cruise to Excellence Conference
Atlanta LIG Vendor Shootout
Austin Local Interest Group News
 
Search Back Issues
Print This Article
Print All Articles
Return to Main Content Page







Newsletter Committee:

Susan Trembly, Editor
Michael Cardinal, Comm Chair
Tess DePalma, Copy Editor
Merily Talalla, Designer


Contributing Editors:
Anil Balla
Wendy Barrington
Robyn McGregor
Madhu More
Laura Sellers
Silvia Siqueira
Cheryl Winters

Three Threats to Disaster Recovery and How to Diffuse Them
By: Dwayne Melançon, Vice President of Corporate and Business Development, Tripwire, Inc.

One of the most significant events involving lost data came as a result of the terrorist attacks on Sept.11, 2001. Of the 131 sites affected, only two performed a successful “failover.” Of the 129 that failed, 70% of data was recovered after 120 hours, but 30% was lost forever. This means $3.1 billion worth of technology did not work as expected.1

Why weren’t the affected organizations’ disaster recovery efforts more successful? First, the event was beyond the scope of most existing disaster recovery plans. Second, the complexity of the IT environments made testing and verification impractical if not impossible. Third, there was a lack of process automation—a reliance on manual intervention and no enterprise-wide best practices.

Without an audit trail, there is no automated record to account for changes, updates or fixes to data and systems. One change to a single piece of software can undo whole systems.  If failover testing is performed just once a year, there is the danger that up to a year’s worth of data will be lost with the next outage. Without the synchronization of production and disaster recovery infrastructure, data integrity suffers, with the potential for loss unbounded. Here are the three factors that are most often to blame for any unsuccessful failover:

Unplanned/Undocumented Changes
Unplanned work may seem harmless but its cumulative effect is staggering. Unplanned work often means untested or poorly understood work.  Forrester Research estimates that as much as 35% of an IT organization’s “operate/maintain” workload is unplanned. That can easily drain resources and budgets, place the organization in constant firefighting mode, and prevent the timely completion of planned business objectives, such as ensuring a Business Continuity Plan (BCP) s in working order. 

Additionally, undocumented changes introduce risk.  If you lose infrastructure without documentation of the changes that have been made to that infrastructure, how can you ever hope to rebuild it to the same specifications?

Too Much Access
As an IT organization grows, the number of those who have access privileges also grows. People who have been promoted to management or other duties may have “legacy access” to systems they should no longer tamper with. Immediately review access privileges, clearing everyone away from your IT infrastructure unless they are formally authorized to make changes. Why? One undocumented or untested change can undo a previous change or even a whole series of changes, causing immediate catastrophe or planting a long fuse for a future catastrophe, which can create a disaster event of its own.

Lack of Accountability
The expectation of what is required to ensure an adequate BCP must come from the top. Sound control processes must be made a priority across the enterprise. This accountability must extend through the organization, with functional and process owners bearing responsibility for the portions of infrastructure under their care.

Managing and Prioritizing Change is Essential
While many companies use change and configuration management tools to reduce risk and manage change, these tools can be easily circumvented—the tools only know what they know, and do not have a universal view of all change taking place on your entire IT system.

Adding a configuration audit and control solution provides a universal view that can continually monitor all systems to discover unplanned, undocumented or unauthorized changes, alerting IT staff to such instances, giving them the opportunity to reconcile or resolve issues. This helps keep systems in a known and trusted state.

Configuration audit and control also provides the ability to enforce a zero tolerance policy for undocumented changes. Eliminating unplanned changes also reduces time-consuming “firefighting” and frees up resources for a more useful activity–such as ensuring your BCP is in good working order.

Limiting change to a specific window of time (planned vs. unplanned) is an effective way to police change activity. You can use this change window as a low effort start to begin tracking the number and type of unauthorized changes, and to begin the process of identifying and prioritizing the assets in your data center with the greatest risk of crashing and causing an outage, or failing after a disaster event. These assets are the most at risk and as a result, are in most need of detective controls to ensure that the change management process is not circumvented.

Limiting Access and Maintaining Accountability
A valuable piece of information you can derive from restricting all change to a specific change window is finding out who implements what changes and when. This gives you a clear view of the level of access to your IT systems. Use this information to review access privileges and explain to personnel that restricting access is critical to the integrity and success of disaster recovery and business continuity.

As for accountability, IT is one area of business where everything, from day-to-day operations to urgent matters, can be managed entirely through facts and analysis. There’s no reason to “manage by gut feeling” or circumvent processes; in fact, circumventing can ruin a BCP. Adherence to change management policies and procedures is a strategy that benefits the business as a whole and ensures recovery after a disaster.

1  Source: Center for Research on the Epidemiology of Disasters; SunGard; U.S. FEMA



About the Author:


Dwayne Melancon, CISA, is VP of Corporate and Business Development for Tripwire, Inc. Mr. Melancon has worked with the IT Process Institute on its research of best practices as well as with numerous corporations around the world on IT service management improvement. 

 



Previous Article | Next Article
 
itSMF150 East Colorado Boulevard, Suite 215, Pasadena, California 91105 | Phone: 626/449-3300