Business Continuity and the Cloud

Posted Posted by Martin Peacock in Blog, Business Continuity     Comments No comments
Oct
1
2012

Combining the terms “cloud PACS” and “Business Continuity” (BC) evokes a spectrum between two extremes:

At one end of the spectrum is where an institution maintains its primary workflow ‘in the cloud’.  The other end is where the ‘cloud’ is simply used as a Disaster Recovery (DR) mechanism.  An institution may be positioned at any point on that spectrum.  Where possible, it is wise to be deliberate about that positioning but often other factors take precedence. At the very least, awareness of the position should be factored into BC plans

I’ve written before on the topic but briefly – to be clear about the way in which I am using the term ‘cloud’ – I mean not simply that it is PACS hosted elsewhere (that is just co-location and has been around since the eighties if not longer) – but that the end user doesn’t know – and doesn’t care where the services are coming from – at least within some constraints.  Those constraints are manifested in the UK by the NHS Information Governance guidelines culminating in IGSoc which make it really difficult to move Personally Identifiable Data out of the confines of the UK.  A cloud solution must have provision for such restrictions to be part of its techno-legal framework. But that said…

The two ends of the spectrum display quite different properties in relation to DR and BC.  Remember, DC is only one part – perhaps a small part of BC and understanding the contribution of DR solutions within the context of BC is really important.

Lets take the scenario where cloud services are used as a component of a DR solution where the primary workflow is hosted on-site.  There are two strategies that can support this:

  • In the event of a disaster the cloud solution takes over operation for all or part of the normal workflow.  Unless the solution comes as part of the primary, local PACS this can be really difficult to manage from an integration perspective particularly when it comes to fail-back to the local services.  An IHE integration profile to coordinate this would, IMO, be valuable.
  • In the event of a disaster that involves loss of data, the cloud solution can be used as a readily accessible offsite storage facility to quickly restore local facilities.

While the second of these is much more easily achievable, it only addresses that subset of potential disasters that involve loss of data, and even then, may not radically lubricate the restoration process:

  • It is perfectly possible for a disaster event to cause widespread disruption without a loss of data.  For example, A fire in the vicinity of a server room may well disrupt power to the room AND restrict physical access for purposes of restoration.  The Critical Path in the restoration plan may well be more about power facilities than data.
  • In the ‘traditional disaster scenario’ of a fire within the server room, the recovery plan may well not initially focus on data.  Lets say a server room with 100 individual servers suffers a disaster event.  For one to source 100 replacements units (not to mention networking kit) at a moments notice is – frankly – not likely*.  And then the room itself is likely to be in a state not fit for purpose, so additional facilities – with appropriate power, environment controls and physical security – need to be sourced also.  Having the data readily accessible is probably not of major concern, at least in early stages of recovery.  Indeed, once a plan for restoration is put in black-and-white, once replacement hardware is sourced, restoration from plain-old-backup-tape might well be an appropriate mechanism.

So for a number of reasons, neither of these two strategies – by themselves – are ideal and we are left with supplementary measures, or a single-vendor approach that I hope most people have learned to eye cautiously.

Looking at the other end of the spectrum – where the cloud solution is used as the primary service provider, and the operator handles the Disaster Tolerance and High Availability issues.  This clearly covers a wide range of potential disaster scenarios but also introduces new ones:

  • Reliance on public infrastructure that still has weaknesses.  Whether a direct line or public internet – additional risks are incurred – from accidental digger damage to DNS poisoning and beyond.  Of course these can be mitigated themselves (e.g. redundant links from different providers) but that comes at a cost and added complexity = added risk.
  • Data lock-in.  Even big companies can fail, and there has to be a way to get all of one’s data back out of the cloud should this risk show up.  Including data migration as part of the Ts&Cs is important – but you should also consider having a form of escrow to ensure that the operator can continue to operate for long enough to get the data out.

These can of course be mitigated by maintaining a local copy – but, yet again, that comes at a cost.

The discussion above concerns just one subset of potential disaster scenarios – the subset that involves some form of calamity within the datacentre.  There are many other potential disaster risks that must be included within the DR plan.  Then for true Business Continuity one must consider those risks that do not constitute a catastrophe – for example losing a third of your staff to flu for a week might not be a ‘disaster’ but would have a significant and potentially long-lasting effect on the business of running a Radiology department.

Best Laid Plans and Black Swans

“No plan extends with any certainty beyond the first contact with the enemy”

Helmuth von Moltke the Elder

Any plan, however well designed, cannot be guaranteed to work perfectly.  That should be incorporated into the plan itself (memories of double-differentiation in calculus class strike here).

Then, of course, there are also those events that simply cannot be anticipated – the Black Swans as written about by Nassim Nicholas Taleb.  One of the arguments in the book is that the events that are most likely to have a catastrophic effect are precisely those which cannot be anticipated – because if they could, they could be mitigated against.

The best way to tackle these two issues is to also maintain Organisational Agility – by:

  • Promoting and exercising the development of skills that may not be used daily, but would be invaluable in the face of unexpected events.
  • Managing relationships with vendors so that help can be mobilised at a moments notice.  Depending on circumstances, targets defined in SLAs might need to be surpassed.
  • Keeping the hive mind aware of the knowledge and experience that has shaped the organisation and its services.  Having core systems developed in-house is a great way to do this for some domains, but isn’t to everybody’s taste.

Next up: Full Stack Redundancy

* Note: While in the process of writing this Wired reported the possibility of drop-in datacentres arriving on the market.  Intended for scalability, it is also an interesting option for DR although still, sourcing one at short notice may well remain a challenge. http://www.wired.com/cloudline/2012/09/data-centers-in-a-box

Post comment