Health IT is often compared unfavourably to IT in Financial Services, with, perhaps, a grass-is-always-greener perspective, but maybe HIT isn't as badly placed as many think? The challenges are greater, the investment lower, and still, the quality of solutions actually compares equitably. In fact, HIT is probably better placed than FS-IT to step up to the next level. Perhaps we shouldn't beat ourselves up *quite* so much.
I came across a post in MedCityNews this morning (well, maybe more of an advertorial than a post but interesting nonetheless). The story - in a nutshell - is about one particular vendor potentially using ‘Big Data’ to improve health outcomes. The use of ‘Big Data’ to do with actual people is a topic of much debate – the furore around security agency access notwithstanding, there is much to debate. But I’ll leave that to one side, for another day.
The line that really woke me up (to be fair, I was on the red-eye train into Dublin) was this:
Some equate it to finally catching up to where the banking and airline industries have been for years:
And that’s a position I’ve sympathised with (and even moaned about) in the past. How, moaned I, can I get cash out of a hole in the wall in a foreign country thousands of miles from home in a matter of seconds, but two hospitals a couple of miles apart have such trouble sharing X-Rays?
But I’ve changed my position over the last 12 months or so. While the 2012 IT failure at RBS didn’t end there for the bank , for those of us thankfully unaffected, it does offer an interesting – if brief and transitory - insight into the backoffice working of those august institutions. Combined with some recent direct experience in the backoffice workings of financial services (FS) I can offer a different perspective.
While the pre-automation form of the two sectors are similar in terms of process depth and complexity, there are a number of differences in the two sectors, and between perception and reality:
- Firstly, and perhaps most importantly, FS systems are seen from the outside as being superbly architected artifacts of elegance. In fact, they are more like the angry-looking swan above, gracefully gliding over the surface while paddling like the clappers down below. Behind the super-efficient delivery of flawless process and documentation lies (still!) a small army of humans greasing the wheels. That isn't to say that is bad, but it is contrary to many people's expectations.
- It may well be true that cash transactions between relatively unconnected organisations can be fairly efficient in real-time, but in fact, many transactions do not occur in real-time, especially at a consumer level. Those who check bank statements might see debit card transactions from retail outlets batched over 2 or 3 days - especially at the weekend (of course - those statements can be botched as well as batched). Other batches might run only once per quarter or even once per year. In the meantime, data/hardware/software can be cleaned, cranked and patched into position.
- For many FS systems, daily operation is on a 12/5 basis. When the small army of humans go home - the system is done for the day. Life would be so much easier if all hospital IT systems could be switched into 'maintenance' mode at 6PM ready for backups, updates and general TLC. Yes, there are the very-low latency, intensive systems for Trading and ForEx, and the like - to the point where folk are talking about firing neurinos through the middle of the earth to get milli-second advantages in network latency, but they are in the minority.
- Even those statements may well be batched - to a point in the day when an intense period of dispatch can be prepared for in advance. This is a luxury not afforded with anything like the same frequency in health IT (clinical, at least).
- The transactions that do cross organisational bodies - whether in real time or not - are driven by legislative mandate and international agreement. Despite HIT having much more complete and overarching standards than FS, it is the lethargy of implementation that is holding it back.
- The money invested in FS IT far outstrips that in health. Of course, all of the wolrd's money goes through the banks - not once but many times as as it is transformed and re-invested, spent, or just plain lost. The world's economy in 2011 was 70 trillion - so there is plenty of incentive to invest heavily in systems and processes.
The healthcare industry spends only 2% of gross revenues on HIT, which is low compared to other information intensive industries such as finance, which spend upwards of 10% ( http://en.wikipedia.org/wiki/Meaningful_use#United_States )
- It doesn't help that culturally, health IT investments tend towards the monolithic, one-size-fits-all solutions while banks and FS are much more prepared to engage in bespoke solutions - including self-help in many circumstances. For me, that part is a mistake from those involved in HIT - I believe in having the capabiltity to adopt whichever approach offer the most benefit. But I know there are other opinions, and there are no answers that are right *all* of the time.
- There is a constant 'debate' in Health IT about the quality of end-user applications, their user-experience (UX), sub-second response times, consistency of UI etc., that slows down system implementation and sometimes downright stops it. FS-IT doesn't seem to have that problem, even though some of the UX in end-user apps is pretty poor indeed - and the consumer -visible software is only a small part of it.
It isn't really fair to make comparisons, HIT is at so many disadvantages. But still, IHE offers a great way forward. It has been slow, and there have been (still are and will continue to be) disagreements, but it is quite possible to conceive of a time when a significant proportion of the clinical record is fully interoperable.
There are still plenty of hills to climb and the gradient is somewhat steeper for H-IT than for FS-IT. But what is there for HIT - as we speak - in the form of the standards being implemented internally within organisations as well as those standards that will allow for wider collaboration - offers a platform for expansion that many other industries can only envy. It has yet to come to fruition, and work has yet to be done, but its not a bad place to be. Its good to learn from the mistakes (or otherwise) in other sectors, but remember apples ain't oranges - GM notwithstanding.
After the Instagram Ts&Cs palaver, an interesting piece on cloud service terms, conditions and acceptable use.
Of course almost all Health IT – related cloud services are based on infrastructure provided by one of the underlying vendors – Amazon, Rackspace, Azure, et al.
It is possible “…simply because a customer, a politician, or even a competitor claims there are issues with your — or your customers’ — activities” for the underlying vendor to shut down the service being provided – in the article, based on on allegations of DMCA transgression.
But Health IT – particularly clinical services – may have a comparable achilles heel – the certifications such as FDA 510(k) and EU CE Mark. It isn’t impossible for a given software product to be classified differently by different companies. For example, I know of two vendors selling effectively the same product – one of which has it classified as CE Class I, the other, Class IIa. The difference in effort between the two is not insignificant and depends entirely on legal opinion. It is possible (I don’t think it has happened) that the latter vendor has used the difference in legal opinion to pressure potential customers to avoid the possible risk should litigation come knocking, but then, it is up to individual hospitals to consider that as a possible hazard.
But for competing services in the cloud – a small word in the right ear that the cloud provider (at the level of Amzon, Microsoft or Rackspace) is hosting services that are incorrectly classified and may be open to legal action – would certainly generate a phone call to General Counsel. And action may well follow.
Simon Phipps in the article linked above suggests 3 mitigations of varying degrees of ‘implementability’. The third is part of what I would call Full Stack Redundancy – avoid single points of failure all through the hardware/software stack (at least where possible) – including cloud vendor. Its a tough ask, but hey:
Shoot for the moon. Even if you miss, you’ll land among the stars
I note a press release from the Information Commissioner’s Office publishing a code of best practice when it comes to data anonymisation, as well as the launch of a new body (UKAN) to promote said best practice.
While this is largely focussed on aggregate (e.g. statistical / epidemiology) practices, anonymisation (and more specifically, pseudonymisation) is an important topic in the conversation around the two big buzzwords of the day – ‘mobile’ and ‘cloud’
Mobile applications – particularly when combined with BYOD – has security issues. That much we all know, and there are solutions to those issues that are, by now, well understood, including encryption, virtualisation, compartmentalisation and zero-footprint-client. But these come with compromises and issues of their own. It would of course be simpler in terms of deployment and performance if native apps could be used without concerns over information security. This is where pseudonymisation has a benefit – on the way to the device, all identifiable data is replaced with ‘alternative identities’, with the relationship between the original and alternative only known to a trusted system (aka ‘the server’). On the way back, the original identity(ies) is restored and no privacy issues accrue.
But this may be more important ‘in the cloud’. Access to patient data through the NHS IGSoC process demands careful planning and implementation of best-practice security measures (rightly so), and it is virtually (if not downright) impossible to ship Persoanlly Identifiable Data (PID) to locations outside the UK. For the cloud, that is a problem. Very often, restricting the location of data to within the UK will mean many of the benefits of cloud are not realisable. In some cases, it means that the choice of infrastructure supplier is limited and in some situations – potentially valuable services made entirely irrelevant.
A cloudy service launched in the US, for example, will often choose an infrastructure provider based in the US. That isn’t much good for organisations in the UK who cannot store PID outside the UK. Pseudonymisation offers a way forward for those situations.
Redundancy in hardware terms is good, and is a major reason why servers tend to look nothing like desktops. There are a handful of components within a server that have a tendency to fail:
- Power supplies
- Disk Drives
- Network Cards
.. and there are long-established mechanisms for including redundancy in each of those within a single server. Hot-swappable power supplies, multiple pathways to multiple disks through multiple disk controllers. A good redundancy strategy even includes matching different generations of redundant components to avoid coincidental high-risk points on the bathtub curve.
Less often (indeed rarely) one might see a motherboard fail – one of the reasons why redundancy is extended to the next level and each server can be mirrored in a warm standby or clustered arrangement. In recent years – particularly with the popularisation of virtualisation technologies, redundancy can extend across different datacentres – even across different continents.
But an information service consists of more than just the hardware it runs on. The hardware itself is useless without an operating system, an important part of the series of layers (The Stack) on which any service depends. The stack includes – depending on the system in question:
- A database layer.
- A web server layer – including an active language such as PHP, Java, .Net etc.
- Application Server layer – which may include the web service layer but extends beyond to other runnable services.
And continuing to ascend up the stack – there are elements of software implementation that could (in many cases) to be considered to be discrete units in their own right. In the case of a Radiology based service this might include:
- DICOM/IHE/HL7 implementation.
- Speech recognition implementation
Let’s consider for a moment the level of the stack at which the Operating System runs. Operating Systems are selected for many reasons – reliability being one. It is widely accepted (historically at least) that Linux as a server OS is a ‘good’ choice from this perspective. Early in July however, a bug in Linux related to leap seconds disrupted a number of major web services. Even Linux has bugs that cause problems.
Can an end-user (even a sys admin) ever be sure that any level of the Stack does not hide a systemic issue that can cause problems? If, however, a service is engineered such that a Linux server is in concert with a Windows server, then such systemic risks can be mitigated.
It is equally (in fact probably easier) to engineer a service to eliminate a dependence on a particular database. It may mean compromising the abilities to utilise some database-specific features, but it can be done, and most services don’t really need access to the specific features.
Let’s look at a different – and Radiology specific layer – DICOM services. While I do not subscribe to many of the criticisms of DICOM, it cannot be denied that implementation can be a tortuous process, with regular edge cases (sometime vendor-specific) that must be incorporated. Sometimes, things break. When that happens – whoever is handed the task of troubleshooting has a number of options. One is to try an alternative DICOM implementation. This is one of the great benefits of the strong Open Source Software we have available, and should be kept to-hand as part of the process of Organisational Agility.
Just as with Operating System and Databases, a service (in this case, PACS) can be engineered to avoid dependence on a given DICOM implementation. If considered early enough in the development process, it can be done within a product, but it can also be implemented with a multi-vendor construction.
But doesn’t that increase the cost?
Yes, it does. And these are tough times, where the Return on Investment may not make a compelling enough case to do this. A long time ago, I worked for a large multinational (sadly, on hard times nowadays) who had a project on which two entirely separate development teams were working, each providing redundancy for the other. The cost of the project was, consequently, almost doubled. But the product in question was SO IMPORTANT that failure of a single team to deliver – or even not quite meeting expectations – was not an option. This is not new – it is after all the principle behind Double-Entry Bookkeeping
Does that apply to all cases? Certainly not. It depends on how critical a service is, and what the options are to provide Stack Redundancy. In fact, for PACS (at least, ‘general’ PACS services), Stack Redundancy can be deployed quite cheaply. If your primary PACS happens to run on Windows servers – consider a companion running DCM4CHEE on Linux. If your primary PACS runs on Linux, why not try ClearCanvas (or ‘CHEE) on Windows? There are workflow issues with synchronising data updates but even ignoring these gives a huge increase in Disaster Tolerance capability.
For other services, it may not be possible at all, especially when considering reliance on vendors, but its always something worth looking out for.
Combining the terms “cloud PACS” and “Business Continuity” (BC) evokes a spectrum between two extremes:
At one end of the spectrum is where an institution maintains its primary workflow ‘in the cloud’. The other end is where the ‘cloud’ is simply used as a Disaster Recovery (DR) mechanism. An institution may be positioned at any point on that spectrum. Where possible, it is wise to be deliberate about that positioning but often other factors take precedence. At the very least, awareness of the position should be factored into BC plans
I’ve written before on the topic but briefly – to be clear about the way in which I am using the term ‘cloud’ – I mean not simply that it is PACS hosted elsewhere (that is just co-location and has been around since the eighties if not longer) – but that the end user doesn’t know – and doesn’t care where the services are coming from – at least within some constraints. Those constraints are manifested in the UK by the NHS Information Governance guidelines culminating in IGSoc which make it really difficult to move Personally Identifiable Data out of the confines of the UK. A cloud solution must have provision for such restrictions to be part of its techno-legal framework. But that said…
The two ends of the spectrum display quite different properties in relation to DR and BC. Remember, DC is only one part – perhaps a small part of BC and understanding the contribution of DR solutions within the context of BC is really important.
Lets take the scenario where cloud services are used as a component of a DR solution where the primary workflow is hosted on-site. There are two strategies that can support this:
- In the event of a disaster the cloud solution takes over operation for all or part of the normal workflow. Unless the solution comes as part of the primary, local PACS this can be really difficult to manage from an integration perspective particularly when it comes to fail-back to the local services. An IHE integration profile to coordinate this would, IMO, be valuable.
- In the event of a disaster that involves loss of data, the cloud solution can be used as a readily accessible offsite storage facility to quickly restore local facilities.
While the second of these is much more easily achievable, it only addresses that subset of potential disasters that involve loss of data, and even then, may not radically lubricate the restoration process:
- It is perfectly possible for a disaster event to cause widespread disruption without a loss of data. For example, A fire in the vicinity of a server room may well disrupt power to the room AND restrict physical access for purposes of restoration. The Critical Path in the restoration plan may well be more about power facilities than data.
- In the ‘traditional disaster scenario’ of a fire within the server room, the recovery plan may well not initially focus on data. Lets say a server room with 100 individual servers suffers a disaster event. For one to source 100 replacements units (not to mention networking kit) at a moments notice is – frankly – not likely*. And then the room itself is likely to be in a state not fit for purpose, so additional facilities – with appropriate power, environment controls and physical security – need to be sourced also. Having the data readily accessible is probably not of major concern, at least in early stages of recovery. Indeed, once a plan for restoration is put in black-and-white, once replacement hardware is sourced, restoration from plain-old-backup-tape might well be an appropriate mechanism.
So for a number of reasons, neither of these two strategies – by themselves – are ideal and we are left with supplementary measures, or a single-vendor approach that I hope most people have learned to eye cautiously.
Looking at the other end of the spectrum – where the cloud solution is used as the primary service provider, and the operator handles the Disaster Tolerance and High Availability issues. This clearly covers a wide range of potential disaster scenarios but also introduces new ones:
- Reliance on public infrastructure that still has weaknesses. Whether a direct line or public internet – additional risks are incurred – from accidental digger damage to DNS poisoning and beyond. Of course these can be mitigated themselves (e.g. redundant links from different providers) but that comes at a cost and added complexity = added risk.
- Data lock-in. Even big companies can fail, and there has to be a way to get all of one’s data back out of the cloud should this risk show up. Including data migration as part of the Ts&Cs is important – but you should also consider having a form of escrow to ensure that the operator can continue to operate for long enough to get the data out.
These can of course be mitigated by maintaining a local copy – but, yet again, that comes at a cost.
The discussion above concerns just one subset of potential disaster scenarios – the subset that involves some form of calamity within the datacentre. There are many other potential disaster risks that must be included within the DR plan. Then for true Business Continuity one must consider those risks that do not constitute a catastrophe – for example losing a third of your staff to flu for a week might not be a ‘disaster’ but would have a significant and potentially long-lasting effect on the business of running a Radiology department.
Best Laid Plans and Black Swans
“No plan extends with any certainty beyond the first contact with the enemy”
Helmuth von Moltke the Elder
Any plan, however well designed, cannot be guaranteed to work perfectly. That should be incorporated into the plan itself (memories of double-differentiation in calculus class strike here).
Then, of course, there are also those events that simply cannot be anticipated – the Black Swans as written about by Nassim Nicholas Taleb. One of the arguments in the book is that the events that are most likely to have a catastrophic effect are precisely those which cannot be anticipated – because if they could, they could be mitigated against.
The best way to tackle these two issues is to also maintain Organisational Agility – by:
- Promoting and exercising the development of skills that may not be used daily, but would be invaluable in the face of unexpected events.
- Managing relationships with vendors so that help can be mobilised at a moments notice. Depending on circumstances, targets defined in SLAs might need to be surpassed.
- Keeping the hive mind aware of the knowledge and experience that has shaped the organisation and its services. Having core systems developed in-house is a great way to do this for some domains, but isn’t to everybody’s taste.
Next up: Full Stack Redundancy
* Note: While in the process of writing this Wired reported the possibility of drop-in datacentres arriving on the market. Intended for scalability, it is also an interesting option for DR although still, sourcing one at short notice may well remain a challenge. http://www.wired.com/cloudline/2012/09/data-centers-in-a-box
Over at UKRC a couple of weeks ago I I spoke to a number of folk about the technology behind a ‘Business Continuity’ solution. The technology involved was indeed a fine solution for Disaster Recovery – but that is not the same as Business Continuity. Those conversations have prompted me to begin a series of notes on Business Continuity beginning with this one.
So what is the difference between DR and BC? I’ll try to put it into the context of the IT services in a mid-large sized hospital. A typical mid-sized hospital might have IT services delivered through a Data Centre (DC) in a secure location on campus. A great deal of money is spent minimising the risks associated with this DC – with carefully planned air conditioning, UPS, backup generators, physical access control and the like. Lets say for the sake of argument this DC contains 100 servers, along with assorted networking equipment. The backup strategy employed is an aggressive one – full backups are taken every 12 hours and taken off-site, with a monthly and quarterly recovery testing plan. Even that level of organisation is a major challenge.
The original DC is unusable (buried in debris), so some amount of ground work is needed to move power supplies to an alternate location (lets say there is one handily available). Ordering, shipping and commissioning all of the servers and networking equipment could take a 2-3 weeks (optimistically). Mobilising software vendors to install and make ready for use the software might be another week or two. AS a Disaster Recovery Plan, this is clearly unacceptable – so we’ll assume a prioritisation exercise is performed on a quarterly basis to ensure main systems are back online as soon as possible. With a good plan, good backups, and skilled workers, the first 10 services might be back online within 5 days, with the stragglers taking up to 3-4 weeks longer.
On one hand, this is a slightly pessimistic example. Data Centres can be quite resilient to such risks. And the evolution of SAN and Virtualisation options complements the traditional backup mechanisms well. But not all data centre facilities are created equal, the risk remains and the Black Swan event always looms. More about the Black Swan in a later note.
So the disaster in this case has been recovered from, but there is nothing mentioned so far about Continuity of Service. How did the core business processes (in our case, patient care) continue while this DR was ongoing? Many things might have happened – alternate ‘cloud’ solutions could be deployed. Many processes may well have reverted to manual pen-and-paper for the duration. A local ‘bank’ staff agency may well have deployed a number of additional staff to cope with the confusion. Having these options available for whatever ‘disaster’ befalls a hospital – not just in relation to IT – is where Business Continuity Planning is important, and different to DR. BC is as much about what *else* happens while the process of DR is ongoing.
Each of these coping mechanisms required something to happen in good time. The cloud solution may require an existing account with provision for sudden upscaling. Reversion to manual processes may require maintenance of stock of otherwise redundant stationary, and an existing relationship with the bank agency would certainly be a helpful lubricant. These can all be planned, communications plans drawn up to allow speedy access to the right people, a small room set aside to store old-style pre-OrderComms requisition slips, to be prepared, a room allocated to be the coordination base for all of these coping mechanisms.
A BC Plan requires decision makers (at an appropriate level depending on the circumstances) being mobilised, and being empowered with skills, resources and tools to keep the core business processes operational. Information Technology is not a core business process in Healthcare – it has become one of the most important contributing services but still tangential. Business Continuity is about the Business, not the contributors.
While there is an established framework for maximising the effectiveness of Business Continuity Planning the first step is to appreciate the difference between DR and BC.
Next up: Disaster Tolerance, the Technology Stack, and the Cloud.
A great day was had on the 27th last at the British Institute of Radiology for the Spring 2012 meeting of the UK Imaging Informatics Group. A fantastic collection of speakers discussing a variety of important topics.
I was delighted to be asked to present on the topic of “Mobile Technologies in Legacy Workflows”.
Annotated PowerPoint below.
Some of the more progressive PACS sites have already started implementing SSD technology as part of their performance management initiatives. In general this is (and certainly should remain for the foreseeable future) one part of a multi-tier architecture that has SSD as the very-short-term, ultra-quick cache with less expensive technologies providing cheaper, deeper, possibly replicated and even (a subject for another post), possibly deduplicated.
However, image data is very often (not always, though*) identifiable and should be treated with appropriate confidentiality. When it comes to decommissioning a bundle of hardware, this is often achieved by forensically wiping clean each disk (taking a screwdriver to the interface pins comes a poor second ).
As noted on the Sophos Naked Security blog, wiping SSDs is considerably more complicated than the same process for their cousins, spinning disks, and to maintain security, encryption should be engaged from day 1. Making the encryption keys unrecoverable is much simpler than doing the same for the whole disk.
* Of course, some PACS implementations do not store identifiable metadata with the pixel data itself – only within the accompanying database – for some modalities. That’s obviously part of another discussion on what precisely ‘vendor neutral’ is and how data migration can be facilitated in an equitable world.
I’ve written before on the difference between Business Continuity (BC) and High Availability (HA). Having delivered a presentation in London recently on Project Failure, and am writing another along similar lines for UKRC in June, I’d like to cover what I believe to be the number 1 strategy for successful BC that is covered in both of those.
A central tenet to BC is risk management. Within risk management one identifies risks, quantifies them, prioritises them and mitigates them. This is important, and for any operation of any scale it should probably be handled within a well structured methodology – whether PMI, PRINCE2, NIST or any of the other frameworks. But there is more..
A number of years ago while I was developing a BC strategy for a large hospital, I had an insight. I happened to be reading “The Black Swan” by Nassim Nicholas Taleb. Its a great book and well worth a read but one of the basic points is that if one were to look back over pretty much any period in history – the events that have shaped today’s world were simply not predictable. It was inconceivable that the events leading to the First World War would have the consequences they did. Ditto with the rise of Hitler and WWII, or 9/11 and subsequent military action in Afghanistan and Iraq. The individual inputs to today’s economic crises were well understood, but who predicted where we have ended up today? The notion that taking a finite number of inputs to a chaotic system and confidently predicting the outputs is flawed.
And it is the events that cannot be predicted that will always have the greatest effect – precisely because they are not predictable, and therefore cannot be mitigated against (at least, specifically).
In my mind, there is only one way to mitigate against the unknown risk. That is, to have skills and agility at your disposal to be able to react to the unexpected when – not if – it happens. That brings me to what I think of as a spectrum of programme delivery strategies.
At any point on the spectrum – there is a combination of internal resources as well as external (usually vendor) resources. I have my own preferences (having a technical background, I tend to back my own skills), but in many respects any programme (or initiative within a programme) can find a perfectly appropriate place anywhere on that spectrum. The histograms in the graphic indicate the level of influence that different groups have over the programme (albeit somewhat caricatured), But equally indicate where the skills required within the programme reside.
I’m old and hoary enough to remember the world of IT in the early ’90s. Back in those days – ‘downsizing’ and ‘outsourcing’ was all the rage. There was no need to have an IT department AT ALL, went the argument. A reputable, specialised service with properly written contracts and SLAs was all that was needed. But very quickly, a situation evolved in many organisations where the internal skillset had atrophied to the point that the organisation needed external help to specify and manage the contracts and SLAs. Enter the ‘consultants’, exit control and stability. Fortunately, ‘downsizing’ became ‘rightsizing’ where there was a recognition that it is of strategic benefit to retain some skills inside the organisation.
I would argue that the strategic benefit should extend to maintaining the ability to react to the unexpected in an agile – yet controlled – manner. To some extent, that means holding, and nurturing, skills, processes, tools, industry awareness and industry contacts within the direct control of the organisation. All of those – especially the latter two – require an investment in time that must be balanced carefully but are important nonetheless.
So at the extreme left of the spectrum lies ‘the cloud’. For the moment, I won’t quibble about the difference between ‘private cloud’ and ‘public cloud’. There is a huge difference but I’d like to consider the ‘public cloud’ – which is entrusting an entire service – its resources – and its data – to ‘somewhere else’. I maintain a positive scepticism to the public cloud – I use cloud products myself although only because I can reassure myself that I have the skills, the infrastructure and the tools to cope with anything that might happen within those (not under my control) clouds.
I do however see organisations choosing to occupy the left-hand-end of the spectrum inappropriately:
- Because it is a ‘default’ choice made through lack of awareness – or willingness – to consider alternatives.
- Because the cost savings documented can often derive from a perceived opportunity to divest of skills and agility.
Agility should always be at least considered as a necessary output from any initiative within a programme. To do otherwise may well be costly.
Note: This post sat as a draft for a while until an article from John Halamka prompted me to finish it off. The article covers the challenges often described as VUCA. Addressing those challenges very often requires – guess what – people having skills and agility
Cross posted under Even a *HOT* backup is not a Business Continuity Strategy. And this doesn’t even START on BC/DR vs HA
Over in the PACSGroup UK forum there is a discussion thread that has been quite busy the last week or so. The overall topic is the specification of the notion of ‘Vendor Neutral Archive’ – in particular, at least as I read it, what problems VNA are supposed to solve. Lest I be misinterpreted at any point – I am a big supporter of any disruption in the Radiology/PACS market that empowers the end user to a greater extent than they have been in the past. That disruption has come in the shape of VNAs over the last couple of years and I tip my cap.
However, there is one suggestion regarding the value of VNAs that makes me pause for thought. That is that VNA can be used as a Business Continuity facility to mitigate against the risk of PACS becoming unavailable. That just in its own right is a limited view of how PACS BC can be approached. A VNA has the capability of maintaining an independent copy of images – which can be used as the distribution hub for clinical locations outside Radiology. Lets assume for the moment that the viewing capabilities outside of Radiology are themselves independent of PACS (with the state of affairs as well as many procurements under way even as I speak), the fact is (elephant in the room?) if PACS is down – RADIOLOGY is down. That is NOT providing any kind of continuity of service at all.
And, in my mind – it is a profoundly oversimplified perspective of what BC is and the challenges that face it, as a process.
The foundation of business continuity are the standards, program development, and supporting policies; guidelines, and procedures needed to ensure a firm to continue without stoppage, irrespective of the adverse circumstances or events. All system design, implementation, support, and maintenance must be based on this foundation in order to have any hope of achieving business continuity
Wikipedia: Business Continuity
It is important to realise that Business Continuity measures cannot be easily plugged-into existing systems and in a moment I’ll give a couple of examples where BC processes can be hampered by decisions made in the delivery of basic infrastructure – quick possibly many years prior to the procurement of a given system.
Business continuity is sometimes confused with disaster recovery, but they are separate entities. Disaster recovery is a small subset of business continuity.
DR is where the old-fashioned off-site backups come in. Lets say you suffer the nightmare scenario – you have a bunch of servers running in a server room which is consumed in fire. DR is then about recovering to a ‘normal’ state of affairs. You may need to provision a new server room. You’ll probably need to order up a bunch of hardware in a hurry (some thoughts on Vendor Relationship Management to follow!) and mobilise your software vendors to help restore from whatever backup strategy you have in place – which – yes – can include a VNA (as long as it was on a seperate site). Depending on the number of systems involved and the amount of preparation, that process is likely to be measured on a timescale of days. Clearly, DR in this extreme sense, is NOT about continuity of business process.
Where BC comes in in this case – is in protecting the viability and integrity of core business processes while the DR process is ongoing. And BC must consider a whole gamut of potential risks other than the ‘nightmare scenario’. Some of these risks can be identified, analysed, and mitigated. Some that are identified may simply be carried, and there will always be the scenario that comes out of left field. One the the core messages in the superb book “The Back Swan” by Nassim Nicholas Taleb is that the risks we identify should never do us significant damage – precisely because we saw it coming and can prepare in advance. It is the disasters and events we don’t see coming (like the existence of black swans) that will have the biggest effect – because we didn’t. It is the job of the Business Continuity Process to provide the flexibility and structures to deal with the black – as well as the white – swans.
A couple of examples that are reasonable extrapolations of real events that have occurred in any number of sites around the world:
Lets say one has a server room with a number of servers – most of which have disk storage served from a SAN. The hardware for the servers, the SAN and the network has been carefully specified and implemented to eliminate any Single Point of Failures (SPoF) with multiple layers of redundancy, integrity checking and clustering. Fire suppression is state-of-the-art. What can go wrong? The air conditioning fails, and will take 3 hours to fix.
With the heat generated by the various kit, the room temperature will go out-of-spec in 1 hour. All servers must then be shut down until the AC is fixed and the environment restored. It is possible to identify a subset of servers that simply MUST be kept online. Can the remainder be shut down for the duration while the handful stay up – yes they can – but it doesn’t make any difference. By far the biggest generator of heat is the SAN. The SAN is either up – or it isn’t. If it isn’t – all servers are down anyway.
Lesson: Its really easy to miss a risk at an early point in an overall infrastructure deployment that turns out to have a much bigger impact when facilities have grown.
How does BC handle this? Probably by fallback to a manual process. Does that mean special stationary needs to be held in stock? Does everybody involved know how to fallback to manual? Will you need a team of facilitators mobilised to ease that process? It WILL have an impact on the performance of involved departments – does there need to be an escalation plan to limit service demand by (for example) diverting emergency patients to alternative facilities? This is Business Continuity Planning, and requires a holistic approach.
Lets say you have all of that in place, but that the ‘event’ was a fire outside the server room. Lets face it, fire suppression works, but fires still happen. The suppression kicked in and as soon as the Fire Safety Officer has given the all-clear, servers are restarted within 30 minutes and the server room back to normal in good time. In the meantime the arrangements are that test orders are written on card (a supply of pink and blue cards is maintained in central stores) and urgent orders telephoned through to the relevant department. Additional staff are drafted in to deal with the extra legwork and data entry. All fine. Except the fibre optic cables from the server room were housed in metal conduits. The conduit got hot, and melted the cables so although the servers are up – they are still offline. So manual fallback is still running. Except for Orthopaedic Theatre – whose telphones run over VoIP – controlled – or not – by a server in the (currently) offline server room. Oops.
Even the best laid plans sometimes need to pivot and adapt to retain BC. And therein lies the key to true Business Continuity Planning – people.
If people on the ground (in clinical, admin and operational areas as well as technical) have the right skills and knowledge – technical, communications, and management, then whatever threat comes out of left field is vulnerable. The more static and rigid the available skills – the more likely the Black Swan will bite.
Latest from the Blog:
- High Availability
- Lessons learned
- Open Source
- PACS General
- Project matters
- Martin Peacock on Virtual Servers and PACS
- Edward Mangiola on Virtual Servers and PACS
- Globalstorage on Business Continuity Planning in Health IT
- Martin P on Window/Levelling in a browser – CANVAS or server-trips?
- Martin P on 3D can be a dangerous game