Full Stack Redundancy

Posted Posted by Martin Peacock in Blog, Business Continuity     Comments No comments
Oct
4
2012

Redundancy in hardware terms is good, and is a major reason why servers tend to look nothing like desktops.  There are a handful of components within a server that have a tendency to fail:

  • Power supplies
  • Disk Drives
  • Network Cards

.. and there are long-established mechanisms for including redundancy in each of those within a single server.  Hot-swappable power supplies, multiple pathways to multiple disks through multiple disk controllers.  A good redundancy strategy even includes matching different generations of redundant components to avoid coincidental high-risk points on the bathtub curve.

Less often (indeed rarely) one might see a motherboard fail – one of the reasons why redundancy is extended to the next level and each server can be mirrored in a warm standby or clustered arrangement. In recent years – particularly with the popularisation of virtualisation technologies, redundancy can extend across different datacentres – even across different continents.

But an information service consists of more than just the hardware it runs on.  The hardware itself is useless without an operating system, an important part of the series of layers (The Stack) on which any service depends.  The stack includes – depending on the system in question:

  • A database layer.
  • A web server layer – including an active language such as PHP, Java, .Net etc.
  • Application Server layer – which may include the web service layer but extends beyond to other runnable services.

And continuing to ascend up the stack – there are elements of software implementation that could (in many cases) to be considered to be discrete units in their own right. In the case of a Radiology based service this might include:

  • DICOM/IHE/HL7 implementation.
  • Speech recognition implementation

Let’s consider for a moment the level of the stack at which the Operating System runs.  Operating Systems are selected for many reasons – reliability being one.  It is widely accepted (historically at least) that Linux as a server OS is a ‘good’ choice from this perspective.  Early in July however, a bug in Linux related to leap seconds disrupted a number of major web services. Even Linux has bugs that cause problems.

Can an end-user (even a sys admin) ever be sure that any level of the Stack does not hide a systemic issue that can cause problems?  If, however, a service is engineered such that a Linux server is in concert with a Windows server, then such systemic risks can be mitigated.

It is equally (in fact probably easier) to engineer a service to eliminate a dependence on a particular database.  It may mean compromising the abilities to utilise some database-specific features, but it can be done, and most services don’t really need access to the specific features.

 

Let’s look at a different – and Radiology specific layer – DICOM services.  While I do not subscribe to many of the criticisms of DICOM, it cannot be denied that implementation can be a tortuous process, with regular edge cases (sometime vendor-specific) that must be incorporated.  Sometimes, things break.  When that happens – whoever is handed the task of troubleshooting has a number of options. One is to try an alternative DICOM implementation. This is one of the great benefits of the strong Open Source Software we have available, and should be kept to-hand as part of the process of Organisational Agility.

Just as with Operating System and Databases, a service (in this case, PACS) can be engineered to avoid dependence on a given DICOM implementation.  If considered early enough in the development process, it can be done within a product, but it can also be implemented with a multi-vendor construction.

But doesn’t that increase the cost?

Yes, it does.  And these are tough times, where the Return on Investment may not make a compelling enough case to do this.   A long time ago, I worked for a large multinational (sadly, on hard times nowadays) who had a project on which two entirely separate development teams were working, each providing redundancy for the other.  The cost of the project was, consequently, almost doubled.  But the product in question was SO IMPORTANT that failure of a single team to deliver – or even not quite meeting expectations – was not an option. This is not new – it is after all the principle behind Double-Entry Bookkeeping

Does that apply to all cases?  Certainly not.  It depends on how critical a service is, and what the options are to provide Stack Redundancy.  In fact, for PACS (at least, ‘general’ PACS services), Stack Redundancy can be deployed quite cheaply.  If your primary PACS happens to run on Windows servers – consider a companion running DCM4CHEE on Linux.  If your primary PACS runs on Linux, why not try ClearCanvas  (or ‘CHEE) on Windows?  There are workflow issues with synchronising data updates but even ignoring these gives a huge increase in Disaster Tolerance capability.

For other services, it may not be possible at all, especially when considering reliance on vendors, but its always something worth looking out for.

Post comment