_7: Increasing Availability in your OT System

What end users don't understand about what it takes to provide highly available OT systems.

10/28/20244 min read

a close up of a server's nameplates on the side of a
a close up of a server's nameplates on the side of a

System availability is the most important requirement in any OT/ICS system.  The reason why is because downtime means no capital generated and unexpected downtime could result in personnel injury.  In order to keep systems running as intended, there are two main points all OT/ICS environments need and they are protection and redundancy.  In this blog, I will focus on redundancy.

What is Redundancy?  Redundancy is when a single point of failure occurring on your system does not hinder your OT/ICS system.  Redundancy must be designed into the system in order to handle events.  Some of these designs include the following:

  • Redundant Power Sources

  • Utilization of Uninterruptible Power Supply (UPS)

  • Redundant Network Media

  • Redundant Hardware

  • Redundant Infrastructure and SCADA Services

Redundant Power Sources

It's always better to have more one than one power source.  Imagine you had a power outage on your main power source; without a secondary source, your plant would be pitch black without the clicking and clacking of machines making products.  An option is to install onsite industrial generators to maintain operations during the power outage.  Depending on how critical your operations is to the industry, you should consider if generators could be an option for your plant.

Uninterruptible Power Supply (UPS)

When switching from the main power source to a backup system, the UPS keeps your system running without a power glitch.  The UPS does that by detecting no power presence at the infeed of your system and then switches over to the well maintained batteries until power is stable.  UPS can help save your system from having to start the 30 minute to 1 hour of system shutdown and startup sequence.  When installing UPSs, also consider configuring graceful shutdown and startup in the case when power is not restored.  Graceful shutdowns keep your servers running smoothly.  Disruption from power outages will cause loss of data and corrupt your file systems if servers are not properly shut down.

Redundant Network Media

There are essentially two different approach on the topic of redundant network media.  First and the foremost, always have your electrical contractor or network cabling contractor pull an extra copper.  If you're pulling fiber, use multi-strands; select a fiber cable that will provide at least another pair of fiber strand.  You never know when you need to do a quick patch to get existing equipment running on the plant floor that resulted of cable deterioration.  There shouldn't be too much of a cost difference between one copper pull version two.  The contractor is already doing majority of the work and labor cost is where majority of the cost lie.

The second approach on the topic of redundant network media is to pull extra copper/fiber cables through a different conduit/wire cable route.  In case one conduit/wire cable route gets damaged, the second set will still allow communication.  This approach will require, at minimum the contractor doing twice the amount of the work, including forming conduits/setting wire trays and pulling cables.  At the end, the work will cost you twice as much to have redundant network media.  You as the end user will have to evaluate the costs and benefit of having such redundancy designed into your system.

Redundant Hardware

Redundant hardware is necessary to maintain high availability.  Starting with the OT servers, it's always recommended to have at least three server hosts within a server host cluster.  If there is a host failure, VMs would automatically be restarted on other available hosts.  Within the hosts themselves, it's always recommended to have at least two network adapters with at least 4 vmnics.  A pair of vmnics, one from each of the network adapters are dedicated for management, a pair of vmnics is dedicated to VM Network, a pair dedicated for vMotion and Fault Tolerance and a pair dedicated to Storage Area Network iSCSI.  Each vmnic are selected from each network adapters to provide redundancy in case a network card fails.

Secondly, always stack your network switches to provide high availability.  Stacking network switches will provide a primary and secondary switch.  Also configure link aggregation to each of the network switch to provide link redundancy, doubling bandwidth and load balancing.  An exception where link aggregation shouldn't be used is with iSCSI as per recommendation from many user groups.  Based on some knowledgebase documentation from multiple SAN manufacturer, there are support for use of link of aggregation with iSCSI.  I recommend everyone ensure they've tested throughly before deploying in production.

Redundant Infrastructure and SCADA Services

In addition to hardware redundancy, redundant infrastructure and SCADA services is also very important to maintain high availability in an OT environment.  Some of these services include Domain Controller, HMI Servers, Historian Server and Data Servers to name a few.  Having redundant Domain Controllers to keep your access control, DNS, DHCP, etc. (depending on what services you have deployed on your Domain Controllers) services highly available in an OT environment.  Creating a redundant Domain Controller won't cost you anything if you already have Windows OS Licenses (either from a datacenter license or standard license which provides two VMs per license).  Redundant SCADA services are also very important but does require additional capital to purchase additional set of licenses.  The maintenance of redundant services may be not as seamless as Windows Domain Controllers but any OT admin can get up to speed with the proper training.

In conclusion, maintaining availability of any OT system is essential to manufacturing or critical infrastructure.  In order to do so, end users must invest into their systems.  One thing to consider when calculating your Return on Investment is understanding the costs if your operations was down for 1 hour.  The question you should consider is: how many hours of downtime can you afford?  Imagine you operation going down for 1 hour during your peak season.  Could the cost from NOT producing cover the additional hardware/software and engineering required to implement redundancy?  What if you operations was down for 2 hours during your peak season...what if it was 3 hours....4 hours....5 hours....Something for end users to seriously considering when designing.  Remember, it's always easier to put in the redundancy up front.  Brownfield projects are doable but may required additional time/testing and coordination from operations, meaning additional downtime.