A basic explanation of different types of redundancy, and what they are good at.
This is a part of a series of articles about redundancy. These articles should be read in the correct order. If you have not read the previous article, then use this link to go to the first article Redundancy to solve all problems.
A redundant or fault tolerant system consists of more than one device that does the same, so that when one device stops working, then an other device will take over. We will divide redundancy into three categories.
1. Load-shared redundancy Here we have more than one device that share the load between them. They all do the same and there will normally be some system to assign tasks to them. If one of them fails then the tasks it should have done will be done by the remaining devices. If one device is no longer able to perform its duties then the total performance of the system will degrade.
2. Active redundancy Active Redundancy is typically two devices doing the exact same thing. But it is only the output of one of the devices that is actually used. The output from the other device is not used. This means that there must be some sort of switching on the output side of the two devices. The two devices will typically be equally powerful so that the total performance of the system won’t change because we switch to the backup device.
3. Passive redundancy In this case we have a device that is doing something and a device that is idle and ready to take over the task of the other device in case it fails. This means that there must be some switching system that takes care of starting the idle device and stopping the active device and switch both on the input side and on the output side.
Note that the terms active and passive redundancy is used differently in different articles about redundancy. Other authors may also use different categories, defining a category called One-to-One where each device always have a spare device where the spare device typically is operated in active redundancy mode. An other category is N+[i]X[i] redundancy where you have N systems that are active and a smaller number X on stand by. N+1 redundancy is then a special case where you have only one spare device to take over from N devices. The spare device will typically be passive, because if will need instructions about which task it is going to perform before it can start working.
The differences are most easily understood from examples. For active and passive redundancy it makes sense to have a Main and Backup device, where Main is always the system that is in use and Backup is the system that is ready to take over. An example of a load-shared system is an FM-transmitter with two PA blocks. Normally they will both be on, but if one of them should fail, the transmitter will still be on air, but with reduced output power, typically -6 dB. In this case nothing needs to decide to use a main or backup system switch to a backup system, but the output power is reduces as long as a PA block isn’t working. Another example is an airplane with two engines. All pilots that fly commercial airlines learn how to fly and land an airplane even if one of the engines don’t work.
An example of an active redundancy system could be two video playout servers that are linked together so when the operator plays a clip, then they both play the clip. Should the Main device fail, then some detection mechanism must detect that there is no valid video signal and switch to the output of the Backup device. If this is done in a clever way, then the blackout will be very short. The Backup system is working all the time, and it makes no difference to it if it is on air or not.
Passive redundancy is a system where the Backup device isn’t doing anything. It is probably on and ready, but its not performing any jobs until started when its needed. An example of that could be two independent playout devices in a control room. The backup playout device isn’t doing anything but its turned on and ready to play. An other example could be redundant transmitters where you have one transmitter ready if an other transmitter stops working. Then there must be a control unit that detects that Main has a problem, shuts Main down, switches the inputs and outputs and tells Backup to start. Here we also find N+1 redundancy. If we have a transmitter station with a series of transmitters, then there could be one Backup which can replace any of the other transmitters. If one of the Main transmitters fail, then the control unit must detect which transmitter have fails, make sure to send the correct signal to the input of the Backup transmitter, tell the Backup which frequency it must tune to, switch the outputs and start the Backup.
If it makes sense to use it, then load-shared redundancy will often be the type of redundancy that gives most value-for-money, because you don’t invest in something that is not used. It should be considered if the decrease in output, that often follows if a device in a load-shared redundancy configuration fails, is acceptable for the time it takes to fix it. If that is not acceptable, then the choice will be between active and passive redundancy. If one should compare active and passive redundancy, most people would probably think that active redundancy is best, but the devil is in the details…
Some years ago I visited the nuclear power plant at Barsebäck near Copenhagen. It was still in full operation and it had been debated to shut it down for as long as I can remember. (Non-Danish/Swedish readers may never have heard about Barsebäck, so here is a very short introduction: It was a nuclear power plant in Sweden built only 20 km from Copenhagen, the capital of Denmark with more than 1 million inhabitants. When it was built the Danish government didn’t have any objections, in fact Denmark had plans to build nuclear power plants of its own. But then came the 70’s, the Three Mile Island accident and a majority against nuclear power among the Danish voters. So Denmark never had nuclear power plants of its own and then it became a problem that the Swedes had a nuclear power plant within 20 km of Copenhagen, so since then different Danish governments did what they could to push the Swedes to shut Barsebäck down which finally happened in 2005.) When I visited the place, they naturally told us a lot about the different redundancy and security systems they had for all sorts of problems. I noticed a particular thing about the pumps for cooling water for the reactor. They had 4 pumps, each the size of a small truck, and all pumps were on. So the 4 pumps were working in a load-shared setup. Each pump also had a control unit, but all the control units were different. At first this seemed a bit odd, but the reason for this shows how much the broadcasting industry can learn from other industries: If you have several systems that are built by the exact same components, and some special conditions occur, conditions that would cause a control unit to do something wrong, then it is likely that if all control units were subject to the same conditions (and given that they work in a load-shared setup this is very likely), then they would also make the same error. In this case the error could be to stop the pump. If this happened for all pumps, then the very nice redundant setup where two out of 4 pumps would be enough to cool the reactor would be worth nothing, and we could have a nuclear disaster. So that is why the control units were different.
In the broadcasting industry we rarely need to pump large quantities of water, but its not difficult to imagine that an active redundancy setup made with two devices of the same brand and model may fail. This could be two playout servers playing the same playlist. If there is an error in this playlist causing the Main server to crash, then it is very likely that the Backup server who is doing the exact same thing also crashes – at least if the two servers are the same type with the same firmware etc. So if it is not possible to have two servers from different manufacturers, then it is worth considering if it’s not better to have a passive redundancy system. If anything in the playlist causes Main to crash, then Backup will not have crashed because it wasn’t doing anything, and then it will be ready to start.