This is a part of a series of articles about redundancy. These articles should be read in the correct order. If you have not read the previous article, then use this link to go to the first article Redundancy to solve all problems.
Redundancy only does good if it works according to the expectations of the users. Most users don’t care about how their backup systems work and how to use them. Even the best redundant system can end by causing dead air, if the users expect something wrong from it. The following case is from a TV station, and it shows how things can go wrong even when all the necessary redundancy should prevent things to go wrong.
The TV station has a Main and a Backup video playout server. Each video playout server has two video outputs, making 4 video outputs in total. In normal operation one of the servers is used in A/B mode and the other server is not used, but all video files are transferred to it automatically so that it is ready to play when needed. The director has two PC’s each controlling one video server, but limitations in space (it is always a problem where to find space for computer monitors in a control room) there is only one screen and mouse and keyboard to the two PC’s, so the director uses a KVM switch to select which PC to control. The switching between the video outputs – should they come from Main or Backup – is done by pressing a red button. Finally there is a control monitor that shows what clips is currently being played, the duration so far and the remaining duration. It has an auto-sensing function so it just works out by itself if Main or Backup is playing and then displays the information from the right server.
One day there was a problem with the Main playout server, so it could not play. The director then used the KVM switch to connect to the PC that controls the Backup playout server. But the director expected the playlist that was loaded on the Main PC to be loaded automatically on the Backup PC. Before they had had an active redundancy system, which had been changed to a passive system. After some delay and call for help from the support team, the playlist was loaded on the backup PC, and then the red button was used to switch the video signals. The result was that the audio signals were switched as expected, but the video signals to the preview monitors and the video mixer didn’t switch. It turned out to be a problem in the video router so that the input to the video mixer was not the output from the red button, but the output taken directly from the Main server. This error was found and corrected about half and hour after the show had ended. So this show (a current affairs program) was done without the ability to play a single video clip. At least they had some live guests in the studio, who got more airtime than what they had ever dreamt of…
So these problems were first wrong expectations from a user and then a single point of failure. The problem with the wrong user expectations are at the same time the easiest and the hardest to solve. Each time a commercial flight gets ready for takeoff, the captain checks that the airplane works – both the normal functions that are needed for all flights and a series of backup systems that we rely on for different emergency situations. Had the director checked that there was images and audio from both Main and Backup, the director would have found out that the redundancy had been changed. If the directors were used to do this, they would also have found out that the red button didn’t do what it was supposed to do, and this problem could have been solved long time before they got the problem with the main videoserver.
So this should be easy, and if a pilot can do it, then it must also be possible for a director to do it. The problem is that when airlines can be grounded if they don’t do these procedures (or even worse: crash), then this negligence usually has no consequences in a broadcasting house. I’m not saying that directors who forget this should be fired, but it is a management task to prioritize these things, and if the management fails to do so, then it can’t complain about faults like this.
Now back to the red buttons. The red button turned out to be a single point of failure. It was not the red buttons fault, after all it did what it was supposed to, but the error was that the output from the red button was not routed to the inputs of the image mixer. But that isn’t really relevant. The bottom line is that they pressed the red button and what was supposed to happen didn’t happen.
Now if we agree that it is not realistic to do a ”Pre-Flight-Check” of a TV control room before each show, then it is even more important that the redundancy that we have got can be used without any knowledge about how it works – it must be self-explaning. In this case there was a red button that switched the audio- and videosignals, but it wasn’t crystal clear to the users exactly what was swithed. This uncertainty was made even greater because the control monitor (the one that showed the name and the time elapsed and remaining of the clip currently playing) had its own auto-sensing function. We will get back to that one. The person who made the error with the video routing definitely didn’t understand or know the concept of the red button.
I suggested that the red button was removed and replaced by 4 buttons on the video mixer and 4 faders on the audio mixer. This way they would get rid of a single point of failure and no knowledge about how to switch to Backup would be required, because it is self explaining. The red button probably won’t disappear completely because of the preview monitors. There is never enough space on the big monitors in the control room, so using two spaces for nothing is not really an option. But then the red button would be away from the transmission chain, and if it failed, it would still be possible to broadcast just by using the Preview of the image mixer.
Now back to the control monitor that shows the name and elapsed and remaining duration of the clip currently playing. This is just a PC that is monitoring what the video playout servers are doing. When we finally had got the right video signal into the image mixer, there was still problems getting the control monitor to show the correct information. Some times it showed something very strange, such as negative remaining duration of the clip that was playing. The problem turned out to come from the Main video playout server, where the problem was that the Main video playout server was partly working. Had it been completely dead, then there had been no problem, but now it was partly not working. So the information on the control monitor was some sort of mix of information from the two playout servers, which was not useful at all. And since it was supposed to sense automatically which server was playing, nobody knew how to force it to only monitor the Backup server. We found out eventually.
The problem with the time code monitor shows that if you have something which switches from Main to Backup and back automatically, then it must be possible to disable the automatic switch and decide manually which source to use. And if the auto-swich means that it becomes particularly difficult to switch manually, it should seriously be considered to always let this switching happen manually. In this case, the choice should be easy, since the switching must be done manually both for the video signal and the KVM switch. In that case everything should be switched manually. Since the red button will be used to switch the video monitoring, then it will only be natural to let it switch the source to control monitor as well.
Finally there was a problem one day, where the KVM switch to the PC’s controlling the video playout servers didn’t work. Ups, we just found an other single point of failure there, that might need some thought.