This article goes a bit deeper into some practical details about cluster servers.
If you have not read the series from the beginning, it is advised to start with the first article Redundancy to solve all problems.
A server cluster is normally two servers that can take over tasks from each other. These two servers are called cluster nodes. From the outside world the two nodes can look like two independent servers with their own IP addresses and network names, but apart from this a cluster has some virtual network names and IP addresses. The virtual IP addresses can then “stay” on the node that runs a service now. So seen from the outside, the systems that need to connect to the cluster don’t have to worry about that in reality this is two servers. The two nodes can also share some storage, in that case they need a common controller and disk system (the disk system could also be a SAN). The setup becomes quite expensive, but it also has some advantages which makes it well suited for broadcast applications. A service (this could be an SQL server) which we want to have on a redundant platform will in fact just run on one server at a time. So in our redundancy terms, it is just a passive redundant system. If Main stops working, then the cluster service will detect this and start the SQL server on Backup.
Apart from this, we have the possibility to manually move services from one node to the other, and where the downtime (the time it takes for the switch to happen and the service to start on the new node) can become very short. But there will be downtime, and the application depending on the service that runs on the cluster will be disconnected. So the application must know that it should try to reconnect if it gets disconnected, and be patient enough to let the service start on the other node.
The passive redundancy means that the efficiency of the system (the ratio between obtained power/power of the hardware we bought) becomes 50%, because there will always be a backup server doing nothing. Wasting half of your investment like this is the price we pay for passive redundancy. A small element of load sharing could be added, if you need to run two services that can both run on the cluster, and where it makes sense to divide the services over the two different nodes. Should one node fail, then both services will end up running from the other node, so the services should not be so demanding that this is not possible. Such an application could be an SQL Server on one node and storage with sound files on the other node. Normally it will run fast, but in case of a problem the users will have to accept that things are a bit slower until the problem is solved.
Cluster systems are not the solution to all problems. As any other passive redundancy system, “something” needs to decide if we run on the first or the second node, and in this case it is the cluster service. The Cluster Service is a service that runs in Windows, and it can introduce some errors of its own. I have seen examples that the virtual network name for a cluster service stopped working, and it required a restart of both nodes before this was fixed. Both servers could be accessed individually by their physical names, but the virtual name (that the services using the cluster depended on) didn’t work. So this is an example where the cluster itself introduced an error that we wouldn’t have had, if we had just been using a stand-alone server.
Do you need a cluster system or not? This depends if you need to be able to switch nodes practically without downtime. If there is something wrong on your SQL server, in many cases it is faster to switch nodes than to restart the SQL server on the existing node. This avoids to always have to do service and maintenance at night. But there are alternatives, such as replicating the database to an other SQL server. Then the other SQL server can be ready to log on to, it just requires that the clients know that they must log in to a different place. You could also use some trick if you run your own DNS.
Something similar can be said about storage. It is nice to be able to access the same storage through an other server, but what it is the disk system itself that has a problem. Then you may be better off having an independent disk system somewhere, preferably in another physical location.
Finally you should also remember that a cluster can degrade. If you have only used one node for a long time, and some problem occurs and the cluster service wants to switch node, then there is a big risk that the other node doesn’t work. There can be all sorts of reasons for that. I have experienced that a node had got a problem with the AD, so that it needed to be added to the domain once more. Or that someone had enabled some special debugging log files for a specific service. Then later everything was forgotten about that and the folder for the log file was deleted. Unfortunately this meant that the service couldn’t run, but since the folder was deleted while this particular node was the backup, then nobody noticed until the other node had a different problem one day, and then the service couldn’t run on any node. And it took a while to figure out what the problem was, because this business with the debug folder was forgotten long time ago.
For reasons like this, clusters have – and this is not completely fair – got the reputation that when you finally need the redundancy then it doesn’t work. But it is not the cluster fault that the AD suddenly has a problem with a node or that some human mistake prevents a service from running. The only way to avoid such problems is to do regular failovers on the cluster. This gives the opportunity to discover and solve problems before they turn into a disaster. Because if a service is running on one node, and you fail it over to the other node, and it can’t start there, the cluster service will fail it back to the first node immediately. Then it will be clear to you that there is a problem, but the system is running and you have the chance to fix the problem.
But it must be done regularly. If you choose to do it manually, do it at least once a week, and make a note in your calendar so that you don’t forget it. I have implemented some automatic failovers, started by a Windows Task Scheduler, where it is monitored if the failover happens and the service is not failed back by the cluster service. If it is failed back, then the system surveillance software will detect it and give an alarm to the surveillance team. The time of such a scheduled failover is outside primetime, but it should not be in the middle of the night if this means that there is nobody there to fix a problem.