Coding, is? Fun!

Saturday, August 09, 2008

MySql High Availability

I recently had the opportunity to work on a solution for MySQL High Availability. Since I had no idea how MySQL worked and what High Avail meant, it was a crash course for a week. At the end of the week I had a good idea about the “primitives” in database HA solutions. It was, surprisingly, simple to understand.
High Availability solutions are concerned with database services being up most of the time (or the “five 9” SLA – databases to be available 99.999%). Although the specific solutions vary across different vendors (such as Microsoft SQL Server, Oracle and MySQL), there is a common thread or a set of concepts cutting across vendors.
Any database solution is a service that runs different threads servicing connection requests. This service runs on a physical or virtual server. The service accesses data from a data storage. Usually, the data storage is in the file system. But it can also be part of a Storage Area Network (SAN). SANs are used in large enterprises and allow network devises to be used as local file systems. Either way, the service need not be aware of this distinction. As far as a service is considered, the data storage is local.
A normal production system, that does not have high availability may use a single database server that runs the service which accesses the local file system. If the service fails, someone needs to be notified. The System Administrator (SA) would then restart the service and check what went wrong. This means for a variable time, the database is down and so is the application.

MySQL Replication

For high availability, you necessarily need at least two servers or “nodes”. You can have more than two, but two is minimum. If one server or service fails, the other takes over.
This seems simple – once a server goes down, the SA can have the second server take over.
But there are two problems with this scenario
- The data in the first server needs to be available in the second. If it is not, the second server has older data and in a commercial system, that is unacceptable.
- The application is still pointing to the first server. We need to change the IP of the second server to take the place of the first, so that the application can continue to function. This is called Automatic IP failover

To solve the first problem, MySQL uses the Master/Slave configuration or MySQL Replication. When MySQL is configured, you create on server as a Master and the other as a Slave (there can be multiple slaves). MySQL Master internally takes care of replicating data at frequent intervals to the Slave. Thus, when the Master fails, you have most of the data in the Slave. There may be a small delat missing after the last synchronization, but that may usually be accaptable, depending on the replication time interval.
Note that MySQL handles this replication, not a separate service.
So, at this point, the SA gets notified on Master failure. SA converts Slave to Master and manually switches the IP of the Slave. Application can now start working while the SA fixes the Master.

Note: With Master/Slave, in websites which have mostly read-only access, you can have the Slave handle most of the reads, while the Master handles the reads and writes. This requires an application level change (like changing connection strings in PHP or .NET). Thus there is a level of load-balancing you can achieve.

General parameters for HA

Thus, you can see that running a High Availability solution needs you to understand four important parameters. Ask yourselves these four questions when looking at a clustering solution from any database vendor:
1. How do I get alerted when a service goes down? Most data centers have service-independent Alerting systems that would page a system administrator. There is a Linux service called the Heartbeat that can monitor the operation of any service in Linux.
2. Does the recovery or fail-over need manual intervention? If no, what is the time interval for which the service is broken? As you can see in Master/Slave, the answer is Yes, we do require manual intervention so that a SA can configure the Slave to be the new Master.
3. Is the data available completely after failure recovery? In Master/Slave, we already saw that there is a short loss of data. But this can be resynchronized manually using transaction logs in the Master.
4. How many nodes (servers) do I need to run a HA system? This corresponds to the infrastructure cost. Master/Slave can run with two nodes, while MySQL Clustering needs as many as 5 nodes to run a HA system. Of course this is a tradeoff, because the system recovery in a MySQL cluster does not require manual intervention and can happen in less thatn 3 seconds in case of failure.

There are other options for HA in MySQL such as MySQL+Heartbeat+DRBD and MySQL Clustering. I encourage you to explore these and apply the above four questions to each solution. In fact, you can use these for judging SQL Server and Oracle clustering too.

Labels: ,

0 Comments:

Post a Comment

<< Home