Mikael Ronstrom: May 2012

Tuesday, May 22, 2012

MySQL Cluster 7.2 achieves 4.3BN reads per minute

Previously we announced that MySQL Cluster 7.2.5 achieved 1.05BN reads per minute on an 8-node configuration using Intel Xeon X5670 CPUs. We got the chance to try out MySQL Cluster 7.2 now also on a bigger cluster using the new Intel Xeon generation, Intel Xeon E5-2670. Using these machines we achieved a new record of 4.3BN reads per minute (72M reads per second) with 30 data nodes.

Running benchmarks is always interesting since one never really knows where the bottleneck appears. In this benchmark the actual bottleneck was that the benchmark programs didn't have the ability to drive through more transactions per second. So we have good expectations we can improve those numbers even further in coming MySQL Cluster releases.

These numbers really attest that the new MySQL Cluster 7.2 generation is a real powerhouse when it comes to performance.

Monday, May 21, 2012

Intel Xeon E5-2670 gives 56% higher throughput than Intel Xeon X5670

Intel provided us with a chance to try out the new Intel Xeon E5-2670 in a large cluster set-up to compare it against the previous Intel generation, the Intel Xeon X5670. We compared two similar set-ups for MySQL Cluster with 4 data nodes, each node was configured in a realistic production set-up. With Intel Xeon X5670 we achieved 9.77M reads per second in this set-up and with Intel Xeon E5-2670 we achieved 15.2M reads per second. A healthy 56% speedup.

The speedup comes from three factors. The first factor is that the Intel Xeon E5-2670 have 33% more cores (8 vs. 6). The second factor is that more cores gives us greater flexibility in configuring MySQL Cluster and we can actually configure 50% more threads per data node. The third factor is that each Intel Xeon E5-2670 core is actually faster than the Intel Xeon X5670 cores although they execute on a lower clock frequency.

Wednesday, May 16, 2012

MySQL Cluster 7.2.7 achieves 1BN update transactions per minute

In MySQL Cluster there is a limiting factor in the receive threads that limits our update performance to about 0.5M update transactions per node group per second (usually 2 nodes per node group). In MySQL Cluster 7.2.7 we have removed most of this bottleneck and can now achieve 3x as many update transactions. We're reaching about 1.5M updates per node group per second. On a 30-node configuration we achieved 19.5M update transactions per second which corresponds to 1.17BN updates per minute. This means we achieve almost linear increase of update performance all the way to 30 data nodes.

The benchmarks were executed using the benchmark scripts dbt2-0.37.50 available at dev.mysql.com, the benchmark program is the flexAsynch program mentioned in some of my earlier blogs. We used 8 LQH threads per data node.

Monday, May 14, 2012

Challenges in reaching 1BN reads and updates per minute for MySQL Cluster 7.2

In an earlier blog we've described the general high-level idea of how to achieve 10X better performance for MySQL Cluster 7.2 compared to MySQL Cluster 7.1.

Naturally the development is never as straightforward as the high-level view looks like. In this blog I'll mention a few of the most important roadblocks on the path to improved performance of MySQL Cluster 7.2 that we met and resolved.

Initially when we increased the number of LQH threads from 4 to 16 we only saw scaling to 8 LQH threads and we saw no scaling in going to 16 LQH threads. This was very puzzling since we don't really have any mutexes that should be an issue. However we looked into the mutexes that we had and managed to decrease the number of conflicts on the send mutexes by a factor of 15. This did however not improve performance at all.

Next we noted using oprofile that there was a few functions that for some reason 50% of the CPU time was spent. This was quite surprising and given that the exactness of those measurements is not always 100%, I was very suspicious about those numbers. Eventually however the reason dawned on me.

The reason was that I had some cache lines that was too often updated. This lead to that some instructions took several microseconds to execute since all threads were serialised on updating this cacheline.

The first such instance was a piece of code used to check whether send buffers were overloaded. In case the send buffer is more than 75% overloaded we start rejecting client requests to ensure that already ongoing requests are able to complete. This is accomplished using a bitmap with one bit per node we're communicating with. This bitmap is obviously global and this was updated every time we made a remote send to another node. This was obviously quite unnecessary to update it every time, it's enough to update when the state changes, so a simple if-statement resolved that problem.

The next problem was even harder to understand how it could be an issue. It turned out that the problem resided in our crash information subsystem. We have a macro called jam() (Jump Address Memory, an acronym we inherited from the AXE system once upon a time). This macro inserts the line number we're currently executing together with sometimes the block number we're executing. When there was only one thread in the MySQL Cluster data nodes then this data structure was global and shared by all others.

With the move to multithreaded architecture we changed this to be a data structure per thread such that we can get detailed information on each thread what it did before any crash.

Most blocks are only executing in one thread and was fairly straightforward to change this. However in one case we have a code path which is used by all TC threads to ask the distribution handlers which nodes and threads that contain the data for a certain partition of the MySQL Cluster. The distribution handler thus is one block called from many threads and thus we needed to use different jam's dependent on which thread that called this function. When this wasn't done then this code showed up as another bottleneck since many threads tried to update the same cachelines again.

With those fixes we were able to reach very good numbers on a single node with up to 16 LQH threads. However we saw that the risk of getting out of send buffer memory had severely increased due to the great increase of threads in the data node. The threads communicate using a lock-free scheme, however this means that there needs to be dedicated memory available for each two threads that communicate. Also the code doesn't always use the send buffer memory in the most efficient manner to speed up communication. This meant that we needed to do something about send buffer memory handling in order to make the data nodes as stable as before. We found three points in the code where we needed to pack send buffer memory, non of these were part of the normal code path but were vital to ensure that we packed things in cases when we got close to run out of send buffer memory. We also went through all send buffer memory configuration defaults and parameters and made them more appropriate to also handle larger data node configurations.

As a final step towards getting single node performance working really good we also made sure that all data structures that were global were properly aligned on cacheline sizes.

Putting the code to the test in a distributed environment also revealed a few new points to handle. At first we discovered that the free list of connections to the data node in an API had an interesting impact on the balance of the use of TC threads. For some reason, still unclear exactly how, a LIFO queue here had the impact that we used some TC threads up to 10x more than other TC threads which obviously made for very bad scalability with many TC threads. The solution was simple however, a quick change to a FIFO queue and the problem was no longer there.

The next problem was yet one more imbalance, this time the imbalance was on LQH threads. The imbalance only showed up on very large clusters. This time the imbalance came from the manner in which we distribute rows into partitions. In order to make on-line reorganisation of tables very efficient we divide the table into a number of virtual partitions using a hashmap, then the virtual partitions are mapped to a real partition. Previously the hashmap always created 240 virtual partitions which was quite sufficient with 4 LQH threads, but not when moving to 16 LQH threads. So we changed to using 3840 virtual partitions instead.

Actually the choice of 240 and 3840 is intricate here. 3840 is equal to 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 3 * 5. This means that if using 12 LQH threads (2 * 2 * 3) then the number of nodes in the cluster should either be on the form 2**n or 5 * 2**n. If not things will still work fine, but load will be slightly unbalanced since the real partitions will contain different numbers of virtual partitions. Uisng 16 LQH partitions the numbers of nodes should be on either of the forms 2**n, 3*2**n, 5*2**n or 3*5*2**n where n is less than or equal to 4. Thus we get from this calculation that 30 nodes is better than 32 nodes if using 16 LQH threads since it brings about a more even distribution of rows in the cluster.

With these changes the performance of reads in a distributed environment was quite good and was providing very good scalability. However updates still caused issues.

The problem for updates was that the receiver thread handling signals between the nodes in the same node group was overloaded. There was simply too many signals to handle. Much of this was due to some work that was left as a TODO after the 7.0 release since it wasn't an issue in 7.0, but now with the higher throughput it has become an issue.

So the solution was to properly implement packing of signals such that several commit messages were packed together and similarly for the commit acknowledge from the API. These straightforward changes decreased the load on the receiver thread to a third and made it possible to push through 1.5M updates per second per node group. It is still possible that this becomes a bottleneck but it should be extremely unusual for a real-world application to reach this state.

With all these changes implemented we managed to scale update and read performance linearly up to 30 nodes.

Friday, May 11, 2012

History of MySQL Cluster architecture development

With the release of MySQL Cluster 7.2.5 we've released a version of MySQL Cluster where each data node is capable of using up to 51 threads in total. This is a remarkable feat for a product that only a few years ago was purely single-threaded. This article tries to explain what it is in the MySQL Cluster architecture that makes it so scalable to new HW demands.

The MySQL Cluster architecture is based on a design that is used in the world's most sold telecom switch, the AXE switch. This switch is used in all mobile networks delivered by Ericsson. MySQL Cluster also originates from Ericsson. The architecture of the AXE switch was developed in reaction to an older switch that had quality issues since there were too many global variables and too little modularity. The architecture was inspired by how HW was designed. In HW design each module can only communicate with other HW modules using signals (mostly electrical). The idea of the AXE architecture was to use this architecture also for software. The basic concept of this architecture is blocks and signals. Blocks are the software modules (similar to a HW module). Each block is self-contained and contains all the data and code for its needs. There is no shared data with other blocks. So the only method to get access to data in another block is to send a signal to the block requesting the data.

When the storage engine of MySQL Cluster, NDB Cluster was designed, the AXE architecture was used, so the NDB DBMS software contains a set of blocks (currently a little more than 20) that communicate with each other using signals.

This is the first reason why it is so easy to make MySQL Cluster scale to many CPUs. Since each block only communicates with each other through signals it means that it's trivial to move blocks between different threads unless they are using signals that are immediate and have dependency on returning to the sender block in the same state as when the signal was sent. The NDB blocks have a number of blocks that are not possible to split, these are LQH (Local Query Handler, handles communication between data storage and indexes, handles scans), TUP (data storage), ACC (hash index), TUX (ordered index) and some parts of TSMAN, PGMAN and LGMAN (tablespace, page and log management for disk data). There is however no such dependency between TC and the Local Data Manager blocks (LDM), also the handling of asynchronous replication is in a separate block that can be easily moved to any thread. We also have a number of schema management blocks that have been separated into their own threads using some special lock techniques. Thus we have already from functional division a separation into the LDM domain, the TC domain, the SPJ domain, the schema management domain, the asynchronous replication domain. Currently TC and SPJ are colocated although it's not really necessary, but TC and SPJ are extremely easy to scale since each transaction is independent of the other. Thus we have 4 thread domains and each of these can be placed into separate domains.

In MySQL Cluster 6.3 and previous version everything was done in one single thread and this configuration is still supported since it has nice real-time characteristics when everything is placed into one thread. In MySQL Cluster 7.0, the LDM blocks were separated from the rest of the blocks into separate threads. In addition the schema management blocks and the TC and SPJ domain was separated into one domain. Asynchronous replication was also separated into its own thread domain. Finally given that all of these threads requires communication with the outside world we also created a separate receive thread that receives data on TCP sockets and converts the data into prepackaged signals that the blocks can execute and puts them onto a queue to the thread that will execute the signals. We spent considerable effort in making the implementation of communication between threads extremely efficient and this communication is entirely lock-free and uses memory barriers and careful writing and reading to communicate with each other. On x86-machines this part is partly implemented in assembler to make efficient use of optimal assembler instructions.

The LDM used yet one more technique to distribute these blocks onto several threads. It makes use of the partitioning that NDB Cluster already supports, also the REDO log in NDB Cluster was also already separated into 4 log parts and thus it was straightforward to create 4 LDM modules where each block is replicated within different threads. Each LDM instance takes care of one part of the data stored in the data node. From the outside the protocol is still the same although the address of a block now also contains thread id in addition to the node id and block id. The block reference already contained space enough for this purpose so no change to the block code was required to handle this.

In MySQL Cluster 7.2 we have advanced the threading yet one more step. The first step was to separate the schema management domain from the TC and SPJ domain. Then given that TC can be partitioned to handle different transactions using a simple round robin allocation scheme, we also separated the TC domain into multiple threads. In principle there is no limit to how many TC threads we could use, we pick a number which is appropriate to the CPU requirements of the TC domain. We selected 16 as the maximum number since we've found no scenario where more TC threads are required. We also extended the number of LDM instances to a maximum of 16 by also extending the possible number of log parts to 16.

The next step was to create multiple receiver threads. This was relatively easy as well by partitioning the threads to handle different sockets (one socket is used to communicate with each other node in the cluster).

Another step we also took was to separate the send call from the block threads. Thus special send threads can be used, this increases latency slightly but improves throughput. It also minimizes the risk of bottlenecks occurring in the data nodes. We set the maximum of send threads to 8 and similarly for receive threads. Normally 3 threads should suffice for even the biggest configuration to some extent dependent on the efficiency of the OS read and write syscalls.

Thus what we can see is that the choice of a block and signal architecture have made it possible for us with very small means to bring data node scalability from 1 CPU, onwards to 8 and now onwards to at least 40. Even more can be achieved. Thus MySQL Cluster is based on a SW architecture which can easily accomodate itself to the new HW developed. It has also from the first codeline had the concept of efficient communication in its genes. The first prototype of NDB Cluster developed in 1994 used two SPARC-stations interconnected with Dolphin SCI technology. The current benchmarks we've executed used Infiniband with 56Gb/s interconnects where latency is extremely short when communicating between different machines.

The next generation of HW is likely to revolutionize handling of memory. MySQL Cluster is well prepared for this as well.

Last a small historical anecdote. In the 1990's the HW vendors had a race to deliver GHz cpus. The x86 CPU sold early 1990 was the 80486 that run at 25MHz. In 1999 the race was over when the first GHz processor was delivered. We released MySQL Cluster 7.2 two months ago running at 17.6MHz. We're soon announcing the next major improvement to this number. So will the 2010's be the decade where the race is on for which DBMS that can first deliver a GHz query execution? MySQL Cluster is well prepared for such a challenge.

Mikael Ronstrom