Monday, October 08, 2012

An architect's view of the MySQL Development model

When I stepped into my current job as Senior MySQL Architect in the summer of 2007 one of the first things on the agenda for discussion was the development model. We had released 5.0 almost 2 years ago and we were still more than a year from releasing MySQL 5.1. The model we used for developing MySQL 5.0 and MySQL 5.1 didn't work as well we wanted. So we had many discussions on how to improve this model.

A quick step 5 years forward to today, we are now using the Milestone release model and are just about to release the second GA version that have been developed using this model.


Differences in old and new development model

So what is the main difference between the new development model and the old development model?

I think the most important attribute of our current model is that the development tree always have to be in a releasable state every 2-4 months and this releaseable state have to be possible to bring to a GA version in around 6 months of stabilisation period. Our definition of releasable state is that the quality is at Release Candidate level. The old model effectively only required a releasable state when the version was reaching Release Candidate state since also in the beta stage new features were added, so a releasable state was only reached every 2 years instead of every 2-4 months.




Impact on new code in MySQL

So what does this mean for how we need to develop new code in MySQL. It means that new code must be added in fairly small chunks, even more it means that the new code cannot be too intermingled with old code since that would introduce too many new bugs. So if we need to add a new feature in an area where the complexity of the MySQL Server is high, then we first need to reengineer this area before we can proceed pushing the actual new feature. So in a sense the highest priority is always given to maintaining a stable development tree. In the past to get a new feature into the MySQL Server could sometimes have higher priority than the stability of the development tree.

So interestingly, by prioritizing tree stability before new features, we are able to develop more new features! But we are not necessarily able to develop specific new features as quick as we want sometimes. If a new feature requires tampering with complex MySQL Server code, then it might require a number of preparatory projects before they can actually be turned into a real feature. But anything that can be done outside of the complex parts of the MySQL Server can be added quite quickly.


Development of major features in new model

How does one handle addition of new major features in this model. The main consideration is that all new code must be fairly well structured. So when adding a new feature we can't simply start coding along, we first need to consider if the current code structure can be used to develop this new feature. As an example one of the last things added in MySQL 5.6 was a split of LOCK_open. If we would have tried to add this feature in earlier versions of the MySQL Server we would have had a new feature that touches many functional parts of the MySQL Server and it would have been introduced a significant set of new bugs to fix before any new GA version could have been released. Instead we went about things differently. First a number of reengineering projects isolated the main feature of LOCK_open to only protect metadata operations on tables and caching of metadata in connections. When these steps were completed about halfways through the MySQL 5.6 development, then it was very straightforward to make the actual split of LOCK_open in a fairly small patch which was a safe change.

Inspiration to new development model

Our selection of this new development model obviously was inspired by lessons from other successful open source projects such as the Linux kernel.


Change from old to new model

The change from the old model to the new model were probably the most difficult part. At the time of the change we had a new development tree which had some fairly significant new features implemented in it. But the tree was very far away from getting to a releasable state, so the tough decision was taken to restart development with a new tree based on the MySQL 5.1 development tree. This tree was the start of development using the new milestone model. The features from the old development tree was step by step moved over to the new development tree. This process actually took a few years since there are very high quality requirements on new features making it into the new development tree. But now we've moved all features from this old development tree that we will move, some of the features have been dropped or simply completely reimplemented.

Impact on future MySQL development

So what does this mean for current and future MySQL development? In MySQL 5.6 we added 200 new features, so obviously the new development model leads to a high pace of new interesting changes that will benefit our users and customers. We expect that the development model will continue to benefit us and that we will continue delivering new interesting features at a high pace to the MySQL community. As an architect I am proud to see how the MySQL development team continously improves the MySQL architecture and thus making it possible to add new features at an even higher pace.

Thursday, October 04, 2012

My personal list of new features in MySQL 5.6

This blog post simply lists and describes a few of the most interesting new features in MySQL 5.6. The list is simply my preferences based on my development interest and what I think is very interesting feaures in general.

In MySQL we develop new features using worklogs, the development is divided into milestones and the MySQL 5.6 development was divided into 6 different milestones. In total there was around 200 worklogs developed in MySQL 5.6 which is a quite impressive set of new features. Naturally there is also a ton of bug fixes in MySQL 5.6 as well but I won't go into those here.

In this blog I will only mention 30-40 of those worklogs, but even this is a fairly large number of interesting features.

I will discuss features in areas of interest.

Partitioning

Let's start with partitioning. We developed partitioning originally as a 5.1 feature. Now development is mostly handled by Mattias Jonsson and we meet regularly to discuss new bugs and features in partitioning. MySQL 5.6 saw a couple of significant new partitioning features. With MySQL 5.6 it is now possible to import and export partitions to and from tables as a very quick operation (except when you import and want to check rows for correctness as part of the import). This means you can insert your data into a single table and then decide to move the table into a partitioned table after a while. This feature can be used for a variety of things.

We have also worked hard on improving performance with many partitions. We have fixed a number of performance issues in the algorithms, we have improved the locking part such that we only lock those partitions that are actually used in the query. This improvement required some reengineering effort in the MySQL Server which is beneficial to many other parts other than partitioning. Another thing we added is that it is now possible to specify in queries which partitions one want to use, thus partitions can in a sense be treated as tables in queries if need be.

Optimizer

The next area is the optimizer, many new things have happened here and I am not the main expert in this area, but even from my view the improvements in subquery handling, index condition pushdown, persistence of optimizer stats (both in server and in InnoDB), new algorithms for multi-range reads focused on disk sweeps and an improved sort function is very interesting and many of them gives extreme performance improvement for certain queries. A feature I also think is very important is the optimizer trace, this feature came up a while ago in a reengineering project and it's nice to see it completed now. It gives anyone the ability to understand exactly how the optimizer makes its decisions and thus also making it possible for both users and developers to understand how complex queries can be improved.

The new ability to use EXPLAIN also on UPDATE/DELETEs is obviously a very important addition to users of MySQL.

Replication

The next area is replication, it contains a large amount of functional features that will make it easier to use. There is a number of improvements that I consider very interesting, the first is the improved scalability of binlog writing which makes the scalability balanced such that we're scalable both for replicated and non-replicated scenarios. The introduction of GTIDs is also a very important feature that will make life much easier for developers of various sharding and HA solutions based on MySQL replication. Finally also of course parallel slave applier is another important addition to the scalability of MySQL replication. Many new MySQL 5.6 features makes it easier to use MySQL as a component in larger systems.

MySQL Server internals

In the MySQL Server I also of course take an interest in the split of LOCK_open and the work on metadata locks. These changes are very significant for MySQL Server scalability. There are also many new features in the performance schema area making it possible to monitor MySQL in numerous new ways not possible in the past. A major effort was also made in the area of on-line alter table changes. Now there is a API in the MySQL Server to make any change on-line if the storage engine supports it and InnoDB specifically supports on-line add index.

InnoDB

Given my background in developing MySQL Cluster, and my general interest in everything around performance, it's not surprising that most everything that happens in the InnoDB storage engine is also things I consider interesting.

InnoDB has developed extensively in MySQL 5.6 both in terms of scalability and functionality.

In the area of scalability InnoDB has seen improvements of:
- RW-locks for buffer pool hash
- Split of kernel mutex
- Multi-threaded purge
- Reduced contention on log mutex
- Increased max size of REDO log files
- Improvements to InnoDB thread concurrency feature
- Read-only transaction optimisations

Furthermore InnoDB also have new major features such as fulltext search, preloading of buffer pool, separate tablespaces for UNDO log, flexible page sizes, memcache support for InnoDB and transportable tablespaces.

Finally InnoDB has improved stability of performance through improved flushing and it has a new low-level monitoring (InnoDB metrics).

As a final thing we have also updated default configurations to make it even easier to manage MySQL Servers.

So the MySQL 5.6 is a release that will be very useful for both new and old MySQL users.

Wednesday, October 03, 2012

Scalability improvements in MySQL 5.6

At MySQL Connect last weekend the MySQL 5.6 was released as a Release Candidate. It contains a large set of improvements over MySQL 5.5. In this blog I will focus on what has changed in the area of scalability.

There are four main areas of improvement of scalability in the MySQL 5.6:
1) Splitting InnoDB Kernel mutex
2) The G5 discovery
3) Splitting LOCK_open mutex
4) Binlog group commit

In InnoDB as in so many other products there was an infamous kernel mutex. This mutex was a sort of catch all mutex used to protect the system from concurrent use of various data structures. For an implementer it was easy enough to use the kernel mutex as a way to achieve safe concurrency. The problem of course is that it hurts our scalability. The kernel mutex was a major bottleneck in many benchmarks, and even more important it also made it very difficult to resolve other scalability issues.

Already early in the MySQL 5.6 development an effort in the InnoDB development team led by Sunny Bains, took the kernel mutex and split it into around 10 new mutexes. So effectively the kernel mutex was previously used for 10 different things. Now through the split, each new mutex protects only one logical entity. This naturally had a great impact on scalability in the MySQL 5.6 release, but also will it make future developments easier. Now since each new mutex is only protecting one logical entity it is much easier to see how to improve handling of each of those 10 new mutexes when they become scalability issues. The new mutex that has the most contention now is the lock mutex, this mutex protects the metadata of tables mainly.

The G5 discovery refers to a name we gave a patch internally. We made several experiments after splitting the kernel mutex to see which of the remaining mutexes would make most sense to fix next. So some experimental patches were developed that entirely removed a number of mutexes (obviously not a fully working MySQL version, but sufficiently working for a simple benchmark). The puzzle we met was that whatever we removed we had no success in improving performance. We looked at this problem using various tools without finding the root cause of it. Finally we used oprofile and the new Linux perf tool to see which code lines that we spent time in. We eventually discovered the issue. The problem was that every time a row was read in InnoDB we updated a statistics counter. In the case of Sysbench we have a number of scans where each scan hits 100 rows, so for each Sysbench transaction we read around 410 rows. Modern 4-socket servers are fast, but if one tries to update the same variable from all CPUs at once, then the cache line which this variable resides in, will bounce back and forth between the CPUs in the system. The systems we used for testing this could handle between 2-4 million updates per second of the same variable when updates were applied from CPUs residing in different sockets. So effectively this statistics counter kept the number of rows read in a MySQL Server down to 2-4 million reads per second or converted to Sysbench transactions we could not deliver more than 5-10.000 TPS. This variable have no major effect in MySQL 5.5 since it has other scalability issues that hides this issue. But in MySQL 5.6 we solved so many scalability issues that this variable became a major bottleneck. A simple solution was obviously to simply remove this line, but obviously we developed a more proper solution that we also used in many other similar areas where this could potentially become an issue. Finding and resolving this issue was a team effort made possible by the InnoDB team, the MySQL Server team and the MySQL performance experts. So the development resources available in the Oracle MySQL organisation makes it possible to continue scaling MySQL towards new heights.

Another sort of kernel mutex is the LOCK_open mutex in MySQL. This mutex used to be a mutex to protect the metadata of a table within the MySQL Server. It was however used for many other things as well. This meant that splitting this mutex required a number of reengineering projects before we were ready to perform the actual split. Already early in the MySQL 5.6 development the ground work was finalised to perform the actual split through the work of the MySQL runtime team. The idea to the actual split came in a development meeting in Trondheim where myself and Dmitry Lenev quickly came up with an idea to protect TABLE objects separately through an array of mutexes and keep the LOCK_open only for creating new TABLE objects and other activities around table metadata changes.

Finally this idea also became a new addition to the MySQL 5.6 release. Standing on the shoulders of the G5 discovery and the split of the kernel mutex, the split of the LOCK_open mutex made the jump in performance very significant. We were able to improve performance 70% and more only based on the LOCK_open split in a number of benchmarks.

What we have discovered with MySQL 5.6 and its benchmarking is that the standard Sysbench has a limiting factor. The problem is that Sysbench only uses 1 table, this means that any protection of this single table will be part of the scalability issues in running Sysbench. Since we assume that most, if not all, applications don't direct all queries towards one table, we also started benchmarking MySQL 5.6 using 2 tables and more in Sysbench. In this case we avoid bottlenecks related to use of only a single table.

In MySQL 5.6 we also decided to spend a serious effort in improving replication scalability. We were happy with the improvements of scalability in MySQL 5.5, but saw a need to also move more attention to the replication area. We set out with a modest goal of scaling replication to 16 CPUs. We actually managed to scale replication on the server side as far as the MySQL Server itself to 48 CPUs. We managed this even with sync_binlog set to 1! So in MySQL 5.6 the binlog writing flies. Also the slave side saw major improvements in scalability in MySQL 5.6.

The problem related to writing of the binlog was that we only had one mutex protecting those writes and all its phases. This meant among other things that this mutex also protected file writes to the binlog and even sync to the disks of the file writes.

In order to solve this issue Mats Kindahl and I experimented with a number of approaches to discover what worked best. We eventually came up with a sort of queueing system. So writes to the binlog were divided into phases. The first phase happens as part of transaction execution (as it does in MySQL 5.5) where binlog writes are gathered in a data structure operated by the connection. Thus there is no concurrency issue at all in this phase. The next step is to write this data into the file buffers operated by the operating system through write system calls. The next step is the sync of the writes to the disk. The last step is the commit that happens in the storage engines. Each of those steps is handled similarly but with minor differences.

Each step has two phases, a queue phase and an execution phase. The queue and the execution phase is protected by separate locks. Whenever someone arrives to a step he grabs the queue mutex and places himself in the queue. If he is the first to arrive in the queue then he also tries to grab the execution mutex after releasing the queue mutex. This means that while waiting for the execution mutex, more connections can place their data in the queue, they will then wait for the execution owner to signal them when the work is done (this could happen after more than one step is executed). When the execution mutex is acquired, the execution mutex owner will also grab the queue mutex, grab the queue and set the queue to empty and release the queue mutex. After this he will execute all tasks in the queue.

The final commit step can be parallelised if desired, thus multiple connections can commit simultaneously. The only limitation to this is when a concurrent hot backup is happening at the same time.

What is interesting with this architecture is that it is flexible to where the bottleneck is. For some workloads the bottleneck is in the commit phase, in this case it makes sense to parallelise this part which is possible. For others it is the sync phase which is the bottleneck and for yet other workloads the writing to the buffer is the bottleneck. Wherever the bottleneck resides this architecture makes the best of the situation and scales MySQL 5.6 replication to new heights.

So with this impressive set of new improvements in MySQL 5.6, what is left to do? As usual when one develops a new feature or performance improvement one also finds out about a ton of other things one wants to improve. So the list of ideas is not empty and there is some pretty interesting scalability improvements in line also for MySQL 5.7. We have access to machines currently with 96 CPU threads and bigger machines are likely to appear, we also cooperate with other parts of Oracle where even larger systems are possible. So our aim continues to keep MySQL scalable on commodity servers. Naturally we have noted that many of our customers are sharding their data set to handle even higher workloads. A long career on distributed systems has learnt me that it is extremely important to do proper partitioning of data sets to achieve scalability on network level. But it is still extremely important to make each node in the distributed system as large as possible. This decreases maintenance activities, it minimises issues in sharding by minimising the number of shards. We are pursuing this effort both in the MySQL Cluster area and in the MySQL area where InnoDB is used as a storage engine.

Tuesday, June 05, 2012

LOCK_open finally removed as a bottleneck in MySQL 5.6

From the very first day at MySQL, the LOCK_open mutex have been an infamous bottleneck. The road to removing it as a bottleneck has been a long one. Already in the MySQL 5.4 beta release we had some first improvements of the LOCK_open. Other preparations were done in the solution of the bug #901 with the introduction of the MDL locking (metadata locking). Finally some preparations was done in early MySQL 5.6 DMR's where LOCK_open was removed as a mutex for a lot of activities where it wasn't really needed in opening and closing of files.

During an internal meeting in Trondheim at the end of 2010 I sat down with Dmitry Lenev and we came up with a solution where the LOCK_open is removed as a bottleneck. So finally this year we finalised this development and made it available as part of the MySQL 5.6 june 2012 labs release.

LOCK_open was used to protect a number of data structures related to the TABLE_SHARE object that contains a number of attributes about the table. It was also used to protect the cache of TABLE objects and a number of data structures related to it.

So what to do in order to fix this bottleneck. The first analysis is that most queries only need read access to the TABLE_SHARE object and they need one instance of a TABLE object. So if we have a set of TABLE objects then this can be used in parallel and their is no reason to have global data structures protecting the caches of the TABLE instances.

So the first step to fixing LOCK_open is to introduce a new set of TABLE cache instances. Each such instance will contain a list of cached TABLE objects. Each such TABLE cache instance also have a reference to the TABLE_SHARE object. So this means that we can find both a TABLE_SHARE object and a TABLE object by using one of those TABLE cache instances.

So through these set of TABLE cache instances it isn't necessary to use the LOCK_open if there are free TABLE objects in the TABLE cache instance used. LOCK_open is still needed to create new TABLE and TABLE_SHARE objects, but these objects can be reused any number of times once it's created. When performing DDL operations it's necessary to lock both LOCK_open and all TABLE cache instances.

So this means that we have replaced the LOCK_open with a set of mutexes protecting one TABLE cache instance. The number of TABLE cache instances are configurable, we used 16 instances in the published benchmarks. We also added a number of statistics counters to how many accesses of TABLE cache hits and misses we have in a MySQL Server.

We have seen at least 70% improvement based on just adding this new feature and in some cases the gain can probably be even bigger. By using 2 tables in Sysbench we can grow performance even further which would not make any difference when the LOCK_open was still a bottleneck and with this we can get the performance impact of removing LOCK_open to be even larger than 100%.

MySQL 5.6 makes leapfrog in performance

I blogged about how the April labs release 2012 made 5.6 improve performance of the Sysbench OLTP RO benchmark by more than 50%. Now we made an even more significant improvement of the MySQL 5.6 performance in the june 2012 labs release available at labs.mysql.com.

We've been able to improve Sysbench OLTP RO performance by a massive 270% going from MySQL 5.6.5 to the MySQL 5.6 june 2012 labs release. The benchmarks were made on an internal pre-release of the june labs release.

Also for Sysbench OLTP RW we made a very significant breakthrough in performance, the performance increase is 182% here.

The tests were made on a "monster"-box we have available at Oracle with 8 CPU sockets, each socket has an Intel Xeon 7540 processor that have 6 cores per socket and 2 threads per core. Thus a total of 96 CPU threads are available in the machine. The CPU frequency was 2.00 GHz and the available memory was 512 GB. The disks used in the benchmark was SSDs. The benchmarks was executed using the dbt2-0.37.50 scripts available on the dev.mysql.com site. We will upload the latest changes to those scripts early next week. The MySQL Server was locked to use up to 64 CPU threads in the benchmarks executed.

So what is the cause of such massive performance improvement. Well actually it is the result of all 5.6 changes made over a long period of time. There has been a great number of significant changes in the InnoDB engine in the 5.6 development. Most notable of those are changes by Sunny Bains to split the InnoDB kernel mutex into 10 different mutexes. Also Inaam Rana made a significant changes of the buffer pool mutexes to improve these. There has also been numerous other changes in InnoDB, one notable such one that Dimitri Kravtchuk has been instrumental in benchmarking and suggesting changes to is the new adaptive flushing algorithm. Check Dimitri's blog for more meat on the bones of that change. Most of the InnnoDB changes were in the MySQL 5.6.5 release as well, so why the large jump in performance now.

Part of the reason is that there was a cache line that became very hot in the MySQL Server as mentioned in an earlier blog. Removing this bottleneck alone made performance jump by more than 50%. However there was also other bottlenecks that caused performance to still not jump. Most notable of those were the LOCK_open mutex in the MySQL Server part. We have finally now removed this as the main bottleneck in the server. I will describe these changes more in a coming blog. So the impact of the important changes in the MySQL 5.6.5 were hidden by the cache line problems and LOCK_open problems and once those were removed we could see this leapfrog in performance making the MySQL 5.6 the best MySQL release ever.

With LOCK_open removed as a bottleneck it also made sense to see how much performance was affected in Sysbench by only having one table in the benchmark. So we did the benchmarks mentioned above using 2 tables instead of one. Using 3 tables had no impact on performance, but going to 2 tables means that we split one of the mutexes in the data dictionary into two and also the InnoDB index mutex is split into two and finally the root page of the InnoDB indexes are two instead of one. We think it's fair to say that most real-life applications would use more than one table. The second table improved performance by about 25-30%.

Tuesday, May 22, 2012

MySQL Cluster 7.2 achieves 4.3BN reads per minute

Previously we announced that MySQL Cluster 7.2.5 achieved 1.05BN reads per minute on an 8-node configuration using Intel Xeon X5670 CPUs. We got the chance to try out MySQL Cluster 7.2 now also on a bigger cluster using the new Intel Xeon generation, Intel Xeon E5-2670. Using these machines we achieved a new record of 4.3BN reads per minute (72M reads per second) with 30 data nodes.

Running benchmarks is always interesting since one never really knows where the bottleneck appears. In this benchmark the actual bottleneck was that the benchmark programs didn't have the ability to drive through more transactions per second. So we have good expectations we can improve those numbers even further in coming MySQL Cluster releases.

These numbers really attest that the new MySQL Cluster 7.2 generation is a real powerhouse when it comes to performance.

Monday, May 21, 2012

Intel Xeon E5-2670 gives 56% higher throughput than Intel Xeon X5670

Intel provided us with a chance to try out the new Intel Xeon E5-2670 in a large cluster set-up to compare it against the previous Intel generation, the Intel Xeon X5670. We compared two similar set-ups for MySQL Cluster with 4 data nodes, each node was configured in a realistic production set-up. With Intel Xeon X5670 we achieved 9.77M reads per second in this set-up and with Intel Xeon E5-2670 we achieved 15.2M reads per second. A healthy 56% speedup.

The speedup comes from three factors. The first factor is that the Intel Xeon E5-2670 have 33% more cores (8 vs. 6). The second factor is that more cores gives us greater flexibility in configuring MySQL Cluster and we can actually configure 50% more threads per data node. The third factor is that each Intel Xeon E5-2670 core is actually faster than the Intel Xeon X5670 cores although they execute on a lower clock frequency.

Wednesday, May 16, 2012

MySQL Cluster 7.2.7 achieves 1BN update transactions per minute

In MySQL Cluster there is a limiting factor in the receive threads that limits our update performance to about 0.5M update transactions per node group per second (usually 2 nodes per node group). In MySQL Cluster 7.2.7 we have removed most of this bottleneck and can now achieve 3x as many update transactions. We're reaching about 1.5M updates per node group per second. On a 30-node configuration we achieved 19.5M update transactions per second which corresponds to 1.17BN updates per minute. This means we achieve almost linear increase of update performance all the way to 30 data nodes.

The benchmarks were executed using the benchmark scripts dbt2-0.37.50 available at dev.mysql.com, the benchmark program is the flexAsynch program mentioned in some of my earlier blogs. We used 8 LQH threads per data node.

Monday, May 14, 2012

Challenges in reaching 1BN reads and updates per minute for MySQL Cluster 7.2

In an earlier blog we've described the general high-level idea of how to achieve 10X better performance for MySQL Cluster 7.2 compared to MySQL Cluster 7.1.

Naturally the development is never as straightforward as the high-level view looks like. In this blog I'll mention a few of the most important roadblocks on the path to improved performance of MySQL Cluster 7.2 that we met and resolved.

Initially when we increased the number of LQH threads from 4 to 16 we only saw scaling to 8 LQH threads and we saw no scaling in going to 16 LQH threads. This was very puzzling since we don't really have any mutexes that should be an issue. However we looked into the mutexes that we had and managed to decrease the number of conflicts on the send mutexes by a factor of 15. This did however not improve performance at all.

Next we noted using oprofile that there was a few functions that for some reason 50% of the CPU time was spent. This was quite surprising and given that the exactness of those measurements is not always 100%, I was  very suspicious about those numbers. Eventually however the reason dawned on me.

The reason was that I had some cache lines that was too often updated. This lead to that some instructions took several microseconds to execute since all threads were serialised on updating this cacheline.

The first such instance was a piece of code used to check whether send buffers were overloaded. In case the send buffer is more than 75% overloaded we start rejecting client requests to ensure that already ongoing requests are able to complete. This is accomplished using a bitmap with one bit per node we're communicating with. This bitmap is obviously global and this was updated every time we made a remote send to another node. This was obviously quite unnecessary to update it every time, it's enough to update when the state changes, so a simple if-statement resolved that problem.

The next problem was even harder to understand how it could be an issue. It turned out that the problem resided in our crash information subsystem. We have a macro called jam() (Jump Address Memory, an acronym we inherited from the AXE system once upon a time). This macro inserts the line number we're currently executing together with sometimes the block number we're executing. When there was only one thread in the MySQL Cluster data nodes then this data structure was global and shared by all others.

With the move to multithreaded architecture we changed this to be a data structure per thread such that we can get detailed information on each thread what it did before any crash.

Most blocks are only executing in one thread and was fairly straightforward to change this. However in one case we have a code path which is used by all TC threads to ask the distribution handlers which nodes and threads that contain the data for a certain partition of the MySQL Cluster. The distribution handler thus is one block called from many threads and thus we needed to use different jam's dependent on which thread that called this function. When this wasn't done then this code showed up as another bottleneck since many threads tried to update the same cachelines again.

With those fixes we were able to reach very good numbers on a single node with up to 16 LQH threads. However we saw that the risk of getting out of send buffer memory had severely increased due to the great increase of threads in the data node. The threads communicate using a lock-free scheme, however this means that there needs to be dedicated memory available for each two threads that communicate. Also the code doesn't always use the send buffer memory in the most efficient manner to speed up communication. This meant that we needed to do something about send buffer memory handling in order to make the data nodes as stable as before. We found three points in the code where we needed to pack send buffer memory, non of these were part of the normal code path but were vital to ensure that we packed things in cases when we got close to run out of send buffer memory. We also went through all send buffer memory configuration defaults and parameters and made them more appropriate to also handle larger data node configurations.

As a final step towards getting single node performance working really good we also made sure that all data structures that were global were properly aligned on cacheline sizes.

Putting the code to the test in a distributed environment also revealed a few new points to handle. At first we discovered that the free list of connections to the data node in an API had an interesting impact on the balance of the use of TC threads. For some reason, still unclear exactly how, a LIFO queue here had the impact that we used some TC threads up to 10x more than other TC threads which obviously made for very bad scalability with many TC threads. The solution was simple however, a quick change to a FIFO queue and the problem was no longer there.

The next problem was yet one more imbalance, this time the imbalance was on LQH threads. The imbalance only showed up on very large clusters. This time the imbalance came from the manner in which we distribute rows into partitions. In order to make on-line reorganisation of tables very efficient we divide the table into a number of virtual partitions using a hashmap, then the virtual partitions are mapped to a real partition. Previously the hashmap always created 240 virtual partitions which was quite sufficient with 4 LQH threads, but not when moving to 16 LQH threads. So we changed to using 3840 virtual partitions instead.

Actually the choice of 240 and 3840 is intricate here. 3840 is equal to 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 3 * 5. This means that if using 12 LQH threads (2 * 2 * 3) then the number of nodes in the cluster should either be on the form 2**n or 5 * 2**n. If not things will still work fine, but load will be slightly unbalanced since the real partitions will contain different numbers of virtual partitions. Uisng 16 LQH partitions the numbers of nodes should be on either of the forms 2**n, 3*2**n, 5*2**n or 3*5*2**n where n is less than or equal to 4. Thus we get from this calculation that 30 nodes is better than 32 nodes if using 16 LQH threads since it brings about a more even distribution of rows in the cluster.

With these changes the performance of reads in a distributed environment was quite good and was providing very good scalability. However updates still caused issues.

The problem for updates was that the receiver thread handling signals between the nodes in the same node group was overloaded. There was simply too many signals to handle. Much of this was due to some work that was left as a TODO after the 7.0 release since it wasn't an issue in 7.0, but now with the higher throughput it has become an issue.

So the solution was to properly implement packing of signals such that several commit messages were packed together and similarly for the commit acknowledge from the API. These straightforward changes decreased the load on the receiver thread to a third and made it possible to push through 1.5M updates per second per node group. It is still possible that this becomes a bottleneck but it should be extremely unusual for a real-world application to reach this state.

With all these changes implemented we managed to scale update and read performance linearly up to 30 nodes.

Friday, May 11, 2012

History of MySQL Cluster architecture development

With the release of MySQL Cluster 7.2.5 we've released a version of MySQL Cluster where each data node is capable of using up to 51 threads in total. This is a remarkable feat for a product that only a few years ago was purely single-threaded. This article tries to explain what it is in the MySQL Cluster architecture that makes it so scalable to new HW demands.

The MySQL Cluster architecture is based on a design that is used in the world's most sold telecom switch, the AXE switch. This switch is used in all mobile networks delivered by Ericsson. MySQL Cluster also originates from Ericsson. The architecture of the AXE switch was developed in reaction to an older switch that had quality issues since there were too many global variables and too little modularity. The architecture was inspired by how HW was designed. In HW design each module can only communicate with other HW modules using signals (mostly electrical). The idea of the AXE architecture was to use this architecture also for software. The basic concept of this architecture is blocks and signals. Blocks are the software modules (similar to a HW module). Each block is self-contained and contains all the data and code for its needs. There is no shared data with other blocks. So the only method to get access to data in another block is to send a signal to the block requesting the data.

When the storage engine of MySQL Cluster, NDB Cluster was designed, the AXE architecture was used, so the NDB DBMS software contains a set of blocks (currently a little more than 20) that communicate with each other using signals.

This is the first reason why it is so easy to make MySQL Cluster scale to many CPUs. Since each block only communicates with each other through signals it means that it's trivial to move blocks between different threads unless they are using signals that are immediate and have dependency on returning to the sender block in the same state as when the signal was sent. The NDB blocks have a number of blocks that are not possible to split, these are LQH (Local Query Handler, handles communication between data storage and indexes, handles scans), TUP (data storage), ACC (hash index), TUX (ordered index) and some parts of TSMAN, PGMAN and LGMAN (tablespace, page and log management for disk data). There is however no such dependency between TC and the Local Data Manager blocks (LDM), also the handling of asynchronous replication is in a separate block that can be easily moved to any thread. We also have a number of schema management blocks that have been separated into their own threads using some special lock techniques. Thus we have already from functional division a separation into the LDM domain, the TC domain, the SPJ domain, the schema management domain, the asynchronous replication domain. Currently TC and SPJ are colocated although it's not really necessary, but TC and SPJ are extremely easy to scale since each transaction is independent of the other. Thus we have 4 thread domains and each of these can be placed into separate domains.

In MySQL Cluster 6.3 and previous version everything was done in one single thread and this configuration is still supported since it has nice real-time characteristics when everything is placed into one thread. In MySQL Cluster 7.0, the LDM blocks were separated from the rest of the blocks into separate threads. In addition the schema management blocks and the TC and SPJ domain was separated into one domain. Asynchronous replication was also separated into its own thread domain. Finally given that all of these threads requires communication with the outside world we also created a separate receive thread that receives data on TCP sockets and converts the data into prepackaged signals that the blocks can execute and puts them onto a queue to the thread that will execute the signals. We spent considerable effort in making the implementation of communication between threads extremely efficient and this communication is entirely lock-free and uses memory barriers and careful writing and reading to communicate with each other. On x86-machines this part is partly implemented in assembler to make efficient use of optimal assembler instructions.

The LDM used yet one more technique to distribute these blocks onto several threads. It makes use of the partitioning that NDB Cluster already supports, also the REDO log in NDB Cluster was also already separated into 4 log parts and thus it was straightforward to create 4 LDM modules where each block is replicated within different threads. Each LDM instance takes care of one part of the data stored in the data node. From the outside the protocol is still the same although the address of a block now also contains thread id in addition to the node id and block id. The block reference already contained space enough for this purpose so no change to the block code was required to handle this.

In MySQL Cluster 7.2 we have advanced the threading yet one more step. The first step was to separate the schema management domain from the TC and SPJ domain. Then given that TC can be partitioned to handle different transactions using a simple round robin allocation scheme, we also separated the TC domain into multiple threads. In principle there is no limit to how many TC threads we could use, we pick a number which is appropriate to the CPU requirements of the TC domain. We selected 16 as the maximum number since we've found no scenario where more TC threads are required. We also extended the number of LDM instances to a maximum of 16 by also extending the possible number of log parts to 16.

The next step was to create multiple receiver threads. This was relatively easy as well by partitioning the threads to handle different sockets (one socket is used to communicate with each other node in the cluster).

Another step we also took was to separate the send call from the block threads. Thus special send threads can be used, this increases latency slightly but improves throughput. It also minimizes the risk of bottlenecks occurring in the data nodes. We set the maximum of send threads to 8 and similarly for receive threads. Normally 3 threads should suffice for even the biggest configuration to some extent dependent on the efficiency of the OS read and write syscalls.

Thus what we can see is that the choice of a block and signal architecture have made it possible for us with very small means to bring data node scalability from 1 CPU, onwards to 8 and now onwards to at least 40. Even more can be achieved. Thus MySQL Cluster is based on a SW architecture which can easily accomodate itself to the new HW developed. It has also from the first codeline had the concept of efficient communication in its genes. The first prototype of NDB Cluster developed in 1994 used two SPARC-stations interconnected with Dolphin SCI technology. The current benchmarks we've executed used Infiniband with 56Gb/s interconnects where latency is extremely short when communicating between different machines.

The next generation of HW is likely to revolutionize handling of memory. MySQL Cluster is well prepared for this as well.

Last a small historical anecdote. In the 1990's the HW vendors had a race to deliver GHz cpus. The x86 CPU sold early 1990 was the 80486 that run at 25MHz. In 1999 the race was over when the first GHz processor was delivered. We released MySQL Cluster 7.2 two months ago running at 17.6MHz. We're soon announcing the next major improvement to this number. So will the 2010's be the decade where the race is on for which DBMS that can first deliver a GHz query execution? MySQL Cluster is well prepared for such a challenge.

Tuesday, April 10, 2012

MySQL team increases scalability by >50% for Sysbench OLTP RO in MySQL 5.6 labs release april 2012

A MySQL team focused on performance recently met in an internal meeting to discuss and work on MySQL scalability issues. We had gathered specialists on InnoDB and all its aspects of performance including scalability, adaptive flushing and other aspects of InnoDB, we had also participants from MySQL support to help us understand what our customers need and a number of generic specialists on computer performance and in particular performance of the MySQL software.

The fruit of this meeting can be seen in the MySQL 5.6 labs release april 2012 released today. We have a new very interesting solution to the adaptive flushing problem. We also made a significant breakthrough in MySQL scalability. On one of our lab machines we were able to increase performance of the Sysbench OLTP RO test case by more than 50% by working together to find the issues and then quickly coming up with the solution to the issues. Actually in one particular test case we were able to improve MySQL performance by 6x with these scalability fixes.

In this blog I will provide some details on what we have done to improve the scalability of the MySQL Server on large servers.

MySQL have now reached a state where the solutions to the scalability is no longer only related to protected regions and their related mutexes and read-write locks or atomic variables. MySQL scalability is also affected by the type of scalability issues normally found in high-performance computing. When developing MySQL Cluster 7.2 and its scalability enhancements we encountered the same type of problems as we discovered in MySQL 5.6, so I'll describe the type of issues here.

In a modern server there are three levels of CPUs, there are CPU threads, there are CPU cores and there are CPU sockets. A typical high-end server of today can have 4 CPU sockets, 32 CPU cores and 64 CPU threads. Different vendors name this building blocks slightly differently but from a SW point of view it's sufficient to consider these 3 levels.

One issue with a multi-core SW architecture is the fact that all these 64 CPU threads share the same memory. In order to work with this memory they work with cachelines. Cacheline sizes can vary but on the highest level it's usually 64 bytes for x86 servers. So in order to read or write a particular memory area, it's necessary to read and to write entire cachelines.

The fact that cachelines are bigger than the variables that the SW works on means that sometimes we can use the same cacheline for multiple purposes. This is called false sharing. It's usually not a problem, but on servers with multiple sockets it can sometimes be an issue to share data if the data is updated very frequently.

Reading a cacheline from many CPU threads simultaneously is not a problem since each CPU thread will download its own version of the cacheline and thus there is no real limit to how fast we can read read-only cachelines. The problem arrives when we have a cacheline which is frequently updated. When a cacheline is updated, only one CPU at a time can update the cacheline and all other CPUs that want to update it, have to wait in queue until the cacheline is ready for them to update. Given that this occurs on specific cachelines it's very hard to nail down when this waiting occurs.

So in any scalable SW architecture it's important to avoid cachelines that are updated more frequently than around one million updates per second. Beyond 1 million updates of a cacheline per second, the wait times on the cacheline can quickly become significant to application performance.

One example where we found such a case was in MySQL Cluster where we had a bitmap that was updated at each message sent over the network. In 7.1 this was never an issue because there were no cases where we sent millions of messages on a node in the cluster. However in MySQL Cluster 7.2 we scaled up performance by more than 4x for each data node and we hit this wall. At first it was a mystery where the probleem came from, but after some more analysis we discovered this bitmap update. The solution was very simple here, we made sure that we only updated the bitmap in case it actually changed and this solved this issue and we could move on and reach sending almost 5M messages per second in one data node which was up from previously being limited to 1.8M messages per second.

The time it takes to update a cacheline is dependent on the number of CPU sockets used, with multiple sockets it becomes even more important to ensure that we avoid the issues with updating a cacheline too often. So the CPU scalability issues that we've solved in MySQL 5.6 should affect users that ensure that the MySQL Server is only executed on one CPU socket less.

There are also many other interesting tidbits in MySQL 5.6.5 DMR and the MySQL 5.6 labs release april 2012 which you can digest in a number of other technical blogs.

Wednesday, February 15, 2012

Scalability enhancements of MySQL Cluster 7.2

For MySQL Cluster 7.2 we have worked on increasing the number of CPUs that can be used to execute the workload on the data nodes.

A MySQL Cluster 7.2 data node is functionally divided into 7 thread types.

1) The Local Data Manager threads (ldm)
2) The Transaction Coordinator threads (tc)
3) The Asynchronous Replication threads (rep)
4) The Schema Management threads (main)
5) Network receiver threads (recv)
6) Network send threads (send)
7) IO threads

In the LDM domain we have increased the maximum number of threads from 4 to 16. The LDM contains the actual data, this means that when using 16 threads the data becomes more partitioned (as usual this is all automatic in MySQL Cluster). Each LDM thread maintains its own set of data partitions, index partitions and its own REDO log. The number of LDM parts per data node isn't dynamically changeable. It's however possible to map more than one part onto each LDM thread which provides some flexibility in changing the number of LDM threads.

The TC domain contains the state of ongoing transactions. This means that every new transaction can easily be assigned to a new TC thread. This provides for easy flexibility of the number of TC threads to grow to a large number of threads. We've found that in most cases 1 TC thread per 2 LQH thread is sufficient and in many cases even 1 TC thread per 4 LQH threads. In rare cases with very high update loads it can be required to have 3-4 TC threads per 4 LQH threads. In MySQL Cluster 7.1 it was only possible to have 1 TC thread, now with MySQL Cluster 7.2 it's possible to have up to 16 TC threads. The TC domain also contains the control parts that handle the Adaptive Query Localization introduced in MySQL Cluster 7.2.

The Asynchronous Replication functionality is already in MySQL Cluster 7.1 separated into its own thread and remains so.

The Schema Management domain has been separated from the TC domain to make it necessary to scale the number of TC threads. This thread normally have very little load and we've seen no requirements on scaling this domain to more than one thread.

The Network receiver domain was bound to 1 thread on MySQL Cluster 7.1. With the increase of threads in MySQL Cluster 7.2 it is also necessary to scale up the number of receive threads. We have partitioned such that each receive thread takes care of one or more sockets to communicate with other nodes in MySQL Cluster. There is one socket per node to communicate with other nodes. In MySQL Cluster 7.2 we can now scale the number of receive threads up to 8 threads.

The Network send domain is a new thread type. In MySQL Cluster 7.1 sending was done from the other threads directly. This can provide slightly lower latency, but for highest throughput we have provided the possibility to perform send from separate send threads. There can be up to 8 send threads. It is also possible to continue without using send threads. We have also been able to improve the overload handling when using send threads.

The final thread type, the IO threads remains the same in 7.2 as in 7.1. This means that we can have multiple IO threads, either one thread per open file, or a fixed number of IO threads that handles the IO traffic. Except when using compression on disk, the IO threads have very light load.

The configuration of threads can be handled in two ways. The first is to simply set MaxNoOfExecutionThreads to an appropriate number that specifies how many threads we want to run in the data node. The second manner is to use the new ThreadConfig variable where it is possible to set both how many threads of each type to use and also to set the CPUs to bind those threads to.

We're very proud of those scalability enhancements that have made it possible to scale CPU usage per data node to more than 5x of what is possible in MySQL Cluster 7.1. In addition we've removed a number of bottlenecks making it possible to scale per data node performance by even more than 5x. Comparing our benchmark numbers for MySQL Cluster 7.2 we can see that we achieved 2.1M reads per second on 7.1 and we've achieved 17.6M reads per second, both on 8 data node set-ups using the same HW. Thus more than 8X better performance per node for MySQL Cluster 7.2 compared to 7.1.

The flexible configuration means that it is possible to optimise data node set-up to use anything from a single CPU up to a 48 CPU box. Colocating the MySQL Server and a single data node we can even make nice use of a 64 CPU box or even an 80 CPU box. As usual it is also still possible to have multiple data nodes per machine, but this is now required only for very large machines with 4 CPU sockets and more.

As an example of how we can make best use of a 24 CPU box is to use 8 ldm threads, 4 tc threads, 3 recv threads, 3 send threads and 1 rep thread for asynchronous replication. Each of those threads should be bound to a CPU. The main thread (schema management domain) and the IO threads should be possible to bind to the same CPU in most installations. Thus we have bound threads to 20 different CPUs. We should also protect these 20 CPUs from interrupts by using the IRQBALANCE_BANNED_CPUS config variable in /etc/sysconfig/irqbalance and setting it to 0xFFFFF0. Running MySQL Cluster generates a lot of interrupt processing and also OS kernel processing and it's a good idea to stay away from a number of CPUs to ensure that this activity won't conflict with the MySQL Cluster threads. When booting a Linux kernel one can even provide an option isolcpus=0-19 in grub.conf. This means that the Linux scheduler won't use these CPUs for any task. Only by using CPU affinity syscalls can a process be made to run on those CPUs. Using this together with binding MySQL Cluster threads and banning CPUs IRQ processing on these tasks should provide for a very stable performance environment of a MySQL Cluster data node.

On a 32 CPU box one upgrades the number of ldm threads to 12, increase tc threads to 6 and provide 2 more CPUs for the OS and interrupts. The number of send and receive threads should in most cases still be sufficient.

On a 40 CPU box one can go 16 ldm threads, 8 tc threads and increment send and receive threads to 4.

On a 48 CPU box it is possible to use more safety margins and use 12 tc threads instead, 2 more CPUs for the OS and interrupts and avoid using IO threads and main thread on same CPU and finally add 1 more receive thread.

You can learn more about all of the enhancements in MySQL Cluster 7.2 from our Developer Zone article.

1.05BN QPM using MySQL Cluster 7.2

We've passed another performance milestone using MySQL Cluster. In 2002 we passed the limit of 1M reads per second. Now we've passed the milestone of 1B reads per minute. We achieved 1.05BN reads per minute on an 8-node cluster using MySQL Cluster 7.2.5.

The benchmark used the latest MySQL Cluster version, available on launchpad and to be released as MySQL Cluster 7.2.5. The benchmark program is flexAsynch. Each read is a transaction consisting of a read of an entire row consisting of 25 attributes, each 4 bytes in size. flexAsynch uses the asynchronous feature of the NDB API which enables one thread to send off multiple transactions in parallel. This is handled similarly to how Node.js works with callbacks registered that reports back when a transaction is completed.

I will in a later blog discuss how an application can be designed to make optimal use of the asynchronous API to enable the application to reach these kind of performance numbers for key lookup operations.

The benchmark was executed on a set of 2-socket servers using X5670 with Infiniband interconnect and 48GB of memory per machine. There were 8 data nodes in the cluster and each machine had 1 data node placed in it. There were 10 machines running flexAsynch (1 process per machine).

We reported a similar benchmark a year ago, the results from an 8 data node set-up was 2.1M reads per second (8X improvement) and using 8 machines the set-up was 16 data nodes where we reached 4.3M reads per second (4X improvement). This dramatic improvement is possible since we have made each data node make more effective use of the available hardware threads. Previously each data node exploited about 6 CPUs with 8 threads, in this benchmark we used about 14 CPUs with 24 threads. We've also removed a number of bottlenecks and the configuration of the threads is done in such a manner as to handle many different loads well.

We have also improved the update numbers from 1.3M updates per second in an 8-node setup to now reaching 1.9M updates per second in a similar set-up.

The scripts required to run the benchmark are available on our benchmark page. We've placed the configuration file used to drive the benchmark in the dbt2-0.37.50 tarball in the examples directory.

You can learn more about all of the enhancements in MySQL Cluster 7.2 from our Developer Zone article.