Monday, November 25, 2013

How to make an efficient Scalable Key Lookup engine of MySQL Cluster

MySQL Cluster have all the ingridients to be designed as a very scalable and extremely efficient key lookup engine for the Cloud. As we have shown in earlier entries of my blog we've been able to scale MySQL Cluster 7.2 to handle 72 million key lookups per second or 4.3 billion key lookups per minute. This benchmark was actually limited by a limitation of the NDB API nodes to handle more than about 300k lookups per second and so with a maximum of 255 nodes we got to around 72 million per second in total. However in MySQL Cluster 7.3 we have removed this limitation, in addition we have also enabled scaling to even bigger data nodes, so it should be possible now to reach even higher numbers.

The aim of this blog is however not to give any new benchmark results, rather it is providing details about how the benchmark program works and how this benchmark program architecture can be used to design an efficient scalable key lookup data store.

To obtain best possible performance we want to ensure that the data node can operate as efficiently as possible. This is done by ensuring that a connection to the data node sends many key lookups bundled together. Operating on individual key lookups is possible of course, but as usual it is more efficient to operate on bigger entities than one key lookup at a time. To provide this we use a concept we call Executor Thread. The Executor Thread will only execute key lookups aimed for a certain data node. So this means that the number of Executor Threads will be a multiple of the number of data nodes (there could be more than one thread per data node if necessary). The Executor Thread will receive key lookups from an internal queue handled by the application program (in our case the flexAsynch benchmark program). The key lookups are prepared by the Definer Threads. The Definer Thread will receive a key lookup aimed for any data node, it will take this key lookup and calculate the receiving data node for this key lookup (there is API calls in the NDB API to handle this). Based on this calculation the Definer Thread will put the key lookup in the queue of the proper Executor Thread.

The architecture before the Definer Thread is dependent on the application. In the figure provided here we have shown one possible architecture where we have one receive thread that receives a flow of messages from somewhere, to process those messages we need to interpret the packets and process them, this could entail one or more key lookups. In the figure we have assumed there is one key lookup per message and that the Executor Thread can format the packet back to the sender based on the information in the internal key lookup order.



So the important part of the architecture is the Executor Thread that handles messages to one data node based on an internal data structure that defines one key lookup and defines how to process the response (this thread should do as little work as possible to ensure it can focus on communication with the data node). There should also be a Definer Thread that prepares the key lookup request and puts the request in the queue of the proper Executor Thread. The Definer Thread could also do other things and there could be few or many Definer Threads in the architecture.

So how does flexAsynch work, in this case we don't have any input traffic, we generate the key lookups in the Definer Threads. The Definer Thread has a very simple operation. It starts by preparing a set of key lookups to any data node. For each of those key lookups it puts the request in the queue of the proper Executor Thread. After placing all requests in a queue, it starts waiting for all operations to complete. After all requests have received their response from the Executor Threads we simply continue with the next batch.

The operation of the Executor Thread is also very simple, it gets the current set of key lookups waiting in queue, it prepares those for execution of the NDB API. It sends off all the operations to the data node. When the data node have executed all of the operations it reports the result back to the Definer Threads and updates some benchmark statistics, then it continues with the next batch of key lookups.

So the operation of an efficient key lookup data store is not difficult at all. To make it scale one can then add up to 48 data nodes per cluster (each is capable of handling more than 5 million key lookups per second of around 100 byte in size). Each cluster can handle a total of 255 nodes in total. Obviously it is also straightforward to operate more than one cluster to scale even further.

The benchmark code exists in storage/ndb/test/ndbapi/flexAsynch.cpp, the interesting code exists in the NEW module here (it also contains a lot of legacy code for old variants of the flexAsynch benchmark).

Tuesday, November 19, 2013

MySQL Cluster run-time environment: Part 3: Configuration recommendations

Binding threads to CPUs in the MySQL Cluster data nodes can have great benefits. So what about hyperthreading, should we use all CPU threads, or only 1 CPU thread per CPU core? The recommendation differs actually. For the most part it is beneficial to use hyperthreading. In most thread types it gives about 40% higher performance with 2 CPUs using hyperthreading compared to 1 CPU not using hyperthreading. There are a few cases where it might be beneficial to not use hyperthreading though.

The first example is for LDM threads. Using hyperthreading means we will increase the number of partitions of a table by a factor of two. In many cases this isn't beneficial, particularly when the number of LDM threads is high. I tried with using 24 LDM threads with hyperthreading on 12 CPU cores and compared it to 12 LDM threads on 12 CPU cores. This case didn't benefit from using hyperthreading of LDM threads. However if the number of LDM threads is low it would probably still pay off, so going from 2 to 4 LDM threads is still beneficial, probably also going from 4 to 8. But going from 8 to 16 is less likely to be beneficial.

I have tested send and recv threads with and without hyperthreading, my benchmarks have always improved by using hyperthreading. It has always been better to use hyperthreading. The same conclusion I've seen with tc threads. Obviously if the main thread or the rep thread for some reason becomes the major bottleneck, then it makes sense to remove use of hyperthreading here.

Avoiding use of hyperthreading can be simply done by not configuring any threads to use the second CPU on each of the CPU cores we want to avoid hyperthreading on. As an example we will configure a machine with one data node, the machine have 4 sockets with 24 cores and 48 CPU threads. The CPUs 0-5 represent the CPU threads on socket 0 and core 0-5 and thread 0, CPUs 24-29 represents socket 0 and core 0-5 and thread 1. So if we want to configure with 12 LDM threads not using hyperthreading here we could use the config:

ThreadConfig="ldm={count=12,cpubind=0-11},tc={count=4,cpuset=12-17,36-41},send={count=4,cpuset=12-17,36-41},recv={count=4,cpuset=12-17,36-41},io={cpuset=18-19,42-43},main={count=1,cpuset=18-19,42-43},rep={count=1,cpuset=18-19,42-43},wd={cpuset=18-19,42-43}"

In this configuration the LDM threads will use CPUs 0-11 which covers all cores on socket 0 and 1. No other thread is configured to use any CPU thread in those CPU sockets. The OS might still decide to use some of them, but we have left a number of empty CPUs that the OS hopefully discovers as idle and uses those to schedule OS activities. We can actually do even better than this, there is a boot config variable in Linux whereby one can specify which CPUs the OS is allowed to use. Similarly there is a similar config variable for irqbalance to ensure interrupts are not scheduled on any CPU used by the MySQL Cluster data node. The tc, send and recv thread are scheduled on any the CPU threads on socket 2 and the main, rep, wd, and io threads are using 2 cores on socket 3. The OS and other processes aren't blocked from using other CPUs, but will most likely be scheduled on any of those free CPUs with no activity on them.

The first question that one starts in specifying the MySQL Cluster run-time environment would be to decide on how many ldm threads one should use. This is dependent on how many CPUs that are accessible to the data nodes. So assuming we have access to 24 CPU threads with hyperthreading. In this case it would be natural to start by using the number of ldm threads set to 6 and not use hyperthreading. A natural initial start is to use half of the available CPU cores for ldm threads. Next one assigns about a quarter of the ldm CPU resources to tc threads, in this case we land at 3 tc threads. Next one assigns a similar number of CPUs to send and recv threads. Then one assigns the rest of the CPUs to main, rep, io and wd threads. This should give a fair amount of resources available also to the OS.

After this initial assignment it is a good idea to run some benchmark which is close to the workload of your application to see whether the config works well. For this test run one should use cpubind for all thread types to ensure that we know how much CPU resources each thread type consumes (can be easily derived looking at top with per-CPU load mode, using cpubind). First check whether ldm threads is a bottleneck, if it is then check if it is possible to increase to the next level. In this example this would mean going to 8 ldm threads then using 2/3 of the CPU resources. If this isn't possible then just make sure that the rest of the thread types have a fair amount of CPU resources to avoid any unneeded bottlenecks.

If the bottleneck isn't the ldm thread, then assign more resources to this thread type and in most cases by removing resources from non-ldm threads. There could be cases where less than half of the CPU resources are needed by the ldm threads, but I would deem those as very unusual. Given that the actual database processing is done in the ldm threads, it would be an exceptional case if other threads consume more than half of the resources.

Always remember to update the NoOfFragmentLogParts variable if changing the number of ldm threads.

After a few trials we have most likely found a decent configuration. After finding this configuration we can also consider where to use cpuset and if any threads should use the realtime or spintime variables.

So next question is when to use cpubind, cpuset, realtime and spintime.

Both cpubind and cpuset is about locking threads to an individual CPU or a set of CPUs. We could actually consider even using no cpubind and cpuset as a form of CPU locking. We are locking the threads to the set of available CPUs in the OS. Given that we might be running on a virtual OS this might actually already be a subset of the existing subset of the existing CPUs. To make any configuration of CPUs using cpubind/cpuset one has to have knowledge of how cpu ids maps to CPU sockets and CPU cores.

So the default configuration not using any cpubind/cpuset is to allow the OS to schedule the thread onto any of the available CPUs. The OS scheduler is optimised towards an interactive environment where processes need to react to human interaction. It does also a good job of server environments where a fairly high number of threads compete for a small number of CPUs. It does also do a decent job of handling server environments where there are only a handful of threads competing for a number of CPUs.

The type of threads that the normal OS schedulers have most problems to handle are long-running threads that consume a lot of CPU resources. In particular threads that run more or less constantly. What happens is that the OS scheduler eventually downgrades their priority such that other processes are given a chance to execute, since the process still wants to execute the OS now searches for a free CPU to use. This is often successful, the problem is however that this means that we migrate the thread onto a new CPU. This happens many hundreds of times per second for those busy threads. The effect is that the thread comes to a new CPU where it has no data or instructions cached in the CPU caches. So this means that a migrated thread will spend quite a lot of time to warm up CPU caches before it can run as efficiently as it could before the migration. In addition this requires more bandwidth on the memory bus which in some cases can become a bottleneck.

In the above case it is better to stay with the same CPU even if a new job is scheduled, this is the case if we can be sure that the new process isn't one more long-running thread. So this means that in order to optimise the MySQL Cluster run-time environment we need to be in control over all usages of the CPU on the machine we're using. In many cases we colocate data nodes and MySQL Server processes. We can control placement of MySQL Server processes using numactl or taskset that can be applied when starting the process or using pid when the process is already started. This ensures that the MySQL Server process is never scheduled outside the set of CPUs we gave it access to through the taskset/numactl process. This is how I control the environment when running any MySQL Cluster benchmark. Similarly I also control the benchmark processes (sysbench/dbt2 client processes/flexAsynch...).

In this manner I am certain that no application process is using the CPUs provided for execution of the data node threads. So the only threads that will execute on the CPUs are either OS kernel threads or interrupt handling. Even this can be controlled by using a special boot option isolcpus and list the cpu numbers that the OS is allowed to use. When setting this variable only the listed cpus are using a normal scheduler handling with the possibility to migrate CPUs. Usage of the rest of the CPUs can only be invoked if the application uses a locking call such as controlled by cpuset/cpubind or taskset/numactl. The final thing to control is the execution of interrupts. This is normally handled by the irqbalance process and this can be configured to avoid a bitmap specified by the irqbalance configuration variable IRQBALANCE_BANNED_CPUS. So it is possible to create a completely compartmentalized load with interrupts on certain CPUs, OS and other applications on another set of CPUs, a set of CPUs for data node, a set of CPUs for MySQL Server and a set of CPUs for any other consuming application process. Providing this is a combination of boot options, irqbalance configuration, MySQL Cluster configuration and finally using taskset/numactl on certain processes.

So when to use cpubind and when to cpuset. cpubind should mainly be used on threads that can consume up to 100% of the CPUs. This would mainly be the ldm threads normally, but can definitely be the case also for other threads dependent on the workload and the configuration. So one way of configuring MySQL Cluster environments is to use cpubind for ldm threads and put the tc, send and recv threads in a common cpuset. The nice thing with this configuration is that we can run with more threads than really necessary, so e.g. we can run with 4 tc threads, 4 send threads and 4 recv threads and put these into a cpuset consisting of e.g. 8 CPUs. In this manner the  OS can easily handle if there is a certain thread types that requires more resources for a while.

So configuring a MySQL Cluster run-time environment is about making use of static scheduling for the most critical resources and to use the flexibility of the OS scheduling to handle the less critical resources. A good design philosophy for all environments like this is to design the run-time environment with a well-known bottleneck. For MySQL Cluster data nodes we recommend to always make the ldm threads the bottleneck. The nice thing about this is that it makes it easier to understand how to handle overload and reason around it. As an example if the ldm thread are overloaded and tc threads have available resources we can ensure that the tc threads handle sending error messages about overload without even contacting the ldm threads. This can be achieved by decreasing available resources in ldm threads through configurations at least for scan operations.

Hopefully these recommendations will help you find the optimal configuration and still a safe configuration. Default configurations will work for most installations, but to get the last 10-100% performance out of the system one might need to dive a bit deeper into the configuration of the MySQL Cluster run-time environment.

Next configuration item to consider is the realtime setting. This can now be set on each thread type. Traffic queries normally execute arriving in the recv thread, sent to the tc or the ldm threads and then the reply is sent through the send thread. main thread is mainly involved in the meta-data operation which rarely are time-critical. rep threads can have a high load, but they are not part of any critical paths except for asynchronous replication to other clusters. The io thread is only involved in time-critical operations if disk data is used in MySQL Cluster. The wd threads are different, obviously it is important that the watchdog thread gets an opportunity to execute every now and then. So if other threads are using realtime it's a good idea to use realtime also on the wd thread type. recv, send and tc threads are usually doing small jobs and thus realtime scheduling might be beneficial for those threads to cut the response time. For ldm threads that execute close to 100% of the time it's debatable whether realtime is such a good idea. The OS cannot handle realtime threads executing on realtime priority for a long time, so we have implemented protection for this in the data nodes. This ensures that we decrease the priority to normal user priority even for realtime threads if they execute for too long. It's rarely any benefits of throughput to use the realtime configuration for threads. It's mainly intended to enable less variation on response time.

Another use case for realtime is when there is a mix of data node threads and other application threads where the application threads have lower priority. In this case we ensure that the data node threads gets prioritised access to CPUs before the other application threads gets access.

The final configuration item to consider is the spintime. This means that the data node thread will execute for a bit longer before entering sleep mode. So if the recv thread or any other thread sends a new message to the thread in this spintime we decrease the wake up time. The only case where this can increase throughput is if the spinning thread is the bottleneck of the data node and the spinning doesn't steal resources from any other critical thread. The main usage of spintime would be as a tool to improve response time in an environment with plentiful of CPU resources. It is important to consider that the use of spintime will increase use of CPU resources for those threads it is set on. It only applies to the thread types ldm, tc, main, rep, send and recv threads.  One should in most cases avoid mixing spintime and realtime settings on the same thread.

MySQL Cluster run-time environment: Part 2: How to configure

The first selection in selecting run-time environment is done by selecting the binary. The original NDB run-time environment was a single-threaded environment where we executed everything except IO threads in a single thread. This environment actually still exists and is selected by running with the binary ndbd. The new multithreaded environment is selected by running with the binary ndbmtd. The ndbmtd can clearly scale far beyond ndbd, but ndbd can still have advantages in environments which has a low load and needs optimisation on latency. Since ndbd does receive on socket, execution of signals and send of signals in the same thread it can have shorter latency at the cost of scalability. Most of the rest of this description discusses ndbmtd configuration. The ThreadConfig variable doesn't apply to running with ndbd since ndbd uses hard-coded thread configuation consisting of 1 main thread.

When configuring the MySQL Cluster run-time environment the most important variable is ThreadConfig. One can alternatively also use MaxNoOfExecutionThreads and LockExecuteThreadToCPU as we will show below. There is also LockMaintThreadsToCPU that can be used to bind IO threads to a specific CPU, it is now recommended to do this using ThreadConfig for the locking of io threads instead since it has more options on how to do this. RealtimeScheduler and SchedulerSpinTimer can be used to set real-time scheduling and spin time on threads. If ThreadConfig is also used then these variables are only used as default settings that can be overridden by the ThreadConfig setting. SchedulerExecutionTimer is not applicable when running with ndbmtd. It is only applicable when running with ndbd.

LockPagesInMemory is part of the MySQL Cluster run-time environment and can have a heavy impact on response times. Setting this hasn't changed and its setting is independent on the rest of the settings discussed in this blog.

NoOfFragmentLogParts is an important variable that should be set equal to number of ldm threads if number of ldm threads is larger than 4, it cannot be set to anything smaller than 4. One can set NoOfFragmentLogParts larger than number of ldm threads, there is no advantage of this that I can think of. It can be set to the following values: 4, 6, 8, 12, 16, 24, 32.

The types of threads we have are ldm, tc, main, rep, send, recv, io and wd threads. These were discussed in part 1 of this blog. main and rep have a fixed amount of threads equal to one. The wd threads are always three with specific roles, however from a configuration point of view the number of wd threads is one. ldm, tc, send and recv threads have a configurable amount of threads. There can be 1,2,4,6,8,12,16,24 or 32 ldm threads. There can be anywhere between 1 to 32 tc threads and anywhere between 1 and 16 send threads and likewise between 1 and 16 recv threads. The number of ldm, tc, send, recv, main, rep and wd threads are fixed once the configuration is given, when the ndbmtd process is started these threads are started and not stopped until the process is stopped. The io threads are different, the io threads are handled by a dynamic pool of threads.

The io thread pool is controlled by two variables, the InitialNoOfOpenFiles which gives the initial amount of io threads (one thread handles one open file). The second variable is MaxNoOfOpenFiles that specifies the maximum of io threads that will be created. DiskIoThreadPool can be used to set the number of parallel accesses to disk data files, it doesn't affect the number of io threads. It only affects that more than one thread at a time can access a disk data file.

The preferred manner of configuring most of this environment is by using the ThreadConfig variable. This variable accepts a string and the string is parsed. I will start by giving an example.

ThreadConfig="ldm={count=4,cpubind=0-3,spintime=50},tc={count=1,cpubind=4,realtime=1},main={count=1},rep={count=1},send={count=1,cpubind=5},recv={count=1,cpubind=6},io={cpubind=7},wd={cpubind=7}"

In this example we see that for each thread type we can specify the following information, count, this is the number of threads of this type, count must be 1 for main and rep and also for io and wd. For io and wd threads the count is actually ignored since IO threads are handled with a dynamic pool and wd threads are 3 specific threads (WatchDog, SocketServer and SocketClient thread).

cpubind can be given for all thread types, cpubind is a comma separated list of CPUs where one can also use ‘-‘ to indicate an array of CPUs. So e.g. 0-3 is equivalent to 0,1,2-3. Threads are assigned to CPUs in order, so for the ldm's in the above config the first ldm thread will be bound to CPU 0, the second ldm thread to CPU 1, the next to CPU 2 and the last ldm thread to CPU 3. CPU 3 here is the CPU ordering imposed by the OS. On Linux one can get details on this by running 'cat /proc/cpuinfo', this will provide exact information about the CPU Socket the CPU belongs to, the CPU core the CPU belongs and which of the hyperthreads it runs, it will also list a bunch of other things. One must have at least one CPU per thread assigned here, otherwise an error will occur, having more CPUs specified is ok, but only the first ones provided will actually be used. cpubind always means that the thread can only execute on the specified CPU.

Setting realtime=1 means that the thread will be using the real-time scheduler of the OS. To protect the OS from starvation we will yield in this case if we execute for too long with real-time priority and allow other threads to get some work done as well. As can be seen from the example we can use real-time setting on a subset of the threads. The global configuration variable RealtimeScheduler is still used and sets the default value for a thread, so if this is set to 1, then we need to set realtime=0 on a thread to ensure we don't get real-time scheduling on it. Use of the real-time scheduler is mainly intended to lower the variance of the latency. Using it isn't likely to improve performance except possibly in a highly loaded system where other processes are not requiring real-time scheduling. Given that the OS often expects real-time threads to execute for short time with high requirements on short response time, it is not recommended to mix this configuration option with spintime on the same thread.

Setting spintime=50 means that we will spend at least 50 microseconds of executing signals and waiting for more signals before setting the thread to sleep again. This parameter is mainly intended as well to improve our latency, by keeping the threads ready to go, we can react faster to a query. We cannot set the spintime higher than 500 microseconds. It will consume considerably more CPU resources setting spintime and provides a tiny bit of extra throughput.

In the latest version of MySQL Cluster 7.x we also introduced the ability to bind threads a set of CPUs. Each CPU can only belong to one CPU set, on Solaris a cpu set is exclusive also to usage of the data node. In Linux it only sets the whereabouts of the MySQL Cluster data node threads. Other threads can still be scheduled on the same CPUs.

As an example we could use the following config:
ThreadConfig="ldm={count=4,cpuset=0-3},tc={count=1,cpuset=4-7},main={count=1,cpuset=4-7},rep={count=1,cpuset=4-7},send={count=1,cpuset=4-7},recv={count=1,cpuset=4-7},io={cpuset=4-7},wd={cpuset=4-7}"

In this example we have the same amount of threads as in the previous example, but we divided the threads into two CPU sets. The first CPU set covers CPU 0-3 and is only used by the 4 LDM threads. The other CPU set covers CPU 4-7 and takes care of the remaining threads in the data node. In this manner we can arbitrarily configure scheduling domains. cpuset have the advantage that the OS gets some liberty to do dynamic scheduling, the default config with no CPU sets or CPU bindings means that we have one CPU set consisting of all CPUs. By using cpuset and cpubind we can mix usage of OS scheduler and our own fixed scheduler. Usually we want the threads with most load to have their own CPUs, for other threads that have more variable load it makes sense to use cpusets such that the OS can dynamically schedule the threads dependent on the current load of the system.

The thread types that have variable number of threads is currently ldm, tc, send and recv threads. In 7.3.3 we have extended the support of threads such that we can have 32 ldm threads, 32 TC threads, 16 send threads and 16 recv threads. TC, send and recv can have an arbitrary number and it would work also to change the ThreadConfig variable, at least after an initial node restart. However the number of LDM threads has more restrictions, first of all it cannot be changed other than after starting up a completely new cluster. Also we only allow for the following number of LDM threads, 1,2,4,6,8,12,16,24 and 32. A very important part of the config which is easy to forget is that if we increase the number of LDM threads beyond 4, then we also need to set NoOfFragmentLogParts to at least the same as the number of LDM threads. Normally one would simply set this variable to the same value as the number of LDM threads.

A final note on the usage of ThreadConfig is that one can even divide one thread type into several groups, so one could write ldm={cpubind=0},ldm={cpubind=1},ldm={cpubind=2},ldm={cpubind=3) as equivalent to ldm={count=4,cpubind=0-3}.

If one wants to avoid discovering the best run-time configuration one can also still use the MaxNoOfExecutionThreads variable together with LockExecuteThreadToCPU to get a simpler, but not as flexible to configure the run-time environment. The number of ldm, tc, send and recv threads is in this case dependent on MaxNoOfExecutionThreads through a lookup in a table which is found in the code (there is a program to generate this table).

Here is the table (it is found in mt_thr_config.cpp, both the table and the program generating it). Each entry in the table have five entries, the first one is the value of the MaxNoOfExecutionThreads, the next four ones are number of ldm threads, number of tc threads, number of send threads and finally number of recv threads.

  static const struct entry
  {
    Uint32 M;
    Uint32 lqh;
    Uint32 tc;
    Uint32 send;
    Uint32 recv;
  } table[] = {
    { 9, 4, 2, 0, 1 },
    { 10, 4, 2, 1, 1 },
    { 11, 4, 3, 1, 1 },
    { 12, 6, 2, 1, 1 },
    { 13, 6, 3, 1, 1 },
    { 14, 6, 3, 1, 2 },
    { 15, 6, 3, 2, 2 },
    { 16, 8, 3, 1, 2 },
    { 17, 8, 4, 1, 2 },
    { 18, 8, 4, 2, 2 },
    { 19, 8, 5, 2, 2 },
    { 20, 8, 5, 2, 3 },
    { 21, 8, 5, 3, 3 },
    { 22, 8, 6, 3, 3 },
    { 23, 8, 7, 3, 3 },
    { 24, 12, 5, 2, 3 },
    { 25, 12, 6, 2, 3 },
    { 26, 12, 6, 3, 3 },
    { 27, 12, 7, 3, 3 },
    { 28, 12, 7, 3, 4 },
    { 29, 12, 8, 3, 4 },
    { 30, 12, 8, 4, 4 },
    { 31, 12, 9, 4, 4 },
    { 32, 16, 8, 3, 3 },
    { 33, 16, 8, 3, 4 },
    { 34, 16, 8, 4, 4 },
    { 35, 16, 9, 4, 4 },
    { 36, 16, 10, 4, 4 },
    { 37, 16, 10, 4, 5 },
    { 38, 16, 11, 4, 5 },
    { 39, 16, 11, 5, 5 },
    { 40, 16, 12, 5, 5 },
    { 41, 16, 12, 5, 6 },
    { 42, 16, 13, 5, 6 },
    { 43, 16, 13, 6, 6 },
    { 44, 16, 14, 6, 6 },
    { 45, 16, 14, 6, 7 },
    { 46, 16, 15, 6, 7 },
    { 47, 16, 15, 7, 7 },
    { 48, 24, 12, 5, 5 },
    { 49, 24, 12, 5, 6 },
    { 50, 24, 13, 5, 6 },
    { 51, 24, 13, 6, 6 },
    { 52, 24, 14, 6, 6 },
    { 53, 24, 14, 6, 7 },
    { 54, 24, 15, 6, 7 },
    { 55, 24, 15, 7, 7 },
    { 56, 24, 16, 7, 7 },
    { 57, 24, 16, 7, 8 },
    { 58, 24, 17, 7, 8 },
    { 59, 24, 17, 8, 8 },
    { 60, 24, 18, 8, 8 },
    { 61, 24, 18, 8, 9 },
    { 62, 24, 19, 8, 9 },
    { 63, 24, 19, 9, 9 },
    { 64, 32, 16, 7, 7 },
    { 65, 32, 16, 7, 8 },
    { 66, 32, 17, 7, 8 },
    { 67, 32, 17, 8, 8 },
    { 68, 32, 18, 8, 8 },
    { 69, 32, 18, 8, 9 },
    { 70, 32, 19, 8, 9 },
    { 71, 32, 20, 8, 9 },
    { 72, 32, 20, 8, 10}

MySQL Cluster run-time environment, Part 1

The run-time environment in MySQL Cluster have always been geared towards high throughput, but also at having many features providing real-time access to the data in MySQL Cluster. Real-time features have included the possibility to lock the memory such that no swapping occurs. It has also included the ability to control placement of the many threads introduced in the MySQL Cluster 7.x series.

In the latest version of MySQL Cluster 7.x we have improved these real-time features and enabled a higher level of control over how those features are configured in the MySQL Cluster data nodes.

Given the high flexibility of configuring MySQL Cluster 7.x we have moved away from global config variables affecting all threads and instead moved towards using the ThreadConfig variable to specify the configured behavior of MySQL Cluster data nodes. The global variables are still in use, but they only set the default value, this value can be overridden by the settings in ThreadConfig.

Given that the ThreadConfig variable has such a key role in configuring the run-time environment of the MySQL Cluster data nodes, it is a good idea to spread more light on what can be configured now in the latest versions of MySQL Cluster 7.x. To assist users in this, I've written a series of three blogs where this is the first, and this blog introduces the concepts. The second blog gives information on what syntax to use and which configuration variables to use. Finally the third blog gives recommendations on which configuration setting to use in a number of different situations.

So what can be controlled using the ThreadConfig variable. We can control all threads that are created as part of running a multithreaded data node. There is only one exception to this rule and this is the IO threads which are handled by a dynamic pool of threads. We can still set some properties of these IO threads, but there are other configuration variables available to set the properties of the pool size, initial allocation of threads for IO.

We have the following thread types in a data node. We have ldm threads, these are the most central threads that runs the actual local database processing. Each thread takes care of its own partition of the data. We have a set of tc threads that takes care of transaction coordination. Each query to the data nodes starts with a message to a tc thread, this thread type also executes join execution pushed down to the data nodes from the MySQL Server. The main thread runs a large amount of code to handle control of system restarts, meta-data operations, control over checkpointing and some other management operations. It is used very lightly in normal operation. The rep thread receives update triggers from the ldm threads and is responsible for transporting those events to the proper API node. This thread is heavily used if asynchronous replication is used, otherwise it's not used much at all.

We have also enabled the possibility to split out all TCP/IP traffic from the other threads, this includes that we have threads taking care of all send calls to other data nodes and API nodes and we have threads taking care of receive calls on sockets and preparing those for execution of the other threads. It is still possible to configure without specific send threads, in this case the sending will happen from the ldm, tc, main and rep threads directly.

One way of understanding how these different threads use the CPU is to assign them to different CPUs and watch in top how much of each CPU that is used. I will briefly give some guidelines on what effects how much a certain thread type is used.

The ldm threads is the most common bottleneck in a data node. Each ldm thread handles the hash index, the ordered indexes and the actual data of a hash partition of a certain table. This means that most of the time the various ldm threads have similar load on the CPU. However if there are certain rows that are very hot, then some level of imbalance could happen between the CPUs. There are really three main activities going on in the ldm threads that are worth mentioning for normal operation. These are primary key lookups (using the hash index on a table), scans of a table (there are a couple of variants on how to scan a table) and there is checkpointing activity. The checkpoint activity is a background activity that comes and goes and its load is dependent on configuration settings. The cost of the primary key lookup is a fixed lookup cost and then there is a variable cost dependent on the amount of data read in the row and sent to the API. The same is true for scans. Each CPU handling a ldm thread could normally handle in the order of a few hundred thousand lookups or one or two million scanned rows (with delivered data, one can scan even faster if the row wasn't selected to be sent to the API). One might wonder how unique key lookups come into play, they are simply two primary key lookups, the first on the unique index and the second on the table itself. So seen from the ldm threads point of view there is no unique key lookup, this is a concept handled through transaction coordinator in the tc thread and local triggers in the ldm thread.

The tc thread is involved in every lookup and scan as well, the amount of work done in the CPU is completely dependent on the type of queries. Obviously if the ldm thread spends little time on each request, then the tc thread is more involved and spends more CPU per ldm thread. For most activities one would need about 1 tc thread per 4 ldm threads. There are however exceptions when there are many long scan queries then less tc threads are needed (e.g. sysbench), when there are many short primary key lookups, then there might be a need for more tc threads (e.g. flexAsynch). Also with complex queries the tc threads are involved in that they handle parts of the join processing.

The main thread is usually lightly loaded and hardly ever requires a full CPU, the load on the rep thread can be high if one uses asynchronous replication, otherwise it's negligible load on this thread. The IO threads normally also have negligible load, the exception to this is where one have configured IO with compression in which case the compression and decompression happens in the IO threads. This causes a high load on each IO thread.

We have also introduced a new category of threads, the wd threads, these are actually 3 specific threads. It is the watchdog thread, the socket server thread and the socket client thread. The watch dog thread wakes up once every 100 milliseconds and does some brief checks to ensure that no data node thread has been stuck. The socket server thread listens to some sockets and handles the connection setup phase of new sockets. The data node does very small amount of socket setups normally since sockets stay up until nodes are stopped. The socket client thread is only involved at connection setup phase of sockets, it handles the client side of a connection setup. It effectively calls connect on a socket to see if it has become connected. Neither of those threads will consume any sizeable part of a CPU. Previously these threads were handled as part of the io threads. From a configuration point of view we manage these threads as one.

The send threads can have a substantial load dependent on the amount of traffic sent from a data node, but also dependent on how many nodes we communicate with. There is always a higher cost of communication if only small traffic goes on between two nodes. This is due to the fact that networking cost has a fixed overhead per message sent, so with small amount of traffic there will be less data per message and thus a higher overhead. A send thread can handle messages to any other node, so when a message is to be sent we choose the first send thread which is not currently used. We always search in the same order, so this means that the first send thread will always be most loaded and the last send thread will be the least loaded.

The receive threads are organized such that each socket is handled by one receive thread, there is one socket per node we are communicating with. So this means that the load of a receive thread is dependent on the activity on its sockets. The sockets communicating with API nodes are all similar and activity depends on what happens on the application side. The communication between data nodes is for the most part very small, there is one exception here and this is the socket between two nodes in the same node group. Here there is a lot of traffic between the nodes when we have write transactions, each write transaction means that there is transaction coordination messages going between the data nodes in the same node group. Here it is also important to recall that on the API side one can have several API nodes per application process. In particular the MySQL Server has a configuration variable called --ndb-cluster-connection-pool that sets the number of API nodes that the MySQL Server can use to communicate with the data nodes. Each API node have a separate socket to each of the data nodes in the cluster.

So armed with this information we can easily do tests to see how loaded the various threads are, this will help us to understand if we need to change our configuration to remove some bottleneck.

To understand how the  various thread types interact for the most common execution of a primary key lookup and a scan operation, we have added a few figures describing this.

The first figure shows how a primary key read operation flows between the threads in a data node.



The second figure shows how a scan operation flows between the threads in a data node.



The third figure shows a more detailed view on what goes in a scan operation, the protocol looks the same independent of scan type.


Monday, September 30, 2013

MySQL 5.7 improves CPU scaling further


As shown in previous blogs and technical papers about MySQL 5.6, MySQL 5.6 improved scalability from 36 CPU threads sockets all the way to 60 CPU threads on a machine where each socket have 6 cores and 12 CPU threads.

Now that we have released the MySQL 5.7.2 DMR it's interesting to note that we have improved scaling yet one more step. I performed a number of simple test runs to see how our scalability have improved from MySQL 5.6.15 to MySQL 5.7.2 DMR. What the numbers clearly shows is that we have increased our scalability from 60 CPU threads to 72 CPU threads. Where we previously leveled off going from 48 CPU threads to 60 CPU threads, we're now leveling off going from 60 CPU threads to 72 CPU threads and the maximum performance is now found at 72 CPU threads compared to 60 CPU threads in MySQL 5.6.

Here we have the graph for scalability improvements of Sysbench RO.
Here is the graph for scalability improvements of Sysbench RW.
The reason is a series of improvements, both in the MySQL Server and in the InnoDB parts. One important thing is the improvement of the index locking which improves write scalability since updating indexes now is done with less concurrency problems. For read only workloads and primary key reads in particular there have been great improvements in the area of MDL locking and in this area we have almost doubled the throughput which is possible for MySQL 5.6 compared to MySQL 5.7.

In my work on scalability I have started using the perf tool found in modern Linux releases. It's an awesome tool, to get a complete performance analysis I can simply start perf record and specify the CPU and PID that I want to track. I can track on timed events, on first-level cache misses, last-level cache misses, various forms of TLB misses, branch prediction misses and much more. This should prove useful also in finding improvements also for single-threaded workloads. I have already done such analysis and improvements of the MySQL Cluster data node code and seen some stunning results. It has taken some time to get to understand the perf tool however.

The main obstacle with this tool is that the reporting sometimes points to assembler instructions which are not where the real issues are. The problem here is that the reporting is very inaccurate, the reported instruction can sometimes be as much as several dozens of instructions away from the real instruction where the bottleneck resides. In the literature this problem is referred to as skid, if an event occurs then the reported instruction is the next instruction to restart execution from after handling the event. Since a processor can have hundreds of instructions in flight at one point in time, this means that the skid (the number of instructions between the instruction that caused the event and the reported instruction) can be hundreds of instructions. This means that it becomes much more difficult to use the results from the perf tool. I read some Linux discussions about this tool and it seems not to be a priority of Linux kernel developers to do something about this skid problem. So one needs to remember that the perf output is mostly an indication of where the problem resides and no more.

Improvements of scalability in MySQL Cluster 7.3

There are two major achievements in the development of MySQL Cluster 7.3 that has led to significant improvements of scalability. The first is the move to MySQL 5.6, this means that we've removed the infamous LOCK_open bottleneck that limited scalability of MySQL Cluster 7.2 which is based on MySQL 5.5 and also all other earlier versions. The other is the improvements of the scalability in the NDB API. For sysbench we are able to process more than three times as many transactions per NDB API connection. In 7.3 it is also possible to do receive processing in separate threads that can be locked to specific CPUs which also enables higher throughput.

When testing MySQL Cluster 7.3 vs MySQL Cluster 7.2 using sysbench one can overcome the limitation of scalability in the NDB API by simply using many more NDB API connections in MySQL Cluster 7.2. However the limitation imposed by LOCK_open cannot be overcome, for sysbench this means that we can scale to usage of about 40 CPU threads and beyond that there is no additional gain of having more CPUs. When running with so high load it's actually a challenge to handle this load in the data nodes as well. What we discover is that the main bottleneck lies in the local data management threads (LDM threads). It turns out that for this particular type of threads it does actually not pay off to use hyperthreading. So best results are achieved by using 16 LDM threads that are not using hyperthreading. The problem is that when we add hyperthreading we also increase the number of LDM threads to achieve the same performance, with sysbench this means more partitions and also more work to process. So this is a rare case for MySQL where it doesn't pay off to use hyperthreading. So using this setup for MySQL Cluster we reach 7096 TPS for Sysbench RW and 9371 TPS for Sysbench RO using MySQL Cluster 7.2.

So when we take this configuration to MySQL Cluster 7.3 we can easily increase the performance since the LOCK_open bottleneck is now removed. Actually with the machine that I have access to (a standard x86/Linux box with 8 sockets and a total of 96 CPU threads on 48 cores) I can no longer get the MySQL Server to become a bottleneck for this type of standard sysbench tests. For point selects I can still reach this bottleneck since MDL locking still has some limitation in MySQL 5.6. As can be seen from the MySQL 5.7.2 DMR this is also a problem which is going away and then there will be even less problems to scale further.

So with MySQL Cluster 7.3 we are actually limited by the hardware of a 96 CPU thread box and not by any software limitation. With this box we are able to reach 8932 TPS on Sysbench RW and 11327 TPS on Sysbench RO. At this load we're using 8 CPU threads for the benchmark, 20 cores for the data nodes and the remaining 48 CPU threads for the MySQL Server. So in order to increase throughput on this machine we simply have to make better use of the HW resources at hand. With MySQL Cluster 7.3 we've made it possible to scale data nodes all the way to 32 LDM threads, so if we had access to a sufficiently big box we would be able to scale MySQL Cluster 7.3 performance even further. It's likely that we are able to scale it to about 70-80 CPU threads for the MySQL Server and this would require about 30 cores for the data nodes.

So what we see is that we're quickly reaching a point where MySQL Cluster scalability can go beyond what most customers need and it is thus very important to also start focusing on the area of CPU efficiency. In order to get an understanding of where load is spent we make good use of the perf tool in Linux which can pinpoint problems in the code with branch prediction, tight loops, cache misses and give us an idea of how to use software prefetching to improve code efficiency. Given that we program in C++, the compiler can often be assisted by introducing local variables, but one has to take care such that those local variables are not spilled to the stack in which case they only do harm to the efficiency of the program. Our initial experiments of increasing efficiency in the MySQL Cluster data nodes using this approach have been very successful.

NOTE: All the above benchmarks were done on one machine using a single MySQL Server and a single data node. Obviously MySQL Cluster is also capable of scaling in the number of nodes in addition to the scaling on one machine. In this benchmark we have however focused on how far we can scale MySQL Cluster on a single machine.

Wednesday, May 15, 2013

Year of Anniversaries

Many people write blogs and emails about various anniversaries. I had an anniversary when I started writing this blog on the 2 may where I celebrated 23 years since I my start date according to my employment contract. 2 may 1990 was the first day I worked at Ericsson where I started working on databases and where NDB Cluster was born that later grew into MySQL Cluster.

Actually this is a year of anniversaries for me. I started by becoming half a century old a few months ago, then I celebrated a quarter of a century as member in the LDS church, my wife and I just celebrated our 25th wedding day and in september I will celebrate 10 years of working with MySQL together with the other people that joined MySQL from Ericsson 10 years ago.

So I thought this was an appropriate timing to write down a little history of my career and particulary focused on how it relates to MySQL and MySQL Cluster. The purpose of this isn't to publish any new technical facts, the purpose is more to provide some input to those interested in the history of software development. I am personally very interested in history and always enjoy telling a good story, particularly when it is a true story. I think however that the most important reason to write this blog is to help young software engineers to understand both the challenges they will face and also that the path to success sometimes or even most of the time is a road that contains many road blocks and even complete changes.

I will start the story at the university where I studied mathematical statistics, I learned to enjoy hard work and challenges when I somewhat by chance found myself studying at a university while still attending last year of gymnasium (similar to last year of high school) and also working half-time. This challenge taught me to work hard and that overcoming challenges was quite fun (a lot more fun than easy work) when you come out on top.

I remember when I started working after the university that I was looking at two options, either work with data communication or work with databases. I selected to work with data communication! This meant that I learned a lot about low level stuff and working on X.25 protocols, SNA protocols and often programming in obscure assembler code. This has been a very important competence all through my career since it means that I have always had a good understanding of the performance aspects of the programs I have analysed and written.

In 1989 I had some personal inspiration that I would study for a Ph.D in mathematics and databases. I didn't have any possibility to start this immediately. However when I started working at Ericsson the 2nd of May 1990 in their education department I could quickly extend my competence through learning Object oriented programming, C++, telecommunication, advanced system testing and many other things. In march 1991 the opportunity to start learning about databases appeared as they were to develop a new course on an internal Ericsson database. Later the same year I saw an internal job offering at the Ericsson systems department working with system aspects of this internal Ericsson database. So in 1991 my career in databases went off to a start. Late 1991 I heard of a large research project about UMTS (what is today called 3G) where Ericsson participated. I went to my manager and asked him if I could spend a third of my time over the next few years on this project. He accepted and a few months later he also accepted that I went to half my time on this research project. My manager Martin Hänström and my department manager Göran Ahlforn and their vision of future telecom databases was crucial to the development of MySQL Cluster although they never at any time were involved in its actual development.

This project was called RACE MONET and involved hundreds of researcher in Europe and I learned a lot about network databases and the requirements that they would see the next few decades. This meant that I could start my Ph.D. studies in mathematics and databases. I spent a few years reading database theory, understanding more about future telecommunication systems. I developed a few high-level models for how to build a distributed database, but it took a few years to come up with the model now used in MySQL Cluster data nodes. In those days I actually focused so much on reading and thinking that I didn't even see a need to have a computer on my desk.

I had some interesting collaboration with a norwegian team of database researchers that meant that we were very close to joining forces in 1994 on a project on developing a distributed database. We did eventually join forces 14 years later when MySQL was acquired by Sun and the ClustRa team (ClustRa had been acquired by Sun a few years earlier) joined the MySQL team as the Sun database team. Today this team is an important part of all parts of the MySQL development but through history we have had many occasions of both collaboration and competing in the marketplace.

After studying a lot for 3-4 years I had a very interesting period in 1995 and 1996 where all my studying went into the idea phase. Through understanding of both requirements on a telecom database and database theory I had a period of constant innovation, a new idea popped up more than once per week and I was constantly thinking of new interesting ideas, a lot of new ideas popped up while I was out running.

With all these ideas I had the theoretical foundation of NDB Cluster that later turned into MySQL Cluster. Not a single code line was yet written, but most high-level algorithms was written down in my Ph.D thesis that I eventually finalised in 1998.

Now the time had come to actually start developing NDB Cluster. I remember driving by a new compound where we were supposed to move to in a few months later and I told a friend that in this new building my 5th kid, NDB, will be born. In the beginning I had no resources, but at Ericsson it was easy to hire summer students and master thesis students. So in a few years I worked with 30-40 summer students and 10 master thesis students. Most of these students focused on special areas and prototyped some idea, this period was extremely important since many of the early ideas wasn't so bright and the student works proved that we were on the wrong path. I remember especially a prototype of the communication protocol to the database, the prototype used BER encoding and only the protocol code was a few thousand lines of code. It was clearly not the right path and we went for a much more fixed protocol which actually evolved in six different versions in a short time to end where it is today. The NDB protocol is entirely based on 32-bit unsigned integers and most data are in fixed positions with a few exceptions where bits are used to shave off a few words from the message.

The first steps in the NDB development was completely intertwined with the other half of my work at Ericsson. This work entailed building a virtual machine for the APZ processor that was an Ericsson-developed CPU that powered the AXE switches which was a very important part of the Ericsson offering. In 1998 we developed a prototype of this virtual machine which was faster than any other APZ processor available at the time. This project was eventually brought into real products, personally I haven't done anything in this area the last 10 years. This intertwining of projects was important to get funding for the first phases of the NDB development and in the second half of 1997 we started up a real development project for NDB Cluster. The aim of the project was to build a prototype together with an operator to demonstrate low-latency communication towards an external database from AXE thus providing easy maintenance of various telecom services. The intertwining happened since NDB was executed in this virtual machine. There are still remnants of this virtual machine in the NDB code in the form of signals between blocks.

We ran this project at full speed for one year with up to 30 people involved. This project brought NDB from version 0.1 to version 0.3 and much of the code for version 0.4 was also written. The first NDB API was developed here based on some ideas from the telecom operator and much of the transaction protocols and data structures. The project was brought to a successful conclusion. There was however no continuation of the project since Ericsson decided to focus their resources on another internal database project. So most of 1999 the project continued as a skunk-work project. For about half a year I was the single person involved in the project and spent a lot of time in fixing various bugs, mostly in the hash index part and replaced the hash function with the MD5 crypto function.

The NDB API is based on a very simple record access model that makes it possible to access individual rows, scan tables and scan using an ordered index. Most relational databases have a similar functional split in their software. Since my research into requirements on telecom databases showed that almost all queries are very simple queries that are part of the traffic protocols it seemed like a very good idea to make this interface visible to the application developers. Also research into other application areas such as genealogy, email servers and charging applications showed similar needs.

Today MySQL Cluster still benefits from this record level API that makes it easy to integrate MySQL Cluster also for big data applications. It has also made it easy to make other types of interfaces to the data nodes such as LDAP interface and Java interfaces and so forth.

Eventually I wrote an email to the Ericsson CEO, the email was focused on the use of NDB for a web cache server that we prototyped together with the Ericsson IT department. The email was sent back to my organisation which quickly ignored it, but the CEO also sent the email on to a new Ericsson organisation, Ericsson Business Development. I met with one of the key people in this organisation, Gunnar Tyrsing, whom I got to know when we both participated in a Ph.D course on data communication. He had done a fast career in developing the Japanese 2G mobile system. He was therefore a manager that understood both technology and management and he quickly understood the possibilities of NDB. He forwarded the assignment to work with me to Ton Keppel who became the manager of NDB development the next 3 years.

Together with Ton Keppel I built up a new NDB organisation and we hired a lot of people 2000-2002, many of those are still working at MySQL, some of them in MySQL Cluster development but also a few in management positions. One of the first things we did was to convert the NDB code into a C++ program.

Those days were interesting, but very hectic, I even had my own personal secretary for a period when I had meetings all the time, developer meetings, management meetings, meetings with our owner Ericsson Business Innovation, meetings with potential customers and many, many meetings with potential investors. I participated in a course in 2000 on Innovation in a large organisation where I learned a lot about marketing and we visited a number of VCs in the US. While visiting one of these VCs in april 2001 I remember him stating that 2 weeks previously the market for business to business web sites had imploded. This was the first sign of the IT crash I saw. Half a year later the problems came to us. In 2000 we had access to an almost unlimited budget and at the end of 2001 we started working with a more normal and constrained budget. At this time the number of ventures was decreased from 30 to 5 in half a year. So every week a new venture was closed so we all knew that our venture could be closed at any time. Somehow we managed to survive and we were one of the 5 remaining ventures that was left after downsizing completed.

We refocused our efforts towards the telecom market after the IT crash, our first target market was the stock exchange market which wasn't a bright market after the stock market going for the worse. This meant that we finalised all the work on node recovery and system recovery that was started in 1998-1999. We found a first telecom customer in Bredbandsbolaget and they developed a number of their key network applications using MySQL Cluster which later many other customers have replicated.

At this time in history all telecom vendors were bleeding and downsizing. Ericsson downsized from 110.000 employees to around 47.000 employees in a two-year period. So eventually the turn came also to Ericsson Business Innovation. All remaining ventures including us had to start searching for new owners. Actually the quality of the ventures was at this point very high since all of them managed to find new owners in just a few summer months in 2003.

So the 2 september 2003 the acquisition of NDB into MySQL was completed and the NDB team was now part of the MySQL team. Our team came into MySQL at a very early point and increased the number of employees in MySQL significantly at the time.

Our team continued working with existing customers and we also started on integrating the NDB Cluster storage engine into MySQL. We decided to call the combined MySQL and NDB product MySQL Cluster and this name is what we still use.

After less than a year of working at MySQL the VP of Engineering Maurizio Gianola approached me and he had decided that I should work on new assignments. This happened exactly at the time when the commercial success of MySQL Cluster started to happen. Obviously after working on a project for more than 10 years it was hard to all of a sudden to start working on a new project. Interestingly I think it was good for me as a person to do this shift. What happened was that NDB was ready for success and needed a lot of detailed work on bugs, support, sales and new features for the customers. I have always been an innovator and therefore moving into a new project meant that I could innovate something new.

The new project I got involved in was MySQL partitioning. At first I had to learn the MySQL Server inside out and in particular I had to learn the MySQL parser which was far away from my expertise areas. I had great help of Anthony Curtis in this time and I spent about one year developing the code and one year fixing the bugs before it was released as part of MySQL 5.1. I had seriously good help in the QA part of this project where Matthias Leich helped me writing me many new test programs that exercised the new partitioning code. Also Peter Gulutzan had a special genius of finding out how to write test cases that found bugs. When I saw his test cases I quickly concluded that he had a lot of imagination of how to use SQL.

Working with MySQL code improved my skills as a programmer. In the Ericsson environment I learned to work on an architecturial level. At Ericsson I developed mind models that I finally "dumped" into source code. At MySQL I learned more about the actual coding skills.

At this point in time I decided to make something revolutionary in my life, I resigned from MySQL and started working as an independent consultant. I always want to have a feeling that things are moving upwards and at this point in my life I didn't get this feeling. So I decided to be a bit revolutionary and take some risks. I had a large family with 5 kids at the time and financially it was a bit of an adventure. It did however work out very well. I still continued to dabble with MySQL development as a consultant part-time, but I also worked with other companies. I worked a lot with a norwegian company, Dolphin, that I helped develop marketing material for their network cards that could be used in combination with MySQL Cluster. I also worked as a MySQL consultant in the Stockholm area, I even developed a completely brand new 3-day MySQL course in 12 days, I produced 300 new slides in a short time. It was a fun time and the company went well and I booked around 45 consultancy hours per week so financially I came out quite ok.

As part of this company I also started developing my own hobby project, iClaustron, I am still developing it. It's mainly an educational project for me, but eventually I might release the code, after 7 years I produced around 60.000 lines of code and it starts to actually do something useful :).

Meanwhile MySQL had found a new VP of Engineering in Jeffrey Pugh and he was looking for a Senior MySQL Architect. He gave me an offering I could not refuse, it both meant that I returned to having a feeling of going upwards in my career and also financially ok. Initially I worked mostly on getting a lot of rock stars to communicate. We had a number of strong outspoken architects in the organisation, many of which were very well-known in the MySQL world, we also had a number of competent, but less outspoken architects. So my work was to try to get these people to get to an agreement on techical development which wasn't always so easy. We did however manage to shield off the developers from the technical debates and ensure that instead of debating each new feature in a large group, we allowed the developers to work with one or two architects in developing the new features. This has paid off well now in that our developers are much more independent in the current development organisation.

At the end of 2008 it became clear that MySQL was in great need of a scalability project, at this time we were part of Sun after being acquired early 2008 and we had a skilled Sun team assisting us in this project. Given my background in performance-oriented design I dedicated much of my time over the next two years to this project. We delivered a breakthrough scalability project about once per year at the MySQL conference. The first one in 2009 we also got word of the Oracle acquisition. Personally I met a person coming out of the elevator at 7am coming to breakfast who asked how I felt about being acquired by Oracle. We have continued on this trend of delivering new scalability enhancements all the way until now when MySQL 5.6 was released with substantial performance enhancements and we've come a really long way on this path since we started this development almost 5 years ago.

In the meantime I also worked with Kelly Long to develop the MySQL Thread Pool design. There was a previous design in the MySQL 6.0 project, this design didn't scale and we opted for a completely new design based on splitting connections into thread groups. Kelly Long left MySQL to do a new startup in the networked storage business and this week I read how he sold his new venture for 119M$, so working with MySQL means working with many talented people. The thread pool design was recently updated for MySQL 5.6 and its design rocks also at a very much higher throughput.

The last few years have seen a lot of tremendous performance improvements in various areas. Much of the scalability improvements now happen as part of the normal engineering process and I mainly focus on helping out in small directed projects. So I helped out with the split of the LOCK_open mutex and the increased scalability of the binlog group commit writes and resolving the bottleneck in InnoDB we called G5 which essentially was a 1-liner problem.

Other areas I have focused my attention on has been back to MySQL Cluster, in MySQL Cluster 7.2 we increased scalability of MySQL Cluster by almost 10x through adding possibilities for multiple send threads, multiple receive threads, multiple TC threads and scaling to even more local database threads (LDM threads). In MySQL Cluster 7.3 I just completed a much improved scalability of the NDB API.

For me the last few years have been interesting, I have had some tough health issues, but at the same time the efforts we have done have paid off in ways never seen before. We've increased performance of the MySQL Server by 3.3x, we've increased performance of MySQL Cluster by almost 10x, we've demonstrated 1 billion updates per minute in MySQL Cluster and we're ready to scale MySQL Cluster 7.3 to hundreds of millions reads per second.

I tend to call the things I've done the last few years surgery. I go into working code, find the hot-spots and find ways to remove these hot-spots. The patches to do this is usually very small but one needs to be very careful in the changes since they touch parts of the code that are extremely central. As with real surgery the impact of these surgical patches can be astounding.

So where does the future take us, well there are still many interesting things to further develop. Every time we put some effort into parallelising some part of the MySQL code we can expect at least a 10x speedup. There is possibilities to speed up local code in most areas, there is possibilities to use the dramatic surge of CPU power to use compression even for main memory data. There is lots of tools that we can build around MySQL that provides more scalability, more high availability. We also know that the HW industry is innovating and they are likely to put things on our table that enables new things we never dreamed of doing before. One thing I particularly look forward to is when the HW industry can replace hard drives by something similar to DRAM's that have a persistent memory. When this happens it can change a lot of things we currently take for granted.

I found in my career that as an innovator it's important to be ready to let go of your development and put it into other people's competent hands. This means that one can move on to new areas. If one continues to work in the same organisation one can then always return to the old work areas. So I still do a lot of work in MySQL Cluster, I still review work done on MySQL partitioning, I still participate in working on the MySQL thread pool, I still help out on various MySQL scalability projects and there is even some new projects yet to be released where I continue in the review role after helping out in the early phases.

I did however leave one project solely to myself and this is the iClaustron project. It's nice to be able to work on something completely without regard to any other person's view and do exactly as I please. It's refreshing to do this every now and then and for me it has served as a very important tool to keep me up-to-date on build tools, code organisation, modularisation of code and many other aspects of software engineering.

Tuesday, May 14, 2013

MySQL Cluster 7.3 Improvements - Connection Thread Scalability

As many have noted we have released another milestone release of MySQL Cluster 7.3. One of the main features of 7.3 is obviously foreign keys. In this post I am going to describe one more feature added to MySQL Cluster in the second milestone release which is called Connection Thread Scalability.

http://dev.mysql.com/tech-resources/articles/cluster-7.3-dmr2.html

Almost all software designed for multithreaded use cases in the 1990s have some sort of big kernel mutex, as a matter of a fact this is also true for some hyped new software written in this millenium and even in this decade. Linux had its big kernel mutex, InnoDB had its kernel mutex, MySQL server had its LOCK_open mutex. All these mutexes are characterized by the fact that these mutexes protects many things that often have no connection with each other. Most of these mutexes have been fixed by now, the Linux big kernel mutex is almost gone by now, the LOCK_open is more or less removed as a bottleneck, the InnoDB kernel mutex has been split into ten different mutexes.

In MySQL Cluster we have had two types of kernel mutexes. In the data nodes the "kernel mutex" was actually a single-thread execution model. This model was very efficient but limited scalability. In MySQL Cluster 7.0 we extended the single threaded data nodes to be able to run up to 8 threads in parallel instead. In MySQL Cluster 7.2 we extended this to enable support of up to 32 or more threads.

The "kernel mutex" in the NDB API we call the transporter mutex. This mutex meant that all communication from a certain process, using a specific API node, used one protected region that protected communication with all other data nodes. This mutex could in some cases be held for substantial times.

This has meant that there has been limitation on how much throughput can be processed using one API node. It has been possible to process much throughput
anyways using multiple API nodes per process (this is the configuration parameter ndb-cluster-connection-pool).

What we have done in MySQL Cluster 7.3 is that we have fixed this bottleneck. We have split the transporter mutex and replaced it with mutexes that protects sending to a specific data node, mutexes that protects receiving from a certain data node, mutexes that protect memory buffers and mutexes that protect execution on behalf of a certain NDB API connection.

This means a significant improvement of throughput per API node. If we run a benchmark with just one data node using the flexAsynch benchmark that handles around 300-400k transactions per second per API node, this improvement increases throughput by around 50%. A Sysbench benchmark for one data node is improved by a factor of 3.3x. Finally a DBT2 benchmark with one data node is improved by a factor of 7.5x.

The bottleneck for an API node is that only one thread can process incoming messages for one connection between an API node and a data node. For flexAsynch there is a lot of processing of messages per node connection, it's much smaller in Sysbench and even smaller in DBT2 and thus these benchmarks see a much higher improvement due to this new feature.

If we run with multiple data nodes the improvement increases even more since the connections to different data nodes from one API node are now more or less independent of each other.

The feature will improve performance of applications without any changes of the application, the changes are entirely done inside the NDB API and thus improve performance both of NDB API applications as well as MySQL applications.

It is still possible to use multiple API nodes for a certain client process. But the need to do so is much smaller and in many cases even removed.

Monday, May 13, 2013

MySQL Thread Pool in 5.6

MySQL Enterprise Edition included the thread pool in its MySQL 5.5 version. We have now updated the thread pool also for the MySQL 5.6 Enterprise Edition.

You can try it for free at trials

The MySQL thread pool is developed as a plugin and all the interfaces needed by the thread pool are part of the MySQL community server enabling anyone to develop their own version of the MySQL thread pool. As part of the development of the thread pool we did a lot of work to make it possible to deliver stand-alone plugins using the interfaces provided by the MySQL server. Most of these interfaces were available already in MySQL 5.1, but we extended them and made them more production ready as part of MySQL 5.5 development. So a plugin can easily define their own configuration variables and their own information schema tables and also easily use performance schema extensions. A plugin can also have its own set of tests.

Thus one can build a new plugin by adding a new directory in the plugin directory. What is needed is then the source code of the plugin, the tests of the plugin and finally a CMakeLists.txt-file to describe any special things needed by the build process. The public services published by the MySQL server is found in the include/mysql directory. There is currently six services published by the MySQL Server in 5.6. The my_snprintf, thd_alloc, thd_wait, thread_scheduler, my_plugin_log and the mysql_string services. One usually needs also to integrate a bit more of the MySQL server as the work on modularizing the MySQL server is a work in progress. To help other developers understand what private interfaces are used by a plugin some plugins also provide a plugin_name_private.h-file. There is such a file for the thread pool and for InnoDB currently.

Buying and reading the book MySQL 5.1 Plugin Development will be helpful if you want to develop a new MySQL server plugin. The MySQL documentation also have a chapter on extending MySQL at http://dev.mysql.com/doc/refman/5.6/en/extending-mysql.html

The thread pool have now been updated for the MySQL 5.6 version. Obviously with the much higher concurrency of the MySQL Server in 5.6 it's important that the thread pool doesn't add any new concurrency problem when scaling up to 60 CPU threads. The good news is that the thread pool works even better in MySQL 5.6 than in MySQL 5.5. MySQL 5.6 has fixed even more issues when it comes to execution of many concurrent queries and this means that the thread pool provides even more stable throughput almost independent of the number of queries sent to it in parallel.

Our benchmark numbers using Sysbench OLTP RW and Sysbench OLTP RO shows that the thread pool continues to deliver at least 97% of the top performance even when going to 8192 concurrent queries sent to it. The thread pool has a slight overhead of 2-3% at low concurrency but the benefit it provides at high concurrency is breathtaking. At 8192 concurrent queries the thread pool delivers 13x more throughput compared to standard MySQL 5.6.



So let's step back and see in which cases the thread pool is useful. Most production servers are used at low loads most of the time. However many web servers have high load cases where a massive load is happening every now and then, this could be due to more users accessing the site, but it could also be due to changes of the web site that means that many web caches need to be refreshed. In those cases the MySQL Servers can all of a sudden see a burst of queries which means that the MySQL Server has to execute thousands of queries simultaneously for a short time.

Executing thousands of queries simultaneously is not a good use of the computer resources. The problem is that there is a high chance that we run out of memory in the machine and have to start using virtual memory not resident in memory. This will have a severe impact on the performance. Also switching between thousands of threads means a high cost on CPU cache usage since each time a thread is swapped in it has to build up the CPU caches again and this means more work for the CPUs. There is also other scalability issues within the MySQL server code that makes contention on certain hot-spots higher when many queries are executed concurrently.

So how does the thread pool go about resolving this problem to ensure that the MySQL Server always operate at optimum throughput? The first step is that when many concurrent queries arrive that these queries are to some extent serialised, so rather than executing all queries at once, we put the queries in a queue and execute them one by one. Obviously there needs to be some special handling of long-running queries to ensure that these queries don't block short queries for too long time. Also some special handling is needed of queries that block for some reason such as on a row lock, a disk IO or other wait event.

The thread pool divides connections into thread groups, the number of thread groups is configurable, in really big machines it can payoff to have as many as up to 60 thread groups using MySQL 5.6. Each thread group normally executes either 0 or 1 query. Only in the case of blocked queries and long-running queries does the thread group execute more than 1 query for a short time.

The above description serves to avoid problems with swapping when using too much memory and also fixes the issue with too much context switches that taxes the CPUs.

There is however still one problem that can cause issues and this is the fact that transactions hold resources also when no query is running at the moment. To solve this problem the thread pool puts a higher priority on connections that have already started a transaction. Started transactions can hold locks on rows that block other transactions from proceeding, they are part of sensitive hot-spot protected regions that can slow down the execution of all other transactions. So giving started transactions a higher priority makes a lot of sense. Indeed this feature means that we can sustain 97% of the optimum throughput even in the context of 8192 concurrent queries running, without this feature we would not be able to handle more than around 70% of the optimum throughput at such high concurrency.

One way of describing the thread pool is as an insurance. If you run without the thread pool your system will run just fine for most of the time. But in the event of spike loads where the system is taking a severe hit the thread pool serves as a protection measure to ensure that the system continues to deliver optimum throughput also in the context of massive amounts of incoming queries.

The above description fits the use cases for InnoDB. It is also possible to use the thread pool in MySQL Cluster 7.2 and onwards. For MySQL Cluster the thread pool has a slightly different use case. In this case the thread pool ensures that the data nodes of MySQL Cluster are protected from overload. So setting the number of thread groups to a number using MySQL Cluster means that this MySQL Server can only process this number of queries concurrently and thus ensuring that the data nodes are protected from overload cases and also ensuring that we always operate in optimal manner. Naturally with MySQL Cluster one can have many MySQL servers within one cluster and thus the thread pool can provide a more stable throughput in also highly scalable use cases of MySQL Cluster.

Monday, April 29, 2013

MySQL Architect at Oracle

I have worked as an architect in the MySQL/NDB world for more than 20 years and I am still working at Oracle and I like it here. Given all the FUD spread about MySQL I thought it might be a good idea to spread the word about all the great things we're doing to MySQL at Oracle.

#1 We are working on improving modularity in MySQL code base
In the early days of MySQL the MySQL development had serious issues with its development model. It was a model designed for a small code base. I used to work at Ericsson which is developing telecom switches that have systems with tens of millions lines of code. Such large systems require modularity. The Ericsson switches was developed with modularity built into the programming language already since the 70's. Even with this modularity a second level of modularity was required. The learnings from this reengineering project that span over more than a decade has given me valuable insights that can now be put to use for the development of the MySQL architecture.

When we're developing MySQL at Oracle we have a long-term view on the development, we've taken a lot of steps to make the code more modular. This means more work in developing a feature from a short-term point of view, but it pays back very quickly. As an example of this we did rearchitect the meta-data locking model in MySQL over a period of almost 5 years. Due to this rearchitecting effort we were able in MySQL 5.6 to split one of the crucial locks in the MySQL Server, the LOCK_open. This meant payback time as this improved our top performance of more than 100% with one small incremental step.

The principal idea is that new areas of development we try to put into separate modules which are as independent as possible from the rest of the MySQL Server code. Old code cannot be modularised in one swift move, this code has to be improved upon in small reengineering steps. Eventually those steps lead into more modularity and easier changability also of the old code base. We expect to continuosly improve the MySQL architecture in this manner.

Does this work benefit the entire MySQL community. Yes, modular code is the very foundation for successful open source code development also by the MySQL community. MySQL has a thriving development community, we expect it to thrive even more by the improved modularity we add to the MySQL code base.

#2 We have an improved development model
One of the first things I got engaged in 2007 when I was assigned to Senior MySQL Architect was to improve the MySQL development model. Previously we had a model where features were developed as projects and then pushed to the next release branch, often new features were pushed in several steps. This model had issues in quality and in development speed. This model often led to prioritisation issues when coming close to GA releases and often code was pushed very close to GA releases without proper quality. After MySQL was acquired by Sun and Oracle we have managed to get this under control using a new development model.

In the new development model we wanted to ensure that we took small steps forward (called milestone releases), always with retained quality. So in the new model we develop in 3-6 month steps, each step is quality assured and we use a lot of QA resources to ensure each milestone release is ok from a quality point of view. Some of these milestone releases are then taken out to become new GA releases. The GA releases are then put into our QA grind from where most bugs are squeezed out of the code.

We introduced this new model in 2009, it affected positively the MySQL 5.5 development and MySQL 5.6 has been completely developed according to this model. One only needs to look at what we achieved with MySQL 5.6 to understand that the new model works very well. We're continuing to use this model for new steps in the MySQL development.

The model is described in http://dev.mysql.com/doc/mysql-development-cycle/en/index.html

Our MySQL development is divided into a number of parts, we have a team taking care of InnoDB, another team taking care of MySQL Cluster (with the NDB storage engine), one team dealing with partitioning, an optimizer team, a replication team, a runtime team taking care of things such as metadata handling and other things related to query execution and a general team taking care of support code, networking, client code.

We also have teams handling MySQL Connectors, utilities and other things needed around the MySQL Server. There is also a performance and architecture team focusing on scalability, special projects and all sorts of improvements to ensure that our users get a good experience of using MySQL. Naturally we also have teams focusing entirely on the quality of the code. We have also a separate team working on MySQL Workbench and related tools to support MySQL management and development.

#3 We have great performance experts for the MySQL development
The community has lots of performance experts that understand how to help users in getting the best out of a MySQL Server. The community also has a number of people that are able to pinpoint performance issues in the MySQL Server. The patches these people develop are very useful in showing what can be improved in the MySQL codebase. The community is here very useful to the development of the MySQL Server. In this area we see a very good cooperation between our developers and the MySQL community developers.

We have also been able to do numerous performance improvements independent through the combination of MySQL experts and InnoDB experts that we have internally in Oracle. If you need proof of this just go to the MySQL 5.6 page and see demonstrated performance improvements we've done in MySQL compared to MySQL 5.5. This is an area I have contributed a lot to personally and I find it interesting how we managed to increase scaling of the MySQL Servers from 4 CPU threads to almost 64 CPU threads in just a few years.

#4 We continously develop MySQL partitioning
We continously improve scalability of our partitioning solution and integrate it further into the main storage engines as evidenced by the new set of partitioning features in MySQL 5.6.

#5 We have the most serious QA team in the MySQL world
Through the use of a large QA team we do our best to ensure that our community users are shielded from bugs in our released code. We also work with the community through milestone releases, lab releases and RC releases to ensure that our GA releases have top notch quality. We tripled the size of the QA teams as part of the MySQL 5.6 development.

Anyone can of course add new features to these MySQL trees, but it will be the community that mainly becomes the QA team for these additional features. We strive hard to ensure that our community users are shielded from being our QA team.

Although the MySQL is continously growing, the code coverage of our tests has continously improved going from MySQL 5.1 to MySQL 5.6.

#6 We are continously developing InnoDB inside Oracle
The major storage engine for MySQL is InnoDB, the InnoDB developers are also located within Oracle. We've seen many great benefits of being together inside the same company. This means that we can integrate InnoDB even better and improve scalability and functionality than ever before.

#7 We have a thriving MySQL Cluster development in Oracle
All MySQL Cluster development currently happens inside Oracle. I am still very much involved in this development and I sit next to a number of the key Cluster developers in my office in Stockholm with a beatiful view over the Stockholm City centre. We have shown magnitude improvements in scalability, performance per node and so forth in a regular and steady development. MySQL Cluster even has the capability to execute parallel queries as of version 7.2.

#8 We are making great efforts to make MySQL manageable
Many community developments are centered around getting more access to internal statistics about MySQL Server behaviour. Since a few years back we have a focused effort in this area to provide anything that a user could be interested in through the performance schema tables. These tables gather data internally and provide access to this statistical data through a set of performance schema tables. This effort is continuing, there is still more work to be done in this area.

#9 We have the largest and best optimizer team in the MySQL world
Our team of optimizer developers consists of developers that have loads of experience on developing DBMSs. Many of the developers have a background in developing SQL databases in the past and also have a scholar background which is almost a requirement to understand the fairly complex optimizer algorithms required in a relational DBMS. The team has made a tremendous effort in preparing a load of new optimizer features for MySQL 5.6.

#10 A thriving team working on MySQL Replication
The MySQL success have always been closely intertwined with the ability to replicate between MySQL servers using the master-slave model. The requirements on better failover capabilities and manageability of large sharded environments have led to an ever increasing set of new functionality in the replication area. We're continuing this development and working to provide even more functionality for the MySQL community that will be useful in building large MySQL installations.

#11 We have many other teams
The runtime team is doing a lot of work on handling metadata changes online, making the MySQL Server more scalable and making the server more modular.

The general team is ensuring that MySQL becomes more modular, ensuring that our support code is further developed, ensuring that we also have more unit tests on our code.

We have a thriving utilities team that is developing various tools useful for people managing MySQL Servers.

Each of the teams mentioned above have developed a great number of new features in the respective MySQL versions making it quite clear that MySQL development is thriving as never before.

So if you're looking for a MySQL version to use that have the best performance, the best stability and the most future-proof code base, then use MySQL 5.6.

The MySQL community continues to be an interesting place to work in. Oracle leads the development of new MySQL versions, we have a number of large companies that are building massive infrastructure components based on MySQL (Google, Facebook, Twitter and so forth) and we have a number of forks that also develops their own versions. There is also many new interesting requirements on MySQL to behave well also in a highly scalable environment and this requires even more work to be done on the MySQL replication architecture.

As a MySQL architect I am able to work with many parts of the MySQL engineering organisation and also the MySQL support organisation. There is a lot of interesting work going on that will continue to increase the use cases of MySQL for the MySQL community. So my work continues to be interesting and it's very rewarding to develop new functionality knowing that it will be used by many organisations in the world.

When I tell my kids or their friends about my work I tell them that I work on things that makes their computer games continue to tick since this is what most kids care about in the IT community. Personally I find it even more interesting the use cases for genealogy and other cool stuff.