Wednesday, April 01, 2015

Benchmarking MySQL Cluster 7.4 on an Intel NUC

I have done a lot of benchmarks of MySQL Cluster on large servers which
is obviously very interesting. As mentioned in a previous blog I have
an Intel NUC machine now easily accessible. So I thought it would be
fun to make some benchmarks on this machine to see how fast MySQL Cluster
runs on small HW.

First a little description of the HW. The CPU is an Intel Core i5-4250
CPU. It runs at 1.3GHz and have a turbo frequency of 2.3 GHz. The CPU
has two cores and each core can run two threads simultaneously
(called hyperhtreading in Intel CPUs). It comes with the box containing
the motherboard and the CPU. Then you buy one or two DRAMs to it and an
SSD drive. I installed two DDR3L DRAMs which gives me a total of 16GByte
memory in the machine. In addition I installed an SSD drive of 256GByte.
The box fits nicely into the palm of your hand.

On top of this HW I installed Oracle Linux 7.

The configuration of the benchmark uses my laptop as benchmark driver
and uses the Intel NUC to run the data node. Since I only have one
Intel NUC available the benchmarks focused on a setup with only one
replica.

To start with I will focus this blog on performance of a table which
is a pure main memory where after a restart the table still exists,
but is empty. This kind of tables are useful to keep track of live
states where it doesn't make sense to recover the table since the
live state changes so often that the restored data is useless.

When recoverability on the table is active it adds lots of writes
of logs and checkpoints that adds an extra overhead of about
30-50% for inserts, deletes and updates. Reads are about as fast
with recoverability as without. I’ll dig into this type of benchmark
in a later blog.

So to the results of the benchmark run. Inserts were processed at a
rate of 400k per second, updates at 402k per second, deletes at
433k per second. Reads I got up to 589k per second. For both writes
and reads we're operating here at full CPU speed and also on full
network speed (Gigabit Ethernet).

This benchmark was executed with 1 primary key of 8 bytes and one
field with 100 bytes in it.

I was curious to see what the top rate of key lookups was with very
small records where the network speed wasn't a factor in the play.
So I changed the field from 100 bytes to 8 bytes and reran the
benchmark.

Now I got inserts at 476k per second, updates at 479k per second,
deletes at 474k per second and finally reads at 925k per second.
So close to a million reads per second on this tiny machine alone.

Given that I ran without logging to disk most of the configuration
was pretty straightforward of the data node. The most important
part is obviously how to make use of the 4 CPU threads in the 2
available cores.

As usual the LDM threads is where the actual database resides is
the thread that requires the most CPU, in this particular benchmark
also the TC threads require a lot of CPU since they are involved
in each and every key lookup. For e.g. applications with scans the
TC thread is a lot less involved. There is also a send thread
and a receive thread which is important to configure correctly.
These four thread types are the most important to configure when
running without recoverability. When adding recoverability the
configuration of the file io threads is also important. There is
also a main thread and replication thread and some connection
threads, but this have no impact on the performance of this
benchmark, but they were configured to be on the CPU with lowest
use.

The best results were achieved with only one thread per thread
type. The best config of those was to allow the LDM thread to
use its own CPU core and thus not use hyperthreading for this
thread to any great extent. Then one CPU thread in the other
core was used for the TC thread and the other CPU thread was
used for the send and the receive thread. This CPU thread also
handled the other threads.

When running with 2 LDM threads the number of threads increased
and the OS had to do more scheduling, most likely this config
could provide better results in a more bursty load, but for
sustained good performance it was better to use only a few
threads. The performance difference was though quite small
between using 1 LDM and 1 TC compared to using 2 LDMs and
2 TCs.

The benchmark application was executed on my laptop connected
to the Intel NUC machine through a gigabit ethernet switch.