Mikael Ronstrom: November 2009

Friday, November 20, 2009

Partitioning as performance booster

When I developed partitioning for MySQL the main goal was
to make it easier for MySQL users to manage large tables
by enabling them to easily add and drop partitions.

It turns out that partitioning can also be used as a manner
to make MySQL more scalable. The reason is that in some
cases the storage engine have internal locks per table or
per index (one such example is the btr_search_latch in InnoDB).

So in this case adding a
PARTITION BY KEY (key_part)
PARTITIONS 4
to the table definition makes a very hot table into 4 tables
from the storage engine point of view.

This would mostly be beneficial in cases where the main
operation is primary key lookups on the table. Dividing the
indexes in cases of scans can be both positive and negative.
So this solution is definitely not a winner for all situations.

I haven't tried this out yet myself in my benchmark suites,
but I plan to make some experiments in this area. It is usable
in sysbench, it's possible to use for DBT2 (have used partitioning
for DBT2 in MySQL Cluster benchmarks a lot already) and it's
possible to use in Dimitri's dbStress benchmark.

Full automation of DBT2 test runs in benchmark scripts

My benchmark scripts was upgraded once more today.
I fixed all issues pertaining to full automation
of DBT2 runs. So now it is as easy to start a
DBT2 test run as it previously was to start a
sysbench run. Obviously DBT2 was already earlier
supported in the benchmark scripts, so what I did
now was add the final steps to make it fully
automated. This includes also generating the DBT2
load files needed to load the DBT2 database.

See the download section on www.iclaustron.com
for the tarball including some description of
how to configure sysbench and DBT2 test runs.

Tuesday, November 17, 2009

New version of benchmark scripts also supporting Drizzle

I updated my benchmark scripts this week. These scripts can now
run:

- Sysbench benchmarks for MySQL and Drizzle
- DBT2 benchmarks for MySQL and MySQL Cluster
- TPC-W benchmark for MySQL Cluster

There is also a number of scripts to start and stop
MySQL, Drizzle and MySQL Cluster nodes.

In this version I added Drizzle support for sysbench and also
added a README-AUTOMATED file that describes the steps needed
to set-up a completely automated sysbench run for MySQL
and Drizzle.

To run a MySQL sysbench benchmark one needs the DBT2 tarball,
the sysbench tarball and a MySQL tarball (gzipped tarballs).

The tarball is found on www.iclaustron.com in the downloads
section and this version is named dbt2-0.37.47.tar.gz.

Thursday, November 12, 2009

Improvements by LOCK_* patches

I have done a long series of tests to verify that
the impact of the LOCK_alarm removal, removing two
variables from LOCK_threadcount protection to being
atomic increments instead and decreasing hold time
of LOCK_open is positive in most if not all cases.

There are a number of test cases needed:
1) With and Without cpuspeed activated
2) With sysbench local and with sysbench on another
server
3) With MySQL Server limited to 2,4,6,8,12,16 cores.
4) With number of threads going from 1 to 256 threads
in fair sized steps.

I've done most of those tests for Sysbench Readonly
and Sysbench Readwrite. The results are positive in
almost all cases towards the baseline which is based
off the MySQL 5.4.3 tree (not exactly 5.4.3 but close
enough).

In the case of networked benchmark with cpuspeed
activated the gain is biggest, top performance goes up
about 10% and also top performance moves from 64 to 128
threads, for 256 threads the performance increases
by about 50%.

When cpuspeed is deactivated and we use networked
benchmarks the network handling becomes a bottleneck, so
the numbers here are less interesting since we need to
resolve the network bottleneck first.

With cpuspeed activated and local communication the top
performance increase by about 8% and there is gain
for all number of threads. The gain is a bit higher
on sysbench readonly than on sysbench readwrite.

With cpuspeed deactivated and local communication we
naturally get the best numbers but also the
smallest gains. Top performance of Sysbench
Readonly increased by 2.5% and for Readwrite
it increased 4%, the top performance for
sysbench readwrite also moved from 16 threads
to 32 threads. The improvement is slightly
better on more threads.

Wednesday, November 11, 2009

Analysis of InnoDB adaptive hash index parameter for sysbench

As I mentioned in a previous blog post I was suspicious that the
adaptive hash index in InnoDB added to the scalability issues in
the MySQL Server. So I decided to run a test where I disabled the
adaptive hash index.

The results were fairly much as expected. The adaptive hash
index usage improves performance on low thread counts
(up to about 16) by a few percent. However at 32 threads and
beyond the performance is better without the adaptive hash
index and equal in some cases. In particular the top performance
goes up by about 3% when this is disabled.

This confirms the documentation in the InnoDB manual that the
adaptive hash index improves performance as long as the lock
around it doesn't hurt performance. So for the majority of users
it's a good idea to have it turned on, but for users with high-end
servers it's a good idea to test if turning it off will improve
performance.

For sysbench benchmarks it's clearly a good idea to turn it off.

245% improvement of MySQL performance in 1 year

When I did sysbench benchmarks 1 year ago I used a
4-socket server, a Linux kernel based on 2.6.18 and
MySQL 5.1. The sysbench readwrite numbers I got then
was around 2700. When I run the same benchmarks
now the numbers I get are 9300.

These improvements obviously comes from a mixture
of HW development, OS development (now using
a 2.6.31 based kernel) and MySQL development.

The machine is still a 4-socket server, the operating
system is still Linux and the database is still MySQL,
but the performance has improved by 245%. Needless
to say this is an extraordinary performance
improvement in just one year and clearly shows that
the both the HW industry and the open source SW
industry is quickly picking up on how to improve
performance using multi-core multi-socket servers.

Effect of CPU Powersave mode on Sysbench benchmarks

When I started analysing the various patches that I had made
for improving the MySQL Server performance I did by mistake
forget to turn cpuspeed off in Linux. This is a feature that
makes it possible to run the CPU's on a much lower frequency
in cases when they aren't so heavily used.

So at first I considered simply turning it on and forgetting the
data I had produced. Then I realised that actually to have
cpuspeed activated is the default behaviour and for many
servers out there in the world it is actually the best mode since
most servers goes from high load to low load frequently.

So I decided that it would be worthwhile to analyse behaviour
both with and without this feature turned on.

The cpuspeed feature was particularly involved when running
sysbench on a different server, so thus using a socket based
communication. In this case the performance drop off at
64,128 and 256 threads were fairly significant. However the
performance drop off when I added fixes for LOCK_open,
LOCK_alarm and LOCK_threadcount was very significant. I
got 50% better performance with these patches when
cpuspeed was activated and I was running sysbench over the
network.

When I ran sysbench and mysqld on the same host the impact
of the patches and cpuspeed was smaller but still significant.
Turning cpuspeed on decreases the performance of sysbench
readwrite by almost 10% for the baseline (MySQL 5.4.3) whereas
with the patches that fixes the LOCK_* problems the drop in
using cpuspeed is only 1%.

So it seems like having many extra mutexes to pass through
doesn't hurt performance so much when running at full CPU
speed all the time, but as soon as the CPU power save mode
is activated these many extra mutexes to pass through has a
significant negative effect.

GDB stack trace analysis of Sysbench benchmark

After many unsuccessful attempts to get MySQL to run faster I got
a very simple but effective gdb script courtesy of Intel. The script
is very simple and attaches to the mysqld process and does a
backtrace on all threads in the process and then it gathers
statistics on all stacktraces.

With this script I did some analysis on what goes in sysbench
readonly and sysbench readwrite.

Starting with sysbench readonly I discovered a lot of things I
already knew such that LOCK_open is a major bottleneck.
There were also many other things that popped up such as:
LOCK_alarm, LOCK_threadcount, LOCK_grant,
btr_search_latch and the mutex on the table object which is
used to discover TABLE level locks on the MySQL level. This
last lock is contended mostly because sysbench is only
operating on one table. So most normal benchmarks will
not have any major problems with this mutex since it's rare
with applications that put so heavy weight on a single table.

Interestingly the kernel_mutex in InnoDB wasn't so prevalent
in sysbench readonly. Also I was a bit surprised to find the
btr_search_latch there since it's a RW-lock, but it seemed
like every now and then someone took a X-lock on the
btr_search_latch even in readonly queries. Probably has
something to do with InnoDB adaptive hash index.

One surprising lock here is the LOCK_grant which is also
a RW-lock and this is never taken in anything else than
the Read mode unless one changes the grants which
doesn't happen in a sysbench run. Some discussions
concluded that pthread_rwlock is actually implemented
by using a mutex and this is the cause of the contention
on LOCK_grant. So to resolve that a read-optimised
RW-lock is needed for the MySQL Server code.

To remove LOCK_alarm there is already code in the MySQL
Server prepared to do that so the patch to remove
LOCK_alarm is fairly straightforward.

To remove LOCK_threadcount isn't necessary, it's sufficient
to remove two variables thread_running and global_query_id
from being protected by this mutex and instead using
atomic variables. To handle this one can use the my_atomic
framework and add 64-bit support to it. Then the fix of
LOCK_threadcount is straightforward.

To resolve LOCK_open is obviously a bigger problem but a
first step is to simply remove the hash calculation from
LOCK_open.

The MySQL runtime team is working on a metadata locking
infrastucture that was in MySQL 6.0 but still have some
quality issues. But when this code is ready it will also make
it possible to resolve the LOCK_open problem. Actually the
problem then is both about the new lock LOCK_mdl added
by the metadata locking code and LOCK_open. But the new
structure makes it possible to have more aggressive
solutions on both LOCK_open and LOCK_mdl code.

In analysing the sysbench readwrite benchmark the main
contender together with LOCK_open was not very
surprisingly the kernel_mutex in InnoDB. There are some
ideas on how to improve this as well but first things first.

New partitioning SQL syntax added in next MySQL milestone release treee

This blogs gives some insights into the new SQL syntax added in WL#3352
and WL#4444 and WL#4571 which all have been included in the
mysql-next-mr tree which is the base for the next milestone release
codenamed Betony.

The purpose of these new changes is to enable improved partition pruning, and
also making it possible to partition on strings. Supporting TRUNCATE on
partitions will improve partition management and the ability to use separate
key caches for different partitions makes MyISAM partitioned tables more
useful. Finally we added a new function TO_SECONDS which makes it
possible to get more fine-grained dates for partitions and subpartitions.

This information will soon find its way into documentation but if you want
to get started right away here it comes.

There are 5 additions effectively:
1) The ability to RANGE partition by column list instead of by function
2) The ability to LIST partition by column list instead of by function
3) A new function TO_SECONDS which can be used in partition functions
and where partition pruning will be used also on ranges.
4) The ability to TRUNCATE a partition
5) The ability to use a keycache in MyISAM per partition

Here's a few examples of how to use these now additions:

1)
CREATE TABLE t1 (a varchar(5), b int)
PARTITION BY RANGE COLUMNS (a,b)
( PARTITION p0 VALUES LESS THAN ("abc", 1),
PARTITION p1 VALUES LESS THAN ("def", 2));

Some things noteworthy here:
The checks for the constants is fairly strict. Thus using "1" for a constant
to b isn't allowed, the constant must be of the same type as the field it
maps to. Also sql_mode will be ignored for those partition constants.
Thus e.g. even if sql_mode specifies that non-existing dates are allowed
they will not be allowed in the partition constants since these constants
will be a part of the table and need to live for longer than the current
session.

Character sets are allowed and the string constants will be interpreted
in the character set their field belongs to. Also character set strings
without mapping are allowed. If one tries to use SHOW CREATE TABLE
on the partition table and the mapping of the partition constants from
field charset to UTF8 fails or if mapping to client charset fails, then the
partition constants will be written in hex string format.

A partition constant can be MAXVALUE, NULL is however not allowed.
There were some considerations to also be able to use MINVALUE
which effectively would make it possible to create partitions where only
the NULL values can go in. However this is still possible if one knows
the minimum value of the field.

It's possible to partition on integer fields, string fields and date fields.
It's not possible to partition on BLOBs, SETs, ENUMs, GEOMETRY fields,
BIT fields.

2)
CREATE TABLE t1 (a varchar(1))
PARTITION BY LIST COLUMNS (a)
( PARTITION p0 VALUES IN ("a","b","c"));

CREATE TABLE t1 (a varchar(1), b int)
PARTITION BY LIST COLUMNS (a,b)
( PARTITION p0 VALUES IN (("a",1),("b",2),("c",3)));

Noteworthy here is that parenthesis are required when more than one
field is in the list of columns partitioned on. It's required to not use
parenthesis when there is only one field.

NULL values are allowed as in MySQL 5.1 but not MAXVALUE.

3)
CREATE TABLE t1 (a datetime)
PARTITION BY RANGE (TO_SECONDS(a))
( PARTITION p0 VALUES LESS THAN (TO_SECONDS("2009-11-11 08:00:00")),
PARTITION p1 VALUES LESS THAN (MAXVALUE));

Same syntax as in MySQL 5.1 but also possible to use TO_SECONDS as a
partition function.

4)
ALTER TABLE t1 TRUNCATE PARTITION p0;
ALTER TABLE t1 TRUNCATE PARTITION p1;
ALTER TABLE t1 TRUNCATE PARTITION ALL;

This delete all rows in the given partitions and resets the given partitions
auto_increment values (if exists) to 0.

The syntax works in the same manner as for ANALYZE, OPTIMIZE and other
commands that can be applied on partitions already in MySQL 5.1.

5)

CACHE INDEX t1 PARTITION p0 IN keycache_fast;
CACHE INDEX t1 PARTITION p1, p2 IN keycache_slow;
LOAD INDEX INTO CACHE t1 PARTITION p0;
LOAD INDEX INTO CACHE t1 PARTITION p0, p1;

This new syntax makes it possible to have separate key caches for different
partitions in a partitioned table using the MyISAM storage engine.

Tuesday, November 10, 2009

Improvement of MySQL partitioning included in MySQL's next milestone release tree

It was quite some time since I last blogged. It's not due to
inactivity. For those of you that have followed my blog might
have seen earlier blog posts about a new partitioning feature.

This new partitioning I first blogged about in July 2006 and
that blog is still the 3rd most popular blog of my blogs, even
when looking at the last months views. The work on this started
out in 2005 and so it's nice to now get it in a state where it's
quality is ready for more heavy testing. For those interested
in partitioning I think this feature will enlarge the number of
cases where partitioning is applicable. It's now possible to
partitioning on many more field types and also on multiple fields
in an efficient manner.

This feature described by WL#3352 has now been pushed
to the mysql-next-mr tree. For those of you new to our new
milestone release model this tree is where we push new features
before clone off. After clone off this tree is merged with
mysql-trunk tree and after about 6 months of bug fixing a Milestone
Release is performed. The current mysql-trunk tree is the tree from
which the current MySQL 5.4 releases are produced. A milestone
release is of beta quality and some milestone releases will be
continued towards a GA release. There will be a new milestone
release with about 3-6 months interval. For more information on
the release model see here.

The WL#3352 was pushed into the Milestone which has the
codename Betony in the article referred to above. The current
MySQL 5.4 release series is called Azalea (actually 5.4.0 and
5.4.1 belonged to Summit and 5.4.2 was the first Azalea release).

The major advantages of this new feature is that it makes it
possible to partition on string fields and also to partition on
multiple fields and still get good partition pruning. Previous
partitioning required a partition function that delivered an
integer result and a couple of functions could deliver good
partition pruning.

Now it is possible to partition on most fields and even a set
of them and always get good partition pruning.

The final result of the new syntax is the following:

CREATE TABLE t1 (a varchar(5) character set ucs2, b int)
PARTITION BY RANGE COLUMNS (a,b)
( PARTITION p0 VALUES LESS THAN (_ucs2 0x2020, 1),
PARTITION p1 VALUES LESS THAN (MAXVALUE, MAXVALUE));

So the keyword COLUMNS indicates that a list of fields is
used to partition on instead of a function. The new
partitioning applies to RANGE and LIST partitioning.

All the management functions for partitioning still applies.
However the major difference comes when you do a query like:

select * from t1 WHERE a > _ucs2 0x2020;

In this case the partition pruning will discover that only
the partition p1 is possible to find records and will thus
prune away partition p0. In MySQL 5.1 it's only possible
to perform pruning on intervals for single fields and
the partition function must also be of a type that is
safe to always increase such as YEAR or TO_DAYS (actually
this new feature also added the function TO_SECONDS to this
list of functions that can be pruned on efficiently).

So partition pruning on this works very much like an
index. Not surprisingly the partition pruning code
reuses the code for the range optimiser which looks
at what indexes can be used for a certain query.

If you wonder why I am using these _ucs2 constants
as examples it's because I had to learn a lot about the
character set code in MySQL to get everything right with
this feature. Actually even found and fixed a few MySQL
character set bugs in the process :)

Mikael Ronstrom