Wednesday, July 05, 2006

PARTITION by a date column

One of the most common usage of partitioning is where one wants to partition
by date. One might want to have one partition per year or per month or per
week or per day. This blog entry shows how to handle this requirement using
MySQL 5.1.

The most common method to partition in this case is by range.
Partitioning in 5.1 uses a function on one or more fields of the table. In 5.1
there are some requirements on these fields if unique indexes or primary keys
also exist in the table. The reason is that 5.1 doesn't have support for global
indexes. Development of support for this have however started so should be in
some future release of MySQL.

In 5.1 functions have to return an integer. There are two functions that has
special support in the MySQL server. These are TO_DAYS() and YEAR(). Both
of these functions take a DATE or DATETIME argument and return an integer.
YEAR() returns the year and TO_DAYS() returns the number of days passed
since a particular start date.

The MySQL optimizer has special support for these two partition functions. It
knows that those functions are strictly increasing and use this knowledge to
discover that queries such as:

SELECT * from t1 WHERE a <= '1991-01-31' AND a >= '1991-01-01';
with a partition function PARTITION BY RANGE (to_days(a)) can be mapped
to a range of partition function values starting at
TO_DAYS('1991-01-01') and ending at TO_DAYS("1999-01-31")

Thus the MySQL Server can map TO_DAYS('1991-01-01') to a starting partition
and TO_DAYS('1991-01-31') to an ending partition. Thus we only need to scan
partitions in a range of partitions.

Most functions don't have this nice mapping from value range to partition
range. The functions TO_DAYS(date) and YEAR(date) are known by the
MySQL optimizer to have this attribute and they will thus be better for range
optimisations. Also a partition function on a field which is an integer field
where the function is the field by itself will have this characteristic. Other
functions won't, theoretically many more can be handled but this requires
special care of overflow handling to be correct and this will be added in
some future MySQL release.

So with this knowledge let's set up a that does partition by month.

CREATE TABLE t1 (a date)
PARTITION BY RANGE(TO_DAYS(a))
(PARTITION p3xx VALUES LESS THAN (TO_DAYS('2004-01-01'),
PARTITION p401 VALUES LESS THAN (TO_DAYS('2004-02-01'),
PARTITION p402 VALUES LESS THAN (TO_DAYS('2004-03-01'),
PARTITION p403 VALUES LESS THAN (TO_DAYS('2004-04-01'),
PARTITION p404 VALUES LESS THAN (TO_DAYS('2004-05-01'),
PARTITION p405 VALUES LESS THAN (TO_DAYS('2004-06-01'),
PARTITION p406 VALUES LESS THAN (TO_DAYS('2004-07-01'),
PARTITION p407 VALUES LESS THAN (TO_DAYS('2004-08-01'),
PARTITION p408 VALUES LESS THAN (TO_DAYS('2004-09-01'),
PARTITION p409 VALUES LESS THAN (TO_DAYS('2004-10-01'),
PARTITION p410 VALUES LESS THAN (TO_DAYS('2004-11-01'),
PARTITION p411 VALUES LESS THAN (TO_DAYS('2004-12-01'),
PARTITION p412 VALUES LESS THAN (TO_DAYS('2005-01-01'),
PARTITION p501 VALUES LESS THAN (TO_DAYS('2005-02-01'),
PARTITION p502 VALUES LESS THAN (TO_DAYS('2005-03-01'),
PARTITION p503 VALUES LESS THAN (TO_DAYS('2005-04-01'),
PARTITION p504 VALUES LESS THAN (TO_DAYS('2005-05-01'),
PARTITION p505 VALUES LESS THAN (TO_DAYS('2005-06-01'),
PARTITION p506 VALUES LESS THAN (TO_DAYS('2005-07-01'),
PARTITION p507 VALUES LESS THAN (TO_DAYS('2005-08-01'),
PARTITION p508 VALUES LESS THAN (TO_DAYS('2005-09-01'),
PARTITION p509 VALUES LESS THAN (TO_DAYS('2005-10-01'),
PARTITION p510 VALUES LESS THAN (TO_DAYS('2005-11-01'),
PARTITION p511 VALUES LESS THAN (TO_DAYS('2005-12-01'),
PARTITION p512 VALUES LESS THAN (TO_DAYS('2006-01-01'),
PARTITION p601 VALUES LESS THAN (TO_DAYS('2006-02-01'),
PARTITION p602 VALUES LESS THAN (TO_DAYS('2006-03-01'),
PARTITION p603 VALUES LESS THAN (TO_DAYS('2006-04-01'),
PARTITION p604 VALUES LESS THAN (TO_DAYS('2006-05-01'),
PARTITION p605 VALUES LESS THAN (TO_DAYS('2006-06-01'),
PARTITION p606 VALUES LESS THAN (TO_DAYS('2006-07-01'),
PARTITION p607 VALUES LESS THAN (TO_DAYS('2006-08-01'));

Then load the table with data. Now you might want to see the
data from Q3 2004. So you issue the query:
SELECT * from t1
WHERE a >= '2004-07-01' AND a <= '2004-09-30';
This should now only scan partition p407, p408, p409. You can
check this by using EXPLAIN PARTITIONS on the query:
EXPLAIN PARTITIONS SELECT * from t1
WHERE a >= '2004-07-01' AND a <= '2004-09-30';

You can also get similar results with more complicated expressions.
Assume we want to summarize on all measured Q3's so far.
SELECT * from t1
WHERE (a >= '2004-07-01' AND a <= '2004-09-30') OR
(a >= '2005-07-01' AND a <= '2005-09-30');

Using EXPLAIN PARTITIONS we'll discover the expected result that this
will only scan partitions p407, p408, p409, p507, p508 and p509.

When july comes to its end it is then time to add a new partition for
august 2006 which we do with a quick command:
ALTER TABLE t1 ADD PARTITION
(PARTITION p608 VALUES LESS THAN (TO_DAYS('2006-09-01'));