Wednesday, February 15, 2012

1.05BN QPM using MySQL Cluster 7.2

We've passed another performance milestone using MySQL Cluster. In 2002 we passed the limit of 1M reads per second. Now we've passed the milestone of 1B reads per minute. We achieved 1.05BN reads per minute on an 8-node cluster using MySQL Cluster 7.2.5.

The benchmark used the latest MySQL Cluster version, available on launchpad and to be released as MySQL Cluster 7.2.5. The benchmark program is flexAsynch. Each read is a transaction consisting of a read of an entire row consisting of 25 attributes, each 4 bytes in size. flexAsynch uses the asynchronous feature of the NDB API which enables one thread to send off multiple transactions in parallel. This is handled similarly to how Node.js works with callbacks registered that reports back when a transaction is completed.

I will in a later blog discuss how an application can be designed to make optimal use of the asynchronous API to enable the application to reach these kind of performance numbers for key lookup operations.

The benchmark was executed on a set of 2-socket servers using X5670 with Infiniband interconnect and 48GB of memory per machine. There were 8 data nodes in the cluster and each machine had 1 data node placed in it. There were 10 machines running flexAsynch (1 process per machine).

We reported a similar benchmark a year ago, the results from an 8 data node set-up was 2.1M reads per second (8X improvement) and using 8 machines the set-up was 16 data nodes where we reached 4.3M reads per second (4X improvement). This dramatic improvement is possible since we have made each data node make more effective use of the available hardware threads. Previously each data node exploited about 6 CPUs with 8 threads, in this benchmark we used about 14 CPUs with 24 threads. We've also removed a number of bottlenecks and the configuration of the threads is done in such a manner as to handle many different loads well.

We have also improved the update numbers from 1.3M updates per second in an 8-node setup to now reaching 1.9M updates per second in a similar set-up.

The scripts required to run the benchmark are available on our benchmark page. We've placed the configuration file used to drive the benchmark in the dbt2-0.37.50 tarball in the examples directory.

You can learn more about all of the enhancements in MySQL Cluster 7.2 from our Developer Zone article.

40 comments:

Anonymous said...

Could you elaborate to an uneducated reader whether we talk QPM or TPM? Numbers from about a year ago spoke of TPS - 2.46M write tps (~147M write tpm), 6.82M read tps (~400M read tpm), http://mikaelronstrom.blogspot.com/2011/04/mysql-cluster-running-246m-updates-per.html

Mikael Ronstrom said...

In this case TPM and QPM is equivalent. There is one query per transaction and the query is a primary key lookup reading 25 attributes which are 4 bytes each.

In the case of updates it is also equivalent where a transaction consists of one update of all 25 attributes using the primary key as index.

TPS is per second and QPM is per minute.

John Ward said...

Hey there Mikael, those are quite astonishing results.. What kind of storage are you using on the back end?

Mikael Ronstrom said...

These benchmarks are done on read benchmarks on main memory data, so disks are not really involved in the benchmark. I will later report on benchmarks also on updates where we have machines with advanced IO as well. Stay tuned.

Andy said...

Mikael,

1) What if the data set is bigger than RAM? Say I'm using mainstream SSD's such as Intel 320 - would MySQL Cluster by any faster than regular MySQL with such IO-bound workload? What type of performance per node can be expected in this case - 10k rps? 100k rps?

2) Would you recommend running MySQL on a single node? Eventually I may need to scale out but initially I'd like to use just 1 machine

Dimitri said...

Kudos, Mikael! :-)

getting 1.5M TPS in 2002 was already something out of imagination (specially that the code was not far from alpha level stability yet ;-))

and reaching over 16M TPS 10 years later is looking not less amazing!! :-)) So, x10 times speed-up in 10 years! but I'm pretty sure you're able to reach more, no? ;-)

there is no word about bottlenecks you're meeting.. - keeping in mind that NDB today is free of lock contentions, what is your limit today? only HW or anything else?..

Thank you!

Rgds,
-Dimitri

Mikael Ronstrom said...

MySQL Cluster supports placing non-indexed data on disk and many use this on production workloads. We haven't spent so much time doing benchmarks to see exactly what the limitations are.

We use a modern state-of-the-art page replacement algorithm, so it's tied to how much the caching provides and how fast the disk can deliver 32k pages on random reads.

Many SSD's can deliver quite some reads and if we assume it can deliver 10k 32k pages per second and the cache has a 90% hit rate, then 100k rps is certainly achievable.

I'll try to find some time to benchmark using disk-based data as well.

Mikael Ronstrom said...

Using MySQL Cluster on a single node in a MySQL environment certainly makes some sense. Obviously it doesn't provide any high-availability. But I tested it a lot to see what one MySQL Server with one data node can do when based on the same machine. MySQL Cluster 7.2 has some pretty awesome features that makes even a single node case interesting to use, like the possibility to execute queries in parallel using the adaptive query localization feature.

Mikael Ronstrom said...

Thx Dimitri,
Yep, certainly not reached the limit here.

I will write up a blog of some contentions that we did come up against. It was mostly false CPU cacheline sharing that we had some issues with.

Obviously the usual scalability issues around many nodes still exist. This usually hits limits to have many packets network drivers can process since many nodes means smaller packets and thus more packets.

I will write up some blog about some things to take into consideration for network interrupt handling using MySQL Cluster.

Ryan said...

In regards to: "Each read is a transaction consisting of a read of an entire row consisting of 25 attributes, each 4 bytes in size."

Is each row 4 bytes in size?, or each of the 25 attributes 4 bytes in size?Such that the result you are pulling back is either: row size = 25 x 4 bytes = 100 bytes.
or row size = 4 bytes.

It would be pretty useful to see this test repeated with a larger row size (10 bytes, 100 bytes, 1 kb, 10 kb, 100kb) to see the varying results.

Thanks

Mikael Ronstrom said...

The row size was 100 bytes, 25 attributes of 4 bytes each.

We have done some tests on a single node set-up (different machine than the benchmark was run on) where 100 bytes record handled 3.3M reads per second and 8 byte records about 4.5M reads per second. I have forgotten the numbers on 500 bytes but the cost of a read is essentially a fixed overhead per record plus the time it takes to transfer the extra data.

Anyways I am planning to write some more details later on this single node testing, so stay tuned.

Ryan said...

Yea, that would be really useful to see the single machine results. Do you know what the fixed overhead per record is?

Also, when you say "reads per second" - Is that a query being performed where there might be other factors involved (indexes, volume of total data, where clauses, etc) or is it sheer read speed at a lower level than a query? I guess I'm asking, what is actually timed there?

Mikael Ronstrom said...

The read per second is here queries per second. A query is in SQL-terms equivalent to a SELECT * from t1 WHERE pk = x; The benchmark as mentioned however use the NDB API to issue the query, not using SQL but a more direct API.

So the where clause in this case effectively means a hash lookup followed by a read of the record attributes followed by a transport of the attribute data back to the API.

The NDB API has support for all access methods of the NDB data nodes. The access methods currently are: Primary key access (read, update, delete, insert and write (either insert or update), full table scan and index scans using an ordered index. In 7.2 it's now also possible to mix such operations and perform join operations in the data nodes.

The benchmarks discussed in this article uses the primary key lookups. So it's similar to what a Key-Value Store would be doing.

Andy said...

> how fast the disk can deliver 32k pages on random reads.

Is the page size of MySQL Cluster configurable? 32K seems pretty large. Most SSDs tend to have 4K page size. Would this page size mismatch cause any issues?


> Previously each data node exploited about 6 CPUs with 8 threads, in this benchmark we used about 14 CPUs with 24 threads

How could each data node use 14 CPUs? The machines you used are 2-socket X5670. Each X5670 has 6 cores, so each node has 12 cores in total. Where did the 14 CPUs come from?

Mikael Ronstrom said...

The page size mismatch won't cause any issues, making the page size configurable have a potential to be a benefit, but it will definitely depend on the workload as usual. We're trying this out in InnoDB currently, so we'll get experience from there.

When I said it used 14 CPUs I meant that top said that it used 1400% CPU. This means that 14 of the total of 24 CPU threads are in use on the machine.

The X5670 has 6 cores but also uses Hyperthreading and thus from an OS point of view there are 12 CPUs per socket when Hyperthreading is enabled.

Anonymous said...

Hello,
Could you like me know the database size for your testing? And if the memory size is much smaller than database size, the performance will be downgraded a lot? If you testing can show the data about performance vs. database size and memory size, that will be more helpful.

Thanks.

Nai Yan.

Nai Yan said...

Hello
Would you like me know what's data base size and memory size (48G * 8)? And how much performance will be downgraded when memory size is smaller than database size?

Thanks.

Nai Yan.

Mikael Ronstrom said...

The possible memory size with 8 data nodes with 48 GB each is about 4 * 35 GB = 140GB since we need to use a part of the memory for internal memory structures to handle ongoing transactions, send buffers... , index memory. Additionally we use NoOfReplicas = 2, so two machines contains the same data.

If database size is > memory size then you're using disk-based data and performance is dependent on how the cache hit rate in the page pool is and the speed of the disk.

digitalpoint said...

Curious if you were running IP over InfiniBand, SDP using LD_PRELOAD or something else?

Nai Yan said...

Hi Mikael,
Thank you for your reply. Actually my question is how many rows are there in your test database? And how big is the index size.

What I am afraid is if the index size is bigger than memory size, the performance will be far degraded - i.e. far lower than 1.05BN QPM.

If my worry is true, would you give me a recommendation for memory size required vs. rows of DB, so that we can reach 1.05BN QPM?

Thanks!

Nai Yan.

Mikael Ronstrom said...

We ran with IPoIB (IP over Infiniband), simply because it was easiest to set-up, everything was already set-up and prepared for it. SDP is probably slightly more efficient (at least so my previous experience says), but IPoIB was quite good enough for these tests.

Mikael Ronstrom said...

Nai Yan,
Certainly performance will take a major hit if index size isn't fitting in main memory, we cannot change the laws of physics :). In MySQL Cluster indexes stays in memory. We support disk-based attributes as long as they are not indexed.

For calculating the size of memory used there is around 15 bytes of overhead for the primary key hash index, there is around 10-20 bytes overhead per row (I think the manual has a more exact estimate) and any additional ordered index will be another 10-15 bytes of overhead. So in our table with 1 primary key index, 25 attributes of 4 bytes in size and no ordered indexes we will use about 110-120 bytes of DataMemory per row and about 15 bytes of IndexMemory per row (if ordered indexes are used they are stored in DataMemory).

So this means that with the 140GB's of memory accessible to DataMemory and IndexMemory I can assign 120GB to DataMemory and 20GB to IndexMemory and thus be able to store 1BN rows.

Nai Yan said...

Thank you, Mikael.

However if we have at least 256B rows, each row contains 20 ~ 30 bytes data. Any recommendation from you for system design so that we can maximize our read throughput. We estimate, we can leverage totally around 2T memory (also need to consider HA) and some SSDs.

In addition to this, if current MySQL cluster can support Infiniband RDMA as the interconnection?

Thanks.

Nai Yan.

Mikael Ronstrom said...

Hi,
256B rows each with 20-30 bytes data means that you need about 15 bytes (hash index) + 32 bytes (e.g. row data) + 15 bytes (row overhead) per row. Thus you need around 64 bytes per row.

This means 256B * 64 bytes = 16TB. Replication means an additional 16TB. Thus you need servers with a total capacity of around 35TB taking additional overhead into account.

If you have 2TB's of memory you can handle about 16B rows.

MySQL Cluster supports IPoIB and SDP using Infiniband but not Infiniband RDMA. But it shouldn't matter very much to performance of the system given that we can use multiple CPUs for network handling.

Nai Yan said...

if you mean, mySQL cluster have to put all the rows into memory? Is there any option to balance memory usage and performance? Thanks.

Nai Yan.

Mikael Ronstrom said...

All rows have indexed fields in memory and also indexes are memory based. Fields that are non-indexed can be disk-based.

Nai Yan said...

So if I only index primary key field (4 bytes), then for my case, memory consumption can be cut down to 256B * (15+4) < 5T? - 15 bytes for hashed index and 4 bytes for index field.

If my calculation is wrong? Thanks.

Nai Yan.

bruss said...

Thanks your infiniband information.

1)
I want to know config.ini setting in case of IPoIB and SDP.

2)
I have only 2 data node.
and I am going to use infiniband using peer to peer connection,
without infiniband switch.

It is possible and useful?

Bruss said...

I am very curios [TCP] setting in config ini about infiniband.

Did you connect only between data node with infiniband.

Otherwise, Alse did you connect between sql node and data node with infiniband?

Mikael Ronstrom said...

There are no special settings needed to use Infiniband based on IPoIB or SDP. Simply set the hostname of the node to the IP address of the Infiniband interface, that's all that is needed.

Irrespective of that I usually set some things a bit higher, in particular TcpSendBufferSize and TcpReceiveBufferSize (I usually set those to 256k). But this I do also for Ethernet, so nothing special for Infiniband.

bruss said...

Thanks Miakel.
Sincerly thanks for your reply.
I have some question additionally.

>Simply set the hostname of the >node to the IP address of the >Infiniband interface, that's all >that is needed

In [TCP] setting of config.ini,
only can set aboout data node.
When i set [TCP] between mgm node and data node
or between sql node and date node, ndb_mgm could not start with next error.

MySQL Cluster Management Server mysql-5.5.19 ndb-7.2.4
/export/home2/pb2/build/sb_0-4838533-1327948997.21/mysql-cluster-gpl-7.2.4/storage/ndb/src/mgmsrv/ConfigInfo.cpp:4580: require((ctx.m_config->get("Node", id1, &node1))) failed

Did you set in [TCP] of config.ini only between data node.

and
I have other question.
Do you use infiniband switch?

Mikael Ronstrom said...

The [TCP] section is only needed if you need to set specific things related to one specific connection between two nodes. In this case you need to set both Node1 and Node2 of this section which is what the error message says.

I didn't use the [TCP] section at all. I used however the [TCP DEFAULT] section to set TcpReceiveBufferSize and TcpSendBufferSize.

What I meant with setting hostname was to set the hostname on the nodes, this is a mandatory parameter, simply set this hostname of the node to the Infiniband IP address used in IPoIB or SDP. Nothing more needed.

We had 18 machines running the benchmark, so yes, there were Infiniband switches.

Brian said...

you are using Dual-X5670 based nodes and using 14 threads out of 24, so what if we use AMD Opteron based servers (like Quad-6272 which is almost 2x faster than the Dual-X5670) and will have 64 cores and upto 512GB memory support, do you think it can reach the 2B/m mark?

also if we scale the cluster over different datacenters located in completely different geo-locations, what will be the effects?

bruss said...

thanks mikael. It was of much help to me.

By the way, don't use Dolphin sci?

It looks like better use dolphin sci rather than use infiniband?

Nai Yan said...

Hi Miakel again.

If indexed column size is bigger than total RAM, then if mySQL cluster will automatically put part of it on disk or some place we configure (such as SSDs)? Or if we can drive MySQL cluster to load an index for particular table dynamically and those indexes can saved particularly on SSDs?
Because, we might have 6400 billion rows and each row has a primary key 4 bytes.

We are seeking for a solution to deal with such a big data. Looking forward to your suggestion!

Thanks.

Nai Yan.

li said...

Hi, all, I want to test 7.2.5, where I could download 7.2.5 dev version since I understand it is not GA.

Thanks.

Mikael Ronstrom said...

The code for 7.2.5 can be retrieved on launchpad. 7.2.5 is the next version to be released for MySQL Cluster.

Todaydownload.com said...

These facts are really interesting. Few of them were well known for me but many of them were brand new for me too!
I will print this one out and show to my friends because they will be definitely interested in that. Thanks!
MySQL

MartinF said...

Nice results. Do you also have some experience with ndb in virtualized environment?

Anonymous said...

Dear Mikael,

Thanks for sharing such valuable info. We're running something similar to your benchmarks, and I have a question about balancing Soft IRQs within the CPU cores; do you follow such a technique in the benchmark? Or you just use the OS' default (all Soft IRQs are handled by one core at a time)? And could you please elaborate more if it's the former?