Wednesday, July 30, 2008

2: Thoughts on a new NDB API: Send part

In the current API when sending one takes the Transporter mutex and
then sends all the signals generated towards one or many nodes.
There is also some handling of adaptive sends, however this adaptive
algorithm takes care of all nodes, thus waiting for sending is global
on all nodes.

The new design uses one mutex for the sending, however this mutex only
controls the sending part of one socket. Also the time for holding the
mutex is just enough to check the state, no send operations are done
while holding the mutex.

The new adaptive algorithm will keep track of the last sent messages on
this socket and in principle the idea is that if it's at least a 90-99%
probability that it is a good idea to wait, then it will wait (unless
the application has provided the force send flag). It will do so by
keeping track of the last few messages sent.

So in principle the data structure protected by the mutex is:
struct ic_send_node_mutex
{
IC_SEND_THREAD_MUTEX *send_thread_mutex;
Mutex mutex;
boolean send_active;
IC_COMM_BUFFER *first_cb;
IC_COMM_BUFFER *last_cb;
uint32 queued_bytes;
Timer first_buffered_timer;
Timer last_sent_timers[8];
uint32 last_sent_timer_index;
}

For each socket there is a specific send thread, this thread is mostly
sleeping, waiting for someone to wake it up from its sleep. One reason
to wake it up is if one thread has started sending and other threads
have provided so much work that it needs to offload this sending to
a specific thread (the idea is that the sending is normally done by
an application thread which is involved in user activity and we cannot
keep this thread for longer than a few sends, thus we need to make it
possible to offload send activity to a specific send thread when a high
load appears. The send thread could also be awakened to send buffered
messages that has timed out.

The flag send_active is true whenever a thread is actively sending,
and thus a thread that needs to send when this flag is set can
simply return immediately, if it's not true then it can set the flag
and start sending.

It would probably be possible to handle this without a mutex, but the
contention on this mutex should be small enough and also there is some
wakeup logic that makes sense for a mutex.

The application thread can prepare the NDB Protocol messages completely
before acquiring the mutex, the only activity which sometimes happens
inside the mutex is reading the time for handling of the adaptive
algorithm.

Sends normally goes to a NDB Data node but could also go to another
Client node and could even go to another thread in the same process.
This is important to handle parallelisation, thus to parallelise it
is sufficient to send a number of messages to other nodes and/or
threads. Each message can kick of at least one new thread.

No comments: