Archive for May, 2012

Introducing lunchbox::RequestHandler

25. May 2012

The last post –for now– in the lunchbox series is about the RequestHandler. This class is the most specific in Lunchbox, and makes most sense in the context of Collage. The primary use case is to register a pending operation and wait on its completion by another thread. The pattern is similar to futures, except that a future can be easily identified by a single integer.

In Collage, it is heavily used for synchronous, remote operations. The caller registers a request, sends a packet containing the request ID to a remote node which eventually replies using the request ID. The local reply handler then serves the local request, unblocking the caller waiting on the request. For convenience, data can be attached to the request, and the thread serving the request can provide a (return) value passed on to the waitRequest().

This concludes the lunchbox introduction. I’ve intentionally skipped over the following classes since I believe their concepts are more commonplace: Buffer, Clock, DSO, Lock, SpinLock, TimedLock, ScopedMutex, Log, MemoryMap, PerThread, Pool, RefPtr, RNG, Thread and UUID. If you want to hear some background on one of them, please post below.

Next week I’ll cover another topic – BuildYard, Collage, dash are on the list.

Introducing lunchbox::LFVector< T >

18. May 2012

Today’s post is about the LFVector. If you’ve paying attention to the previous posts in this series, you’ll know that this is a thread-safe, lock-free, STL-like vector implementation. It is based on the algorithm described by Damian Dechev et al in Lock-free dynamically resizable arrays.

Since the vector provides resizing operations, it is not completely thread-safe for all combinations of concurrent operations. We’ve also decided to make operations modifying the vector thread-safe by using a spin lock, which costs two compare-and-swap operations in the uncontented case. All read operations are fully lock-free and wait-free.

The LFVector algorithm uses a clever reallocation strategy. Compared to the STL vector, the data is not stored in a single array, since this would invalidate existing storage during resize operations. Instead, it uses a number of slots pointing to arrays of size 2^slot_index. This allows to allocate new storage during concurrent read accesses to existing elements. It also keeps the read access very fast, using just a few operations to access an element:

const int32_t slot = getIndexOfLastBit( ++i );
const size_t index = i ^ ( size_t( 1 )<<slot );
return slots_[ slot ][ index ];

A minor static overhead of this approach is that the slots have to be pre-allocated, which costs 504 bytes in the worst-cast scenario (63 slots of 8 bytes). The number of slots is a configurable template parameter, preset to 32 which seems reasonable for most use cases.

This slot-based storage requires a different approach to implementing iterators, which are based on maintaining the current index and using the operator [] instead of the pointer magic used by the STL vector iterators. In practice, they have the same complexity as the STL iterators and use the same interface.

Insert and erase operations first acquire the spin lock, and then potentially allocate or free the storage corresponding to a slot. Insert operations are completely thread-save with read operations, since existing end() iterators will keep pointing to the old end of the vector. Furthermore, the size is updated after the element is inserted, so size() followed by a read is also thread-safe with a concurrent insert. For erase operations, a concurrent read on the remove element produces undefined results. This affect in particular the end() and back() methods.

The LFVector is one of the magic pieces in DASH, which I guess warrants a whole set of posts in the future. DASH is very exciting work-in-progress for generic, efficient multi-threaded data access. Drop me a note if you want to beta test, provide feedback and contribute to this already.

As an added bonus, LFVector has support for boost::Serialization.

EGPGV 2012

15. May 2012

If you were wondering what’s up with last week’s ‘Introducing Lunchbox’ post: There wasn’t one since I’ve been to the Eurographics Symposium on┬áParallel Graphics and Visualization to present our paper “Parallel Rendering on Hybrid Multi-GPU Clusters”. This week I’m attending Eurographics, but I’ll try to post the fourth article in the Lunchbox series by Friday.

Our paper presented a collection and evaluation of optimizations for medium-sized GPU clusters which use Multi-GPU NUMA nodes. This type of architecture is quite important, since it provides a cost-effective configuration for parallel rendering, since the host and network infrastructure cost is amortized over multiple GPUs. During this paper we found a few surprising insights (<cough>glFinish</cough>) on what optimizations are actually important.

Enough talk: The most important parts are summarized in our presention. Enjoy!

Introducing lunchbox::MTQueue< T >

4. May 2012

Unsurprisingly, this time we’ll look into the MTQueue. The multi-threaded queue is the blocking, fully threadsafe big brother of the LFQueue discussed last week.

The MTQueue is fully thread-safe, that is, any public method can be called from any thread at any given time. Any request which will wait until it can be satisfied.

Naturally, the most common use case is to pop() items from the queue, potentially waiting for them. This is used extensively in Equalizer, e.g., by the pipe render threads which pop tasks received from the server to execute them. For pop, there are also a non-blocking tryPop() and bound timedPop() methods which may fail.

But the MTQueue goes further in the blocking paradigm: getFront(), getBack() operator [] and even push() may block. For the blocking push, the queue has a runtime-configurable maximum size which limits the number of items in the queue. This is useful when linking a slow consumer thread with a fast producer thread to limit memory usage and eventually slow down the producer. Even the setMaxSize() blocks until the queue meets the new maximum size requirement!

To implement the thread-safety and blocking semantics, almost every operation uses a Condition lock/unlock and signal or wait. Since this condition has a certain overhead, bulk operations such as push( std::vector ) amortize this cost over multiple items. This is for example used by the RSPConnection, which will push multiple buffers at once to to application when possible (see last weeks post).

Fun fact: During the writing of this post, I discovered and fixed a thread-safety issue with the copy constructor and assignment operator.