Archive for the ‘multicore’ Category

IEEE VisWeek: DASH Poster

14. October 2012

DASH Poster


Today I’ll be presenting our DASH poster at the IEEE VisWeek. This is the first official outing of DASH, I’m intrigued to see what interest it will generate. If you’re around in Seattle, feel free to come and speak to me, otherwise have a look at the poster on the right and leave comments.

For more information and the source code go to github.com/BlueBrain/dash

Thanks to Marwan for designing this poster!

Equalizer 1.4 released

7. September 2012


The last two weeks have been quiet, since I was on biking through Switzerland. Meanwhile, poor Daniel back at work churned through most of the Collage changes outlined in the last post. You can see the changes in the Collage endian branch on github, which will be merged back into master in the next couple of weeks.

Now back to the news: After finally figuring out how to build Equalizer and dependencies using MacPorts portfiles on Mac OS X, I released the long-standing 1.4 version of Equalizer, GPU-SD, Lunchbox and vmmlib. Below is the release announcement – enjoy!

Neuchatel, Switzerland – September 7, 2012 – Eyescale is pleased to announce the release of Equalizer 1.4.

Equalizer is the standard framework to create and deploy parallel, scalable 3D applications. This modular release includes Collage 0.6, a cross-platform C++ library for building heterogenous, distributed applications, GPU-SD 1.4, a C++ library and daemon for the discovery and announcement of graphics processing units using zeroconf networking and Lunchbox 1.4, a C++ library for multi-threaded programming. All software packages are available for free for commercial and non-commercial use under the LGPL open source license.

Equalizer 1.4 is a feature release extending the 1.0 API, introducing major new features, most notably asynchronous readbacks, region of interest and thread affinity for increased performance during scalable rendering. It culminates over seven years of development and decades of experience into a feature-rich, high-performance and mature parallel rendering framework and related high-performance C++ libraries.

Equalizer enables software developers to easily build interactive and scalable visualization applications, which optimally combine multiple graphics cards, processors and computers to scale the rendering performance, visual quality and display size.

Equalizer Applications

Eyescale provides software consulting and development services for parallel 3D visualization software and GPU computing applications, based on the Eyescale software products or other open and closed source solutions.

Please check the release notes on the Equalizer website for a comprehensive list of new features, enhancements, optimizations and bug fixes. A paperback book of the Programming and User Guide is available.

We would like to thank all individuals and parties who have contributed to the development of Equalizer 1.4.

Left image courtesy of Cajal Blue Brain/ / Blue Brain Project. Second from left copyright Realtime Technology AG, 2008. Right image courtesy University of Siegen, 2008.

1.4 beta release of the Eyescale open source packages

20. June 2012

We are pleased to announce the 1.4 beta release of the Eyescale open source packages. This release is a preview for testing the upcoming 1.4 stable release. It is the first modular release, and contains the following libraries and new features:

  • Equalizer: parallel rendering framework
    • Various scalable rendering performance features: asynchronous readbacks, region of interest and thread affinity.
  • Collage: C++ library for building heterogenous, distributed applications
    • Zeroconf support and node discovery
    • Blocking object commits
    • Increased InfiniBand RDMA performance
  • GPU-SD: discovery and announcement of GPUs using zeroconf
    • VirtualGL detection
    • Hostname command line parameter for gpu_sd daemon
  • Lunchbox: C++ library for multi-threaded programming
    • Servus, C++ interface to announce, discover and iterate over key-value pairs stored in a zeroconf service description
    • LFVector, a thread-safe, lock-free vector
  • Buildyard: A CMake-based superbuilder to download, configure and build the packages and dependencies for this release
    • Generates Unix Makefiles and solution files for Visual Studio 2008/10
    • Simple CMake project configuration scripts
    • Support for local overrides and user forks
    • Extensible with custom in-house or open source projects
  • http://eyescale.github.com: A website for API documentation of all
    the aforementioned packages

Please test this release extensively and report any bugs on the respective project page at https://github.com/Eyescale. The release notes are part of the API documentation at http://eyescale.github.com.

We would like to thank all contributors who made this release possible.

Lunchbox Folly

8. June 2012

Facebook recently released folly, which caught my eye due to its similarity to Lunchbox. It’s definitely a library to watch.

I don’t think we’ll be using it right now, since it’s missing some of the stuff we’re using, e.g, the LFVector, and since the implementation so far seems to be mainly tested on Linux + gcc 4.6. On the other hand, it has some interesting components which we might need in the future, e.g, atomic hash containers. It also contains optimized version of some standard components such as vector and string, which haven’t really shown up as hotspot during profiling our code.

Interesting enough, folly also has some components which are almost identical to the Lunchbox counterparts, such as r/w spin locks and lock-free queues. It’s always good to see when ideas converge to a common design.

Introducing lunchbox::RequestHandler

25. May 2012

The last post –for now– in the lunchbox series is about the RequestHandler. This class is the most specific in Lunchbox, and makes most sense in the context of Collage. The primary use case is to register a pending operation and wait on its completion by another thread. The pattern is similar to futures, except that a future can be easily identified by a single integer.

In Collage, it is heavily used for synchronous, remote operations. The caller registers a request, sends a packet containing the request ID to a remote node which eventually replies using the request ID. The local reply handler then serves the local request, unblocking the caller waiting on the request. For convenience, data can be attached to the request, and the thread serving the request can provide a (return) value passed on to the waitRequest().

This concludes the lunchbox introduction. I’ve intentionally skipped over the following classes since I believe their concepts are more commonplace: Buffer, Clock, DSO, Lock, SpinLock, TimedLock, ScopedMutex, Log, MemoryMap, PerThread, Pool, RefPtr, RNG, Thread and UUID. If you want to hear some background on one of them, please post below.

Next week I’ll cover another topic – BuildYard, Collage, dash are on the list.

Introducing lunchbox::LFVector< T >

18. May 2012

Today’s post is about the LFVector. If you’ve paying attention to the previous posts in this series, you’ll know that this is a thread-safe, lock-free, STL-like vector implementation. It is based on the algorithm described by Damian Dechev et al in Lock-free dynamically resizable arrays.

Since the vector provides resizing operations, it is not completely thread-safe for all combinations of concurrent operations. We’ve also decided to make operations modifying the vector thread-safe by using a spin lock, which costs two compare-and-swap operations in the uncontented case. All read operations are fully lock-free and wait-free.

The LFVector algorithm uses a clever reallocation strategy. Compared to the STL vector, the data is not stored in a single array, since this would invalidate existing storage during resize operations. Instead, it uses a number of slots pointing to arrays of size 2^slot_index. This allows to allocate new storage during concurrent read accesses to existing elements. It also keeps the read access very fast, using just a few operations to access an element:

const int32_t slot = getIndexOfLastBit( ++i );
const size_t index = i ^ ( size_t( 1 )<<slot );
return slots_[ slot ][ index ];

A minor static overhead of this approach is that the slots have to be pre-allocated, which costs 504 bytes in the worst-cast scenario (63 slots of 8 bytes). The number of slots is a configurable template parameter, preset to 32 which seems reasonable for most use cases.

This slot-based storage requires a different approach to implementing iterators, which are based on maintaining the current index and using the operator [] instead of the pointer magic used by the STL vector iterators. In practice, they have the same complexity as the STL iterators and use the same interface.

Insert and erase operations first acquire the spin lock, and then potentially allocate or free the storage corresponding to a slot. Insert operations are completely thread-save with read operations, since existing end() iterators will keep pointing to the old end of the vector. Furthermore, the size is updated after the element is inserted, so size() followed by a read is also thread-safe with a concurrent insert. For erase operations, a concurrent read on the remove element produces undefined results. This affect in particular the end() and back() methods.

The LFVector is one of the magic pieces in DASH, which I guess warrants a whole set of posts in the future. DASH is very exciting work-in-progress for generic, efficient multi-threaded data access. Drop me a note if you want to beta test, provide feedback and contribute to this already.

As an added bonus, LFVector has support for boost::Serialization.

Introducing lunchbox::MTQueue< T >

4. May 2012

Unsurprisingly, this time we’ll look into the MTQueue. The multi-threaded queue is the blocking, fully threadsafe big brother of the LFQueue discussed last week.

The MTQueue is fully thread-safe, that is, any public method can be called from any thread at any given time. Any request which will wait until it can be satisfied.

Naturally, the most common use case is to pop() items from the queue, potentially waiting for them. This is used extensively in Equalizer, e.g., by the pipe render threads which pop tasks received from the server to execute them. For pop, there are also a non-blocking tryPop() and bound timedPop() methods which may fail.

But the MTQueue goes further in the blocking paradigm: getFront(), getBack() operator [] and even push() may block. For the blocking push, the queue has a runtime-configurable maximum size which limits the number of items in the queue. This is useful when linking a slow consumer thread with a fast producer thread to limit memory usage and eventually slow down the producer. Even the setMaxSize() blocks until the queue meets the new maximum size requirement!

To implement the thread-safety and blocking semantics, almost every operation uses a Condition lock/unlock and signal or wait. Since this condition has a certain overhead, bulk operations such as push( std::vector ) amortize this cost over multiple items. This is for example used by the RSPConnection, which will push multiple buffers at once to to application when possible (see last weeks post).

Fun fact: During the writing of this post, I discovered and fixed a thread-safety issue with the copy constructor and assignment operator.

Introducing lunchbox::LFQueue< T >

27. April 2012

The second installment on Lunchbox introduces the lock-free queue.

Lunchbox uses two naming schemes when implementing containers which have a STL pendant, ‘LF’ and ‘MT’. In both cases the containers are thread-safe, in contrast to their STL counterparts. These classes carefully document what types of multi-threaded access are allowed, and which methods might only be accessed from a certain thread or in a certain state, if any.

‘MT’ stands for multi-threaded and typically uses synchronization primitives and blocking access. More on ‘MT’ classes in a later post.

‘LF’ stands for lock-free and uses atomic variables and non-blocking access. Lunchbox provides an Atomic class, which is derived from a library which recently got accepted as boost::lock_free. Atomic variables are a standard concept, google it if you’re not familiar with it.

Implementing lock-free containers is a very tricky business. Smart minds spend a lot of time on it, and still get it wrong quite often. For that reason, the functionality of the LFQueue is limited. To begin with, it has a fixed-size storage allocated at construction time. For that reason, a push might fail if the queue is full. Furthermore, only a single thread may write and (another) single thread might read at the same time, that is, at most two threads can access the LFQueue at the same time.

All these restrictions seem severe, but they allow for a fast and simple implementation. A quick test on my laptop yields:

[roku Release master]% ./Lunchbox/tests/mtQueue
193.339 reads/ms
193.34 writes/ms
[roku Release master]% ./Lunchbox/tests/lfQueue
12288.8 reads/ms, 6629.31 empty/ms
6145.42 writes/ms, 2031.51 full/ms

Collage uses the LFQueue in its multicast implementation. The RSPConnection uses a protocol thread handling the sending of data, acknowledgment, negative acknowledgement and retransmissions. Each connection has a fixed set of data buffers, which are constantly shuffled around between the application and protocol thread. At high wire speed (10GigE), more than 100.000 buffers need to be shifted each second. From the protocol thread to the application thread this is blocking, using an MTQueue, and in the other direction it’s non-blocking using a LFQueue. Since the queue size (number of buffers) is fixed, and only two threads are involved, the LFQueue is well suited here. For the curious, here is more background on the RSP implementation.

Introducing Lunchbox: Monitor< T >

20. April 2012

We’ve recently refactored the Equalizer and Collage base library co::base into a separate library and project called Lunchbox. This post has some background on this change.

Why Lunchbox? It’s a basic ‘toolbox’ library for multi-threaded programming. This combined with the famous ‘the free lunch is over’ quote lead to the name Lunchbox.

I’ve decided to do a weekly series of posts where I present one feature/class of Lunchbox which is beyond the basic, well-documented threading primitives provided by STL and boost. Today it’s the turn of the template class Monitor.

The monitor is a higher-level primitive based on a condition variable. Lunchbox also has a Condition class which encapsulates a condition variable and associated lock. If you’re not familiar with condition variables, google it. Smarter people than me have written good articles about them, and the Lunchbox API is pretty straight-forward.

The monitor allows to observe the state of a variable in a blocking fashion. You can think of it as a blocking variable: A monitor has a value which can be incremented, decremented and set. Any thread can wait on the monitor to reach a certain value (waitEQ), to leave a certain value (waitNE), or to reach (waitGE) or undercut (waitNE) a given value. Using a monitor makes the code easier to understand and robust, as compared to using a traditional semaphore or barrier.

In Equalizer I typically use them to monitor a state for synchronization. For example, the pipe initialization needs to wait for the node to be initialized. Since these tasks are queued and executed in parallel, the pipe thread monitors the node state to pass initialization. The eq::Node has a Monitor< State > which represents its current state (stopped, running, failed, …):

enum State
{
STATE_STOPPED,
STATE_INITIALIZING,
STATE_INIT_FAILED,
STATE_RUNNING,
STATE_FAILED
};
lunchbox::Monitor< State > _state;

This monitor allows to wait on the state to reach a certain value, which is used in Node::waitInitialized to wait for the node thread to finish initialization from the pipe thread:

void Node::waitInitialized() const
{
_state.waitGE( STATE_INIT_FAILED );
}

The node state is advanced after initialization by the node main thread:

_state = initResult ? STATE_RUNNING : STATE_INIT_FAILED;

Similarly, the per-node frame synchronization during rendering is using monitors of frame numbers to synchronize the node main thread and pipe render threads.

Since the monitor is a template class, you can use it with your own data types. Monitors have become an invaluable primitive in Collage and Equalizer for thread synchronization.

Lock Performance

5. July 2011

I’m currently working on a low-level library where locked data access has to be optimized. Therefore I benchmarked the performance of the three lock types in Collage on Linux and Mac OS X. The test just runs a number of threads which just set and unset the lock without any other operation. Click on the image below to get a full-resolution image. Be aware the chart uses double-log scale.

The two benchmarks can not be directly compared since they did not run on the same hardware. There are nevertheless a few interesting observations:

(1) Spinlocks are faster than ‘real’ locks. I’ve blogged about this before. Since they consume CPU time while spinning they should only be hold for a very short time, i.e., to read a value. The Collage implement immediately backs off when encountering a set lock by yielding the thread. This avoids priority inversion, which can be observed by some pthread spin lock implementations.

(2) pthread locks are dead slow on Mac OS X. Be aware that the graph uses log scale – a spin lock is up to three orders of magnitude faster than a pthread lock!

(3) Timed locks are slower than un-timed. This meets my intuitive expectation, since the timed implementation is more complex. The timed lock in Collage is implemented using pthread_cond_timedwait.

(4) The Spinlock is faster on OS X on slower hardware than on Linux. Not sure why that is the case. The Collage spin lock uses an atomic variable and compare_and_set. Either these operations are faster on the Core i5, or the thread yield behaves ‘better’ on OS X.

(5) Single-threaded lock access in pthread libraries seems to be optimized.

(6) pthread conditions on Linux observe a steep performance drop once you have more threads than cores. Could be a scheduling issue again.

Next I’ll work on benchmarking and optimizing read/write locking in the Collage Spinlock. Stay tuned for updates!

EDIT: I discovered a bug in my micro-benchmark which wrongly multiplied the results with the number of threads – doh! The figure is fixed now with a new test run.