1.4 beta release of the Eyescale open source packages

20. June 2012

We are pleased to announce the 1.4 beta release of the Eyescale open source packages. This release is a preview for testing the upcoming 1.4 stable release. It is the first modular release, and contains the following libraries and new features:

  • Equalizer: parallel rendering framework
    • Various scalable rendering performance features: asynchronous readbacks, region of interest and thread affinity.
  • Collage: C++ library for building heterogenous, distributed applications
    • Zeroconf support and node discovery
    • Blocking object commits
    • Increased InfiniBand RDMA performance
  • GPU-SD: discovery and announcement of GPUs using zeroconf
    • VirtualGL detection
    • Hostname command line parameter for gpu_sd daemon
  • Lunchbox: C++ library for multi-threaded programming
    • Servus, C++ interface to announce, discover and iterate over key-value pairs stored in a zeroconf service description
    • LFVector, a thread-safe, lock-free vector
  • Buildyard: A CMake-based superbuilder to download, configure and build the packages and dependencies for this release
    • Generates Unix Makefiles and solution files for Visual Studio 2008/10
    • Simple CMake project configuration scripts
    • Support for local overrides and user forks
    • Extensible with custom in-house or open source projects
  • http://eyescale.github.com: A website for API documentation of all
    the aforementioned packages

Please test this release extensively and report any bugs on the respective project page at https://github.com/Eyescale. The release notes are part of the API documentation at http://eyescale.github.com.

We would like to thank all contributors who made this release possible.

Buildyard + doxygen + github = eyescale.github.com

15. June 2012

Sometimes pieces just fall into place and you know you’re on the right track. While preparing the 1.4 beta release of our Eyescale software stack (Equalizer and friends), I needed to put the API documentation of all these projects on a web server. In the past, with Equalizer as a single project, I just had doxygen dump it somewhere and then copied the stuff at release time to equalizergraphics. With five or more projects, that’s not really an option anymore.

Meet eyescale.github.com: Our new home for all doxygen-generated API documentation. For us, the process is almost fully automated:

  • github provides a git repository for all the web pages, and automatically serves them at eyescale.github.com
  • Buildyard has a configuration which clones this git repository before all other projects.
  • A per-project CMake rule installs the project, runs doxygen on the installed headers, and copies the result to the git repository into project-version
  • CMake magic in the git repository rebuilds the index page and adds all new files to the repository
  • A manual git commit; git push uploads the new pages to the git repository, and github automatically updates the website.

This took about one day to setup, and now it takes almost no time to update API documentation. It has the added benefit that you can easily get all the reference documentation by simply cloning the git repository.

Lunchbox Folly

8. June 2012

Facebook recently released folly, which caught my eye due to its similarity to Lunchbox. It’s definitely a library to watch.

I don’t think we’ll be using it right now, since it’s missing some of the stuff we’re using, e.g, the LFVector, and since the implementation so far seems to be mainly tested on Linux + gcc 4.6. On the other hand, it has some interesting components which we might need in the future, e.g, atomic hash containers. It also contains optimized version of some standard components such as vector and string, which haven’t really shown up as hotspot during profiling our code.

Interesting enough, folly also has some components which are almost identical to the Lunchbox counterparts, such as r/w spin locks and lock-free queues. It’s always good to see when ideas converge to a common design.

Buildyard – C++ source project management

1. June 2012

The Problem

The Equalizer source code always was modularized, but did build everything into a single shared library for convenience. Lately new projects, e.g., dash/codash, required functionality from what was called eq::base. This lead to Lunchbox, which you should know by now.

Managing the development of multiple, source dependent projects is a pain. You’ve got to download them, configure, build and install them in the right order. There are solutions for this in the Java world, but the ones I found for C/C++ were either in limbo (ryppl) or are more for source-based distribution, but not that easy for developers (0install).

Our Solution

After some tinkering we came up with Buildyard, a CMake-based meta build system. It builds on top of the ExternalProject CMake module, which is good but requires quite some customization when used. We’ve simplified this further, and adding a new CMake-based project is a couple of lines in a configuration file, e.g.:

set(EQUALIZER_VERSION 1.3.1)                                                     
set(EQUALIZER_DEPENDS gpu-sd Boost hwloc vmmlib Lunchbox GLStats)                
set(EQUALIZER_ROOT_VAR EQ_ROOT)                                                  
set(EQUALIZER_REPO_URL https://github.com/Eyescale/Equalizer.git)                
set(EQUALIZER_REPO_TAG master)                                                   

Creating this simple configuration files allows you to build a project such as Equalizer, and all it’s dependencies, much easier. You simply clone Buildyard and use make Equalizer in the cloned directory. This will first configure Buildyard to resolve the dependencies, download the ones not installed yet, and then configure/build/install them in the correct order into a local installation directory. This even works with parallel builds.

The development process is largely unchanged as well. Simple go to src/project, code and compile away. Buildyard will inject a Makefile into each project source which sets up the proper targets for building. When compiling in a source directory, other project dependencies are not considered by default to get speedy compiles.

The whole thing is easily extendable. Internally we’ve got a config.bbp repository for the closed-source project configurations, which build on to of the open source ones pre-configured with Buildyard. Developers clone that into the Buildyard directory, where it will be picked up on the next CMake run.

This post just scratches the surface, have a look at the Readme for more details and post questions below. While it’s not as sophisticated as ryppl, it works and we use it daily already in quite a number of projects. It also facilitates creating new projects a lot, such as dash, codash, Lunchbox and GLStats.

Introducing lunchbox::RequestHandler

25. May 2012

The last post –for now– in the lunchbox series is about the RequestHandler. This class is the most specific in Lunchbox, and makes most sense in the context of Collage. The primary use case is to register a pending operation and wait on its completion by another thread. The pattern is similar to futures, except that a future can be easily identified by a single integer.

In Collage, it is heavily used for synchronous, remote operations. The caller registers a request, sends a packet containing the request ID to a remote node which eventually replies using the request ID. The local reply handler then serves the local request, unblocking the caller waiting on the request. For convenience, data can be attached to the request, and the thread serving the request can provide a (return) value passed on to the waitRequest().

This concludes the lunchbox introduction. I’ve intentionally skipped over the following classes since I believe their concepts are more commonplace: Buffer, Clock, DSO, Lock, SpinLock, TimedLock, ScopedMutex, Log, MemoryMap, PerThread, Pool, RefPtr, RNG, Thread and UUID. If you want to hear some background on one of them, please post below.

Next week I’ll cover another topic – BuildYard, Collage, dash are on the list.

Introducing lunchbox::LFVector< T >

18. May 2012

Today’s post is about the LFVector. If you’ve paying attention to the previous posts in this series, you’ll know that this is a thread-safe, lock-free, STL-like vector implementation. It is based on the algorithm described by Damian Dechev et al in Lock-free dynamically resizable arrays.

Since the vector provides resizing operations, it is not completely thread-safe for all combinations of concurrent operations. We’ve also decided to make operations modifying the vector thread-safe by using a spin lock, which costs two compare-and-swap operations in the uncontented case. All read operations are fully lock-free and wait-free.

The LFVector algorithm uses a clever reallocation strategy. Compared to the STL vector, the data is not stored in a single array, since this would invalidate existing storage during resize operations. Instead, it uses a number of slots pointing to arrays of size 2^slot_index. This allows to allocate new storage during concurrent read accesses to existing elements. It also keeps the read access very fast, using just a few operations to access an element:

const int32_t slot = getIndexOfLastBit( ++i );
const size_t index = i ^ ( size_t( 1 )<<slot );
return slots_[ slot ][ index ];

A minor static overhead of this approach is that the slots have to be pre-allocated, which costs 504 bytes in the worst-cast scenario (63 slots of 8 bytes). The number of slots is a configurable template parameter, preset to 32 which seems reasonable for most use cases.

This slot-based storage requires a different approach to implementing iterators, which are based on maintaining the current index and using the operator [] instead of the pointer magic used by the STL vector iterators. In practice, they have the same complexity as the STL iterators and use the same interface.

Insert and erase operations first acquire the spin lock, and then potentially allocate or free the storage corresponding to a slot. Insert operations are completely thread-save with read operations, since existing end() iterators will keep pointing to the old end of the vector. Furthermore, the size is updated after the element is inserted, so size() followed by a read is also thread-safe with a concurrent insert. For erase operations, a concurrent read on the remove element produces undefined results. This affect in particular the end() and back() methods.

The LFVector is one of the magic pieces in DASH, which I guess warrants a whole set of posts in the future. DASH is very exciting work-in-progress for generic, efficient multi-threaded data access. Drop me a note if you want to beta test, provide feedback and contribute to this already.

As an added bonus, LFVector has support for boost::Serialization.

EGPGV 2012

15. May 2012

If you were wondering what’s up with last week’s ‘Introducing Lunchbox’ post: There wasn’t one since I’ve been to the Eurographics Symposium on┬áParallel Graphics and Visualization to present our paper “Parallel Rendering on Hybrid Multi-GPU Clusters”. This week I’m attending Eurographics, but I’ll try to post the fourth article in the Lunchbox series by Friday.

Our paper presented a collection and evaluation of optimizations for medium-sized GPU clusters which use Multi-GPU NUMA nodes. This type of architecture is quite important, since it provides a cost-effective configuration for parallel rendering, since the host and network infrastructure cost is amortized over multiple GPUs. During this paper we found a few surprising insights (<cough>glFinish</cough>) on what optimizations are actually important.

Enough talk: The most important parts are summarized in our presention. Enjoy!

Introducing lunchbox::MTQueue< T >

4. May 2012

Unsurprisingly, this time we’ll look into the MTQueue. The multi-threaded queue is the blocking, fully threadsafe big brother of the LFQueue discussed last week.

The MTQueue is fully thread-safe, that is, any public method can be called from any thread at any given time. Any request which will wait until it can be satisfied.

Naturally, the most common use case is to pop() items from the queue, potentially waiting for them. This is used extensively in Equalizer, e.g., by the pipe render threads which pop tasks received from the server to execute them. For pop, there are also a non-blocking tryPop() and bound timedPop() methods which may fail.

But the MTQueue goes further in the blocking paradigm: getFront(), getBack() operator [] and even push() may block. For the blocking push, the queue has a runtime-configurable maximum size which limits the number of items in the queue. This is useful when linking a slow consumer thread with a fast producer thread to limit memory usage and eventually slow down the producer. Even the setMaxSize() blocks until the queue meets the new maximum size requirement!

To implement the thread-safety and blocking semantics, almost every operation uses a Condition lock/unlock and signal or wait. Since this condition has a certain overhead, bulk operations such as push( std::vector ) amortize this cost over multiple items. This is for example used by the RSPConnection, which will push multiple buffers at once to to application when possible (see last weeks post).

Fun fact: During the writing of this post, I discovered and fixed a thread-safety issue with the copy constructor and assignment operator.

Introducing lunchbox::LFQueue< T >

27. April 2012

The second installment on Lunchbox introduces the lock-free queue.

Lunchbox uses two naming schemes when implementing containers which have a STL pendant, ‘LF’ and ‘MT’. In both cases the containers are thread-safe, in contrast to their STL counterparts. These classes carefully document what types of multi-threaded access are allowed, and which methods might only be accessed from a certain thread or in a certain state, if any.

‘MT’ stands for multi-threaded and typically uses synchronization primitives and blocking access. More on ‘MT’ classes in a later post.

‘LF’ stands for lock-free and uses atomic variables and non-blocking access. Lunchbox provides an Atomic class, which is derived from a library which recently got accepted as boost::lock_free. Atomic variables are a standard concept, google it if you’re not familiar with it.

Implementing lock-free containers is a very tricky business. Smart minds spend a lot of time on it, and still get it wrong quite often. For that reason, the functionality of the LFQueue is limited. To begin with, it has a fixed-size storage allocated at construction time. For that reason, a push might fail if the queue is full. Furthermore, only a single thread may write and (another) single thread might read at the same time, that is, at most two threads can access the LFQueue at the same time.

All these restrictions seem severe, but they allow for a fast and simple implementation. A quick test on my laptop yields:

[roku Release master]% ./Lunchbox/tests/mtQueue
193.339 reads/ms
193.34 writes/ms
[roku Release master]% ./Lunchbox/tests/lfQueue
12288.8 reads/ms, 6629.31 empty/ms
6145.42 writes/ms, 2031.51 full/ms

Collage uses the LFQueue in its multicast implementation. The RSPConnection uses a protocol thread handling the sending of data, acknowledgment, negative acknowledgement and retransmissions. Each connection has a fixed set of data buffers, which are constantly shuffled around between the application and protocol thread. At high wire speed (10GigE), more than 100.000 buffers need to be shifted each second. From the protocol thread to the application thread this is blocking, using an MTQueue, and in the other direction it’s non-blocking using a LFQueue. Since the queue size (number of buffers) is fixed, and only two threads are involved, the LFQueue is well suited here. For the curious, here is more background on the RSP implementation.

Introducing Lunchbox: Monitor< T >

20. April 2012

We’ve recently refactored the Equalizer and Collage base library co::base into a separate library and project called Lunchbox. This post has some background on this change.

Why Lunchbox? It’s a basic ‘toolbox’ library for multi-threaded programming. This combined with the famous ‘the free lunch is over’ quote lead to the name Lunchbox.

I’ve decided to do a weekly series of posts where I present one feature/class of Lunchbox which is beyond the basic, well-documented threading primitives provided by STL and boost. Today it’s the turn of the template class Monitor.

The monitor is a higher-level primitive based on a condition variable. Lunchbox also has a Condition class which encapsulates a condition variable and associated lock. If you’re not familiar with condition variables, google it. Smarter people than me have written good articles about them, and the Lunchbox API is pretty straight-forward.

The monitor allows to observe the state of a variable in a blocking fashion. You can think of it as a blocking variable: A monitor has a value which can be incremented, decremented and set. Any thread can wait on the monitor to reach a certain value (waitEQ), to leave a certain value (waitNE), or to reach (waitGE) or undercut (waitNE) a given value. Using a monitor makes the code easier to understand and robust, as compared to using a traditional semaphore or barrier.

In Equalizer I typically use them to monitor a state for synchronization. For example, the pipe initialization needs to wait for the node to be initialized. Since these tasks are queued and executed in parallel, the pipe thread monitors the node state to pass initialization. The eq::Node has a Monitor< State > which represents its current state (stopped, running, failed, …):

enum State
{
STATE_STOPPED,
STATE_INITIALIZING,
STATE_INIT_FAILED,
STATE_RUNNING,
STATE_FAILED
};
lunchbox::Monitor< State > _state;

This monitor allows to wait on the state to reach a certain value, which is used in Node::waitInitialized to wait for the node thread to finish initialization from the pipe thread:

void Node::waitInitialized() const
{
_state.waitGE( STATE_INIT_FAILED );
}

The node state is advanced after initialization by the node main thread:

_state = initResult ? STATE_RUNNING : STATE_INIT_FAILED;

Similarly, the per-node frame synchronization during rendering is using monitors of frame numbers to synchronize the node main thread and pipe render threads.

Since the monitor is a template class, you can use it with your own data types. Monitors have become an invaluable primitive in Collage and Equalizer for thread synchronization.