Archive for the ‘Benchmarks’ Category

Introducing Collage: DataI/OStream

10. August 2012

The co::DataOStream and co::DataIStream form the core of the co::Object data distribution. They will gain even more importance in the next couple of weeks, when they will replace the current packet-based messaging (see 145). They will become the core of any communication between Collage nodes.

The data iostreams provide a std::iostream-like interface to send data over the network. They hide all the network connection details, allow overlapping of data serialization and sending through bucketization, do configurable compression and allow application-provided serializers for custom data types. We’re currently working on extending them to also do automatic endian conversion.

First of all they allow object serialization without a need of the application to know if the data has to be saved for later used (buffered objects), who will receive the data, whether or not to use multicast or how to compress it. The serialization in the application code is as simple as possible, here for example the eqPly::FrameData:

void FrameData::serialize( co::DataOStream& os, const uint64_t dirtyBits )       
{                                                                                
    co::Serializable::serialize( os, dirtyBits );                                
    if( dirtyBits & DIRTY_CAMERA )                                               
        os << _position << _rotation << _modelRotation;                          
    if( dirtyBits & DIRTY_FLAGS )                                                
        os << _modelID << _renderMode << _colorMode << _quality << _ortho        
           << _statistics << _help << _wireframe << _pilotMode << _idle          
           << _compression;                                                      
    if( dirtyBits & DIRTY_VIEW )                                                 
        os << _currentViewID;                                                    
    if( dirtyBits & DIRTY_MESSAGE )                                              
        os << _message;                                                          
}                                                                                

The deserialize method looks exactly the same, except using a DataIStream and the >> operator instead of <<. Applications can write free-standing serialization functions, similar to free-standing std::iostream operators.

Behind the scene Collage sets up the ostream with all connections to the nodes which have slave instance of the FrameData, preferring multicast connections. The output is bucketized, that is, whenever the accumulated data reaches a certain threshold (default ~64k), the current block is sent. This allows the OS to send data while the application prepares the next buffer.

Each DataOStream has a configurable compressor. The default algorithm uses heuristics to choose the best tradeoff between speed and compression ratio. Each outgoing packet is compressed before transmission. For buffered objects, the compressed data is retained to optimize memory usage.

Currently we are working on automatic endian conversion to form Collage networks between little and big endian hosts (see 146). Since most of the data types are known at deserialization time, a templated swap function will provide the endianness conversion. For void data we simply have to assume that it is already endian safe, or that the application doesn’t need endian safeness. In Collage and Equalizer we will make sure everything is endian-safe, e.g., by using portable boost serialization archives for the co::DataOStreamArchive.

Lock Performance

5. July 2011

I’m currently working on a low-level library where locked data access has to be optimized. Therefore I benchmarked the performance of the three lock types in Collage on Linux and Mac OS X. The test just runs a number of threads which just set and unset the lock without any other operation. Click on the image below to get a full-resolution image. Be aware the chart uses double-log scale.

The two benchmarks can not be directly compared since they did not run on the same hardware. There are nevertheless a few interesting observations:

(1) Spinlocks are faster than ‘real’ locks. I’ve blogged about this before. Since they consume CPU time while spinning they should only be hold for a very short time, i.e., to read a value. The Collage implement immediately backs off when encountering a set lock by yielding the thread. This avoids priority inversion, which can be observed by some pthread spin lock implementations.

(2) pthread locks are dead slow on Mac OS X. Be aware that the graph uses log scale – a spin lock is up to three orders of magnitude faster than a pthread lock!

(3) Timed locks are slower than un-timed. This meets my intuitive expectation, since the timed implementation is more complex. The timed lock in Collage is implemented using pthread_cond_timedwait.

(4) The Spinlock is faster on OS X on slower hardware than on Linux. Not sure why that is the case. The Collage spin lock uses an atomic variable and compare_and_set. Either these operations are faster on the Core i5, or the thread yield behaves ‘better’ on OS X.

(5) Single-threaded lock access in pthread libraries seems to be optimized.

(6) pthread conditions on Linux observe a steep performance drop once you have more threads than cores. Could be a scheduling issue again.

Next I’ll work on benchmarking and optimizing read/write locking in the Collage Spinlock. Stay tuned for updates!

EDIT: I discovered a bug in my micro-benchmark which wrongly multiplied the results with the number of threads – doh! The figure is fixed now with a new test run.

Two Methods for driving OpenGL Display Walls

7. July 2008

Recently the the VMML at the University of Z├╝rich performed a benchmark comparing Chromium and Equalizer on a display wall. The result surprised me, as I would have expected less difference between the two solutions in this setup, since only static display lists are used. Unfortunately neither InfiniBand nor the broadcast SPU were available for this test, which should improve the Chromium performance.

The performance graph is on the left. You can download the White Paper from the Equalizer website.

ICC, GCC and OpenMP

15. May 2008

Since a colleague finished the CPU-based alpha-compositing in Equalizer, it was time for another compiler benchmark round.

Performance of gcc, icc and OpenMP
This time I used my MacBook Pro with an Intel Core 2 Duo 2.16 GHz, running Mac OS X 10.5.2. The compilers available were gcc 4.0.1, gcc 4.2.1 and icc 10.1.014. The latter two ones I tested with OpenMP disabled and enabled.
The results can be seen on the left (click on the picture for a large version). The upper graph shows the absolute throughput in MB/s for the performance-critical algorithms in Equalizer, and the lower the relative performance compared to the gcc 4.0.1 baseline.

Depth compositing assembles multiple color input images into an destination image based on the depth values. This is used for recombining the result of database decompositions of polygonal data.
Alpha compositing blends the results of volume rendering based on the alpha-value of the images.
Image compression is a RLE-like algorithm used to compress the images during network transfer.

For all tests only the basic optimization flag ‘-O2’ was used. I am sure that by tweaking the compiler flags and code, more performance can be squeezed out of the algorithm.
Nevertheless the results are interesting and representative, since I don’t have the time to investigate and maintain more complicated optimizations.
I think most programmers are under similar time constraints, and getting a 50-100% speed bump by just changing the compiler, and another couple of percents for adding a simple OpenMP pragma is quite valuable.

Good work Intel and the GCC-OpenMP team!

PS: Anybody has seen this bug with gcc and OpenMP?