Archive for the ‘Intel’ Category

Sneak Peek: The new Collage

17. August 2012

We’ve started working on making Collage endian safe to be able to communicate between an IBM BlueGene and X86 workstations. This requires that all data exchanges can be byte-swapped, which necessitates heavy refactoring. Long story short, this stalled the 1.0 API definition since the API for Node/LocalNode is not yet final. Therefore I’ll throw in a preview on how the new Collage peer-to-peer communications will look like.

Every communication will be stream-based (see last week’s post). The receiver knows the endianness of the sender and will swap, if necessary, the byte order in the corresponding DataIStream. This approach has the benefit that byte swapping only occurs in mixed-endian environment, whereas the traditional network big endian order nowadays requires two swaps in an x86-only cluster. The packets exchanged between nodes today will disappear completely, since Collage can’t examine their structure.

You can observe this work in the corresponding issue ticket. Sending a packet in the old way looks like this:

NodeAttachObjectPacket packet;
packet.requestID = _localNode->registerRequest( object );
packet.objectID = id;
packet.objectInstanceID = instanceID;
_localNode->send( packet );

This is replaced by a DataOStream, which will send the data once it goes out of scope, making the code much nicer:

const uint32_t requestID = _localNode->registerRequest( object );
_localNode->send( CMD_NODE_ATTACH_OBJECT ) << id << instanceID << requestID;

The receiving side doesn’t improve as much, since we still need to extract all data into local variables. The old code accesses the raw dat by casting it to the appropriate packet:

const NodeAttachObjectPacket* packet = command.get();
[...]
_attachObject( object, packet->objectID, packet->objectInstanceID );

The new code extracts the data into local variables, which will be endian-converted if necessary:

const UUID& objectID = stream< UUID >.get();
const uint32_t instanceID = stream< uint32_t >.get();
[...]
_attachObject( object, objectID, instanceID );

The good news is that Equalizer application code is not affected at all by this. Besides the byte swapping, this will enable other features, since all data passes through the DataI/OStream, for example automatically compressing any packet over a certain size.

This work will continue the following weeks, and once it’s merged into the master branch I’ll continue with my introduction to the then new-and-shiny Collage.

Advertisements

OpenCL – Finally a generic GPGPU API?!

14. June 2008

My first thought when Apple announced OpenCL at this years’ WWDC was ‘Not another GPGPU language!’. While details are still sparse, the Wikipedia article states that OpenCL is submitted to Khronos.

This is indeed good news. So far I have shunned GPGPU languages since they tend to be proprietary or not generally available. While using API’s like CUDA or RMDP is relatively easy, the last 10% of motivation where always missing for me since it would only work on one platform or cost money.

I hope that OpenCL is going to be adopted fast (I am looking at you nVidia, Intel and ATI!) and becomes as ubiquitous as OpenGL.

Edit: It is official!

ICC, GCC and OpenMP

15. May 2008

Since a colleague finished the CPU-based alpha-compositing in Equalizer, it was time for another compiler benchmark round.

Performance of gcc, icc and OpenMP
This time I used my MacBook Pro with an Intel Core 2 Duo 2.16 GHz, running Mac OS X 10.5.2. The compilers available were gcc 4.0.1, gcc 4.2.1 and icc 10.1.014. The latter two ones I tested with OpenMP disabled and enabled.
The results can be seen on the left (click on the picture for a large version). The upper graph shows the absolute throughput in MB/s for the performance-critical algorithms in Equalizer, and the lower the relative performance compared to the gcc 4.0.1 baseline.

Depth compositing assembles multiple color input images into an destination image based on the depth values. This is used for recombining the result of database decompositions of polygonal data.
Alpha compositing blends the results of volume rendering based on the alpha-value of the images.
Image compression is a RLE-like algorithm used to compress the images during network transfer.

For all tests only the basic optimization flag ‘-O2’ was used. I am sure that by tweaking the compiler flags and code, more performance can be squeezed out of the algorithm.
Nevertheless the results are interesting and representative, since I don’t have the time to investigate and maintain more complicated optimizations.
I think most programmers are under similar time constraints, and getting a 50-100% speed bump by just changing the compiler, and another couple of percents for adding a simple OpenMP pragma is quite valuable.

Good work Intel and the GCC-OpenMP team!

PS: Anybody has seen this bug with gcc and OpenMP?