pthread_mutex vs atomic operations


Equalizer uses reference pointers in some places, and the reference count so far was protected by a pthread_mutex (or the equivalent on Windows). The simple reason is that when I implemented it, I didn’t want to spend the time on something better.

This week I’ve found the time to replace the lock-protected counter with an atomic variable, with surprisingly good results. The ‘frame throughput’ when just rendering a quad in eqPly increased from ~200 FPS to ~750 FPS on my MacBook Pro! As soon as one starts rendering something more complex (like the rockerArm), the speedup is much less noticeable. I haven’t done any tests yet on a multi-GPU system where lock contention should be more apparent.

So here is some non-news for all you parallel programming guys: Locks are bad for performance! 😉

Advertisements

8 Responses to “pthread_mutex vs atomic operations”

  1. Jonas Says:

    ORLY??

    😉

    Congrats on the speed-up, though.

  2. r bag Says:

    I have done similar experiments during my master thesis, and the results were also good
    but I read somewhere, don’t remeber where exactly, that atomic operations are not scalable !
    Do you have any idea about that ?

    • eile Says:

      It’s a bit broad as a comment. Do you have a citation for your statement?

      In fact, atomic ops as used here provide much better scalability then locks.

      • r bag Says:

        Of course atomic operations are better than locks 🙂

        What I mean by not scalable is that they don’t scale well compared to other techniques (like non blocking algorithms). The problem of non blocking algorithms is that they are difficult to write !
        By the way I’m also working on atomic operations in GCC And it seems to be the best solution for now 🙂

        Oncea talks about that in his paper, SPAA 2009 : http://portal.acm.org/citation.cfm?id=1583991.1584050

        • r bag Says:

          By the way,We are doing a lot of work on that at INRIA/Alchemy (https://alchemy.futurs.inria.fr/), I cannot give details about the work, but it seems to be very promising 😉
          As soon as we get “real” good results (better than atomic operations) I’ll notify you 🙂

        • eile Says:

          I’m by now means a specialist in non-blocking algorithms, but in my understanding a lot of non-blocking data structures rely on atomics for their implementation.

          Re. your research – I’m looking forward to see the results!

          • r bag Says:

            So, I have a paper that show that some code transformations can be very useful, and can have good performance enhancements compared to atomic instructions, I studied some benchmarks from SPEC000 (equake benchamrk), and I found that, on an 8 core machine, with atomic instruction we get a speedup of 3.18 compared to a speedup of 5.13 using some loop transformations such as privatization/reduction (a combination of loop transformations that we introduced based on classical privatization and reduction),
            If you are intereseted I can send you a copy of the paper, there is a summary that shows a comparaison of speedups after applying Locks, STM, HW atomic onstructions and code transformations,

  3. Lock Performance « Parallel Rendering Says:

    […] Spinlocks are faster than ‘real’ locks. I’ve blogged about this before. Since they consume CPU time while spinning they should only be hold for a very short time, i.e., […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: