Equalizer uses reference pointers in some places, and the reference count so far was protected by a pthread_mutex (or the equivalent on Windows). The simple reason is that when I implemented it, I didn’t want to spend the time on something better.
This week I’ve found the time to replace the lock-protected counter with an atomic variable, with surprisingly good results. The ‘frame throughput’ when just rendering a quad in eqPly increased from ~200 FPS to ~750 FPS on my MacBook Pro! As soon as one starts rendering something more complex (like the rockerArm), the speedup is much less noticeable. I haven’t done any tests yet on a multi-GPU system where lock contention should be more apparent.
So here is some non-news for all you parallel programming guys: Locks are bad for performance!
4. July 2008 at 14:55 |
ORLY??
Congrats on the speed-up, though.
29. October 2009 at 17:13 |
I have done similar experiments during my master thesis, and the results were also good
but I read somewhere, don’t remeber where exactly, that atomic operations are not scalable !
Do you have any idea about that ?
19. November 2009 at 11:53 |
It’s a bit broad as a comment. Do you have a citation for your statement?
In fact, atomic ops as used here provide much better scalability then locks.
19. November 2009 at 14:33 |
Of course atomic operations are better than locks
What I mean by not scalable is that they don’t scale well compared to other techniques (like non blocking algorithms). The problem of non blocking algorithms is that they are difficult to write !
By the way I’m also working on atomic operations in GCC And it seems to be the best solution for now
Oncea talks about that in his paper, SPAA 2009 : http://portal.acm.org/citation.cfm?id=1583991.1584050
19. November 2009 at 14:42 |
By the way,We are doing a lot of work on that at INRIA/Alchemy (https://alchemy.futurs.inria.fr/), I cannot give details about the work, but it seems to be very promising
As soon as we get “real” good results (better than atomic operations) I’ll notify you
19. November 2009 at 14:52 |
I’m by now means a specialist in non-blocking algorithms, but in my understanding a lot of non-blocking data structures rely on atomics for their implementation.
Re. your research – I’m looking forward to see the results!
11. July 2010 at 10:12
So, I have a paper that show that some code transformations can be very useful, and can have good performance enhancements compared to atomic instructions, I studied some benchmarks from SPEC000 (equake benchamrk), and I found that, on an 8 core machine, with atomic instruction we get a speedup of 3.18 compared to a speedup of 5.13 using some loop transformations such as privatization/reduction (a combination of loop transformations that we introduced based on classical privatization and reduction),
If you are intereseted I can send you a copy of the paper, there is a summary that shows a comparaison of speedups after applying Locks, STM, HW atomic onstructions and code transformations,
5. July 2011 at 10:39 |
[...] Spinlocks are faster than ‘real’ locks. I’ve blogged about this before. Since they consume CPU time while spinning they should only be hold for a very short time, i.e., [...]