Let me say a thing: I’m not against OpenMP! I think it’s a wonderful tool and I did use it in the past for several reasons, but I simply didn’t like how it was used in the Mantiuk06 implementation. In fact, most of the OpenMP directives in the code where just basic stuff, throw in the play just to see what happens. I don’t say that it was useless: there is a reasonable speed up in using OpenMP against a plain implementation, but it was the wrong technology to respond to the demand of better performance. Why? Because OpenMP was used on vectors, where SSE instructions are a tailored solution for this kind of problem and can achieve an higher speed up. So I decided to prove myself this idea and I took a piece of code from Mantiuk06 and I’ve tried to implement it in a couple of different why:
- Plain implementation: let the compiler achieve the best it can!
- OpenMP: more or less the plain implementation, but with an omp parallel for (exactly as it was found in Mantiuk06);
- SSE: vectorized implementation using Intel SSE Intrinsics;
- Apple Accelerate: basically an SSE implementation made by Apple for Apple Hardware.
The result confirmed my intuition and the SSE implementation was the faster in 3 different scenarios: only simulation running; one other high demanding process competing for the processor; two others high demanding processes competing with the simulation. This is why I disabled OpenMP in Luminance 2.0.1 (removing also annoying faults during Mantiuk06) and why the next version will use SSE. Graphs can be found here.