OpenMP vs. SSE

Let me say a thing: I’m not against OpenMP! I think it’s a wonderful tool and I did use it in the past for several reasons, but I simply didn’t like how it was used in the Mantiuk06 implementation. In fact, most of the OpenMP directives in the code where just basic stuff, throw in the play just to see what happens. I don’t say that it was useless: there is a reasonable speed up in using OpenMP against a plain implementation, but it was the wrong technology to respond to the demand of better performance. Why? Because OpenMP was used on vectors, where SSE instructions are a tailored solution for this kind of problem and can achieve an higher speed up. So I decided to prove myself this idea and I took a piece of code from Mantiuk06 and I’ve tried to implement it in a couple of different why:

  • Plain implementation: let the compiler achieve the best it can!
  • OpenMP: more or less the plain implementation, but with an omp parallel for (exactly as it was found in Mantiuk06);
  • SSE: vectorized implementation using Intel SSE Intrinsics;
  • Apple Accelerate: basically an SSE implementation made by Apple for Apple Hardware.

The result confirmed my intuition and the SSE implementation was the faster in 3 different scenarios: only simulation running; one other high demanding process competing for the processor; two others high demanding processes competing with the simulation. This is why I disabled OpenMP in Luminance 2.0.1 (removing also annoying faults during Mantiuk06) and why the next version will use SSE. Graphs can be found here.

2 Responses to “OpenMP vs. SSE”


  • I don’t think OpenMP should be instead of SSE. SSE vectorises calculation running on one core; OpenMP parallelizes across multiple cores.

    I currently have 6 core (1 socket) and 8 core (2 socket) machines and even higher core counts are coming. I think you will need proper parallelism to make use of them, though it’s good to use SSE to get speed up of too.

    • You’re right. My test didn’t take into account that multiple cores (4 or more) are so common now and they will really help performance. I’m considering if reintroducing OpenMP again is a good idea or it is better to implement my multi-thread functions in term of Pthread explicitly. I will let you know, but for now I still prefer to keep OpenMP disabled in order to understand better where unexpected behaviours come up.

Comments are currently closed.