Sage: Do you understand this SSE and SIMD stuff? Can you code it?
I've created some SSE variants of procedures.
For example, dot product:
PROCEDURE vecDot(VAR c: REAL; CONST a: vector_t; CONST b: vector_t);
BEGIN
c := a.x*b.x + a.y*b.y + a.z*b.z
END vecDot;
PROCEDURE vecDotSSE(VAR c: REAL; CONST a: vector_t; CONST b: vector_t);
CODE {SYSTEM.i386, SYSTEM.SSE}
MOV EBX, c[EBP]
MOV ECX, a[EBP]
MOV EDX, b[EBP]
MOVUPS XMM0, [ECX]
MOVUPS XMM1, [EDX]
MULPS XMM0, XMM1
MOVHLPS XMM1, XMM0
ADDPS XMM0, XMM1
MOVAPS XMM1, XMM0
SHUFPS XMM0, XMM0, 1
ADDPS XMM0, XMM1
MOVSS [EBX], XMM0
END vecDotSSE;
cross product:
PROCEDURE vecCross(VAR c: vector_t; CONST a: vector_t; CONST b: vector_t);
BEGIN
c.x := a.y * b.z - a.z * b.y;
c.y := a.z * b.x - a.x * b.z;
c.z := a.x * b.y - a.y * b.x
END vecCross;
PROCEDURE vecCrossSSE(VAR c: vector_t; CONST a: vector_t; CONST b: vector_t);
CODE {SYSTEM.i386, SYSTEM.SSE}
MOV EBX, c[EBP]
MOV ECX, a[EBP]
MOV EDX, b[EBP]
MOVUPS XMM0, [ECX]
MOVAPS XMM2, XMM0
MOVUPS XMM1, [EDX]
MOVAPS XMM3, XMM1
SHUFPS XMM0, XMM0, 201
SHUFPS XMM1, XMM1, 210
SHUFPS XMM2, XMM2, 210
SHUFPS XMM3, XMM3, 201
MULPS XMM0, XMM1
MULPS XMM2, XMM3
SUBPS XMM0, XMM2
MOVUPS [EBX], XMM0
END vecCrossSSE;
And because of absence of
data alignment and therefore using slow instrustion for loading XMM registers (MOVUPS instrustion) any of perfomance boost did not achieved.
So, for obtaining of significant perfomance boost:
1. much more greater parts of algorythms should be implemented in SSE.
2. data should be aligned. In that case much faster instruction MOVAPS we can use.
PS.
Optimization is a final part of developmen, at first the working code should exist at least.