Sage: Do you understand this SSE and SIMD stuff? Can you code it?

I've created some SSE variants of procedures.

For example, dot product:

PROCEDURE vecDot(VAR c: REAL; CONST a: vector_t; CONST b: vector_t);

BEGIN

c := a.x*b.x + a.y*b.y + a.z*b.z

END vecDot;

PROCEDURE vecDotSSE(VAR c: REAL; CONST a: vector_t; CONST b: vector_t);

CODE {SYSTEM.i386, SYSTEM.SSE}

MOV EBX, c[EBP]

MOV ECX, a[EBP]

MOV EDX, b[EBP]

MOVUPS XMM0, [ECX]

MOVUPS XMM1, [EDX]

MULPS XMM0, XMM1

MOVHLPS XMM1, XMM0

ADDPS XMM0, XMM1

MOVAPS XMM1, XMM0

SHUFPS XMM0, XMM0, 1

ADDPS XMM0, XMM1

MOVSS [EBX], XMM0

END vecDotSSE;

cross product:

PROCEDURE vecCross(VAR c: vector_t; CONST a: vector_t; CONST b: vector_t);

BEGIN

c.x := a.y * b.z - a.z * b.y;

c.y := a.z * b.x - a.x * b.z;

c.z := a.x * b.y - a.y * b.x

END vecCross;

PROCEDURE vecCrossSSE(VAR c: vector_t; CONST a: vector_t; CONST b: vector_t);

CODE {SYSTEM.i386, SYSTEM.SSE}

MOV EBX, c[EBP]

MOV ECX, a[EBP]

MOV EDX, b[EBP]

MOVUPS XMM0, [ECX]

MOVAPS XMM2, XMM0

MOVUPS XMM1, [EDX]

MOVAPS XMM3, XMM1

SHUFPS XMM0, XMM0, 201

SHUFPS XMM1, XMM1, 210

SHUFPS XMM2, XMM2, 210

SHUFPS XMM3, XMM3, 201

MULPS XMM0, XMM1

MULPS XMM2, XMM3

SUBPS XMM0, XMM2

MOVUPS [EBX], XMM0

END vecCrossSSE;

And because of absence of

data alignment and therefore using slow instrustion for loading XMM registers (MOVUPS instrustion) any of perfomance boost did not achieved.

So, for obtaining of significant perfomance boost:

1. much more greater parts of algorythms should be implemented in SSE.

2. data should be aligned. In that case much faster instruction MOVAPS we can use.

PS.

Optimization is a final part of developmen, at first the working code should exist at least.