PDA

View Full Version : How to do a dot product?


Solomon
11-29-2007, 07:55 PM
Folks,

At one point in my board side code I need to dot product two long arrays of doubles - each may be of the order of 10,000 long and are in mono memory

In C, I might write:

sum=0.
for (i=0;i<10000;i++) {
sum += a[i]*b[i];
{

In Cn this appears to be very slow. :-(
Is there a better way?
And indeed to make it fast do I have to use PIO to push the numbers to the 96 PEs and sum them there?
but if so, how do I sum the 96 partial sums that end uyp spread across the PEs ?

Solomon.

clear-cut
12-03-2007, 08:47 PM
Ideally you always want to try hard to avoid situations where you end up doing a very small number of flops for each byte moved. For instance, in the case you mention with the long vectors, if there is already point in your code when chunks of both a and b are in poly memory, then take the opportunity to accumulate part of the dot product at that time.

If you have no such opportunity, then yes, it would be fastest to use PIO (double-buffering so you can overlap with compute) and form partial dot products on each PE. With the 3.0 release, at the end you use the new cs_reduce_sum() function to get the final mono result. For earlier releases you'll need to code up the reduction yourself either making use of the swazzle mechanism (the fastest) or PIO the 96 partial results to mono, and sum the 96 partials there.

Solomon
12-03-2007, 09:24 PM
Thanks for that.

In my case the bulk of the calculation are in poly and do execute at a very high Gflop rate But there is one point in the algorithm where I need this dot_product (part of the normalisation of a Conjugent Gradient solver).

It doesn't at first look possible to do the partial computation of the dot product when the source numbers are in poly since they come from seperate calculations (and indeed not necessirly computed on the same PEs).

I appreciate that dot product is a memory-bandwidth bound problem - the compute itself should be almost for free.
In which case why is the mono-only version so slow? I know that the mono FPU doesn't pipeline but is this enough to explain why is is better to stream the data though poly?

clear-cut
12-03-2007, 09:34 PM
Right. Mono floating point is not pipelined and the mono processor is a in-order single-issue running at a fairly slow clock rate. (You might also note from the CSX600 instruction set manual that latency to dcache is 6 cycles) Its primary purpose is to drive the PEs. A large operation such as yours is much better done on the PEs where 96*"fairly slow" == FAST.