Solomon
11-29-2007, 08:39 PM
Folks,
In the inner loop of my application, I need to transfer 4 consecutive doubles. i.e. 32 bytes.
I make these 32 byte aligned in mono memory to improve the transfer rate, and also use async_memcpy() rather than memcpy() even though I can't do any further calcs until I have these 4 numbers in poly.
Performance with async_memcpy() is good - around 900 cycles.
But when I worked out the mono memory bandwidth , it came out at only 716.8 MBytes/s
(900 cycles at 210 MHz to transfer 96 time 32 bytes)
This is much lower than the 3.2 GB/s that the mono DRAM is rated at.
For large transfers, I do get much better performance - in the region of 2.4 GB/s
I suspect that async_memcpy has a high set up cost.
I discovered that there are also lower level calls you can make to transfer data from mono to poly - documented in the ClearSpeed Instruction set reference manual.
I have wrapped these up as inline assembler as 2 routines:
cs_scatter_m2p ()
and
cs_gather_p2m ()
which takes as arguments the source and destination array locations, the size (32 in my case), and a semaphore to use, with a caveat that the mono array (the source for cs_scatter_m2p()) is stored in a small poly array of 3 ints that also holds 2 bitmasks - both of which are set to all 1 bits.
Using these the time taken to transfer 32 bytes falls to only c. 400 cycles - over 2x faster :-)
If are interested, I will post my code to this list.
Solomon
In the inner loop of my application, I need to transfer 4 consecutive doubles. i.e. 32 bytes.
I make these 32 byte aligned in mono memory to improve the transfer rate, and also use async_memcpy() rather than memcpy() even though I can't do any further calcs until I have these 4 numbers in poly.
Performance with async_memcpy() is good - around 900 cycles.
But when I worked out the mono memory bandwidth , it came out at only 716.8 MBytes/s
(900 cycles at 210 MHz to transfer 96 time 32 bytes)
This is much lower than the 3.2 GB/s that the mono DRAM is rated at.
For large transfers, I do get much better performance - in the region of 2.4 GB/s
I suspect that async_memcpy has a high set up cost.
I discovered that there are also lower level calls you can make to transfer data from mono to poly - documented in the ClearSpeed Instruction set reference manual.
I have wrapped these up as inline assembler as 2 routines:
cs_scatter_m2p ()
and
cs_gather_p2m ()
which takes as arguments the source and destination array locations, the size (32 in my case), and a semaphore to use, with a caveat that the mono array (the source for cs_scatter_m2p()) is stored in a small poly array of 3 ints that also holds 2 bitmasks - both of which are set to all 1 bits.
Using these the time taken to transfer 32 bytes falls to only c. 400 cycles - over 2x faster :-)
If are interested, I will post my code to this list.
Solomon