View Full Version : How to control data on PEs?
mikeyuan
08-13-2008, 08:33 PM
I have a database and its records are on PEs, then I want to compare or compute one of the records in one PE to all the other records on the other PEs efficiently. Could you please give me an efficient solution?
Dangermouse
08-14-2008, 09:08 AM
You could do this in a number of ways and really it's best to experiment since it depends heavily on what you are actually doing and the size of the objects you are using. Some ideas:
1) Implicit broadcast
mono struct mytype ref;
poly struct mytype test;
// load values into 'test' e.g. using async_memcpym2p
// load value into 'ref'
if (test.value == ref.value)
// do whatever
2) Explicit broadcast using consolidated mode:
poly struct mytype ref;
poly struct mytype test;
// load values into 'test' e.g. using async_memcpym2p
// load value into 'ref' e.g. using async_memcpym2p_consolidated
if (test.value == ref.value)
// do whatever
3) Swazzle
Probably more appropriate if you are comparing a set of values so you can send a different ref value to each PE test it, and then swazzle the ref values around in a ring.
poly struct mytype ref;
poly struct mytype test;
// load values into 'test' e.g. using async_memcpym2p
// load values into 'ref' e.g. using async_memcpym2p
if (test.value == ref.value)
// do whatever
swazzle
if (test.value == ref.value) // this is a new ref value from the neighbouring PE
mikeyuan
08-14-2008, 07:15 PM
Thank you very much for your help. My problem is: I have 14,000 plane records in worst case, and will distribute them into 192 PEs, so about 73 records per PE, then one of the records in PE1 will compare / comupte against all other records in both PE1 and other PEs, how to do it efficiently?
And I still don't know how to use your methods, especially the second consolidated and the third swazzle one, could you please give me some source codes or more detailed examples?
Thanks,
Mike
mikeyuan
08-14-2008, 09:41 PM
And the problem is there are two databases, one is R{x,y} which have 4000 plane records, the other is T{x,y} that have 14,000 plane records including the 4000 planes of report, now I want to compare/ compute each record from R with all other records in T except itself, and the record is excluded next time in this round, e.g, the first record in R is compared with 13,999 records in T, second one is compared with 13,998 in T. Could you please tell me how to do it efficiently?
Mr Potato Head
08-15-2008, 08:07 AM
Mike,
Can you give us an idea of how big each record is and what fields from the struct you use for the calculations? Depending on how much of the data you require for your calculations the answer to what approach will likely work best may well be different.
mikeyuan
08-15-2008, 01:57 PM
The records are not big, T{float x, float y, float j, int q}, R{float x, float y, floatrj, int k}, here (x,y) is the position of aircraft on the records of T or R, then we develop boxes (x+-j, y+-j) and (x+-k, y+-k), each box in R is compared to all boxes in T to see whether they intersect, see my post before. Is the information enough?
Dangermouse
08-15-2008, 03:40 PM
Hi Mike,
What have you tried so far, and have you thought about how you could do this using the swazzle path?
One approach could be along the lines of:
#define N 50
poly struct mystructT t;
poly struct mystructR r[N];
memcpym2p(&r, &r_source_mono[penum * N], N * sizeof(struct mystructR);
memcpym2p(&t, &t_source_mono[penum], sizeof(struct mystructT);
for (i = 0 ; i < num_pes ; i++)
{
for (j = 0 ; j < N ; j++)
{
// compare t with r[j]
}
// do you need all the elements of t in each step, or could you compute
// the box once and swazzle the box instead?
t.x = circular_swazzle_down(t.x);
t.y = circular_swazzle_down(t.y);
t.j = circular_swazzle_down(t.j);
t.q = circular_swazzle_down(t.q);
}
Clearly you would need to put this in a loop to iterate over all the Ts. This should be double buffered (i.e. use async_memcpy instead of memcpy. You will also need to think about how to avoid comparing invalid R records (you should have an example of something similar already).
Remember that the swazzle path only connects processing elements within a single core, i.e. in groups of 96. There are many ways to approach the problem and the best method depends very much on the bigger picture of what you trying to achieve, however on first glance I would suggest reading all the records of R (the smaller set) in strips into the PEs as shown above. If you read the same data into the PEs on the second core, the the first core can look at half the T records and the second core can look at the other half of the T records.
If you overlap the fetching of the next T record(s) with the compare, compute and swazzle then you should be able to get this to go fast.
Hope this helps.
Dangermouse
08-15-2008, 03:49 PM
t.x = circular_swazzle_down(t.x);
t.y = circular_swazzle_down(t.y);
t.j = circular_swazzle_down(t.j);
t.q = circular_swazzle_down(t.q);
Incidentally, since these are float values you are only using half the 64-bit path on each swazzle step. You could use a union with a double to allow you to use the full 64 bits which would take four cycles instead of eight.
mikeyuan
08-15-2008, 03:56 PM
What is swazzle? I looked for it in the introductory documents but didn't find it. Could you please explain it to me first?
Mr Potato Head
08-15-2008, 04:09 PM
Mike you can find a description of the swazzle routines in the Cn Standard Library Reference Manual (http://support.clearspeed.com/resources/documentation/Release3.1Docs/Cn_Standard_Library_Reference_Manual_Rev3.A.pdf) and an architectural description of how swazzle works in the Core Architecture Manual (http://support.clearspeed.com/resources/documentation/Release3.0Docs/CSX600_Core_Architecture_Manual_Rev1.A.pdf) (useful for understanding some of the architectural features of the CSX family) Section 3 (specifically 3.5.6) and the Hardware Programming Guide (http://support.clearspeed.com/resources/documentation/Release3.0Docs/CSX600_Hardware_Programming_Manual_Rev1.A.pdf) (very low level). Both of these last two documents are for the CSX600, revisions for the CSX700 will follow but are not necessary for gaining an understanding of the swazzle paths in an MTAP processor.
Mr Potato Head
08-15-2008, 04:23 PM
Incidentally there are some good additional training materials (http://developer.clearspeed.com/resources/training/) available from this site that supplement the basic documentation supplied with the SDK. I'd work my way through most of these presentations but specifically Part 3 (http://developer.clearspeed.com/resources/training/docs/Part_3_ProgrammingIntroduction_Rev_A.pdf) has some examples on using the swazzle path.
mikeyuan
08-15-2008, 08:59 PM
Thank you very much for your help. For: t.x = circular_swazzle_down(t.x); does this mean go to a lower numbered PE until it goes in a circle back to itself? I have two problems here, one is there are 14,000 records in T, 192 PEs, then about 73 records per PE, I should go over all the records in a PE; second is a plane record in database r won't compare with this plane record in t, and it is excluded next time, i.e., first one is compared to 13,999 reecords, second one is compared to 13,998 records,... Could you please tell me how to solve these problems?
mikeyuan
08-17-2008, 02:18 PM
Do you have any complete source codes in the materials?
mikeyuan
08-17-2008, 03:20 PM
In your code:
for (i = 0 ; i < num_pes ; i++)
{
for (j = 0 ; j < N ; j++)
{
// compare t with r[j]
}
// do you need all the elements of t in each step, or could you compute
// the box once and swazzle the box instead?
t.x = circular_swazzle_down(t.x);
t.y = circular_swazzle_down(t.y);
t.j = circular_swazzle_down(t.j);
t.q = circular_swazzle_down(t.q);
}
I want r[0] to compare with t[1] to t[14000], r[1] to compare with t[2] to t[14000],...,r[4000] to compare with t[4001] to t[14000], I think the code above should be:
for (i = 0 ; i <4000 ; i++)
{
for (j = 0 ; j < num_pes ; j++)
{
// compare r[i] with t
// do you need all the elements of t in each step, or could you compute
// the box once and swazzle the box instead?
t.x = circular_swazzle_down(t.x);
t.y = circular_swazzle_down(t.y);
t.j = circular_swazzle_down(t.j);
t.q = circular_swazzle_down(t.q);
}
}
And how to exclude previous r and itself in the loop? I cannot figure out.
I am confused about swazzle, are the PEs accessed in sequence? Can they be done in parallel to be more efficient?
And one PE has 73 records in average, the program needs to compare all the 73 records, do you think that it can only be done in sequence?
Dangermouse
08-18-2008, 09:48 AM
For: t.x = circular_swazzle_down(t.x); does this mean go to a lower numbered PE until it goes in a circle back to itself?
Yes, for example with 96 PEs then PE 95 sends to 94 and so on, 1 sends to 0 and 0 sends to 95.
there are 14,000 records in T, 192 PEs, then about 73 records per PE, I should go over all the records in a PE;
I don't really understand what you mean, if you want to compare all the records then you will need to go over all the records...
a plane record in database r won't compare with this plane record in t, and it is excluded next time, i.e., first one is compared to 13,999 reecords, second one is compared to 13,998 records,... Could you please tell me how to solve these problems?
In your code you know which records are contained within each PE. For example, if you have a slice of R in each PE then each PE has records from (penum * sliceSize) to ((penum + 1) * slizeSize - 1).
The same applies to the T records except than on each swazzle you need to account for the change. For example if each PE starts off with just one T record with index (penum) then after one swazzle each PE has record (penum + 1) except PE 95 which now has 0.
tIndex++;
if (tIndex == highIndex)
{
tIndex = lowIndex;
}
Note I've assumed you are swazzling T records but this applies the other way around too.
Do you have any complete source codes in the materials?
There are some swazzle examples in the standard distribution. You could also look at the finance examples available on the website (http://support.clearspeed.com/downloads/applications/financeexamples/), the binomial tree and finite differences examples use swazzle.
And how to exclude previous r and itself in the loop? I cannot figure out.
This should be clear from above, you need to know which record indices are on each PE at any time.
I am confused about swazzle, are the PEs accessed in sequence? Can they be done in parallel to be more efficient?
The swazzle happens in parallel, all PEs will send the value to the neighbour at the same time.
And one PE has 73 records in average, the program needs to compare all the 73 records, do you think that it can only be done in sequence?
This depends on your specific application, but the code above is comparing get_numpes() records in parallel.
mikeyuan
08-18-2008, 02:32 PM
Quote:
I don't really understand what you mean, if you want to compare all the records then you will need to go over all the records...
I mean one PE has 73 records, all of them need to be compared, when circular_swazzle_down(t.x) gets to PE95, will it just compare one record on it or all 73?
mikeyuan
08-18-2008, 02:57 PM
Quote:
The swazzle happens in parallel, all PEs will send the value to the neighbour at the same time.
What value are the PEs sending to the neighbor?
Dangermouse
08-18-2008, 04:00 PM
What value are the PEs sending to the neighbor?
Consider the example:
double x;
double y;
x = 1.0 * get_penum();
y = 2.0 * get_penum();
x = circular_swazzle_down(y);
Each PE will send its value of y to the lower numbered PE (for 96 PEs, 0 sends to 95) and each PE will store the value it receives from its neighbour in the variable x.
Before the swazzle:
PE..0....1....2....94......95
x...0.0..1.0..2.0..94.0....95.0
y...0.0..2.0..4.0..188.0...190.0
After the swazzle:
PE..0....1....2....94......95
x...2.0..4.0..6.0..190.0...0.0
y...0.0..2.0..4.0..188.0...190.0
mikeyuan
08-18-2008, 07:59 PM
I have questions. One is: it seems that circular_swazzle_down is to transfer data from PE95 to 94, but I want r[0] to compare with t[1] to t[13,999], I don't need to transfer data between PEs. How to do this?
And second is quota:
This depends on your specific application, but the code above is comparing get_numpes() records in parallel.
I am confused about get_numpes(), do you mean whether for (j = 0 ; j < num_pes ; j++) is from PE 0 to PE 95 or from record[0] to record[73] in one PE?
mikeyuan
08-18-2008, 08:53 PM
In your code:
for (i = 0 ; i < num_pes ; i++)
{
for (j = 0 ; j < N ; j++)
{
// compare t with r[j]
}
// do you need all the elements of t in each step, or could you compute
// the box once and swazzle the box instead?
t.x = circular_swazzle_down(t.x);
t.y = circular_swazzle_down(t.y);
t.j = circular_swazzle_down(t.j);
t.q = circular_swazzle_down(t.q);
}
I think the outer loop is from PE0 to PE95, inner loop is compare record t in one PE, e.g, PE0 to r[0], r[1],..., r[N], and then PEs receive t.x from their neighbors, and then go to next PE. Is that correct? If yes, it is not what I want, I need r[0] to compare to all t[] records in one time, and I don't think that I need to transfer any data from one PE to another.
Dangermouse
08-19-2008, 11:12 AM
Hi Mike,
I am now completely lost ;) so let's go back to basics.
As far as I can tell if you were on a sequential processor you would do something like this.
#define MAX_T 14000
#define MAX_R 4000
struct t_struct tData[MAX_T];
struct r_struct rData[MAX_R];
unsigned int tIndex;
unsigned int rIndex;
for (rIndex = 0 ; rIndex < MAX_R ; rIndex++)
{
for (tIndex = 0 ; tIndex < MAX_T ; tIndex++)
{
if (rIndex != tIndex)
{
// compare rData[rIndex] with tData[tIndex]
}
}
}
Is this a fair understanding of what you are doing?
When parallelising this over the PEs one obvious approach is to give each PE a subset of the R records. In the example below I give just one R record at a time to each PE and I use synchronous I/O. This is inefficient but I want to keep the example simple, if you were implementing this for real then you would read a block of R records into each PE and overlap the I/O. In an earlier message you were talking about reading a large chunk of R records into each PE so you already understand that part.
Approach 1
#define MAX_T 14000
#define MAX_R 4000
struct t_struct tData[MAX_T];
struct r_struct rData[MAX_R];
unsigned int tIndex;
unsigned int rIndex;
unsigned short num_pes = get_num_pes();
poly unsigned short penum = get_penum();
poly unsigned int rIndex;
poly struct r_struct rDatap;
for (rIndex = 0, rIndexp = penum ; rIndex < MAX_R ; rIndex += num_pes, rIndexp += num_pes)
{
if (rIndexp < MAX_R)
{
memcpym2p(&rDatap, &rData[rIndexp], sizeof(struct r_struct));
for (tIndex = 0 ; tIndex < MAX_T ; tIndex++)
{
if (rIndexp != tIndex)
{
// compare rDatap with tData[tIndex]
}
}
}
}
This code would implement the same algorithm and function as you expect, however in addition to the inefficiencies I described above (not block the reading of R records, not using asynchronous I/O) the handling of the T records is also inefficient. You are using an implicit broadcast of the tData[tIndex] record to each of the PEs to allow them to compare with their local rDatap. You could use async_memcpym2p_consolidated (shown using memcpym2p instead).
Approach 2
for (tIndex = 0 ; tIndex < MAX_T ; tIndex++)
{
memcpym2p(&tDatap, &tData[tIndex], sizeof(struct t_struct));
if (rIndexp != tIndex)
{
// compare rDatap with tDatap
}
}
However this is still not as efficient as we would like (even if you correctly double buffer the data transfer). To get the best result we should use the swazzle path which provides very high bandwidth between the PEs.
The idea is that instead of every PE reading the same T record from memory, each PE reads a different T record, compares it with the R record (or all the R records if each PE has more than one) and then gives the T record to the neighbouring PE. Using this we can read a T record and process it in all the PEs while reading another T record in the background. This is what the earlier examples were describing.
Approach 3
#define MAX_T 14000
#define MAX_R 4000
struct t_struct tData[MAX_T];
struct r_struct rData[MAX_R];
unsigned int tIndex;
unsigned int rIndex;
unsigned int tIndexInner;
unsigned short num_pes = get_num_pes();
poly unsigned short penum = get_penum();
poly unsigned int tIndexp;
poly unsigned int rIndexp;
poly unsigned int tIndexInnerp;
poly struct t_struct tDatap;
poly struct r_struct rDatap;
for (rIndex = 0, rIndexp = penum ; rIndex < MAX_R ; rIndex += num_pes, rIndexp += num_pes)
{
if (rIndexp < MAX_R)
{
memcpym2p(&rDatap, &rData[rIndexp], sizeof(struct r_struct));
for (tIndex = 0, tIndexp = penum ; tIndex < MAX_T ; tIndex += num_pes, tIndexp += num_pes)
{
if (tIndexp < MAX_T)
{
memcpym2p(&tDatap, &tData[tIndexp], sizeof(struct t_struct));
}
for (tIndexInner = 0, tIndexInnerp = get_penum() ; tIndexInner < num_pes ; tIndexInner++, tIndexInnerp++)
{
if (tIndexInnerp == num_pes)
{
tIndexInnerp = 0;
}
if ((tIndex + tIndexInnerp) < MAX_T && rIndexp != (tIndex + tIndexInnerp))
{
// compare rDatap with tDatap
}
tDatap.x = circular_swazzle_down(tDatap.x);
tDatap.y = circular_swazzle_down(tDatap.y);
tDatap.j = circular_swazzle_down(tDatap.j);
tDatap.q = circular_swazzle_down(tDatap.q);
}
}
}
Note: this code is just to represent one way this could be done and I haven't run it.
mikeyuan
08-19-2008, 02:25 PM
Thank you very much for your help. I am lost too and I read from your first reply, you gave me three ways, I understand that the first approach is to put R on mono and T on PEs, the third one is to put both on PEs, is that correct? And I don't understand the second approach, could you please explain this, what is the difference between it and the third one?
Thanks,
mikeyuan
08-19-2008, 02:43 PM
Yes, your sequential way is correct, just r[4000] only needs to compare with t[4001] to t[14000], no need for t[0] to t[4000].
Dangermouse
08-19-2008, 04:31 PM
I understand that the first approach is to put R on mono and T on PEs, the third one is to put both on PEs, is that correct? And I don't understand the second approach, could you please explain this, what is the difference between it and the third one?
Thanks,
In my code the first approach has R in PEs and T in mono (of course you can swap these around if you like). The second approach is essentially the same, except that this time I have explicitly copied the T record from mono to poly each time I want to use it. In the example code this is no better (in fact it will be slower) but by using an explicit broadcast you can take the next step and double-buffer the T record so that the compute overlaps the I/O.
The third approach is completely different. The R records are still in the PEs as before, however instead of broadcasting the same T record to all PEs it loads a different T record onto each PE. It then uses the fact that the inter-PE swazzle path is much faster than mono to poly transfer and rotates the T records around the PEs in a circle until each PE has seen all the T records (or rather, all the T records that were loaded on this pass).
Yes, your sequential way is correct, just r[4000] only needs to compare with t[4001] to t[14000], no need for t[0] to t[4000].
This simply requires changing the 'if (rIndexp != tIndex)' condition. You could improve by building more intelligence regarding which T records to load onto which PE and therefore doing less swazzling, however I suggest you put that to one side for now.
mikeyuan
08-19-2008, 08:08 PM
Thank you very much. I understand better and will contact you when I have questions.
mikeyuan
08-19-2008, 09:37 PM
I changed approach 1 to:
#define MAX_T 14000
#define MAX_R 4000
mono struct t_struct tData[MAX_T];
Mono struct r_struct rData[MAX_R];
unsigned int tIndex;
unsigned int rIndex;
unsigned short num_pes = get_num_pes();
poly unsigned short penum = get_penum();
poly unsigned int tIndexp;
poly struct t_struct tDatap;
for (tIndex = 0, tIndexp = penum ; tIndex < MAX_T ; tIndex += num_pes, tIndexp += num_pes)
{
if (tIndexp < MAX_T)
{
memcpym2p(&tDatap, &tData[tIndexp], sizeof(struct t_struct));
for (rIndex = 0 ; rIndex < MAX_R ; rIndex++)
{
if (tIndexp != rIndex)
{
// compare tDatap with rData[rIndex]
}
}
}
}
Do you think that it is: R[i] in mono is compared to T[tIndexp] on PEs each time?
mikeyuan
08-20-2008, 03:53 AM
In your approach 1, 4000 R[i] are distributed into 192 PEs, then about 20 records per PE, will all of the 20 R[] records be compared to T[] or just the first one, e.g, R[0]? I need all of them to be compared. Thank you in advance.
mikeyuan
08-20-2008, 04:08 AM
And are the records in one PE compared in sequence or parallel?
Dangermouse
08-20-2008, 08:28 AM
Within each PE the records are compared in series, the parallelism arises because all the PEs operate on different data at the same time (SIMD).
mikeyuan
08-20-2008, 01:31 PM
Thank you very much. So if I use your approach 3, I should copy all of the records of R[] and T[] from host to PEs, right? Then I have another question, how do I know which record is in which PE, e.g, is R[0] in PE0? I am sorry but I have not understood your approach well and it might be unnecessay, what's your opinion?
Dangermouse
08-20-2008, 02:08 PM
You need to copy all the R and T records from the host to mono memory on the card. Then use the approach 3 to read sets of R and T records into the PEs.
The code given for each approach shows how each PE knows which records it currently has, c.f. the rIndexp and (tIndexp + tIndexInnerp).
mikeyuan
08-20-2008, 07:19 PM
Thank you very much for your help. I am reading the codes and will contact you later.
mikeyuan
08-20-2008, 08:03 PM
I am reading the code of approach 3, I do not know what tIndexInner and tIndexInnerp mean, could you please tell me and whyare they used?
Dangermouse
08-21-2008, 07:27 AM
There are two loops for the T records.
In the outer loop (tIndex) I read num_pes records, one per PE.
The inner loop (tIndexInner) runs num_pes times, since there are num_pes T records and I need every PE to see every record. Therefore in the inner loop I am swazzling the T records around the ring so that by the end of the inner loop (tIndexInner) every PE has seen all the T records that were loaded. Then the outer loop will load the next batch of T records.
The tIndexInnerp index is so that each PE can know which T record it actually has at any specific time.
mikeyuan
08-21-2008, 02:02 PM
Thank you very much for your help. Keep in touch.
Dangermouse
08-21-2008, 02:48 PM
You're welcome, let us all know how you get on.
mikeyuan
08-21-2008, 07:54 PM
I am first doing data transfer according to approach 1, will do approach 3 later, I will contact you when I have questions.
Thanks a lot,
Mike
mikeyuan
09-15-2008, 03:46 AM
I am working on change the memcpym2p to async_memcpym2p, I only found the example in the introductory manual and did not understand well, could you please provide me with a sample code that is easier?
Thanks,
vBulletin® v3.7.0, Copyright ©2000-2009, Jelsoft Enterprises Ltd.