cc_staff
1,486
edits
(adjusted indentation to be consistent) |
No edit summary |
||
Line 104: | Line 104: | ||
= A simple CUDA C program= <!--T:10--> | = A simple CUDA C program= <!--T:10--> | ||
The following example shows how to add two numbers on the GPU using CUDA. Note that this is just an exercise, it's very simple, so don't expect to see any actual acceleration. | The following example shows how to add two numbers on the GPU using CUDA. Note that this is just an exercise, it's very simple, so don't expect to see any actual acceleration. | ||
</translate> | |||
<syntaxhighlight lang="cpp" line highlight="1,5"> | <syntaxhighlight lang="cpp" line highlight="1,5"> | ||
___global__ void add (int *a, int *b, int *c){ | ___global__ void add (int *a, int *b, int *c){ | ||
Line 133: | Line 134: | ||
} | } | ||
</syntaxhighlight> | </syntaxhighlight> | ||
<translate> | |||
<!--T:17--> | <!--T:17--> | ||
Line 138: | Line 140: | ||
That code does not look parallel! | That code does not look parallel! | ||
Solution: Let's look at what's inside the triple brackets in the kernel call and make some changes : | Solution: Let's look at what's inside the triple brackets in the kernel call and make some changes : | ||
</translate> | |||
<syntaxhighlight lang="cpp" line highlight="1,5"> | <syntaxhighlight lang="cpp" line highlight="1,5"> | ||
add <<< N, 1 >>> (dev_a, dev_b, dev_c); | add <<< N, 1 >>> (dev_a, dev_b, dev_c); | ||
</syntaxhighlight> | </syntaxhighlight> | ||
<translate> | |||
Here we replaced 1 by N, so that N different CUDA blocks will be executed at the same time. However, in order to achieve parallelism we need to make some changes to the kernel as well: | Here we replaced 1 by N, so that N different CUDA blocks will be executed at the same time. However, in order to achieve parallelism we need to make some changes to the kernel as well: | ||
</translate> | |||
<syntaxhighlight lang="cpp" line highlight="1,5"> | <syntaxhighlight lang="cpp" line highlight="1,5"> | ||
__global__ void add (int *a, int *b, int *c){ | __global__ void add (int *a, int *b, int *c){ | ||
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; | c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; | ||
</syntaxhighlight> | </syntaxhighlight> | ||
<translate> | |||
where blockIdx.x is the unique number identifying a CUDA block. This way each CUDA block adds a value from a[ ] to b[ ]. | where blockIdx.x is the unique number identifying a CUDA block. This way each CUDA block adds a value from a[ ] to b[ ]. | ||
[[File:Cuda-blocks-parallel.png|thumbnail|CUDA blocks-based parallelism. ]] | [[File:Cuda-blocks-parallel.png|thumbnail|CUDA blocks-based parallelism. ]] | ||
Line 151: | Line 157: | ||
<!--T:19--> | <!--T:19--> | ||
Can we again make some modifications in those triple brackets ? | Can we again make some modifications in those triple brackets ? | ||
</translate> | |||
<syntaxhighlight lang="cpp" line highlight="1,5"> | <syntaxhighlight lang="cpp" line highlight="1,5"> | ||
add <<< 1, '''N''' >>> (dev_a, dev_b, dev_c); | add <<< 1, '''N''' >>> (dev_a, dev_b, dev_c); | ||
</syntaxhighlight> | </syntaxhighlight> | ||
<translate> | |||
Now instead of blocks, the job is distributed across parallel threads. What is the advantage of having parallel threads ? Unlike blocks, threads can communicate between each other: in other words, we parallelize across multiple threads in the block when heavy communication is involved. The chunks of code that can run independently, i.e. with little or no communication, are distributed across parallel blocks. | Now instead of blocks, the job is distributed across parallel threads. What is the advantage of having parallel threads ? Unlike blocks, threads can communicate between each other: in other words, we parallelize across multiple threads in the block when heavy communication is involved. The chunks of code that can run independently, i.e. with little or no communication, are distributed across parallel blocks. | ||
= Advantages of shared memory= <!--T:20--> | = Advantages of shared memory= <!--T:20--> | ||
So far all the memory transfers in the kernel have been done via the regular GPU (global) memory which is relatively slow. Often we have so many communications between the threads that the performance decreases significantly. In order to address this issue there exists another type of memory called '''shared memory''' which can be used to speed-up the memory operations between the threads. However the trick is that only the threads within a block can communicate. In order to demonstrate the usage of such shared memory we consider the dot product example where two vectors are multiplied together element by element and then summed. Below is the kernel: | So far all the memory transfers in the kernel have been done via the regular GPU (global) memory which is relatively slow. Often we have so many communications between the threads that the performance decreases significantly. In order to address this issue there exists another type of memory called '''shared memory''' which can be used to speed-up the memory operations between the threads. However the trick is that only the threads within a block can communicate. In order to demonstrate the usage of such shared memory we consider the dot product example where two vectors are multiplied together element by element and then summed. Below is the kernel: | ||
</translate> | |||
<syntaxhighlight lang="cpp" line highlight="1,5"> | |||
__global__ void dot(int *a, int *b, int *c){ | __global__ void dot(int *a, int *b, int *c){ | ||
int temp = a[threadIdx.x]*b[threadIdx.x]; | int temp = a[threadIdx.x]*b[threadIdx.x]; | ||
} | } | ||
</syntaxhighlight> | </syntaxhighlight> | ||
<translate> | |||
After each thread computes its portion, we need to add everything together: each thread has to share its data. However, the problem is that each copy of thread's temp variable is private. This can be resolved by the use of shared memory. Below is the kernel with the modifications to use shared memory: | After each thread computes its portion, we need to add everything together: each thread has to share its data. However, the problem is that each copy of thread's temp variable is private. This can be resolved by the use of shared memory. Below is the kernel with the modifications to use shared memory: | ||
</translate> | |||
<syntaxhighlight lang="cpp" line highlight="1,4"> | <syntaxhighlight lang="cpp" line highlight="1,4"> | ||
#define N 512 | #define N 512 | ||
Line 176: | Line 187: | ||
} | } | ||
</syntaxhighlight> | </syntaxhighlight> | ||
<translate> | |||
= Basic performance considerations = <!--T:26--> | = Basic performance considerations = <!--T:26--> |