Bureaucrats, cc_docs_admin, cc_staff
2,318
edits
No edit summary |
No edit summary |
||
Line 102: | Line 102: | ||
** Same as cudaMemcpy, but transfers the data asynchronously which means it doesn't block the execution of other processes. | ** Same as cudaMemcpy, but transfers the data asynchronously which means it doesn't block the execution of other processes. | ||
= | = A simple CUDA C program= <!--T:10--> | ||
The following example shows how to add two numbers on the GPU using CUDA. Note that this is just an exercise, it's very simple, so it will not scale | The following example shows how to add two numbers on the GPU using CUDA. Note that this is just an exercise, it's very simple, so it will not scale at all. | ||
<syntaxhighlight lang="cpp" line highlight="1,5"> | <syntaxhighlight lang="cpp" line highlight="1,5"> | ||
__global__ void add (int *a, int *b, int *c){ | __global__ void add (int *a, int *b, int *c){ | ||
Line 142: | Line 142: | ||
<!--T:17--> | <!--T:17--> | ||
Are we missing anything ? | Are we missing anything ? | ||
That code does not look parallel ! | That code does not look parallel! | ||
Solution: | Solution: Let's look at what's inside the triple brackets in the kernel call and make some changes : | ||
<syntaxhighlight lang="cpp" line highlight="1,5"> | <syntaxhighlight lang="cpp" line highlight="1,5"> | ||
add <<< N, 1 >>> (dev_a, dev_b, dev_c); | add <<< N, 1 >>> (dev_a, dev_b, dev_c); | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Here we replaced 1 by N, so that N different | Here we replaced 1 by N, so that N different CUDA blocks will be executed at the same time. However, in order to achieve parallelism we need to make some changes to the kernel as well: | ||
<syntaxhighlight lang="cpp" line highlight="1,5"> | <syntaxhighlight lang="cpp" line highlight="1,5"> | ||
__global__ void add (int *a, int *b, int *c){ | __global__ void add (int *a, int *b, int *c){ | ||
Line 154: | Line 154: | ||
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; | c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; | ||
</syntaxhighlight> | </syntaxhighlight> | ||
where blockIdx.x is the unique number identifying a | where blockIdx.x is the unique number identifying a CUDA block. This way each CUDA block adds a value from a[ ] to b[ ]. | ||
[[File:Cuda-blocks-parallel.png|thumbnail|CUDA blocks-based parallelism. ]] | [[File:Cuda-blocks-parallel.png|thumbnail|CUDA blocks-based parallelism. ]] | ||
Line 162: | Line 162: | ||
add <<< 1, '''N''' >>> (dev_a, dev_b, dev_c); | add <<< 1, '''N''' >>> (dev_a, dev_b, dev_c); | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Now instead of blocks, the job is distributed across parallel threads. What is the advantage of having parallel threads ? Unlike blocks, threads can communicate between each other: in other words, we parallelize across multiple threads in the block when heavy communication is involved. The chunks of code that can run independently, i.e. with little or no communication, are distributed across parallel blocks. | |||
= Advantage of Shared Memory= <!--T:20--> | = Advantage of Shared Memory= <!--T:20--> |