CUDA tutorial: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 102: Line 102:
** Same as cudaMemcpy, but transfers the data asynchronously which means it doesn't block the execution of other processes.
** Same as cudaMemcpy, but transfers the data asynchronously which means it doesn't block the execution of other processes.


= First CUDA C Program= <!--T:10-->
= A simple CUDA C program= <!--T:10-->
The following example shows how to add two numbers on the GPU using CUDA. Note that this is just an exercise, it's very simple, so it will not scale up.
The following example shows how to add two numbers on the GPU using CUDA. Note that this is just an exercise, it's very simple, so it will not scale at all.
<syntaxhighlight lang="cpp" line highlight="1,5">
<syntaxhighlight lang="cpp" line highlight="1,5">
__global__  void add (int *a, int *b, int *c){
__global__  void add (int *a, int *b, int *c){
Line 142: Line 142:
<!--T:17-->
<!--T:17-->
Are we missing anything ?  
Are we missing anything ?  
That code does not look parallel !
That code does not look parallel!
Solution: Lets look at what inside the triple brackets in the Kernel call and make some changes :
Solution: Let's look at what's inside the triple brackets in the kernel call and make some changes :
<syntaxhighlight lang="cpp" line highlight="1,5">
<syntaxhighlight lang="cpp" line highlight="1,5">
add <<< N, 1 >>> (dev_a, dev_b, dev_c);
add <<< N, 1 >>> (dev_a, dev_b, dev_c);
</syntaxhighlight>
</syntaxhighlight>
Here we replaced 1 by N, so that N different cuda blocks will be executed at the same time. However, in order to achieve a parallelism we need to make some changes to the Kernel as well:
Here we replaced 1 by N, so that N different CUDA blocks will be executed at the same time. However, in order to achieve parallelism we need to make some changes to the kernel as well:
<syntaxhighlight lang="cpp" line highlight="1,5">
<syntaxhighlight lang="cpp" line highlight="1,5">
__global__  void add (int *a, int *b, int *c){
__global__  void add (int *a, int *b, int *c){
Line 154: Line 154:
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
</syntaxhighlight>
</syntaxhighlight>
where blockIdx.x is the unique number identifying a cuda block. This way each cuda block adds a value from a[ ] to b[ ].
where blockIdx.x is the unique number identifying a CUDA block. This way each CUDA block adds a value from a[ ] to b[ ].
[[File:Cuda-blocks-parallel.png|thumbnail|CUDA blocks-based parallelism. ]]
[[File:Cuda-blocks-parallel.png|thumbnail|CUDA blocks-based parallelism. ]]


Line 162: Line 162:
add <<< 1, '''N''' >>> (dev_a, dev_b, dev_c);
add <<< 1, '''N''' >>> (dev_a, dev_b, dev_c);
</syntaxhighlight>
</syntaxhighlight>
Now instead of blocks, the job is distributed across parallel threads. What is the advantage of having parallel threads ? Unlike blocks, threads can communicate between each other ? In other words, we parallelize across multiple threads in the block when massive communication is involved. The chunks of code that can run independently (without much communication) are distributed across parallel blocks.
Now instead of blocks, the job is distributed across parallel threads. What is the advantage of having parallel threads ? Unlike blocks, threads can communicate between each other: in other words, we parallelize across multiple threads in the block when heavy communication is involved. The chunks of code that can run independently, i.e. with little or no communication, are distributed across parallel blocks.


= Advantage of Shared Memory= <!--T:20-->
= Advantage of Shared Memory= <!--T:20-->
Bureaucrats, cc_docs_admin, cc_staff
2,314

edits

Navigation menu