CUDA tutorial: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 164: Line 164:
Now instead of blocks, the job is distributed across parallel threads. What is the advantage of having parallel threads ? Unlike blocks, threads can communicate between each other: in other words, we parallelize across multiple threads in the block when heavy communication is involved. The chunks of code that can run independently, i.e. with little or no communication, are distributed across parallel blocks.
Now instead of blocks, the job is distributed across parallel threads. What is the advantage of having parallel threads ? Unlike blocks, threads can communicate between each other: in other words, we parallelize across multiple threads in the block when heavy communication is involved. The chunks of code that can run independently, i.e. with little or no communication, are distributed across parallel blocks.


= Advantage of Shared Memory= <!--T:20-->
= Advantages of shared memory= <!--T:20-->
So far all the memory transfers in the kernel have been done via the regular GPU (global) memory which is relatively slow. Often time we have so many communications between the threads that decreases the performance significantly.  In order to address this issue there exist another type of memory called Shared memory which can be used to speed-up the memory operations between the threads. However the trick is that only the threads within a block can communicate.  In order to demonstrate the usage of such shared memory we consider the dot product example where two vectors are dot-multipled. Below is the kernel:
So far all the memory transfers in the kernel have been done via the regular GPU (global) memory which is relatively slow. Often we have so many communications between the threads that the performance decreases significantly.  In order to address this issue there exists another type of memory called '''shared memory''' which can be used to speed-up the memory operations between the threads. However the trick is that only the threads within a block can communicate.  In order to demonstrate the usage of such shared memory we consider the dot product example where two vectors are multiplied together element by element and then summed. Below is the kernel:
  <syntaxhighlight lang="cpp" line highlight="1,5">
  <syntaxhighlight lang="cpp" line highlight="1,5">
__global__  void dot(int *a, int *b, int *c){
__global__  void dot(int *a, int *b, int *c){
Line 171: Line 171:
}
}
</syntaxhighlight>
</syntaxhighlight>
After each thread computed its portion, we need to add everything together. Each threads has to share its data.  However, the problem is that each copy of thread's temp is private.This can resolved with the use of shared memory. Below is the kernel with the modifications to account the shared memory usage:
After each thread computes its portion, we need to add everything together: each thread has to share its data.  However, the problem is that each copy of thread's temp variable is private. This can be resolved by the use of shared memory. Below is the kernel with the modifications to use shared memory:
<syntaxhighlight lang="cpp" line highlight="1,4">
<syntaxhighlight lang="cpp" line highlight="1,4">
#define N 512
#define N 512
Line 183: Line 183:
}
}
</syntaxhighlight>
</syntaxhighlight>
= Basic Performance Considerations =  
= Basic Performance Considerations =  
== Memory Transfers ==
== Memory Transfers ==
Bureaucrats, cc_docs_admin, cc_staff
2,318

edits