CUDA tutorial: Difference between revisions

Jump to navigation Jump to search
Marked this version for translation
No edit summary
(Marked this version for translation)
Line 2: Line 2:
<translate>
<translate>
[[Category:Software]]
[[Category:Software]]
=Introduction=
=Introduction= <!--T:1-->
This tutorial introduces the Graphics Processing Unit (GPU) as a massively parallel computing device, the CUDA parallel programming language, and some of the CUDA numerical libraries for use in high performance computing.
This tutorial introduces the Graphics Processing Unit (GPU) as a massively parallel computing device, the CUDA parallel programming language, and some of the CUDA numerical libraries for use in high performance computing.
{{Prerequisites
{{Prerequisites
Line 45: Line 45:
* Execute GPU functions (kernels)  
* Execute GPU functions (kernels)  
* Transfer data back to the Host memory
* Transfer data back to the Host memory
=CUDA Execution Model=
=CUDA Execution Model= <!--T:2-->
Simple CUDA code executed on GPU is called KERNEL. There are several questions we may ask at this point:
Simple CUDA code executed on GPU is called KERNEL. There are several questions we may ask at this point:
* How do you run a Kernel on a bunch of streaming multiprocessors (SMs) ?
* How do you run a Kernel on a bunch of streaming multiprocessors (SMs) ?
Line 57: Line 57:
3. Copy results from GPU memory back to CPU memory
3. Copy results from GPU memory back to CPU memory


= CUDA Block-Threading Model =
= CUDA Block-Threading Model = <!--T:3-->


<!--T:4-->
[[File:Cuda-threads-blocks.png|thumbnail|CUDA block-threading model where threads are organized into blocks while blocks are further organized into grid. ]]
[[File:Cuda-threads-blocks.png|thumbnail|CUDA block-threading model where threads are organized into blocks while blocks are further organized into grid. ]]
Given very large number of threads (and in order to achieve massive parallelism one has to use all the threads possible) in CUDA kernel, one needs to organize them somehow. in CUDA, all the threads are structured in threading blocks, the blocks are further organized into grids, as shown on FIg. In dividing the threads we make sure that the following is satisfied:
Given very large number of threads (and in order to achieve massive parallelism one has to use all the threads possible) in CUDA kernel, one needs to organize them somehow. in CUDA, all the threads are structured in threading blocks, the blocks are further organized into grids, as shown on FIg. In dividing the threads we make sure that the following is satisfied:
Line 66: Line 67:
[[File:Cuda_threads.png|thumbnail|Threads within a block intercommunicate via shared memory . ]]
[[File:Cuda_threads.png|thumbnail|Threads within a block intercommunicate via shared memory . ]]


<!--T:5-->
Each thread uses IDs to decide what data to work on:
Each thread uses IDs to decide what data to work on:
* Block IDs: 1D or 2D (blockIdx.x, blockIdx.y)
* Block IDs: 1D or 2D (blockIdx.x, blockIdx.y)
Line 71: Line 73:
Such model simplifies memory addressing when processing multidimmensional data.
Such model simplifies memory addressing when processing multidimmensional data.


= Threads Scheduling =
= Threads Scheduling = <!--T:6-->
Usually streaming microprocessor (SM) executes one threading block at a time. The code is executed in groups of 32 threads (called Warps). A hardware scheduller is free to assign blocks to any SM at any time. Furthermore, when SM gets the block assigned to it, it does not mean that this particular block will be executed non-stop. In fact, the scheduler can postpone/suspend execution os such block under certain conditions when e.x. data becomes unavailable (indeed, it takes quite some time to read data from the global GPU memory). When it happens, the scheduler takes another threading block which is ready for execution. This is a so called zero-overhead scheduling which makes the execution more stream-lined where SMs are not idling.
Usually streaming microprocessor (SM) executes one threading block at a time. The code is executed in groups of 32 threads (called Warps). A hardware scheduller is free to assign blocks to any SM at any time. Furthermore, when SM gets the block assigned to it, it does not mean that this particular block will be executed non-stop. In fact, the scheduler can postpone/suspend execution os such block under certain conditions when e.x. data becomes unavailable (indeed, it takes quite some time to read data from the global GPU memory). When it happens, the scheduler takes another threading block which is ready for execution. This is a so called zero-overhead scheduling which makes the execution more stream-lined where SMs are not idling.


= GPU Memories in CUDA =
= GPU Memories in CUDA = <!--T:7-->
There are several type of memories exists for CUDA operations:
There are several type of memories exists for CUDA operations:
* Global memory
* Global memory
Line 84: Line 86:
* Constant memory
* Constant memory


= Few Basic CUDA Operations =
= Few Basic CUDA Operations = <!--T:8-->
== CUDA Memory Allocation ==
== CUDA Memory Allocation ==
* cudaMalloc((void**)&array, size)
* cudaMalloc((void**)&array, size)
Line 91: Line 93:
** Deallocates object from the memory. Requires just a pointer to the array.
** Deallocates object from the memory. Requires just a pointer to the array.


== CUDA Data Transfers ==
== CUDA Data Transfers == <!--T:9-->
* cudaMemcpy(array_dest, array_orig, size, direction)
* cudaMemcpy(array_dest, array_orig, size, direction)
** Copy the data from either device to host or host to device . Requires pointers to the arrays, size and the direction type (cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, etc)
** Copy the data from either device to host or host to device . Requires pointers to the arrays, size and the direction type (cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, etc)
Line 97: Line 99:
** Same as cudaMemcpy, but transfers the data asynchronously which means it's not blocking execution of other processes.
** Same as cudaMemcpy, but transfers the data asynchronously which means it's not blocking execution of other processes.


= First CUDA C Program=
= First CUDA C Program= <!--T:10-->
The following example shows how to add two numbers on the GPU using CUDA. Note that this is just an exercise, it's very simple, so it will not scale up.
The following example shows how to add two numbers on the GPU using CUDA. Note that this is just an exercise, it's very simple, so it will not scale up.
<syntaxhighlight lang="cpp" line highlight="1,5">
<syntaxhighlight lang="cpp" line highlight="1,5">
__global__  void add (int *a, int *b, int *c){
__global__  void add (int *a, int *b, int *c){


*c = *a + *b;
<!--T:11-->
*c = *a + *b;
}
}
int main(void){
int main(void){
Line 109: Line 112:
int size = sizeof(int);
int size = sizeof(int);


<!--T:12-->
//  allocate device copies of a,b, c
//  allocate device copies of a,b, c
cudaMalloc ( (void**) &dev_a, size);
cudaMalloc ( (void**) &dev_a, size);
Line 114: Line 118:
cudaMalloc ( (void**) &dev_c, size);
cudaMalloc ( (void**) &dev_c, size);


<!--T:13-->
a=2; b=7;
a=2; b=7;
//  copy inputs to device
//  copy inputs to device
Line 119: Line 124:
cudaMemcpy (dev_b, &b, size, cudaMemcpyHostToDevice);
cudaMemcpy (dev_b, &b, size, cudaMemcpyHostToDevice);


<!--T:14-->
// launch add() kernel on GPU, passing parameters
// launch add() kernel on GPU, passing parameters
add <<< 1, 1 >>> (dev_a, dev_b, dev_c);
add <<< 1, 1 >>> (dev_a, dev_b, dev_c);


<!--T:15-->
// copy device result back to host
// copy device result back to host
cudaMemcpy (&c, dev_c, size, cudaMemcpyDeviceToHost);
cudaMemcpy (&c, dev_c, size, cudaMemcpyDeviceToHost);


<!--T:16-->
cudaFree ( dev_a ); cudaFree ( dev_b ); cudaFree ( dev_c );  
cudaFree ( dev_a ); cudaFree ( dev_b ); cudaFree ( dev_c );  
}
}
</syntaxhighlight>
</syntaxhighlight>


<!--T:17-->
Are we missing anything ?  
Are we missing anything ?  
That code does not look parallel !
That code does not look parallel !
Line 139: Line 148:
__global__  void add (int *a, int *b, int *c){
__global__  void add (int *a, int *b, int *c){


c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
<!--T:18-->
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
</syntaxhighlight>
</syntaxhighlight>
where blockIdx.x is the unique number identifying a cuda block. This way each cuda block adds a value from a[ ] to b[ ].
where blockIdx.x is the unique number identifying a cuda block. This way each cuda block adds a value from a[ ] to b[ ].
[[File:Cuda-blocks-parallel.png|thumbnail|CUDA blocks-based parallelism. ]]
[[File:Cuda-blocks-parallel.png|thumbnail|CUDA blocks-based parallelism. ]]


<!--T:19-->
Can we again make some modifications in those triple brackets ?  
Can we again make some modifications in those triple brackets ?  
<syntaxhighlight lang="cpp" line highlight="1,5">
<syntaxhighlight lang="cpp" line highlight="1,5">
Line 150: Line 161:
  Now instead of blocks, the job is distributed across parallel threads. What is the advantage of having parallel threads ? Unlike blocks, threads can communicate between each other ? In other words, we parallelize across multiple threads in the block when massive communication is involved. The chunks of code that can run independently (without much communication) are distributed across parallel blocks.
  Now instead of blocks, the job is distributed across parallel threads. What is the advantage of having parallel threads ? Unlike blocks, threads can communicate between each other ? In other words, we parallelize across multiple threads in the block when massive communication is involved. The chunks of code that can run independently (without much communication) are distributed across parallel blocks.


= Advantage of Shared Memory=
= Advantage of Shared Memory= <!--T:20-->
So far all the memory transfers in the kernel have been done via the regular GPU (global) memory which is relatively slow. Often time we have so many communications between the threads that decreases the performance significantly.  In order to address this issue there exist another type of memory called Shared memory which can be used to speed-up the memory operations between the threads. However the trick is that only the threads within a block can communicate.  In order to demonstrate the usage of such shared memory we consider the dot product example where two vectors are dot-multipled. Below is the kernel:
So far all the memory transfers in the kernel have been done via the regular GPU (global) memory which is relatively slow. Often time we have so many communications between the threads that decreases the performance significantly.  In order to address this issue there exist another type of memory called Shared memory which can be used to speed-up the memory operations between the threads. However the trick is that only the threads within a block can communicate.  In order to demonstrate the usage of such shared memory we consider the dot product example where two vectors are dot-multipled. Below is the kernel:
  <syntaxhighlight lang="cpp" line highlight="1,5">
  <syntaxhighlight lang="cpp" line highlight="1,5">
Line 177: Line 188:
* Use memcpy times to analyse the execution times
* Use memcpy times to analyse the execution times


== Bandwidth ==
== Bandwidth == <!--T:21-->
* Always keep CUDA bandwidth in mind when chaning your code
* Always keep CUDA bandwidth in mind when chaning your code
* Know the theoretical peak bandwidth of the various data links
* Know the theoretical peak bandwidth of the various data links
Line 183: Line 194:
* Utilize the various memory spaces depending on the situation: global, shared, constant
* Utilize the various memory spaces depending on the situation: global, shared, constant


== Common GPU Programming Strategies ==
== Common GPU Programming Strategies == <!--T:22-->
* Constant memory also resides in DRAM- much slower access than shared memory
* Constant memory also resides in DRAM- much slower access than shared memory
** BUT, it’s cached !!!
** BUT, it’s cached !!!
Bureaucrats, cc_docs_admin, cc_staff
2,314

edits

Navigation menu