CUDA tutorial: Difference between revisions
Line 61: | Line 61: | ||
= CUDA Block-Threading Model = | = CUDA Block-Threading Model = | ||
[[File:Cuda-threads-blocks.png|thumbnail|CUDA block-threading model where threads are organized into blocks while blocks are further organized into grid. ]] | |||
Given very large number of threads (and in order to achieve massive parallelism one has to use all the threads possible) in CUDA kernel, one needs to organize them somehow. in CUDA, all the threads are structured in threading blocks, the blocks are further organized into grids, as shown on FIg. In dividing the threads we make sure that the following is satisfied: | Given very large number of threads (and in order to achieve massive parallelism one has to use all the threads possible) in CUDA kernel, one needs to organize them somehow. in CUDA, all the threads are structured in threading blocks, the blocks are further organized into grids, as shown on FIg. In dividing the threads we make sure that the following is satisfied: | ||
* threads within a block cooperate via the shared memory | * threads within a block cooperate via the shared memory | ||
* threads in different blocks can not cooperate | * threads in different blocks can not cooperate | ||
[[File:Cuda-threads | In this model the threads within a block work on the same set of instructions (but perhaps with different data sets) and exchange data between each other via shared memory. Threads in other blocks do the same thing (see Figure). | ||
[[File:Cuda-threads.png|thumbnail|Threads within a block intercommunicate via shared memory . ]] | |||
= First CUDA C Program= | = First CUDA C Program= |
Revision as of 19:20, 14 September 2017
This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.
Introduction
This tutorial introduces the Graphics Processing Unit (GPU) as a massively parallel computing device, the CUDA parallel programming language, and some of the CUDA numerical libraries for use in high performance computing.
This tutorial uses CUDA to accelerate C or C++ code. A working knowledge of one of these languages is therefore required to gain the most benefit out of it. Even though Fortran is also supported by CUDA, for the purpose of this tutorial we only cover the CUDA C/C++. From here on, we use term CUDA C to refer "CUDA C and C++". CUDA C is essentially a C/C++ that allow one to execute function on both GPU and CPU.
- Understanding the architecture of a GPU.
- Understanding the workflow of a CUDA program
- Managing GPU memory and understanding the various types of GPU memory
- Writing and compiling a minimal CUDA code and compiling CUDA examples
What is GPU ?
GPU, or a graphics processing unit, is a single-chip processor that performs rapid mathematical calculations, primarily for the purpose of rendering images. However, in the recent years, such capability is being harnessed more broadly to accelerate computational workloads of the cutting-edge scientific research areas.
What is CUDA ?
CUDA = Compute Unified Device Architecture Provides access to instructions and memory of massively parallel elements in GPU. Another definition: CUDA is scalable parallel programming model and software environment for parallel computing.
CUDA GPU Architecture
There two main components of the GPU:
- Global memory
- Similar to CPU memory
- Accessible by both CPU and GPU
- Streaming multiprocessors (SMs)
- Each SM consists or many streaming processors (SPs)
- They perform actual computations
- Each SM has its own control init, registers, execution pipelines, etc
CUDA Programming Model
Before we start talking about programming model, let us go over some useful terminology:
- Host – The CPU and its memory (host memory)
- Device – The GPU and its memory (device memory)
The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. CUDA code is capable of managing memory of both CPU and GPU as well as executing GPU functions, called kernels. Such kernels are executed by many GPU threads in parallel. Here is the 5-steps recipe of a typical CUDA code:
- Declare and allocate both the Host and Device memories
- Initialize the Host memory
- Transfer data from Host memory to Device memory
- Execute GPU functions (kernels)
- Transfer data back to the Host memory
CUDA Execution Model
Simple CUDA code executed on GPU is called KERNEL. There are several questions we may ask at this point:
- How do you run a Kernel on a bunch of streaming multiprocessors (SMs) ?
- How do you make such run massively parallel ?
Here is the execution recipe that will answer the above questions:
- each GPU core (streaming processor) execute a sequential Thread, where Thread is a smallest set of instructions handled by the operating system's schedule.
- all GPU cores execute the kernel in a SIMT fashion (Single Instruction Multiple Threads)
Usually the following procedure is recommended when it comes to executing on GPU: 1. Copy input data from CPU memory to GPU memory 2. Load GPU program (Kernel) and execute it 3. Copy results from GPU memory back to CPU memory
CUDA Block-Threading Model
Given very large number of threads (and in order to achieve massive parallelism one has to use all the threads possible) in CUDA kernel, one needs to organize them somehow. in CUDA, all the threads are structured in threading blocks, the blocks are further organized into grids, as shown on FIg. In dividing the threads we make sure that the following is satisfied:
- threads within a block cooperate via the shared memory
- threads in different blocks can not cooperate
In this model the threads within a block work on the same set of instructions (but perhaps with different data sets) and exchange data between each other via shared memory. Threads in other blocks do the same thing (see Figure).
First CUDA C Program
__global__ void add (int *a, int *b, int *c){
*c = *a + *b;
}
int main(void){
int a, b, c;
int *dev_a, *dev_b, *dev_c;
int size = sizeof(int);
// allocate device copies of a,b, c
cudaMalloc ( (void**) &dev_a, size);
cudaMalloc ( (void**) &dev_b, size);
cudaMalloc ( (void**) &dev_c, size);
a=2; b=7;
// copy inputs to device
cudaMemcpy (dev_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy (dev_b, &b, size, cudaMemcpyHostToDevice);
// launch add() kernel on GPU, passing parameters
add <<< 1, 1 >>> (dev_a, dev_b, dev_c);
// copy device result back to host
cudaMemcpy (&c, dev_c, size, cudaMemcpyDeviceToHost);
cudaFree ( dev_a ); cudaFree ( dev_b ); cudaFree ( dev_c );
}
Are we missing anything ? That code does not look parallel ! Solution: Lets look at what inside the triple brackets in the Kernel call and make some changes :
add <<< N, 1 >>> (dev_a, dev_b, dev_c);
Here we replaced 1 by N, so that N different cuda blocks will be executed at the same time. However, in order to achieve a parallelism we need to make some changes to the Kernel as well:
__global__ void add (int *a, int *b, int *c){
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
where blockIdx.x is the unique number identifying a cuda block. This way each cuda block adds a value from a[ ] to b[ ].
Can we again make some modifications in those triple brackets ?
add <<< 1, '''N''' >>> (dev_a, dev_b, dev_c);
Now instead of blocks, the job is distributed across parallel threads. What is the advantage of having parallel threads ? Unlike blocks, threads can communicate between each other ? In other words, we parallelize across multiple threads in the block when massive communication is involved. The chunks of code that can run independently (without much communication) are distributed across parallel blocks.
So far all the memory transfers in the kernel have been done via the regular GPU (global) memory which is relatively slow. Often time we have so many communications between the threads that decreases the performance significantly. In order to address this issue there exist another type of memory called Shared memory which can be used to speed-up the memory operations between the threads. However the trick is that only the threads within a block can communicate. In order to demonstrate the usage of such shared memory we consider the dot product example where two vectors are dot-multipled. Below is the kernel:
__global__ void dot(int *a, int *b, int *c){
int temp = a[threadIdx.x]*b[threadIdx.x];
}
After each thread computed its portion, we need to add everything together. Each threads has to share its data. However, the problem is that each copy of thread's temp is private.This can resolved with the use of shared memory. Below is the kernel with the modifications to account the shared memory usage:
#define N 512
__global__ void dot(int *a, int *b, int *c){
__shared__ int temp[N];
temp[threadIdx.x] = a[threadIdx.x]*b[threadIdx.x];
__syncthreads();
if(threadIdx.x==0){
int sum; for(int i=0;i<N;i++) sum+= temp[i];
*c=sum; }
}
Basic Performance Considerations
Memory Transfers
- PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories
- Minimize Host-to-Device and Device-to-Host memory copies