Introduction

This tutorial introduces the Graphics Processing Unit (GPU) as a massively parallel computing device, the CUDA parallel programming language, and some of the CUDA numerical libraries for use in high performance computing.

Prerequisites for this tutorial

This tutorial uses CUDA to accelerate C or C++ code. A working knowledge of one of these languages is therefore required to gain the most benefit out of it. Even though Fortran is also supported by CUDA, for the purpose of this tutorial we only cover the CUDA C/C++. From here on, we use term CUDA C to refer "CUDA C and C++". CUDA C is essentially a C/C++ that allow one to execute function on both GPU and CPU.

Learning objectives

Understanding the architecture of a GPU.
Understanding the workflow of a CUDA program
Managing GPU memory and understanding the various types of GPU memory
Writing and compiling a minimal CUDA code and compiling CUDA examples

What is GPU ?

GPU, or a graphics processing unit, is a single-chip processor that performs rapid mathematical calculations, primarily for the purpose of rendering images. However, in the recent years, such capability is being harnessed more broadly to accelerate computational workloads of the cutting-edge scientific research areas.

What is CUDA ?

CUDA = Compute Unified Device Architecture Provides access to instructions and memory of massively parallel elements in GPU. Another definition: CUDA is scalable parallel programming model and software environment for parallel computing.

CUDA GPU Architecture

There two main components of the GPU:

Global memory
- Similar to CPU memory
- Accessible by both CPU and GPU
Streaming multiprocessors (SMs)
- Each SM consists or many streaming processors (SPs)
- They perform actual computations
- Each SM has its own control init, registers, execution pipelines, etc

CUDA Programming Model

Before we start talking about programming model, let us go over some useful terminology:

Host – The CPU and its memory (host memory)
Device – The GPU and its memory (device memory)

The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. CUDA code is capable of managing memory of both CPU and GPU as well as executing GPU functions, called kernels. Such kernels are executed by many GPU threads in parallel. Here is the 5-steps recipe of a typical CUDA code:

Declare and allocate both the Host and Device memories
Initialize the Host memory
Transfer data from Host memory to Device memory
Execute GPU functions (kernels)
Transfer data back to the Host memory

CUDA Execution Model

Simple CUDA code executed on GPU is called KERNEL. There are several questions we may ask at this point:

How do you run a Kernel on a bunch of streaming multiprocessors (SMs) ?
How do you make such run massively parallel ?

Here is the execution recipe that will answer the above questions:

each GPU core (streaming processor) execute a sequential Thread, where Thread is a smallest set of instructions handled by the operating system's schedule.
all GPU cores execute the kernel in a SIMT fashion (Single Instruction Multiple Threads)

First CUDA C Program

__global__   void add (int *a, int *b, int *c){

	*c = *a + *b;
}
int main(void){
	int a, b, c;
	int *dev_a, *dev_b, *dev_c;
	int size = sizeof(int);

//  allocate device copies of a,b, c
cudaMalloc ( (void**) &dev_a, size);
cudaMalloc ( (void**) &dev_b, size);
cudaMalloc ( (void**) &dev_c, size);

a=2; b=7;
//  copy inputs to device
cudaMemcpy (dev_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy (dev_b, &b, size, cudaMemcpyHostToDevice);

// launch add() kernel on GPU, passing parameters
add <<< 1, 1 >>> (dev_a, dev_b, dev_c);

// copy device result back to host
cudaMemcpy (&c, dev_c, size, cudaMemcpyDeviceToHost);

cudaFree ( dev_a ); cudaFree ( dev_b ); cudaFree ( dev_c ); 
}

CUDA tutorial

Contents