rsnt_translations
56,437
edits
No edit summary |
No edit summary |
||
Line 48: | Line 48: | ||
* Transfer data back to the host memory | * Transfer data back to the host memory | ||
= | =Execution model= <!--T:2--> | ||
Simple CUDA code executed on GPU is called a ''kernel''. There are several questions we may ask at this point: | Simple CUDA code executed on GPU is called a ''kernel''. There are several questions we may ask at this point: | ||
* How do you run a kernel on a bunch of streaming multiprocessors (SMs)? | * How do you run a kernel on a bunch of streaming multiprocessors (SMs)? | ||
Line 60: | Line 60: | ||
3. Copy results from GPU memory back to CPU memory | 3. Copy results from GPU memory back to CPU memory | ||
= | = Block-threading model = <!--T:3--> | ||
<!--T:4--> | <!--T:4--> | ||
Line 79: | Line 79: | ||
Usually a streaming microprocessor (SM) executes one threading block at a time. The code is executed in groups of 32 threads (called warps). A hardware scheduller is free to assign blocks to any SM at any time. Furthermore, when an SM gets the block assigned to it, it does not mean that this particular block will be executed non-stop. In fact, the scheduler can postpone/suspend execution of such blocks under certain conditions when e.g. data becomes unavailable (indeed, it is quite time-consuming to read data from the global GPU memory). When it happens, the scheduler executes another threading block which is ready for execution. This is a so-called zero-overhead scheduling which makes the execution more streamlined so that SMs are not idle. | Usually a streaming microprocessor (SM) executes one threading block at a time. The code is executed in groups of 32 threads (called warps). A hardware scheduller is free to assign blocks to any SM at any time. Furthermore, when an SM gets the block assigned to it, it does not mean that this particular block will be executed non-stop. In fact, the scheduler can postpone/suspend execution of such blocks under certain conditions when e.g. data becomes unavailable (indeed, it is quite time-consuming to read data from the global GPU memory). When it happens, the scheduler executes another threading block which is ready for execution. This is a so-called zero-overhead scheduling which makes the execution more streamlined so that SMs are not idle. | ||
= Types of GPU | = Types of GPU memories= <!--T:7--> | ||
There are several types of memories available for CUDA operations: | There are several types of memories available for CUDA operations: | ||
* Global memory | * Global memory | ||
Line 96: | Line 96: | ||
** Deallocates object from the memory. Requires just a pointer to the array. | ** Deallocates object from the memory. Requires just a pointer to the array. | ||
== | == Data transfer == <!--T:9--> | ||
* cudaMemcpy(array_dest, array_orig, size, direction) | * cudaMemcpy(array_dest, array_orig, size, direction) | ||
** Copy the data from either device to host or host to device. Requires pointers to the arrays, size and the direction type (cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, etc.) | ** Copy the data from either device to host or host to device. Requires pointers to the arrays, size and the direction type (cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, etc.) | ||
Line 185: | Line 185: | ||
= Basic performance considerations = <!--T:26--> | = Basic performance considerations = <!--T:26--> | ||
== Memory | == Memory transfer == | ||
* PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories | * PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories | ||
* Minimize host-to-device and device-to-host memory copies | * Minimize host-to-device and device-to-host memory copies |