CUDA tutorial: Difference between revisions

CUDA tutorial (view source)

Revision as of 16:55, 3 October 2017

23 bytes removed , 7 years ago

no edit summary

Diane27

rsnt_translations

56,437

edits

@@ Line 48: / Line 48: @@
 * Transfer data back to the host memory
-=CUDA execution model= <!--T:2-->
+=Execution model= <!--T:2-->
 Simple CUDA code executed on GPU is called a ''kernel''. There are several questions we may ask at this point:
 * How do you run a kernel on a bunch of streaming multiprocessors (SMs)?
@@ Line 60: / Line 60: @@
 . Copy results from GPU memory back to CPU memory
-= CUDA block-threading model = <!--T:3-->
+= Block-threading model = <!--T:3-->
 <!--T:4-->
@@ Line 79: / Line 79: @@
 Usually a streaming microprocessor (SM) executes one threading block at a time. The code is executed in groups of 32 threads (called warps). A hardware scheduller is free to assign blocks to any SM at any time. Furthermore, when an SM gets the block assigned to it, it does not mean that this particular block will be executed non-stop. In fact, the scheduler can postpone/suspend execution of such blocks under certain conditions when e.g. data becomes unavailable (indeed, it is quite time-consuming to read data from the global GPU memory). When it happens, the scheduler executes another threading block which is ready for execution. This is a so-called zero-overhead scheduling which makes the execution more streamlined so that SMs are not idle.
-= Types of GPU memory in CUDA = <!--T:7-->
+= Types of GPU memories= <!--T:7-->
 There are several types of memories available for CUDA operations:
 * Global memory
@@ Line 96: / Line 96: @@
 ** Deallocates object from the memory. Requires just a pointer to the array.
-== CUDA data transfer == <!--T:9-->
+== Data transfer == <!--T:9-->
 * cudaMemcpy(array_dest, array_orig, size, direction)
 ** Copy the data from either device to host or host to device. Requires pointers to the arrays, size and the direction type (cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, etc.)
@@ Line 185: / Line 185: @@
 = Basic performance considerations =  <!--T:26-->
-== Memory transfers ==
+== Memory transfer ==
 * PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories
 * Minimize host-to-device and device-to-host memory copies

CUDA tutorial: Difference between revisions

CUDA tutorial (view source)

Revision as of 16:55, 3 October 2017

Navigation menu

Search