CUDA tutorial: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 79: Line 79:
Usually a streaming microprocessor (SM) executes one threading block at a time. The code is executed in groups of 32 threads (called warps). A hardware scheduller is free to assign blocks to any SM at any time. Furthermore, when an SM gets the block assigned to it, it does not mean that this particular block will be executed non-stop. In fact, the scheduler can postpone/suspend execution of such blocks under certain conditions when e.g. data becomes unavailable (indeed, it is quite time-consuming to read data from the global GPU memory). When it happens, the scheduler executes another threading block which is ready for execution. This is a so-called zero-overhead scheduling which makes the execution more streamlined so that SMs are not idle.
Usually a streaming microprocessor (SM) executes one threading block at a time. The code is executed in groups of 32 threads (called warps). A hardware scheduller is free to assign blocks to any SM at any time. Furthermore, when an SM gets the block assigned to it, it does not mean that this particular block will be executed non-stop. In fact, the scheduler can postpone/suspend execution of such blocks under certain conditions when e.g. data becomes unavailable (indeed, it is quite time-consuming to read data from the global GPU memory). When it happens, the scheduler executes another threading block which is ready for execution. This is a so-called zero-overhead scheduling which makes the execution more streamlined so that SMs are not idle.


= GPU Memories in CUDA = <!--T:7-->
= Types of GPU memory in CUDA = <!--T:7-->
There are several type of memories exists for CUDA operations:
There are several types of memories available for CUDA operations:
* Global memory
* Global memory
** off-chip, good for I/O, but relatively slow
** off-chip, good for I/O, but relatively slow
* Shared memory
* Shared memory
** on-chip, good for thread collaboration, very fast
** on-chip, good for thread collaboration, very fast
* Registers& Local Memory
* Registers and Local Memory
** thread work space , very fast
** thread work space , very fast
* Constant memory
* Constant memory
Bureaucrats, cc_docs_admin, cc_staff
2,318

edits