Bureaucrats, cc_docs_admin, cc_staff
337
edits
No edit summary |
|||
Line 76: | Line 76: | ||
= Threads Scheduling = | = Threads Scheduling = | ||
Usually streaming microprocessor (SM) executes one threading block at a time. The code is executed in groups of 32 threads (called Warps). A hardware scheduller is free to assign blocks to any SM at any time. Furthermore, when SM gets the block assigned to it, it does not mean that this particular block will be executed non-stop. In fact, the scheduler can postpone/suspend execution os such block under certain conditions when e.x. data becomes unavailable (indeed, it takes quite some time to read data from the global GPU memory). When it happens, the scheduler takes another threading block which is ready for execution. This is a so called zero-overhead scheduling which makes the execution more stream-lined where SMs are not idling. | Usually streaming microprocessor (SM) executes one threading block at a time. The code is executed in groups of 32 threads (called Warps). A hardware scheduller is free to assign blocks to any SM at any time. Furthermore, when SM gets the block assigned to it, it does not mean that this particular block will be executed non-stop. In fact, the scheduler can postpone/suspend execution os such block under certain conditions when e.x. data becomes unavailable (indeed, it takes quite some time to read data from the global GPU memory). When it happens, the scheduler takes another threading block which is ready for execution. This is a so called zero-overhead scheduling which makes the execution more stream-lined where SMs are not idling. | ||
= GPU Memories in CUDA = | |||
There are several type of memories exists for CUDA operations: | |||
* Global memory | |||
** off-chip, good for I/O, but relatively slow | |||
* Shared memory | |||
** on-chip, good for thread collaboration, very fast | |||
* Registers& Local Memory | |||
** thread work space , very fast | |||
* Constant memory | |||
= First CUDA C Program= | = First CUDA C Program= |