CUDA tutorial: Difference between revisions

No edit summary
Line 184: Line 184:
</syntaxhighlight>
</syntaxhighlight>


= Basic Performance Considerations =  
= Basic performance considerations =  
== Memory Transfers ==
== Memory transfers ==
* PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories
* PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories
* Minimize Host-to-Device and Device-to-Host memory copies
* Minimize host-to-device and device-to-host memory copies
* Keep data on the device as long as possible
* Keep data on the device as long as possible
* Sometimes it is not effificient to make the Host (CPU) do non-optimal jobs; executing it on the GPU may still be faster than copying to CPU, executing, and copying back
* Sometimes it is not effificient to make the host (CPU) do non-optimal jobs; executing it on the GPU may still be faster than copying to CPU, executing, and copying back
* Use memcpy times to analyse the execution times
* Use memcpy times to analyse the execution times


== Bandwidth == <!--T:21-->
== Bandwidth == <!--T:21-->
* Always keep CUDA bandwidth in mind when changing your code
* Always keep CUDA bandwidth limitations in mind when changing your code
* Know the theoretical peak bandwidth of the various data links
* Know the theoretical peak bandwidth of the various data links
* Count bytes read/written and compare to the theoretical peak
* Count bytes read/written and compare to the theoretical peak
* Utilize the various memory spaces depending on the situation: global, shared, constant
* Utilize the various memory spaces depending on the situation: global, shared, constant


== Common GPU Programming Strategies == <!--T:22-->
== Common GPU programming strategies == <!--T:22-->
* Constant memory also resides in DRAM- much slower access than shared memory
* Constant memory also resides in DRAM - much slower access than shared memory
** BUT, it’s cached !!!
** BUT, it’s cached !!!
** highly efficient access for read-only, broadcast
** highly efficient access for read-only, broadcast
* Carefully divide data acording to access patterns:
* Carefully divide data acording to access patterns:
** R Only:   constant memory (very fast if in cache)
** read-only:   constant memory (very fast if in cache)
** R/W within Block: shared memory (very fast)
** read/write within block: shared memory (very fast)
** R/W within Thread: registers (very fast)
** read/write within thread: registers (very fast)
** R/W input/results: global memory (very slow)
** read/write input/results: global memory (very slow)
</translate>
</translate>
Bureaucrats, cc_docs_admin, cc_staff
2,318

edits