CUDA tutorial: Difference between revisions

Line 175: Line 175:
* PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories
* PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories
* Minimize Host-to-Device and Device-to-Host memory copies
* Minimize Host-to-Device and Device-to-Host memory copies
* Keep data on the device as long as possible
* Sometimes it is not effificient to make the Host (CPU) do non-optimal jobs; executing it on the GPU may still be faster than copying to CPU, executing, and copying back
* Use memcpy times to analyse the execution times
== Bandwidth ==
* Always keep CUDA bandwidth in mind when chaning your code
* Know the theoretical peak bandwidth of the various data links
* Count bytes read/written and compare to the theoretical peak
* Utilize the various memory spaces depending on the situation: global, shared, constant
== Common GPU Programming Strategies ==
* Constant memory also resides in DRAM- much slower access than shared memory
** BUT, it’s cashed !!!
** highly efficient access for read-only, broadcast
* Carefully divide data acording to access patterns:
** R Only:   constant memory (very fast if in cashe)
** R/W within Block: shared memory (very fast)
** R/W within Thread: registers (very fast)
** R/W input/results: global memory (very slow)
Bureaucrats, cc_docs_admin, cc_staff
337

edits