Bureaucrats, cc_docs_admin, cc_staff
337
edits
Line 175: | Line 175: | ||
* PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories | * PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories | ||
* Minimize Host-to-Device and Device-to-Host memory copies | * Minimize Host-to-Device and Device-to-Host memory copies | ||
* Keep data on the device as long as possible | |||
* Sometimes it is not effificient to make the Host (CPU) do non-optimal jobs; executing it on the GPU may still be faster than copying to CPU, executing, and copying back | |||
* Use memcpy times to analyse the execution times | |||
== Bandwidth == | |||
* Always keep CUDA bandwidth in mind when chaning your code | |||
* Know the theoretical peak bandwidth of the various data links | |||
* Count bytes read/written and compare to the theoretical peak | |||
* Utilize the various memory spaces depending on the situation: global, shared, constant | |||
== Common GPU Programming Strategies == | |||
* Constant memory also resides in DRAM- much slower access than shared memory | |||
** BUT, it’s cashed !!! | |||
** highly efficient access for read-only, broadcast | |||
* Carefully divide data acording to access patterns: | |||
** R Only: constant memory (very fast if in cashe) | |||
** R/W within Block: shared memory (very fast) | |||
** R/W within Thread: registers (very fast) | |||
** R/W input/results: global memory (very slow) |