Bureaucrats, cc_docs_admin, cc_staff
2,318
edits
No edit summary |
|||
Line 184: | Line 184: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
= Basic | = Basic performance considerations = | ||
== Memory | == Memory transfers == | ||
* PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories | * PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories | ||
* Minimize | * Minimize host-to-device and device-to-host memory copies | ||
* Keep data on the device as long as possible | * Keep data on the device as long as possible | ||
* Sometimes it is not effificient to make the | * Sometimes it is not effificient to make the host (CPU) do non-optimal jobs; executing it on the GPU may still be faster than copying to CPU, executing, and copying back | ||
* Use memcpy times to analyse the execution times | * Use memcpy times to analyse the execution times | ||
== Bandwidth == <!--T:21--> | == Bandwidth == <!--T:21--> | ||
* Always keep CUDA bandwidth in mind when changing your code | * Always keep CUDA bandwidth limitations in mind when changing your code | ||
* Know the theoretical peak bandwidth of the various data links | * Know the theoretical peak bandwidth of the various data links | ||
* Count bytes read/written and compare to the theoretical peak | * Count bytes read/written and compare to the theoretical peak | ||
* Utilize the various memory spaces depending on the situation: global, shared, constant | * Utilize the various memory spaces depending on the situation: global, shared, constant | ||
== Common GPU | == Common GPU programming strategies == <!--T:22--> | ||
* Constant memory also resides in DRAM- much slower access than shared memory | * Constant memory also resides in DRAM - much slower access than shared memory | ||
** BUT, it’s cached !!! | ** BUT, it’s cached !!! | ||
** highly efficient access for read-only, broadcast | ** highly efficient access for read-only, broadcast | ||
* Carefully divide data acording to access patterns: | * Carefully divide data acording to access patterns: | ||
** | ** read-only: constant memory (very fast if in cache) | ||
** | ** read/write within block: shared memory (very fast) | ||
** | ** read/write within thread: registers (very fast) | ||
** | ** read/write input/results: global memory (very slow) | ||
</translate> | </translate> |