CUDA tutorial: Difference between revisions

CUDA tutorial (view source)

1,007 bytes added , 7 years ago

Bureaucrats, cc_docs_admin, cc_staff

337

edits

@@ Line 175: / Line 175: @@
 * PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories
 * Minimize Host-to-Device and Device-to-Host memory copies
+* Keep data on the device as long as possible
+* Sometimes it is not effificient to make the Host (CPU) do non-optimal jobs; executing it on the GPU may still be faster than copying to CPU, executing, and copying back
+* Use memcpy times to analyse the execution times
+== Bandwidth ==
+* Always keep CUDA bandwidth in mind when chaning your code
+* Know the theoretical peak bandwidth of the various data links
+* Count bytes read/written and compare to the theoretical peak
+* Utilize the various memory spaces depending on the situation: global, shared, constant
+== Common GPU Programming Strategies ==
+* Constant memory also resides in DRAM- much slower access than shared memory
+** BUT, it’s cashed !!!
+** highly efficient access for read-only, broadcast
+* Carefully divide data acording to access patterns:
+** R Only: 	   constant memory (very fast if in cashe)
+** R/W within Block:		shared memory (very fast)
+** R/W within Thread:		registers (very fast)
+** R/W input/results:		global memory (very slow)