CUDA tutorial: Difference between revisions

CUDA tutorial (view source)

Revision as of 18:44, 28 September 2017

37 bytes added , 7 years ago

→‎Basic Performance Considerations

Stubbsda

Bureaucrats, cc_docs_admin, cc_staff

2,318

edits

@@ Line 184: / Line 184: @@
 </syntaxhighlight>
-= Basic Performance Considerations =
+= Basic performance considerations =
-== Memory Transfers ==
+== Memory transfers ==
 * PCI-e is extremely slow (4-6 GB/s) compared to both host and device memories
-* Minimize Host-to-Device and Device-to-Host memory copies
+* Minimize host-to-device and device-to-host memory copies
 * Keep data on the device as long as possible
-* Sometimes it is not effificient to make the Host (CPU) do non-optimal jobs; executing it on the GPU may still be faster than copying to CPU, executing, and copying back
+* Sometimes it is not effificient to make the host (CPU) do non-optimal jobs; executing it on the GPU may still be faster than copying to CPU, executing, and copying back
 * Use memcpy times to analyse the execution times
 == Bandwidth == <!--T:21-->
-* Always keep CUDA bandwidth in mind when changing your code
+* Always keep CUDA bandwidth limitations in mind when changing your code
 * Know the theoretical peak bandwidth of the various data links
 * Count bytes read/written and compare to the theoretical peak
 * Utilize the various memory spaces depending on the situation: global, shared, constant
-== Common GPU Programming Strategies == <!--T:22-->
+== Common GPU programming strategies == <!--T:22-->
-* Constant memory also resides in DRAM- much slower access than shared memory
+* Constant memory also resides in DRAM - much slower access than shared memory
 ** BUT, it’s cached !!!
 ** highly efficient access for read-only, broadcast
 * Carefully divide data acording to access patterns:
-** R Only: 	   constant memory (very fast if in cache)
+** read-only: 	   constant memory (very fast if in cache)
-** R/W within Block:		shared memory (very fast)
+** read/write within block:		shared memory (very fast)
-** R/W within Thread:		registers (very fast)
+** read/write within thread:		registers (very fast)
-** R/W input/results:		global memory (very slow)
+** read/write input/results:		global memory (very slow)
 </translate>