Nvprof: Difference between revisions

Nvprof (view source)

888 bytes added , 5 years ago

no edit summary

Bureaucrats, cc_docs_admin, cc_staff

337

edits

@@ Line 35: / Line 35: @@
 This is the default operating mode for Nvprof. It outputs a single result line for each instruction such as  a kernel function or  CUDA memory copy/set performed by the application. For each kernel function, Nvprof outputs the total time of all instances of the kernel or type of memory copy as well as the average, minimum, and maximum time.
 In this example, the application is <code>a.out</code> and we run Nvprof to get the profiling :
-{{Command|nvprof  ./a.out}}
+{{Command|nvprof  ./a.out|Result
+[Matrix Multiply Using CUDA] - Starting...
+==27694== NVPROF is profiling process 27694, command: matrixMul
+GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0
+MatrixA(320,320), MatrixB(640,320)
+Computing result using CUDA Kernel...
+done
+Performance= 35.35 GFlop/s, Time= 3.708 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
+Checking computed result for correctness: OK
+Note: For peak performance, please refer to the matrixMulCUBLAS example.
+==27694== Profiling application: matrixMul
+==27694== Profiling result:
+Time(%)      Time     Calls       Avg       Min       Max  Name
+.94%  1.11524s       301  3.7051ms  3.6928ms  3.7174ms  void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
+.04%  406.30us         2  203.15us  136.13us  270.18us  [CUDA memcpy HtoD]
+.02%  248.29us         1  248.29us  248.29us  248.29us  [CUDA memcpy DtoH]
+}}