1perf-report(1) 2============== 3 4NAME 5---- 6perf-report - Read perf.data (created by perf record) and display the profile 7 8SYNOPSIS 9-------- 10[verse] 11'perf report' [-i <file> | --input=file] 12 13DESCRIPTION 14----------- 15This command displays the performance counter profile information recorded 16via perf record. 17 18OPTIONS 19------- 20-i:: 21--input=:: 22 Input file name. (default: perf.data unless stdin is a fifo) 23 24-v:: 25--verbose:: 26 Be more verbose. (show symbol address, etc) 27 28-q:: 29--quiet:: 30 Do not show any warnings or messages. (Suppress -v) 31 32-n:: 33--show-nr-samples:: 34 Show the number of samples for each symbol 35 36--show-cpu-utilization:: 37 Show sample percentage for different cpu modes. 38 39-T:: 40--threads:: 41 Show per-thread event counters. The input data file should be recorded 42 with -s option. 43-c:: 44--comms=:: 45 Only consider symbols in these comms. CSV that understands 46 file://filename entries. This option will affect the percentage of 47 the overhead column. See --percentage for more info. 48--pid=:: 49 Only show events for given process ID (comma separated list). 50 51--tid=:: 52 Only show events for given thread ID (comma separated list). 53-d:: 54--dsos=:: 55 Only consider symbols in these dsos. CSV that understands 56 file://filename entries. This option will affect the percentage of 57 the overhead column. See --percentage for more info. 58-S:: 59--symbols=:: 60 Only consider these symbols. CSV that understands 61 file://filename entries. This option will affect the percentage of 62 the overhead column. See --percentage for more info. 63 64--symbol-filter=:: 65 Only show symbols that match (partially) with this filter. 66 67-U:: 68--hide-unresolved:: 69 Only display entries resolved to a symbol. 70 71-s:: 72--sort=:: 73 Sort histogram entries by given key(s) - multiple keys can be specified 74 in CSV format. Following sort keys are available: 75 pid, comm, dso, symbol, parent, cpu, socket, srcline, weight, 76 local_weight, cgroup_id, addr. 77 78 Each key has following meaning: 79 80 - comm: command (name) of the task which can be read via /proc/<pid>/comm 81 - pid: command and tid of the task 82 - dso: name of library or module executed at the time of sample 83 - dso_size: size of library or module executed at the time of sample 84 - symbol: name of function executed at the time of sample 85 - symbol_size: size of function executed at the time of sample 86 - parent: name of function matched to the parent regex filter. Unmatched 87 entries are displayed as "[other]". 88 - cpu: cpu number the task ran at the time of sample 89 - socket: processor socket number the task ran at the time of sample 90 - srcline: filename and line number executed at the time of sample. The 91 DWARF debugging info must be provided. 92 - srcfile: file name of the source file of the samples. Requires dwarf 93 information. 94 - weight: Event specific weight, e.g. memory latency or transaction 95 abort cost. This is the global weight. 96 - local_weight: Local weight version of the weight above. 97 - cgroup_id: ID derived from cgroup namespace device and inode numbers. 98 - cgroup: cgroup pathname in the cgroupfs. 99 - transaction: Transaction abort flags. 100 - overhead: Overhead percentage of sample 101 - overhead_sys: Overhead percentage of sample running in system mode 102 - overhead_us: Overhead percentage of sample running in user mode 103 - overhead_guest_sys: Overhead percentage of sample running in system mode 104 on guest machine 105 - overhead_guest_us: Overhead percentage of sample running in user mode on 106 guest machine 107 - sample: Number of sample 108 - period: Raw number of event count of sample 109 - time: Separate the samples by time stamp with the resolution specified by 110 --time-quantum (default 100ms). Specify with overhead and before it. 111 - code_page_size: the code page size of sampled code address (ip) 112 - ins_lat: Instruction latency in core cycles. This is the global instruction 113 latency 114 - local_ins_lat: Local instruction latency version 115 - p_stage_cyc: On powerpc, this presents the number of cycles spent in a 116 pipeline stage. And currently supported only on powerpc. 117 - addr: (Full) virtual address of the sampled instruction 118 - retire_lat: On X86, this reports pipeline stall of this instruction compared 119 to the previous instruction in cycles. And currently supported only on X86 120 121 By default, comm, dso and symbol keys are used. 122 (i.e. --sort comm,dso,symbol) 123 124 If --branch-stack option is used, following sort keys are also 125 available: 126 127 - dso_from: name of library or module branched from 128 - dso_to: name of library or module branched to 129 - symbol_from: name of function branched from 130 - symbol_to: name of function branched to 131 - srcline_from: source file and line branched from 132 - srcline_to: source file and line branched to 133 - mispredict: "N" for predicted branch, "Y" for mispredicted branch 134 - in_tx: branch in TSX transaction 135 - abort: TSX transaction abort. 136 - cycles: Cycles in basic block 137 138 And default sort keys are changed to comm, dso_from, symbol_from, dso_to 139 and symbol_to, see '--branch-stack'. 140 141 When the sort key symbol is specified, columns "IPC" and "IPC Coverage" 142 are enabled automatically. Column "IPC" reports the average IPC per function 143 and column "IPC coverage" reports the percentage of instructions with 144 sampled IPC in this function. IPC means Instruction Per Cycle. If it's low, 145 it indicates there may be a performance bottleneck when the function is 146 executed, such as a memory access bottleneck. If a function has high overhead 147 and low IPC, it's worth further analyzing it to optimize its performance. 148 149 If the --mem-mode option is used, the following sort keys are also available 150 (incompatible with --branch-stack): 151 symbol_daddr, dso_daddr, locked, tlb, mem, snoop, dcacheline, blocked. 152 153 - symbol_daddr: name of data symbol being executed on at the time of sample 154 - dso_daddr: name of library or module containing the data being executed 155 on at the time of the sample 156 - locked: whether the bus was locked at the time of the sample 157 - tlb: type of tlb access for the data at the time of the sample 158 - mem: type of memory access for the data at the time of the sample 159 - snoop: type of snoop (if any) for the data at the time of the sample 160 - dcacheline: the cacheline the data address is on at the time of the sample 161 - phys_daddr: physical address of data being executed on at the time of sample 162 - data_page_size: the data page size of data being executed on at the time of sample 163 - blocked: reason of blocked load access for the data at the time of the sample 164 165 And the default sort keys are changed to local_weight, mem, sym, dso, 166 symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat, 167 see '--mem-mode'. 168 169 If the data file has tracepoint event(s), following (dynamic) sort keys 170 are also available: 171 trace, trace_fields, [<event>.]<field>[/raw] 172 173 - trace: pretty printed trace output in a single column 174 - trace_fields: fields in tracepoints in separate columns 175 - <field name>: optional event and field name for a specific field 176 177 The last form consists of event and field names. If event name is 178 omitted, it searches all events for matching field name. The matched 179 field will be shown only for the event has the field. The event name 180 supports substring match so user doesn't need to specify full subsystem 181 and event name everytime. For example, 'sched:sched_switch' event can 182 be shortened to 'switch' as long as it's not ambiguous. Also event can 183 be specified by its index (starting from 1) preceded by the '%'. 184 So '%1' is the first event, '%2' is the second, and so on. 185 186 The field name can have '/raw' suffix which disables pretty printing 187 and shows raw field value like hex numbers. The --raw-trace option 188 has the same effect for all dynamic sort keys. 189 190 The default sort keys are changed to 'trace' if all events in the data 191 file are tracepoint. 192 193-F:: 194--fields=:: 195 Specify output field - multiple keys can be specified in CSV format. 196 Following fields are available: 197 overhead, overhead_sys, overhead_us, overhead_children, sample and period. 198 Also it can contain any sort key(s). 199 200 By default, every sort keys not specified in -F will be appended 201 automatically. 202 203 If the keys starts with a prefix '+', then it will append the specified 204 field(s) to the default field order. For example: perf report -F +period,sample. 205 206-p:: 207--parent=<regex>:: 208 A regex filter to identify parent. The parent is a caller of this 209 function and searched through the callchain, thus it requires callchain 210 information recorded. The pattern is in the extended regex format and 211 defaults to "\^sys_|^do_page_fault", see '--sort parent'. 212 213-x:: 214--exclude-other:: 215 Only display entries with parent-match. 216 217-w:: 218--column-widths=<width[,width...]>:: 219 Force each column width to the provided list, for large terminal 220 readability. 0 means no limit (default behavior). 221 222-t:: 223--field-separator=:: 224 Use a special separator character and don't pad with spaces, replacing 225 all occurrences of this separator in symbol names (and other output) 226 with a '.' character, that thus it's the only non valid separator. 227 228-D:: 229--dump-raw-trace:: 230 Dump raw trace in ASCII. 231 232--disable-order:: 233 Disable raw trace ordering. 234 235-g:: 236--call-graph=<print_type,threshold[,print_limit],order,sort_key[,branch],value>:: 237 Display call chains using type, min percent threshold, print limit, 238 call order, sort key, optional branch and value. Note that ordering 239 is not fixed so any parameter can be given in an arbitrary order. 240 One exception is the print_limit which should be preceded by threshold. 241 242 print_type can be either: 243 - flat: single column, linear exposure of call chains. 244 - graph: use a graph tree, displaying absolute overhead rates. (default) 245 - fractal: like graph, but displays relative rates. Each branch of 246 the tree is considered as a new profiled object. 247 - folded: call chains are displayed in a line, separated by semicolons 248 - none: disable call chain display. 249 250 threshold is a percentage value which specifies a minimum percent to be 251 included in the output call graph. Default is 0.5 (%). 252 253 print_limit is only applied when stdio interface is used. It's to limit 254 number of call graph entries in a single hist entry. Note that it needs 255 to be given after threshold (but not necessarily consecutive). 256 Default is 0 (unlimited). 257 258 order can be either: 259 - callee: callee based call graph. 260 - caller: inverted caller based call graph. 261 Default is 'caller' when --children is used, otherwise 'callee'. 262 263 sort_key can be: 264 - function: compare on functions (default) 265 - address: compare on individual code addresses 266 - srcline: compare on source filename and line number 267 268 branch can be: 269 - branch: include last branch information in callgraph when available. 270 Usually more convenient to use --branch-history for this. 271 272 value can be: 273 - percent: display overhead percent (default) 274 - period: display event period 275 - count: display event count 276 277--children:: 278 Accumulate callchain of children to parent entry so that then can 279 show up in the output. The output will have a new "Children" column 280 and will be sorted on the data. It requires callchains are recorded. 281 See the `overhead calculation' section for more details. Enabled by 282 default, disable with --no-children. 283 284--max-stack:: 285 Set the stack depth limit when parsing the callchain, anything 286 beyond the specified depth will be ignored. This is a trade-off 287 between information loss and faster processing especially for 288 workloads that can have a very long callchain stack. 289 Note that when using the --itrace option the synthesized callchain size 290 will override this value if the synthesized callchain size is bigger. 291 292 Default: 127 293 294-G:: 295--inverted:: 296 alias for inverted caller based call graph. 297 298--ignore-callees=<regex>:: 299 Ignore callees of the function(s) matching the given regex. 300 This has the effect of collecting the callers of each such 301 function into one place in the call-graph tree. 302 303--pretty=<key>:: 304 Pretty printing style. key: normal, raw 305 306--stdio:: Use the stdio interface. 307 308--stdio-color:: 309 'always', 'never' or 'auto', allowing configuring color output 310 via the command line, in addition to via "color.ui" .perfconfig. 311 Use '--stdio-color always' to generate color even when redirecting 312 to a pipe or file. Using just '--stdio-color' is equivalent to 313 using 'always'. 314 315--tui:: Use the TUI interface, that is integrated with annotate and allows 316 zooming into DSOs or threads, among other features. Use of --tui 317 requires a tty, if one is not present, as when piping to other 318 commands, the stdio interface is used. 319 320--gtk:: Use the GTK2 interface. 321 322-k:: 323--vmlinux=<file>:: 324 vmlinux pathname 325 326--ignore-vmlinux:: 327 Ignore vmlinux files. 328 329--kallsyms=<file>:: 330 kallsyms pathname 331 332-m:: 333--modules:: 334 Load module symbols. WARNING: This should only be used with -k and 335 a LIVE kernel. 336 337-f:: 338--force:: 339 Don't do ownership validation. 340 341--symfs=<directory>:: 342 Look for files with symbols relative to this directory. 343 344-C:: 345--cpu:: Only report samples for the list of CPUs provided. Multiple CPUs can 346 be provided as a comma-separated list with no space: 0,1. Ranges of 347 CPUs are specified with -: 0-2. Default is to report samples on all 348 CPUs. 349 350-M:: 351--disassembler-style=:: Set disassembler style for objdump. 352 353--source:: 354 Interleave source code with assembly code. Enabled by default, 355 disable with --no-source. 356 357--asm-raw:: 358 Show raw instruction encoding of assembly instructions. 359 360--show-total-period:: Show a column with the sum of periods. 361 362-I:: 363--show-info:: 364 Display extended information about the perf.data file. This adds 365 information which may be very large and thus may clutter the display. 366 It currently includes: cpu and numa topology of the host system. 367 368-b:: 369--branch-stack:: 370 Use the addresses of sampled taken branches instead of the instruction 371 address to build the histograms. To generate meaningful output, the 372 perf.data file must have been obtained using perf record -b or 373 perf record --branch-filter xxx where xxx is a branch filter option. 374 perf report is able to auto-detect whether a perf.data file contains 375 branch stacks and it will automatically switch to the branch view mode, 376 unless --no-branch-stack is used. 377 378--branch-history:: 379 Add the addresses of sampled taken branches to the callstack. 380 This allows to examine the path the program took to each sample. 381 The data collection must have used -b (or -j) and -g. 382 383--objdump=<path>:: 384 Path to objdump binary. 385 386--prefix=PREFIX:: 387--prefix-strip=N:: 388 Remove first N entries from source file path names in executables 389 and add PREFIX. This allows to display source code compiled on systems 390 with different file system layout. 391 392--group:: 393 Show event group information together. It forces group output also 394 if there are no groups defined in data file. 395 396--group-sort-idx:: 397 Sort the output by the event at the index n in group. If n is invalid, 398 sort by the first event. It can support multiple groups with different 399 amount of events. WARNING: This should be used on grouped events. 400 401--demangle:: 402 Demangle symbol names to human readable form. It's enabled by default, 403 disable with --no-demangle. 404 405--demangle-kernel:: 406 Demangle kernel symbol names to human readable form (for C++ kernels). 407 408--mem-mode:: 409 Use the data addresses of samples in addition to instruction addresses 410 to build the histograms. To generate meaningful output, the perf.data 411 file must have been obtained using perf record -d -W and using a 412 special event -e cpu/mem-loads/p or -e cpu/mem-stores/p. See 413 'perf mem' for simpler access. 414 415--percent-limit:: 416 Do not show entries which have an overhead under that percent. 417 (Default: 0). Note that this option also sets the percent limit (threshold) 418 of callchains. However the default value of callchain threshold is 419 different than the default value of hist entries. Please see the 420 --call-graph option for details. 421 422--percentage:: 423 Determine how to display the overhead percentage of filtered entries. 424 Filters can be applied by --comms, --dsos and/or --symbols options and 425 Zoom operations on the TUI (thread, dso, etc). 426 427 "relative" means it's relative to filtered entries only so that the 428 sum of shown entries will be always 100%. "absolute" means it retains 429 the original value before and after the filter is applied. 430 431--header:: 432 Show header information in the perf.data file. This includes 433 various information like hostname, OS and perf version, cpu/mem 434 info, perf command line, event list and so on. Currently only 435 --stdio output supports this feature. 436 437--header-only:: 438 Show only perf.data header (forces --stdio). 439 440--time:: 441 Only analyze samples within given time window: <start>,<stop>. Times 442 have the format seconds.nanoseconds. If start is not given (i.e. time 443 string is ',x.y') then analysis starts at the beginning of the file. If 444 stop time is not given (i.e. time string is 'x.y,') then analysis goes 445 to end of file. Multiple ranges can be separated by spaces, which 446 requires the argument to be quoted e.g. --time "1234.567,1234.789 1235," 447 448 Also support time percent with multiple time ranges. Time string is 449 'a%/n,b%/m,...' or 'a%-b%,c%-%d,...'. 450 451 For example: 452 Select the second 10% time slice: 453 454 perf report --time 10%/2 455 456 Select from 0% to 10% time slice: 457 458 perf report --time 0%-10% 459 460 Select the first and second 10% time slices: 461 462 perf report --time 10%/1,10%/2 463 464 Select from 0% to 10% and 30% to 40% slices: 465 466 perf report --time 0%-10%,30%-40% 467 468--switch-on EVENT_NAME:: 469 Only consider events after this event is found. 470 471 This may be interesting to measure a workload only after some initialization 472 phase is over, i.e. insert a perf probe at that point and then using this 473 option with that probe. 474 475--switch-off EVENT_NAME:: 476 Stop considering events after this event is found. 477 478--show-on-off-events:: 479 Show the --switch-on/off events too. This has no effect in 'perf report' now 480 but probably we'll make the default not to show the switch-on/off events 481 on the --group mode and if there is only one event besides the off/on ones, 482 go straight to the histogram browser, just like 'perf report' with no events 483 explicitly specified does. 484 485--itrace:: 486 Options for decoding instruction tracing data. The options are: 487 488include::itrace.txt[] 489 490 To disable decoding entirely, use --no-itrace. 491 492--full-source-path:: 493 Show the full path for source files for srcline output. 494 495--show-ref-call-graph:: 496 When multiple events are sampled, it may not be needed to collect 497 callgraphs for all of them. The sample sites are usually nearby, 498 and it's enough to collect the callgraphs on a reference event. 499 So user can use "call-graph=no" event modifier to disable callgraph 500 for other events to reduce the overhead. 501 However, perf report cannot show callgraphs for the event which 502 disable the callgraph. 503 This option extends the perf report to show reference callgraphs, 504 which collected by reference event, in no callgraph event. 505 506--stitch-lbr:: 507 Show callgraph with stitched LBRs, which may have more complete 508 callgraph. The perf.data file must have been obtained using 509 perf record --call-graph lbr. 510 Disabled by default. In common cases with call stack overflows, 511 it can recreate better call stacks than the default lbr call stack 512 output. But this approach is not foolproof. There can be cases 513 where it creates incorrect call stacks from incorrect matches. 514 The known limitations include exception handing such as 515 setjmp/longjmp will have calls/returns not match. 516 517--socket-filter:: 518 Only report the samples on the processor socket that match with this filter 519 520--samples=N:: 521 Save N individual samples for each histogram entry to show context in perf 522 report tui browser. 523 524--raw-trace:: 525 When displaying traceevent output, do not use print fmt or plugins. 526 527--hierarchy:: 528 Enable hierarchical output. 529 530--inline:: 531 If a callgraph address belongs to an inlined function, the inline stack 532 will be printed. Each entry is function name or file/line. Enabled by 533 default, disable with --no-inline. 534 535--mmaps:: 536 Show --tasks output plus mmap information in a format similar to 537 /proc/<PID>/maps. 538 539 Please note that not all mmaps are stored, options affecting which ones 540 are include 'perf record --data', for instance. 541 542--ns:: 543 Show time stamps in nanoseconds. 544 545--stats:: 546 Display overall events statistics without any further processing. 547 (like the one at the end of the perf report -D command) 548 549--tasks:: 550 Display monitored tasks stored in perf data. Displaying pid/tid/ppid 551 plus the command string aligned to distinguish parent and child tasks. 552 553--percent-type:: 554 Set annotation percent type from following choices: 555 global-period, local-period, global-hits, local-hits 556 557 The local/global keywords set if the percentage is computed 558 in the scope of the function (local) or the whole data (global). 559 The period/hits keywords set the base the percentage is computed 560 on - the samples period or the number of samples (hits). 561 562--time-quantum:: 563 Configure time quantum for time sort key. Default 100ms. 564 Accepts s, us, ms, ns units. 565 566--total-cycles:: 567 When --total-cycles is specified, it supports sorting for all blocks by 568 'Sampled Cycles%'. This is useful to concentrate on the globally hottest 569 blocks. In output, there are some new columns: 570 571 'Sampled Cycles%' - block sampled cycles aggregation / total sampled cycles 572 'Sampled Cycles' - block sampled cycles aggregation 573 'Avg Cycles%' - block average sampled cycles / sum of total block average 574 sampled cycles 575 'Avg Cycles' - block average sampled cycles 576 577--skip-empty:: 578 Do not print 0 results in the --stat output. 579 580include::callchain-overhead-calculation.txt[] 581 582SEE ALSO 583-------- 584linkperf:perf-stat[1], linkperf:perf-annotate[1], linkperf:perf-record[1], 585linkperf:perf-intel-pt[1] 586