1# Symbolizer markup format # 2 3This document defines a text format for log messages that can be 4processed by a _symbolizing filter_. The basic idea is that logging 5code emits text that contains raw address values and so forth, without 6the logging code doing any real work to convert those values to 7human-readable form. Instead, logging text uses the markup format 8defined here to identify pieces of information that should be converted 9to human-readable form after the fact. As with other markup formats, 10the expectation is that most of the text will be displayed as is, while 11the markup elements will be replaced with expanded text, or converted 12into active UI elements, that present more details in symbolic form. 13 14This means there is no need for symbol tables, DWARF debugging sections, 15or similar information to be directly accessible at runtime. There is 16also no need at runtime for any logic intended to compute human-readable 17presentation of information, such as C++ symbol demangling. Instead, 18logging must include markup elements that give the contextual 19information necessary to make sense of the raw data, such as memory 20layout details. 21 22This format identifies markup elements with a syntax that is both simple 23and distinctive. It's simple enough to be matched and parsed with 24straightforward code. It's distinctive enough that character sequences 25that look like the start or end of a markup element should rarely if 26ever appear incidentally in logging text. It's specifically intended 27not to require sanitizing plain text, such as the HTML/XML requirement 28to replace `<` with `<` and the like. 29 30## Scope and assumptions ## 31 32This specification defines a format standard for Zircon and Fuchsia. 33But there is nothing specific to Zircon or Fuchsia about the markup 34format. A symbolizing filter implementation will be independent both of 35the _target_ operating system and machine architecture where the logs 36are generated and of the _host_ operating system and machine 37architecture where the filter runs. 38 39This format assumes that the symbolizing filter processes intact whole 40lines. If long lines might be split during some stage of a logging 41pipeline, they must be reassembled to restore the original line breaks 42before feeding lines into the symbolizing filter. Most markup elements 43must appear entirely on a single line (often with other text before 44and/or after the markup element). There are some markup elements that 45are specified to span lines, with line breaks in the middle of the 46element. Even in those cases, the filter is not expected to handle line 47breaks in arbitrary places inside a markup element, but only inside 48certain fields. 49 50This format assumes that the symbolizing filter processes a coherent 51stream of log lines from a single process address space context. If a 52logging stream interleaves log lines from more than one process, these 53must be collated into separate per-process log streams and each stream 54processed by a separate instance of the symbolizing filter. Because the 55kernel and user processes use disjoint address regions in most operating 56systems (including Zircon), a single user process address space plus 57the kernel address space can be treated as a single address space for 58symbolization purposes if desired. 59 60## Dependence on Build IDs ## 61 62The symbolizer markup scheme relies on contextual information about 63runtime memory address layout to make it possible to convert markup 64elements into useful symbolic form. This relies on having an 65unmistakable identification of which binary was loaded at each address. 66 67An ELF Build ID is the payload of an ELF note with name `"GNU"` and type 68`NT_GNU_BUILD_ID`, a unique byte sequence that identifies a particular 69binary (executable, shared library, loadable module, or driver module). 70The linker generates this automatically based on a hash that includes 71the complete symbol table and debugging information, even if this is 72later stripped from the binary. 73 74This specification uses the ELF Build ID as the sole means of 75identifying binaries. Each binary relevant to the log must have been 76linked with a unique Build ID. The symbolizing filter must have some 77means of mapping a Build ID back to the original ELF binary (either the 78whole unstripped binary, or a stripped binary paired with a separate 79debug file). 80 81## Colorization ## 82 83The markup format supports a restricted subset of ANSI X3.64 SGR (Select 84Graphic Rendition) control sequences. These are unlike other markup 85elements: 86 * They specify presentation details (**bold** or colors) rather than 87 semantic information. The assocation of semantic meaning with color 88 (e.g. red for errors) is chosen by the code doing the logging, rather 89 than by the UI presentation of the symbolizing filter. This is a 90 concession to existing code (e.g. LLVM sanitizer runtimes) that use 91 specific colors and would require substantial changes to generate 92 semantic markup instead. 93 * A single control sequence changes "the state", rather than being an 94 hierarchical structure that surrounds affected text. 95 96The filter processes ANSI SGR control sequences only within a single 97line. If a control sequence to enter a **bold** or color state is 98encountered, it's expected that the control sequence to reset to default 99state will be encountered before the end of that line. If a "dangling" 100state is left at the end of a line, the filter may reset to default 101state for the next line. 102 103An SGR control sequence is not interpreted inside any other markup element. 104However, other markup elements may appear between SGR control sequences and 105the color/**bold** state is expected to apply to the symbolic output that 106replaces the markup element in the filter's output. 107 108The accepted SGR control sequences all have the form `"\033[%um"` 109(expressed here using C string syntax), where `%u` is one of these: 110 111| Code | Effect | Notes | 112|:----:|:------:|-------| 113| `0` | Reset to default formatting. | | 114| `1` | Use **bold text** | Combines with color states, doesn't reset them.| 115| `30` | Black foreground | | 116| `31` | Red foreground | | 117| `32` | Green foreground | | 118| `33` | Yellow foreground | | 119| `34` | Blue foreground | | 120| `35` | Magenta foreground | | 121| `36` | Cyan foreground | | 122| `37` | White foreground | | 123 124## Common markup element syntax ## 125 126All the markup elements share a common syntactic structure to facilitate 127simple matching and parsing code. Each element has the form: 128 129``` 130{{{tag:fields}}} 131``` 132 133`tag` identifies one of the element types described below, and is always 134a short alphabetic string that must be in lower case. The rest of the 135element consists of one or more fields. Fields are separated by `:` and 136cannot contain any `:` or `}` characters. How many fields must be or 137may be present and what they contain is specified for each element type. 138 139No markup elements or ANSI SGR control sequences are interpreted inside the 140contents of a field. 141 142In the descriptions of each element type, `printf`-style placeholders 143indicate field contents: 144 145* `%s` 146 147 A string of printable characters, not including `:` or `}`. 148 149* `%p` 150 151 An address value represented by `0x` followed by an even number of 152 hexadecimal digits (using either lower-case or upper-case for 153 `A`..`F`). If the digits are all `0` then the `0x` prefix may be 154 omitted. No more than 16 hexadecimal digits are expected to appear in 155 a single value (64 bits). 156 157* `%u` 158 159 A nonnegative decimal integer. 160 161* `%x` 162 163 A sequence of an even number of hexadecimal digits (using either 164 lower-case or upper-case for `A`..`F`), with no `0x` prefix. 165 This represents an arbitrary sequence of bytes, such as an ELF Build ID. 166 167## Presentation elements ## 168 169These are elements that convey a specific program entity to be displayed 170in human-readable symbolic form. 171 172* `{{{symbol:%s}}}` 173 174 Here `%s` is the linkage name for a symbol or type. It may require 175 demangling according to language ABI rules. Even for unmangled names, 176 it's recommended that this markup element be used to identify a symbol 177 name so that it can be presented distinctively. 178 179 Examples: 180 ``` 181 {{{symbol:_ZN7Mangled4NameEv}}} 182 {{{symbol:foobar}}} 183 ``` 184 185* `{{{pc:%p}}}` 186 187 Here `%p` is the memory address of a code location. 188 It might be presented as a function name and source location. 189 190 Examples: 191 ``` 192 {{{pc:0x12345678}}} 193 {{{pc:0xffffffff9abcdef0}}} 194 ``` 195 196* `{{{data:%p}}}` 197 198 Here `%p` is the memory address of a data location. 199 It might be presented as the name of a global variable at that location. 200 201 Examples: 202 ``` 203 {{{data:0x12345678}}} 204 {{{data:0xffffffff9abcdef0}}} 205 ``` 206 207* `{{{bt:%u:%p}}}` 208 209 This represents one frame in a backtrace. It usually appears on a 210 line by itself (surrounded only by whitespace), in a sequence of such 211 lines with ascending frame numbers. So the human-readable output 212 might be formatted assuming that, such that it looks good for a 213 sequence of `bt` elements each alone on its line with uniform 214 indentation of each line. But it can appear anywhere, so the filter 215 should not remove any non-whitespace text surrounding the element. 216 217 Here `%u` is the frame number, which starts at zero for the location 218 of the fault being identified, increments to one for the caller of 219 frame zero's call frame, to two for the caller of frame one, etc. 220 `%p` is the memory address of a code location. 221 222 In frames after frame zero, this code location identifies a call site. 223 Some emitters may subtract one byte or one instruction length from the 224 actual return address for the call site, with the intent that the 225 address logged can be translated directly to a source location for the 226 call site and not for the apparent return site thereafter (which can 227 be confusing). It's recommended that emitters _not_ do this, so that 228 each frame's code location is the exact return address given to its 229 callee and e.g. could be highlighted in instruction-level disassembly. 230 The symbolizing filter can do the adjustment to the address it 231 translates into a source location. Assuming that a call instruction 232 is longer than one byte on all supported machines, applying the 233 "subtract one byte" adjustment a second time still results in an 234 address somewhere in the call instruction, so a little sloppiness here 235 does no harm. 236 237 Examples: 238 ``` 239 {{{bt:0:0x12345678}}} 240 {{{bt:1:0xffffffff9abcdef0}}} 241 ``` 242 243* `{{{hexdict:...}}}` 244 245 This element can span multiple lines. Here `...` is a sequence of 246 key-value pairs where a single `:` separates each key from its value, 247 and arbitrary whitespace separates the pairs. The value (right-hand 248 side) of each pair either is one or more `0` digits, or is `0x` 249 followed by hexadecimal digits. Each value might be a memory address 250 or might be some other integer (including an integer that looks like a 251 likely memory address but actually has an unrelated purpose). When 252 the contextual information about the memory layout suggests that a 253 given value could be a code location or a global variable data 254 address, it might be presented as a source location or variable name 255 or with active UI that makes such interpretation optionally visible. 256 257 The intended use is for things like register dumps, where the emitter 258 doesn't know which values might have a symbolic interpretation but a 259 presentation that makes plausible symbolic interpretations available 260 might be very useful to someone reading the log. At the same time, 261 a flat text presentation should usually avoid interfering too much 262 with the original contents and formatting of the dump. For example, 263 it might use footnotes with source locations for values that appear 264 to be code locations. An active UI presentation might show the dump 265 text as is, but highlight values with symbolic information available 266 and pop up a presentation of symbolic details when a value is selected. 267 268 Example: 269 ``` 270 {{{hexdict: 271 CS: 0 RIP: 0x6ee17076fb80 EFL: 0x10246 CR2: 0 272 RAX: 0xc53d0acbcf0 RBX: 0x1e659ea7e0d0 RCX: 0 RDX: 0x6ee1708300cc 273 RSI: 0 RDI: 0x6ee170830040 RBP: 0x3b13734898e0 RSP: 0x3b13734898d8 274 R8: 0x3b1373489860 R9: 0x2776ff4f R10: 0x2749d3e9a940 R11: 0x246 275 R12: 0x1e659ea7e0f0 R13: 0xd7231230fd6ff2e7 R14: 0x1e659ea7e108 R15: 0xc53d0acbcf0 276 }}} 277 ``` 278 279## Trigger elements ## 280 281These elements cause an external action and will be presented to the 282user in a human readable form. Generally they trigger an external 283action to occur that results in a linkable page. The link or some 284other informative information about the external action can then be 285presented to the user. 286 287* `{{{dumpfile:%s:%s}}}` 288 289 Here the first `%s` is an identifier for a type of dump and the 290 second `%s` is an identifier for a particular dump that's just been 291 published. The types of dumps, the exact meaning of "published", 292 and the nature of the identifier are outside the scope of the markup 293 format per se. In general it might correspond to writing a file by 294 that name or something similar. 295 296 This element may trigger additional post-processing work beyond 297 symbolizing the markup. It indicates that a dump file of some sort 298 has been published. Some logic attached to the symbolizing filter may 299 understand certain types of dump file and trigger additional 300 post-processing of the dump file upon encountering this element (e.g. 301 generating visualizations, symbolization). The expectation is that the 302 information collected from contextual elements (described below) in the 303 logging stream may be necessary to decode the content of the dump. So 304 if the symbolizing filter triggers other processing, it may need to 305 feed some distilled form of the contextual information to those 306 processes. 307 308 On Zircon and Fuchsia in particular, "publish" means to call the 309 `__sanitizer_publish_data` function from `<zircon/sanitizer.h>` 310 with the "type" identifier as the "sink name" string. The "dump 311 identifier" is the name attached to the Zircon VMO whose handle 312 was passed in the call to `__sanitizer_publish_data`. 313 **TODO(mcgrathr): Link to docs about `__sanitizer_publish_data` and 314 getting data dumps off the device.** 315 316 An example of a type identifier is `sancov`, for dumps from LLVM 317 [SanitizerCoverage](https://clang.llvm.org/docs/SanitizerCoverage.html). 318 319 Example: 320 ``` 321 {{{dumpfile:sancov:sancov.8675}}} 322 ``` 323 324## Contextual elements ## 325 326These are elements that supply information necessary to convert 327presentation elements to symbolic form. Unlike presentation elements, 328they are not directly related to the surrounding text. Contextual 329elements should appear alone on lines with no other non-whitespace 330text, so that the symbolizing filter might elide the whole line from 331its output without hiding any other log text. 332 333The contextual elements themselves do not necessarily need to be 334presented in human-readable output. However, the information they 335impart may be essential to understanding the logging text even after 336symbolization. So it's recommended that this information be preserved 337in some form when the original raw log with markup may no longer be 338readily accessible for whatever reason. 339 340Contextual elements should appear in the logging stream before they are 341needed. That is, if some piece of context may affect how the 342symbolizing filter would interpret or present a later presentation 343element, the necessary contextual elements should have appeared 344somewhere earlier in the logging stream. It should always be possible 345for the symbolizing filter to be implemented as a single pass over the 346raw logging stream, accumulating context and massaging text as it goes. 347 348* `{{{reset}}}` 349 350 This should be output before any other contextual element. The need 351 for this contextual element is to support implementations that handle 352 logs coming from multiple processes. Such implementations might not 353 know when a new process starts or ends. Because some identifying 354 information (like process IDs) might be the same between old and new 355 processes, a way is needed to distinguish two processes with such 356 identical identifying information. This element informs such 357 implementations to reset the state of a filter so that information 358 from a previous process's contextual elements is not assumed for new 359 process that just happens have the same identifying information. 360 361* `{{{module:%i:%s:%s:...}}}` 362 363 This element represents a so called "module". A "module" is a single 364 linked binary, such as a loaded ELF file. Usually each module occupies 365 a contiguous range of memory (always does on Zircon). 366 367 Here `%i` is the Module ID which is used by other contextual elements 368 to refer to this module. The first `%s` is a human-readable identifier 369 for the module, such as an ELF `DT_SONAME` string or a file name; but 370 it might be empty. It's only for casual information. The Module ID 371 will be exclusivelly used to refer to this module in other contextual 372 elements. The second `%s` is the module type and it determines what 373 the remaining fields are. The following module types are supported: 374 375 * `elf:%x` 376 377 Here `%x` encodes an ELF Build ID. The Build ID should refer to a 378 single linked binary. The Build ID string is the sole way to identify 379 the binary from which this module was loaded. 380 381 Example: 382 ``` 383 {{{module:1:libc.so:elf:83238ab56ba10497}}} 384 ``` 385 386* `{{{mmap:%p:%x:...}}}` 387 388 This contextual element is used to give information about a particular 389 region in memory. `%p` is the starting address and `%x` gives the size 390 in hex of the region of memory. The `...` part can take different forms 391 to give different information about the specified region of memory. The 392 allowed forms are the following: 393 394 * `load:%i:%s:%p` 395 396 This subelement informs the filter that a segment was loaded from a 397 module. The module is identified by its module id `%i`. The `%s` is 398 one or more of the letters 'r', 'w', and 'x' (in that order and in 399 either upper or lower case) to indicate this segment of memory is 400 readable, writable, and/or executable. The symbolizing filter can use 401 this information to guess whether an address is a likely code address 402 or a likely data address in the given module. The remaining `%p` gives 403 the module relative address. For ELF files the module relative address 404 will be the `p_vaddr` of the associated program header. For example if 405 your module's executable segment has `p_vaddr=0x1000`, `p_memsz=0x1234`, 406 and was loaded at 0x7acba69d5000 then you need to subtract 0x7acba69d4000 407 from any address between 0x7acba69d5000 and 0x7acba69d6234 to get the 408 module relative address. The starting address will usually have been 409 rounded down to the active page size, and the size rounded up. 410 411 Example: 412 ``` 413 {{{mmap:0x7acba69d5000:0x5a000:load:1:rx:0x1000}}} 414 ``` 415