1# Symbolizer markup format #
2
3This document defines a text format for log messages that can be
4processed by a _symbolizing filter_.  The basic idea is that logging
5code emits text that contains raw address values and so forth, without
6the logging code doing any real work to convert those values to
7human-readable form.  Instead, logging text uses the markup format
8defined here to identify pieces of information that should be converted
9to human-readable form after the fact.  As with other markup formats,
10the expectation is that most of the text will be displayed as is, while
11the markup elements will be replaced with expanded text, or converted
12into active UI elements, that present more details in symbolic form.
13
14This means there is no need for symbol tables, DWARF debugging sections,
15or similar information to be directly accessible at runtime.  There is
16also no need at runtime for any logic intended to compute human-readable
17presentation of information, such as C++ symbol demangling.  Instead,
18logging must include markup elements that give the contextual
19information necessary to make sense of the raw data, such as memory
20layout details.
21
22This format identifies markup elements with a syntax that is both simple
23and distinctive.  It's simple enough to be matched and parsed with
24straightforward code.  It's distinctive enough that character sequences
25that look like the start or end of a markup element should rarely if
26ever appear incidentally in logging text.  It's specifically intended
27not to require sanitizing plain text, such as the HTML/XML requirement
28to replace `<` with `&lt;` and the like.
29
30## Scope and assumptions ##
31
32This specification defines a format standard for Zircon and Fuchsia.
33But there is nothing specific to Zircon or Fuchsia about the markup
34format.  A symbolizing filter implementation will be independent both of
35the _target_ operating system and machine architecture where the logs
36are generated and of the _host_ operating system and machine
37architecture where the filter runs.
38
39This format assumes that the symbolizing filter processes intact whole
40lines.  If long lines might be split during some stage of a logging
41pipeline, they must be reassembled to restore the original line breaks
42before feeding lines into the symbolizing filter.  Most markup elements
43must appear entirely on a single line (often with other text before
44and/or after the markup element).  There are some markup elements that
45are specified to span lines, with line breaks in the middle of the
46element.  Even in those cases, the filter is not expected to handle line
47breaks in arbitrary places inside a markup element, but only inside
48certain fields.
49
50This format assumes that the symbolizing filter processes a coherent
51stream of log lines from a single process address space context.  If a
52logging stream interleaves log lines from more than one process, these
53must be collated into separate per-process log streams and each stream
54processed by a separate instance of the symbolizing filter.  Because the
55kernel and user processes use disjoint address regions in most operating
56systems (including Zircon), a single user process address space plus
57the kernel address space can be treated as a single address space for
58symbolization purposes if desired.
59
60## Dependence on Build IDs ##
61
62The symbolizer markup scheme relies on contextual information about
63runtime memory address layout to make it possible to convert markup
64elements into useful symbolic form.  This relies on having an
65unmistakable identification of which binary was loaded at each address.
66
67An ELF Build ID is the payload of an ELF note with name `"GNU"` and type
68`NT_GNU_BUILD_ID`, a unique byte sequence that identifies a particular
69binary (executable, shared library, loadable module, or driver module).
70The linker generates this automatically based on a hash that includes
71the complete symbol table and debugging information, even if this is
72later stripped from the binary.
73
74This specification uses the ELF Build ID as the sole means of
75identifying binaries.  Each binary relevant to the log must have been
76linked with a unique Build ID.  The symbolizing filter must have some
77means of mapping a Build ID back to the original ELF binary (either the
78whole unstripped binary, or a stripped binary paired with a separate
79debug file).
80
81## Colorization ##
82
83The markup format supports a restricted subset of ANSI X3.64 SGR (Select
84Graphic Rendition) control sequences.  These are unlike other markup
85elements:
86 * They specify presentation details (**bold** or colors) rather than
87   semantic information.  The assocation of semantic meaning with color
88   (e.g. red for errors) is chosen by the code doing the logging, rather
89   than by the UI presentation of the symbolizing filter.  This is a
90   concession to existing code (e.g. LLVM sanitizer runtimes) that use
91   specific colors and would require substantial changes to generate
92   semantic markup instead.
93 * A single control sequence changes "the state", rather than being an
94   hierarchical structure that surrounds affected text.
95
96The filter processes ANSI SGR control sequences only within a single
97line.  If a control sequence to enter a **bold** or color state is
98encountered, it's expected that the control sequence to reset to default
99state will be encountered before the end of that line.  If a "dangling"
100state is left at the end of a line, the filter may reset to default
101state for the next line.
102
103An SGR control sequence is not interpreted inside any other markup element.
104However, other markup elements may appear between SGR control sequences and
105the color/**bold** state is expected to apply to the symbolic output that
106replaces the markup element in the filter's output.
107
108The accepted SGR control sequences all have the form `"\033[%um"`
109(expressed here using C string syntax), where `%u` is one of these:
110
111| Code | Effect | Notes |
112|:----:|:------:|-------|
113| `0`  | Reset to default formatting. | |
114| `1`  | Use **bold text**  | Combines with color states, doesn't reset them.|
115| `30` | Black foreground   | |
116| `31` | Red foreground     | |
117| `32` | Green foreground   | |
118| `33` | Yellow foreground  | |
119| `34` | Blue foreground    | |
120| `35` | Magenta foreground | |
121| `36` | Cyan foreground    | |
122| `37` | White foreground   | |
123
124## Common markup element syntax ##
125
126All the markup elements share a common syntactic structure to facilitate
127simple matching and parsing code.  Each element has the form:
128
129```
130{{{tag:fields}}}
131```
132
133`tag` identifies one of the element types described below, and is always
134a short alphabetic string that must be in lower case.  The rest of the
135element consists of one or more fields.  Fields are separated by `:` and
136cannot contain any `:` or `}` characters.  How many fields must be or
137may be present and what they contain is specified for each element type.
138
139No markup elements or ANSI SGR control sequences are interpreted inside the
140contents of a field.
141
142In the descriptions of each element type, `printf`-style placeholders
143indicate field contents:
144
145* `%s`
146
147  A string of printable characters, not including `:` or `}`.
148
149* `%p`
150
151  An address value represented by `0x` followed by an even number of
152  hexadecimal digits (using either lower-case or upper-case for
153  `A`..`F`).  If the digits are all `0` then the `0x` prefix may be
154  omitted.  No more than 16 hexadecimal digits are expected to appear in
155  a single value (64 bits).
156
157* `%u`
158
159  A nonnegative decimal integer.
160
161* `%x`
162
163  A sequence of an even number of hexadecimal digits (using either
164  lower-case or upper-case for `A`..`F`), with no `0x` prefix.
165  This represents an arbitrary sequence of bytes, such as an ELF Build ID.
166
167## Presentation elements ##
168
169These are elements that convey a specific program entity to be displayed
170in human-readable symbolic form.
171
172* `{{{symbol:%s}}}`
173
174  Here `%s` is the linkage name for a symbol or type.  It may require
175  demangling according to language ABI rules.  Even for unmangled names,
176  it's recommended that this markup element be used to identify a symbol
177  name so that it can be presented distinctively.
178
179  Examples:
180  ```
181  {{{symbol:_ZN7Mangled4NameEv}}}
182  {{{symbol:foobar}}}
183  ```
184
185* `{{{pc:%p}}}`
186
187  Here `%p` is the memory address of a code location.
188  It might be presented as a function name and source location.
189
190  Examples:
191  ```
192  {{{pc:0x12345678}}}
193  {{{pc:0xffffffff9abcdef0}}}
194  ```
195
196* `{{{data:%p}}}`
197
198  Here `%p` is the memory address of a data location.
199  It might be presented as the name of a global variable at that location.
200
201  Examples:
202  ```
203  {{{data:0x12345678}}}
204  {{{data:0xffffffff9abcdef0}}}
205  ```
206
207* `{{{bt:%u:%p}}}`
208
209  This represents one frame in a backtrace.  It usually appears on a
210  line by itself (surrounded only by whitespace), in a sequence of such
211  lines with ascending frame numbers.  So the human-readable output
212  might be formatted assuming that, such that it looks good for a
213  sequence of `bt` elements each alone on its line with uniform
214  indentation of each line.  But it can appear anywhere, so the filter
215  should not remove any non-whitespace text surrounding the element.
216
217  Here `%u` is the frame number, which starts at zero for the location
218  of the fault being identified, increments to one for the caller of
219  frame zero's call frame, to two for the caller of frame one, etc.
220  `%p` is the memory address of a code location.
221
222  In frames after frame zero, this code location identifies a call site.
223  Some emitters may subtract one byte or one instruction length from the
224  actual return address for the call site, with the intent that the
225  address logged can be translated directly to a source location for the
226  call site and not for the apparent return site thereafter (which can
227  be confusing).  It's recommended that emitters _not_ do this, so that
228  each frame's code location is the exact return address given to its
229  callee and e.g. could be highlighted in instruction-level disassembly.
230  The symbolizing filter can do the adjustment to the address it
231  translates into a source location.  Assuming that a call instruction
232  is longer than one byte on all supported machines, applying the
233  "subtract one byte" adjustment a second time still results in an
234  address somewhere in the call instruction, so a little sloppiness here
235  does no harm.
236
237  Examples:
238  ```
239  {{{bt:0:0x12345678}}}
240  {{{bt:1:0xffffffff9abcdef0}}}
241  ```
242
243* `{{{hexdict:...}}}`
244
245  This element can span multiple lines.  Here `...` is a sequence of
246  key-value pairs where a single `:` separates each key from its value,
247  and arbitrary whitespace separates the pairs.  The value (right-hand
248  side) of each pair either is one or more `0` digits, or is `0x`
249  followed by hexadecimal digits.  Each value might be a memory address
250  or might be some other integer (including an integer that looks like a
251  likely memory address but actually has an unrelated purpose).  When
252  the contextual information about the memory layout suggests that a
253  given value could be a code location or a global variable data
254  address, it might be presented as a source location or variable name
255  or with active UI that makes such interpretation optionally visible.
256
257  The intended use is for things like register dumps, where the emitter
258  doesn't know which values might have a symbolic interpretation but a
259  presentation that makes plausible symbolic interpretations available
260  might be very useful to someone reading the log.  At the same time,
261  a flat text presentation should usually avoid interfering too much
262  with the original contents and formatting of the dump.  For example,
263  it might use footnotes with source locations for values that appear
264  to be code locations.  An active UI presentation might show the dump
265  text as is, but highlight values with symbolic information available
266  and pop up a presentation of symbolic details when a value is selected.
267
268  Example:
269  ```
270  {{{hexdict:
271    CS:                   0 RIP:     0x6ee17076fb80 EFL:            0x10246 CR2:                  0
272    RAX:      0xc53d0acbcf0 RBX:     0x1e659ea7e0d0 RCX:                  0 RDX:     0x6ee1708300cc
273    RSI:                  0 RDI:     0x6ee170830040 RBP:     0x3b13734898e0 RSP:     0x3b13734898d8
274     R8:     0x3b1373489860  R9:         0x2776ff4f R10:     0x2749d3e9a940 R11:              0x246
275    R12:     0x1e659ea7e0f0 R13: 0xd7231230fd6ff2e7 R14:     0x1e659ea7e108 R15:      0xc53d0acbcf0
276  }}}
277  ```
278
279## Trigger elements ##
280
281These elements cause an external action and will be presented to the
282user in a human readable form. Generally they trigger an external
283action to occur that results in a linkable page. The link or some
284other informative information about the external action can then be
285presented to the user.
286
287* `{{{dumpfile:%s:%s}}}`
288
289  Here the first `%s` is an identifier for a type of dump and the
290  second `%s` is an identifier for a particular dump that's just been
291  published.  The types of dumps, the exact meaning of "published",
292  and the nature of the identifier are outside the scope of the markup
293  format per se.  In general it might correspond to writing a file by
294  that name or something similar.
295
296  This element may trigger additional post-processing work beyond
297  symbolizing the markup. It indicates that a dump file of some sort
298  has been published.  Some logic attached to the symbolizing filter may
299  understand certain types of dump file and trigger additional
300  post-processing of the dump file upon encountering this element (e.g.
301  generating visualizations, symbolization).  The expectation is that the
302  information collected from contextual elements (described below) in the
303  logging stream may be necessary to decode the content of the dump.  So
304  if the symbolizing filter triggers other processing, it may need to
305  feed some distilled form of the contextual information to those
306  processes.
307
308  On Zircon and Fuchsia in particular, "publish" means to call the
309  `__sanitizer_publish_data` function from `<zircon/sanitizer.h>`
310  with the "type" identifier as the "sink name" string.  The "dump
311  identifier" is the name attached to the Zircon VMO whose handle
312  was passed in the call to `__sanitizer_publish_data`.
313  **TODO(mcgrathr): Link to docs about `__sanitizer_publish_data` and
314  getting data dumps off the device.**
315
316  An example of a type identifier is `sancov`, for dumps from LLVM
317  [SanitizerCoverage](https://clang.llvm.org/docs/SanitizerCoverage.html).
318
319  Example:
320  ```
321  {{{dumpfile:sancov:sancov.8675}}}
322  ```
323
324## Contextual elements ##
325
326These are elements that supply information necessary to convert
327presentation elements to symbolic form.  Unlike presentation elements,
328they are not directly related to the surrounding text.  Contextual
329elements should appear alone on lines with no other non-whitespace
330text, so that the symbolizing filter might elide the whole line from
331its output without hiding any other log text.
332
333The contextual elements themselves do not necessarily need to be
334presented in human-readable output.  However, the information they
335impart may be essential to understanding the logging text even after
336symbolization.  So it's recommended that this information be preserved
337in some form when the original raw log with markup may no longer be
338readily accessible for whatever reason.
339
340Contextual elements should appear in the logging stream before they are
341needed.  That is, if some piece of context may affect how the
342symbolizing filter would interpret or present a later presentation
343element, the necessary contextual elements should have appeared
344somewhere earlier in the logging stream.  It should always be possible
345for the symbolizing filter to be implemented as a single pass over the
346raw logging stream, accumulating context and massaging text as it goes.
347
348* `{{{reset}}}`
349
350  This should be output before any other contextual element. The need
351  for this contextual element is to support implementations that handle
352  logs coming from multiple processes. Such implementations might not
353  know when a new process starts or ends. Because some identifying
354  information (like process IDs) might be the same between old and new
355  processes, a way is needed to distinguish two processes with such
356  identical identifying information. This element informs such
357  implementations to reset the state of a filter so that information
358  from a previous process's contextual elements is not assumed for new
359  process that just happens have the same identifying information.
360
361* `{{{module:%i:%s:%s:...}}}`
362
363  This element represents a so called "module". A "module" is a single
364  linked binary, such as a loaded ELF file. Usually each module occupies
365  a contiguous range of memory (always does on Zircon).
366
367  Here `%i` is the Module ID which is used by other contextual elements
368  to refer to this module. The first `%s` is a human-readable identifier
369  for the module, such as an ELF `DT_SONAME` string or a file name; but
370  it might be empty. It's only for casual information. The Module ID
371  will be exclusivelly used to refer to this module in other contextual
372  elements. The second `%s` is the module type and it determines what
373  the remaining fields are. The following module types are supported:
374
375  * `elf:%x`
376
377    Here `%x` encodes an ELF Build ID. The Build ID should refer to a
378    single linked binary. The Build ID string is the sole way to identify
379    the binary from which this module was loaded.
380
381  Example:
382  ```
383  {{{module:1:libc.so:elf:83238ab56ba10497}}}
384  ```
385
386* `{{{mmap:%p:%x:...}}}`
387
388  This contextual element is used to give information about a particular
389  region in memory. `%p` is the starting address and `%x` gives the size
390  in hex of the region of memory. The `...` part can take different forms
391  to give different information about the specified region of memory. The
392  allowed forms are the following:
393
394  * `load:%i:%s:%p`
395
396    This subelement informs the filter that a segment was loaded from a
397    module. The module is identified by its module id `%i`. The `%s` is
398    one or more of the letters 'r', 'w', and 'x' (in that order and in
399    either upper or lower case) to indicate this segment of memory is
400    readable, writable, and/or executable. The symbolizing filter can use
401    this information to guess whether an address is a likely code address
402    or a likely data address in the given module. The remaining `%p` gives
403    the module relative address. For ELF files the module relative address
404    will be the `p_vaddr` of the associated program header. For example if
405    your module's executable segment has `p_vaddr=0x1000`, `p_memsz=0x1234`,
406    and was loaded at 0x7acba69d5000 then you need to subtract 0x7acba69d4000
407    from any address between 0x7acba69d5000 and 0x7acba69d6234 to get the
408    module relative address. The starting address will usually have been
409    rounded down to the active page size, and the size rounded up.
410
411  Example:
412  ```
413  {{{mmap:0x7acba69d5000:0x5a000:load:1:rx:0x1000}}}
414  ```
415