xref/docs/tls.md

# Thread Local Storage #

The ELF Thread Local Storage ABI (TLS) is a storage model for variables that
allows each thread to have a unique copy of a global variable. This model
is used to implement C++'s `thread_local` storage model. On thread creation the
variable will be given its initial value from the initial TLS image. TLS
variables are for instance useful as buffers in thread safe code or for per
thread book keeping. C style errors like errno or dlerror can also be handled
this way.

TLS variables are much like any other global/static variable. In implementation
their initial data winds up in the `PT_TLS` segment. The `PT_TLS` segment
is inside of a read only `PT_LOAD` segment despite TLS variables being writable.
This segment is then copied into the process for each thread in a unique
writable location. The location the `PT_TLS` segment is copied to is influenced
by the segment's alignment to ensure that the alignment of TLS variables is
respected.

## ABI ##

The actual interface that the compiler, linker, and dynamic linker must adhere
to is actually quite simple despite the details of the implementation being more
complex. The compiler and the linker must emit code and dynamic relocations that
use one of the 4 access models (described in a following section). The dynamic
linker and thread implementation must then set everything up so that this
actually works. Different architectures have different ABIs but they're similar
enough at broad strokes that we can speak about most of them as if there was
just one ABI. This document will assume that either x86-64 or AArch64 is being
used and will point out differences when they occur.

The TLS ABI makes use of a few terms:

  * Thread Pointer: This is a unique address in each thread, generally stored
    in a register. Thread local variables lie at offsets from the thread pointer.
    Thread Pointer will be abbreviated and used as `$tp` in this document. `$tp`
    is what `__builtin_thread_pointer()` returns on AArch64. On AArch64 `$tp`
    is given by a special register named `TPIDR_EL0` that can be accessed using
    `mrs <reg>, TPIDR_EL0`. On `x86_64` the `fs.base` segment base is used and
    can be accessed with `%fs:` and can be loaded from `%fs:0` or `rdfsbase`
    instruction.
  * TLS Segment: This is the image of data in each module and specified by the
    `PT_TLS` program header in each module. Not every module has a `PT_TLS`
    program header and thus not every module has a TLS segment. Each module
    has at most one TLS segment and correspondingly at most one `PT_TLS`
    program header.
  * Static TLS set: This is the sum total of modules that are known to the
    dynamic linker at program start up time. It consists of the main executable
    and every library transitively mentioned by `DT_NEEDED`. Modules that
    require being in the Static TLS set have `DF_STATIC_TLS` set on their
    `DT_FLAGS` entry in their dynamic table (given by the `PT_DYNAMIC` segment).
  * TLS Region: This is a contiguous region of memory unique to each
    thread. `$tp` will point to some point in this region. It contains the
    TLS segment of every module in Static TLS set as well as some
    implementation-private data which is sometimes called the TCB (Thread
    Control Block). On AArch64 a 16-byte reserved space starting at `$tp` is
    also sometimes called the TCB. We will refer to this space as the "ABI TCB"
    in this doc.
  * TLS Block: This is an individual thread's copy of a TLS segment. There is
    one TLS block per TLS segment per thread.
  * Module ID: The module ID is not statically known except for the main
    executable's module ID which is always 1. Other module's module IDs are
    chosen by the dynamic linker. It's just a unique non-zero ID for each
    module. In theory it could be any non-zero 64-bit value that is unique to
    the module like a hash or something. In practice it's just a simple counter
    that the dynamic linker maintains.
  * The main executable: This is the module that contains the start address. It,
    is also treated in a special way in one of the access models. It always
    has a Module ID of 1. This is the only module that can use fixed offsets
    from `$tp` via the Local Exec model described below.

To comply with the ABI all access models must be supported.

#### Access Models ####

There are 4 access models specified by the ABI:

  * `global-dynamic`
  * `local-dynamic`
  * `initial-exec`
  * `local-exec`

These are the values that can be used for `-ftls-model=...` and
`__attribute__((tls_model("...")))`

Which model is used relates to:

1. Which module is performing the access:
  1. The main executable
  2. A module in the static TLS set
  3. A module that was loaded after startup, e.g. by `dlopen`
2. Which module the variable being accessed is defined in:
  1. Within the same module (i.e. `local-*`)
  2. In a different module (i.e. `global-*`)

* `global-dynamic` Can be used from anywhere, for any variable.
* `local-dynamic` Can be used by any module, for any variable defined in that
  same module.
* `initial-exec` Can be used by any module for any variable defined in the static
  TLS set.
* `local-exec` Can be used by the main executable for variables defined in the
  main executable.

###### Global Dynamic ######

Global dynamic is the most general access format. It is also the slowest.
Any thread-local global variable should be accessible with this method. This
access model *must* be used if a dynamic library accesses a symbol defined in
another module (see exception in section on Initial Exec). Symbols defined
within the executable need not use this access model. The main executable can
also avoid using this access model. This is the default access model when
compiling with `-fPIC` as is the norm for shared libraries.

This access model works by calling a function defined in the dynamic linker.
There are two ways functions might be called, via TLSDESC, or via
`__tls_get_addr`.

In the case of `__tls_get_addr` it is passed the pair of `GOT` entries
associated with this symbol. Specifically it is passed the pointer to the first
and the second entry comes right after it. For a given symbol `S`, the first
entry, denoted `GOT_S[0]`, must contain the Module ID of the module in which
`S` was defined. The second entry, denoted `GOT_S[1]`, must contain offset into
TLS Block which is the same as the offset of the symbol in the `PT_TLS` segment
of the associated module. The pointer to `S` is then computed using
`__tls_get_addr(GOT_S)`. The implementation of `__tls_get_addr` will be
discussed later.

TLSDESC is an alternative ABI for `global-dynamic` access (and `local-dynamic`)
where a different pair of `GOT` slots are used where the first `GOT` slot
contains a function pointer. The second contains some dynamic linker defined
auxiliary data. This allows the dynamic linker a choice over which function is
called depending on circumstance.

In both cases the calls to these functions must be implemented by a specific
code sequence and a specific set of relocs. This allows the linker to recognize
these accesses and potentially relax them to the `local-dynamic` access model.

(NOTE: The following paragraph contains details about how the compiler upholds
its end of the ABI. Skip this paragraph if you don't care about that.)

For the compiler to emit code for this access model a call needs to be emitted
against `__tls_get_addr` (defined by the dynamic linker) and a reference to the
symbol name. Specifically the compiler the emits code for (minding the
additional relocation needed for the GOT itself) `__tls_get_addr(GOT_S)`. The
linker then emits two dynamic relocations when generating the GOT. On `x86_64`
these are `R_X86_64_DTPMOD` and `R_X86_64_DTPOFF`. On AArch64 these are
`R_AARCH64_DTPMOD` and `R_AARCH64_DTPOFF`. These relocations reference the symbol
regardless of whether or not the module defines a symbol by that name or not.

###### Local Dynamic ######

Local dynamic the same as Global Dynamic but for local symbols. It can be
thought of as a single `global-dynamic` access to the TLS block of this module.
Then because every variable defined in the module is at fixed offsets from the
TLS block the compiler can optimize multiple `global-dynamic` calls into one.
The compiler will relax a `global-dynamic` access to a `local-dynamic` access
whenever the variables are local/static or have hidden visibility. The linker
may sometimes be able to relax some `global-dynamic` accesses to `local-dynamic`
as well.

The following gives an example of how the compiler might emit code for this
access model:

```

static thread_local char buf[buf_cap];
static thread_local size_t buf_size = 0;
while(*str && buf_size < buf_cap) {
  buf[buf_size++] = *str++;
}
```
might be lowered to
```

// GOT_module[0] is the module ID of this module
// GOT_module[1] is just 0
// <X> denotes the offset of X in this module's TLS block
tls = __tls_get_addr(GOT_module)
while(*str && *(size_t*)(tls+<buf_size>) < buf_cap) {
  (char*)(tls+<buf>)[*(size_t*)(tls+<buf_size>)++] = *str++;
}
```

If this code used global dynamic it would have to make at least 2 calls, one to
get the pointer for buf and the other to get the pointer for `buf_size`.

###### Initial Exec ######

This access model can be used anytime the compiler knows the module that the
symbol being accessed is defined in will be loaded in the initial set of
executables rather than opened using `dlopen`. This access model is generally
only used when the main executable is accessing a global symbol with default
visibility. This is because compiling an executable is the only time the
compiler knows that any code generated will be in the initial executable set. If
a DSO is compiled to make thread local accesses use this model then the DSO
cannot be safely opened with `dlopen`. This is acceptable in performance
critical applications and in cases where you know the binary will never be
dlopen-ed such as in the case of libc. Modules compiled/linked this way have
their `DF_STATIC_TLS` flag set.

Initial Exec is the default when compiling without `-fPIC`.

The compiler emits code without even calling `__tls_get_addr` for this access
model. It does so using a single GOT entry which we'll denote `GOT_s` for symbol
`s` which the compiler emits relocations for to ensure that

```

extern thread_local int a;
extern thread_local int b;
int main() {
  return a + b;
}
```
would be lowered to something like the following
```

int main() {
  return *(int*)($tp + GOT[a]) + *(int*)($tp + GOT[b]);
}
```

Note that on x86 architectures `GOT[s]` will actually resolve to a negative
value.

###### Local Exec ######

This is the fastest access model and can only be used if the symbol is in the
first TLS block which is the TLS block of the main executable. In practice only
the main executable can use this access mode because any shared library can't
(and normally wouldn't need to) know if it is accessing something from the main
executable. The linker will relax `initial-exec` to `local-exec`. The compiler
can't do this without explicit instructions via `-ftls-model` or
`__attribute__((tls_model("...")))` because the compiler cannot know if the
current translation unit is going to be linked into a main executable or a
shared library.

The precise details of how this offset is computed changes a bit
from architecture to architecture.

example code:
```
static thread_local int a;
static thread_local int b;

int main() {
  return a + b;
}
```
would be lowered to
```
int main() {
  return (int*)($tp+TPOFF_a) + (int*)($tp+TPOFF_b));
}
```

On AArch64 `TPOFF_a == max(16, p_align) + <a>` where `p_align` is exactly the
`p_align` field of the main executable's `PT_TLS` segment and `<a>` is the
offset of `a` from the beginning of the main executable's TLS segment.

On `x86_64` `TPOFF_a == -<a>` where `<a>` is the offset of the `a` from the *end*
of the main executable's TLS segment.

The linker is aware of what `TPOFF_X` is for any given `X` and fills in this
value.

## Implementation ##

This section discusses the implementation as it is implemented on Fuchsia. This
said the broad strokes here are widely similar across different libc
implementations including musl and glibc.

The actual implementation of all of this introduces a few more details. Namely
the so-called "DTV" (Dynamic Thread Vector) (denoted `dtv` in this doc) which
indexes TLS blocks by module ID. The following diagram shows what the initial
executable set looks like. In Fuchsia's implementation we actually store a
bunch of meta information in a thread descriptor struct along with the
ABI TCB (denoted `tcb` below). In our implementation we use the first 8 bytes
of this space to point to the DTV. At first `tcb` points to `dtv` as shown in
the below diagrams but after a dlopen this can change.

arm64:
```
*------------------------------------------------------------------------------*
| thread | tcb | X | tls1 | ... | tlsN | ... | tls_cnt | dtv[1] | ... | dtv[N] |
*------------------------------------------------------------------------------*
^         ^         ^             ^            ^
td        tp      dtv[1]       dtv[n+1]       dtv
```

Here `X` has size `min(16, tls_align) - 16` where `tls_align` is the maximum
alignment of all loaded TLS segments from the static TLS set. This is set by
the static linker since the static linker resolves `TPOFF_*` values. This
padding is set that so that if, as required, `$tp` is aligned to main
executable's `PT_TLS` segment's `p_align` value then `tls1 - $tp` will be
`max(16, p_align)`. This ensures that there is always at least a 16 byte space
for the ABI TCB (denoted `tcb` in the diagram above).

x86:
```
*-----------------------------------------------------------------------------*
| tls_cnt | dtv[1] | ... | dtv[N] | ... | tlsN | ... | tls1 | tcb |  thread   |
*-----------------------------------------------------------------------------*
^                                       ^             ^       ^
dtv                                  dtv[n+1]       dtv[1]  tp/td
```

Here `td` denotes the "thread descriptor pointer". In both implementations this
points to the thread descriptor. A subtle point not made apparent in these
diagrams is that `tcb` is actually a member of the thread descriptor struct in
both cases but on AArch64 it is the last member and on `x86_64` it is the first
member.

#### dlopen ####

This picture explains what happens for the initial executables but it doesn't
explain what happens in the `dlopen` case. When `__tls_get_addr` is called it
first checks to see if `tls_cnt` is such that the module ID (given by `GOT_s[0]`
) is within the `dtv`. If it is then it simply looks up `dtv[GOT_s[0]] + GOT_s[1]`
but if it isn't something more complicated happens. See the implementation of
`__tls_get_new` in [dynlink.c](https://fuchsia.googlesource.com/zircon/+/master/third_party/ulib/musl/ldso/dynlink.c).
In a nutshell a sufficiently large space was already allocated for a larger `dtv`
on a call to `dlopen`. It is an invariant of the system that sufficient space
will always exist somewhere already allocated. The larger space is then setup to
be a proper `dtv`. `tcb` is then set to point to this new larger `dtv`. Future
accesses will then use the simpler code path since `tls_cnt` will be large
enough.