1# Thread Local Storage #
2
3The ELF Thread Local Storage ABI (TLS) is a storage model for variables that
4allows each thread to have a unique copy of a global variable. This model
5is used to implement C++'s `thread_local` storage model. On thread creation the
6variable will be given its initial value from the initial TLS image. TLS
7variables are for instance useful as buffers in thread safe code or for per
8thread book keeping. C style errors like errno or dlerror can also be handled
9this way.
10
11TLS variables are much like any other global/static variable. In implementation
12their initial data winds up in the `PT_TLS` segment. The `PT_TLS` segment
13is inside of a read only `PT_LOAD` segment despite TLS variables being writable.
14This segment is then copied into the process for each thread in a unique
15writable location. The location the `PT_TLS` segment is copied to is influenced
16by the segment's alignment to ensure that the alignment of TLS variables is
17respected.
18
19## ABI ##
20
21The actual interface that the compiler, linker, and dynamic linker must adhere
22to is actually quite simple despite the details of the implementation being more
23complex. The compiler and the linker must emit code and dynamic relocations that
24use one of the 4 access models (described in a following section). The dynamic
25linker and thread implementation must then set everything up so that this
26actually works. Different architectures have different ABIs but they're similar
27enough at broad strokes that we can speak about most of them as if there was
28just one ABI. This document will assume that either x86-64 or AArch64 is being
29used and will point out differences when they occur.
30
31The TLS ABI makes use of a few terms:
32
33  * Thread Pointer: This is a unique address in each thread, generally stored
34    in a register. Thread local variables lie at offsets from the thread pointer.
35    Thread Pointer will be abbreviated and used as `$tp` in this document. `$tp`
36    is what `__builtin_thread_pointer()` returns on AArch64. On AArch64 `$tp`
37    is given by a special register named `TPIDR_EL0` that can be accessed using
38    `mrs <reg>, TPIDR_EL0`. On `x86_64` the `fs.base` segment base is used and
39    can be accessed with `%fs:` and can be loaded from `%fs:0` or `rdfsbase`
40    instruction.
41  * TLS Segment: This is the image of data in each module and specified by the
42    `PT_TLS` program header in each module. Not every module has a `PT_TLS`
43    program header and thus not every module has a TLS segment. Each module
44    has at most one TLS segment and correspondingly at most one `PT_TLS`
45    program header.
46  * Static TLS set: This is the sum total of modules that are known to the
47    dynamic linker at program start up time. It consists of the main executable
48    and every library transitively mentioned by `DT_NEEDED`. Modules that
49    require being in the Static TLS set have `DF_STATIC_TLS` set on their
50    `DT_FLAGS` entry in their dynamic table (given by the `PT_DYNAMIC` segment).
51  * TLS Region: This is a contiguous region of memory unique to each
52    thread. `$tp` will point to some point in this region. It contains the
53    TLS segment of every module in Static TLS set as well as some
54    implementation-private data which is sometimes called the TCB (Thread
55    Control Block). On AArch64 a 16-byte reserved space starting at `$tp` is
56    also sometimes called the TCB. We will refer to this space as the "ABI TCB"
57    in this doc.
58  * TLS Block: This is an individual thread's copy of a TLS segment. There is
59    one TLS block per TLS segment per thread.
60  * Module ID: The module ID is not statically known except for the main
61    executable's module ID which is always 1. Other module's module IDs are
62    chosen by the dynamic linker. It's just a unique non-zero ID for each
63    module. In theory it could be any non-zero 64-bit value that is unique to
64    the module like a hash or something. In practice it's just a simple counter
65    that the dynamic linker maintains.
66  * The main executable: This is the module that contains the start address. It,
67    is also treated in a special way in one of the access models. It always
68    has a Module ID of 1. This is the only module that can use fixed offsets
69    from `$tp` via the Local Exec model described below.
70
71To comply with the ABI all access models must be supported.
72
73#### Access Models ####
74
75There are 4 access models specified by the ABI:
76
77  * `global-dynamic`
78  * `local-dynamic`
79  * `initial-exec`
80  * `local-exec`
81
82These are the values that can be used for `-ftls-model=...` and
83`__attribute__((tls_model("...")))`
84
85Which model is used relates to:
86
871. Which module is performing the access:
88  1. The main executable
89  2. A module in the static TLS set
90  3. A module that was loaded after startup, e.g. by `dlopen`
912. Which module the variable being accessed is defined in:
92  1. Within the same module (i.e. `local-*`)
93  2. In a different module (i.e. `global-*`)
94
95* `global-dynamic` Can be used from anywhere, for any variable.
96* `local-dynamic` Can be used by any module, for any variable defined in that
97  same module.
98* `initial-exec` Can be used by any module for any variable defined in the static
99  TLS set.
100* `local-exec` Can be used by the main executable for variables defined in the
101  main executable.
102
103###### Global Dynamic ######
104
105Global dynamic is the most general access format. It is also the slowest.
106Any thread-local global variable should be accessible with this method. This
107access model *must* be used if a dynamic library accesses a symbol defined in
108another module (see exception in section on Initial Exec). Symbols defined
109within the executable need not use this access model. The main executable can
110also avoid using this access model. This is the default access model when
111compiling with `-fPIC` as is the norm for shared libraries.
112
113This access model works by calling a function defined in the dynamic linker.
114There are two ways functions might be called, via TLSDESC, or via
115`__tls_get_addr`.
116
117In the case of `__tls_get_addr` it is passed the pair of `GOT` entries
118associated with this symbol. Specifically it is passed the pointer to the first
119and the second entry comes right after it. For a given symbol `S`, the first
120entry, denoted `GOT_S[0]`, must contain the Module ID of the module in which
121`S` was defined. The second entry, denoted `GOT_S[1]`, must contain offset into
122TLS Block which is the same as the offset of the symbol in the `PT_TLS` segment
123of the associated module. The pointer to `S` is then computed using
124`__tls_get_addr(GOT_S)`. The implementation of `__tls_get_addr` will be
125discussed later.
126
127TLSDESC is an alternative ABI for `global-dynamic` access (and `local-dynamic`)
128where a different pair of `GOT` slots are used where the first `GOT` slot
129contains a function pointer. The second contains some dynamic linker defined
130auxiliary data. This allows the dynamic linker a choice over which function is
131called depending on circumstance.
132
133In both cases the calls to these functions must be implemented by a specific
134code sequence and a specific set of relocs. This allows the linker to recognize
135these accesses and potentially relax them to the `local-dynamic` access model.
136
137(NOTE: The following paragraph contains details about how the compiler upholds
138its end of the ABI. Skip this paragraph if you don't care about that.)
139
140For the compiler to emit code for this access model a call needs to be emitted
141against `__tls_get_addr` (defined by the dynamic linker) and a reference to the
142symbol name. Specifically the compiler the emits code for (minding the
143additional relocation needed for the GOT itself) `__tls_get_addr(GOT_S)`. The
144linker then emits two dynamic relocations when generating the GOT. On `x86_64`
145these are `R_X86_64_DTPMOD` and `R_X86_64_DTPOFF`. On AArch64 these are
146`R_AARCH64_DTPMOD` and `R_AARCH64_DTPOFF`. These relocations reference the symbol
147regardless of whether or not the module defines a symbol by that name or not.
148
149###### Local Dynamic ######
150
151Local dynamic the same as Global Dynamic but for local symbols. It can be
152thought of as a single `global-dynamic` access to the TLS block of this module.
153Then because every variable defined in the module is at fixed offsets from the
154TLS block the compiler can optimize multiple `global-dynamic` calls into one.
155The compiler will relax a `global-dynamic` access to a `local-dynamic` access
156whenever the variables are local/static or have hidden visibility. The linker
157may sometimes be able to relax some `global-dynamic` accesses to `local-dynamic`
158as well.
159
160The following gives an example of how the compiler might emit code for this
161access model:
162
163```
164
165static thread_local char buf[buf_cap];
166static thread_local size_t buf_size = 0;
167while(*str && buf_size < buf_cap) {
168  buf[buf_size++] = *str++;
169}
170```
171might be lowered to
172```
173
174// GOT_module[0] is the module ID of this module
175// GOT_module[1] is just 0
176// <X> denotes the offset of X in this module's TLS block
177tls = __tls_get_addr(GOT_module)
178while(*str && *(size_t*)(tls+<buf_size>) < buf_cap) {
179  (char*)(tls+<buf>)[*(size_t*)(tls+<buf_size>)++] = *str++;
180}
181```
182
183If this code used global dynamic it would have to make at least 2 calls, one to
184get the pointer for buf and the other to get the pointer for `buf_size`.
185
186###### Initial Exec ######
187
188This access model can be used anytime the compiler knows the module that the
189symbol being accessed is defined in will be loaded in the initial set of
190executables rather than opened using `dlopen`. This access model is generally
191only used when the main executable is accessing a global symbol with default
192visibility. This is because compiling an executable is the only time the
193compiler knows that any code generated will be in the initial executable set. If
194a DSO is compiled to make thread local accesses use this model then the DSO
195cannot be safely opened with `dlopen`. This is acceptable in performance
196critical applications and in cases where you know the binary will never be
197dlopen-ed such as in the case of libc. Modules compiled/linked this way have
198their `DF_STATIC_TLS` flag set.
199
200Initial Exec is the default when compiling without `-fPIC`.
201
202The compiler emits code without even calling `__tls_get_addr` for this access
203model. It does so using a single GOT entry which we'll denote `GOT_s` for symbol
204`s` which the compiler emits relocations for to ensure that
205
206```
207
208extern thread_local int a;
209extern thread_local int b;
210int main() {
211  return a + b;
212}
213```
214would be lowered to something like the following
215```
216
217int main() {
218  return *(int*)($tp + GOT[a]) + *(int*)($tp + GOT[b]);
219}
220```
221
222Note that on x86 architectures `GOT[s]` will actually resolve to a negative
223value.
224
225###### Local Exec ######
226
227This is the fastest access model and can only be used if the symbol is in the
228first TLS block which is the TLS block of the main executable. In practice only
229the main executable can use this access mode because any shared library can't
230(and normally wouldn't need to) know if it is accessing something from the main
231executable. The linker will relax `initial-exec` to `local-exec`. The compiler
232can't do this without explicit instructions via `-ftls-model` or
233`__attribute__((tls_model("...")))` because the compiler cannot know if the
234current translation unit is going to be linked into a main executable or a
235shared library.
236
237The precise details of how this offset is computed changes a bit
238from architecture to architecture.
239
240example code:
241```
242static thread_local int a;
243static thread_local int b;
244
245int main() {
246  return a + b;
247}
248```
249would be lowered to
250```
251int main() {
252  return (int*)($tp+TPOFF_a) + (int*)($tp+TPOFF_b));
253}
254```
255
256On AArch64 `TPOFF_a == max(16, p_align) + <a>` where `p_align` is exactly the
257`p_align` field of the main executable's `PT_TLS` segment and `<a>` is the
258offset of `a` from the beginning of the main executable's TLS segment.
259
260On `x86_64` `TPOFF_a == -<a>` where `<a>` is the offset of the `a` from the *end*
261of the main executable's TLS segment.
262
263The linker is aware of what `TPOFF_X` is for any given `X` and fills in this
264value.
265
266## Implementation ##
267
268This section discusses the implementation as it is implemented on Fuchsia. This
269said the broad strokes here are widely similar across different libc
270implementations including musl and glibc.
271
272The actual implementation of all of this introduces a few more details. Namely
273the so-called "DTV" (Dynamic Thread Vector) (denoted `dtv` in this doc) which
274indexes TLS blocks by module ID. The following diagram shows what the initial
275executable set looks like. In Fuchsia's implementation we actually store a
276bunch of meta information in a thread descriptor struct along with the
277ABI TCB (denoted `tcb` below). In our implementation we use the first 8 bytes
278of this space to point to the DTV. At first `tcb` points to `dtv` as shown in
279the below diagrams but after a dlopen this can change.
280
281arm64:
282```
283*------------------------------------------------------------------------------*
284| thread | tcb | X | tls1 | ... | tlsN | ... | tls_cnt | dtv[1] | ... | dtv[N] |
285*------------------------------------------------------------------------------*
286^         ^         ^             ^            ^
287td        tp      dtv[1]       dtv[n+1]       dtv
288```
289
290Here `X` has size `min(16, tls_align) - 16` where `tls_align` is the maximum
291alignment of all loaded TLS segments from the static TLS set. This is set by
292the static linker since the static linker resolves `TPOFF_*` values. This
293padding is set that so that if, as required, `$tp` is aligned to main
294executable's `PT_TLS` segment's `p_align` value then `tls1 - $tp` will be
295`max(16, p_align)`. This ensures that there is always at least a 16 byte space
296for the ABI TCB (denoted `tcb` in the diagram above).
297
298x86:
299```
300*-----------------------------------------------------------------------------*
301| tls_cnt | dtv[1] | ... | dtv[N] | ... | tlsN | ... | tls1 | tcb |  thread   |
302*-----------------------------------------------------------------------------*
303^                                       ^             ^       ^
304dtv                                  dtv[n+1]       dtv[1]  tp/td
305```
306
307Here `td` denotes the "thread descriptor pointer". In both implementations this
308points to the thread descriptor. A subtle point not made apparent in these
309diagrams is that `tcb` is actually a member of the thread descriptor struct in
310both cases but on AArch64 it is the last member and on `x86_64` it is the first
311member.
312
313#### dlopen ####
314
315This picture explains what happens for the initial executables but it doesn't
316explain what happens in the `dlopen` case. When `__tls_get_addr` is called it
317first checks to see if `tls_cnt` is such that the module ID (given by `GOT_s[0]`
318) is within the `dtv`. If it is then it simply looks up `dtv[GOT_s[0]] + GOT_s[1]`
319but if it isn't something more complicated happens. See the implementation of
320`__tls_get_new` in [dynlink.c](https://fuchsia.googlesource.com/zircon/+/master/third_party/ulib/musl/ldso/dynlink.c).
321In a nutshell a sufficiently large space was already allocated for a larger `dtv`
322on a call to `dlopen`. It is an invariant of the system that sufficient space
323will always exist somewhere already allocated. The larger space is then setup to
324be a proper `dtv`. `tcb` is then set to point to this new larger `dtv`. Future
325accesses will then use the simpler code path since `tls_cnt` will be large
326enough.
327