1# Thread Local Storage # 2 3The ELF Thread Local Storage ABI (TLS) is a storage model for variables that 4allows each thread to have a unique copy of a global variable. This model 5is used to implement C++'s `thread_local` storage model. On thread creation the 6variable will be given its initial value from the initial TLS image. TLS 7variables are for instance useful as buffers in thread safe code or for per 8thread book keeping. C style errors like errno or dlerror can also be handled 9this way. 10 11TLS variables are much like any other global/static variable. In implementation 12their initial data winds up in the `PT_TLS` segment. The `PT_TLS` segment 13is inside of a read only `PT_LOAD` segment despite TLS variables being writable. 14This segment is then copied into the process for each thread in a unique 15writable location. The location the `PT_TLS` segment is copied to is influenced 16by the segment's alignment to ensure that the alignment of TLS variables is 17respected. 18 19## ABI ## 20 21The actual interface that the compiler, linker, and dynamic linker must adhere 22to is actually quite simple despite the details of the implementation being more 23complex. The compiler and the linker must emit code and dynamic relocations that 24use one of the 4 access models (described in a following section). The dynamic 25linker and thread implementation must then set everything up so that this 26actually works. Different architectures have different ABIs but they're similar 27enough at broad strokes that we can speak about most of them as if there was 28just one ABI. This document will assume that either x86-64 or AArch64 is being 29used and will point out differences when they occur. 30 31The TLS ABI makes use of a few terms: 32 33 * Thread Pointer: This is a unique address in each thread, generally stored 34 in a register. Thread local variables lie at offsets from the thread pointer. 35 Thread Pointer will be abbreviated and used as `$tp` in this document. `$tp` 36 is what `__builtin_thread_pointer()` returns on AArch64. On AArch64 `$tp` 37 is given by a special register named `TPIDR_EL0` that can be accessed using 38 `mrs <reg>, TPIDR_EL0`. On `x86_64` the `fs.base` segment base is used and 39 can be accessed with `%fs:` and can be loaded from `%fs:0` or `rdfsbase` 40 instruction. 41 * TLS Segment: This is the image of data in each module and specified by the 42 `PT_TLS` program header in each module. Not every module has a `PT_TLS` 43 program header and thus not every module has a TLS segment. Each module 44 has at most one TLS segment and correspondingly at most one `PT_TLS` 45 program header. 46 * Static TLS set: This is the sum total of modules that are known to the 47 dynamic linker at program start up time. It consists of the main executable 48 and every library transitively mentioned by `DT_NEEDED`. Modules that 49 require being in the Static TLS set have `DF_STATIC_TLS` set on their 50 `DT_FLAGS` entry in their dynamic table (given by the `PT_DYNAMIC` segment). 51 * TLS Region: This is a contiguous region of memory unique to each 52 thread. `$tp` will point to some point in this region. It contains the 53 TLS segment of every module in Static TLS set as well as some 54 implementation-private data which is sometimes called the TCB (Thread 55 Control Block). On AArch64 a 16-byte reserved space starting at `$tp` is 56 also sometimes called the TCB. We will refer to this space as the "ABI TCB" 57 in this doc. 58 * TLS Block: This is an individual thread's copy of a TLS segment. There is 59 one TLS block per TLS segment per thread. 60 * Module ID: The module ID is not statically known except for the main 61 executable's module ID which is always 1. Other module's module IDs are 62 chosen by the dynamic linker. It's just a unique non-zero ID for each 63 module. In theory it could be any non-zero 64-bit value that is unique to 64 the module like a hash or something. In practice it's just a simple counter 65 that the dynamic linker maintains. 66 * The main executable: This is the module that contains the start address. It, 67 is also treated in a special way in one of the access models. It always 68 has a Module ID of 1. This is the only module that can use fixed offsets 69 from `$tp` via the Local Exec model described below. 70 71To comply with the ABI all access models must be supported. 72 73#### Access Models #### 74 75There are 4 access models specified by the ABI: 76 77 * `global-dynamic` 78 * `local-dynamic` 79 * `initial-exec` 80 * `local-exec` 81 82These are the values that can be used for `-ftls-model=...` and 83`__attribute__((tls_model("...")))` 84 85Which model is used relates to: 86 871. Which module is performing the access: 88 1. The main executable 89 2. A module in the static TLS set 90 3. A module that was loaded after startup, e.g. by `dlopen` 912. Which module the variable being accessed is defined in: 92 1. Within the same module (i.e. `local-*`) 93 2. In a different module (i.e. `global-*`) 94 95* `global-dynamic` Can be used from anywhere, for any variable. 96* `local-dynamic` Can be used by any module, for any variable defined in that 97 same module. 98* `initial-exec` Can be used by any module for any variable defined in the static 99 TLS set. 100* `local-exec` Can be used by the main executable for variables defined in the 101 main executable. 102 103###### Global Dynamic ###### 104 105Global dynamic is the most general access format. It is also the slowest. 106Any thread-local global variable should be accessible with this method. This 107access model *must* be used if a dynamic library accesses a symbol defined in 108another module (see exception in section on Initial Exec). Symbols defined 109within the executable need not use this access model. The main executable can 110also avoid using this access model. This is the default access model when 111compiling with `-fPIC` as is the norm for shared libraries. 112 113This access model works by calling a function defined in the dynamic linker. 114There are two ways functions might be called, via TLSDESC, or via 115`__tls_get_addr`. 116 117In the case of `__tls_get_addr` it is passed the pair of `GOT` entries 118associated with this symbol. Specifically it is passed the pointer to the first 119and the second entry comes right after it. For a given symbol `S`, the first 120entry, denoted `GOT_S[0]`, must contain the Module ID of the module in which 121`S` was defined. The second entry, denoted `GOT_S[1]`, must contain offset into 122TLS Block which is the same as the offset of the symbol in the `PT_TLS` segment 123of the associated module. The pointer to `S` is then computed using 124`__tls_get_addr(GOT_S)`. The implementation of `__tls_get_addr` will be 125discussed later. 126 127TLSDESC is an alternative ABI for `global-dynamic` access (and `local-dynamic`) 128where a different pair of `GOT` slots are used where the first `GOT` slot 129contains a function pointer. The second contains some dynamic linker defined 130auxiliary data. This allows the dynamic linker a choice over which function is 131called depending on circumstance. 132 133In both cases the calls to these functions must be implemented by a specific 134code sequence and a specific set of relocs. This allows the linker to recognize 135these accesses and potentially relax them to the `local-dynamic` access model. 136 137(NOTE: The following paragraph contains details about how the compiler upholds 138its end of the ABI. Skip this paragraph if you don't care about that.) 139 140For the compiler to emit code for this access model a call needs to be emitted 141against `__tls_get_addr` (defined by the dynamic linker) and a reference to the 142symbol name. Specifically the compiler the emits code for (minding the 143additional relocation needed for the GOT itself) `__tls_get_addr(GOT_S)`. The 144linker then emits two dynamic relocations when generating the GOT. On `x86_64` 145these are `R_X86_64_DTPMOD` and `R_X86_64_DTPOFF`. On AArch64 these are 146`R_AARCH64_DTPMOD` and `R_AARCH64_DTPOFF`. These relocations reference the symbol 147regardless of whether or not the module defines a symbol by that name or not. 148 149###### Local Dynamic ###### 150 151Local dynamic the same as Global Dynamic but for local symbols. It can be 152thought of as a single `global-dynamic` access to the TLS block of this module. 153Then because every variable defined in the module is at fixed offsets from the 154TLS block the compiler can optimize multiple `global-dynamic` calls into one. 155The compiler will relax a `global-dynamic` access to a `local-dynamic` access 156whenever the variables are local/static or have hidden visibility. The linker 157may sometimes be able to relax some `global-dynamic` accesses to `local-dynamic` 158as well. 159 160The following gives an example of how the compiler might emit code for this 161access model: 162 163``` 164 165static thread_local char buf[buf_cap]; 166static thread_local size_t buf_size = 0; 167while(*str && buf_size < buf_cap) { 168 buf[buf_size++] = *str++; 169} 170``` 171might be lowered to 172``` 173 174// GOT_module[0] is the module ID of this module 175// GOT_module[1] is just 0 176// <X> denotes the offset of X in this module's TLS block 177tls = __tls_get_addr(GOT_module) 178while(*str && *(size_t*)(tls+<buf_size>) < buf_cap) { 179 (char*)(tls+<buf>)[*(size_t*)(tls+<buf_size>)++] = *str++; 180} 181``` 182 183If this code used global dynamic it would have to make at least 2 calls, one to 184get the pointer for buf and the other to get the pointer for `buf_size`. 185 186###### Initial Exec ###### 187 188This access model can be used anytime the compiler knows the module that the 189symbol being accessed is defined in will be loaded in the initial set of 190executables rather than opened using `dlopen`. This access model is generally 191only used when the main executable is accessing a global symbol with default 192visibility. This is because compiling an executable is the only time the 193compiler knows that any code generated will be in the initial executable set. If 194a DSO is compiled to make thread local accesses use this model then the DSO 195cannot be safely opened with `dlopen`. This is acceptable in performance 196critical applications and in cases where you know the binary will never be 197dlopen-ed such as in the case of libc. Modules compiled/linked this way have 198their `DF_STATIC_TLS` flag set. 199 200Initial Exec is the default when compiling without `-fPIC`. 201 202The compiler emits code without even calling `__tls_get_addr` for this access 203model. It does so using a single GOT entry which we'll denote `GOT_s` for symbol 204`s` which the compiler emits relocations for to ensure that 205 206``` 207 208extern thread_local int a; 209extern thread_local int b; 210int main() { 211 return a + b; 212} 213``` 214would be lowered to something like the following 215``` 216 217int main() { 218 return *(int*)($tp + GOT[a]) + *(int*)($tp + GOT[b]); 219} 220``` 221 222Note that on x86 architectures `GOT[s]` will actually resolve to a negative 223value. 224 225###### Local Exec ###### 226 227This is the fastest access model and can only be used if the symbol is in the 228first TLS block which is the TLS block of the main executable. In practice only 229the main executable can use this access mode because any shared library can't 230(and normally wouldn't need to) know if it is accessing something from the main 231executable. The linker will relax `initial-exec` to `local-exec`. The compiler 232can't do this without explicit instructions via `-ftls-model` or 233`__attribute__((tls_model("...")))` because the compiler cannot know if the 234current translation unit is going to be linked into a main executable or a 235shared library. 236 237The precise details of how this offset is computed changes a bit 238from architecture to architecture. 239 240example code: 241``` 242static thread_local int a; 243static thread_local int b; 244 245int main() { 246 return a + b; 247} 248``` 249would be lowered to 250``` 251int main() { 252 return (int*)($tp+TPOFF_a) + (int*)($tp+TPOFF_b)); 253} 254``` 255 256On AArch64 `TPOFF_a == max(16, p_align) + <a>` where `p_align` is exactly the 257`p_align` field of the main executable's `PT_TLS` segment and `<a>` is the 258offset of `a` from the beginning of the main executable's TLS segment. 259 260On `x86_64` `TPOFF_a == -<a>` where `<a>` is the offset of the `a` from the *end* 261of the main executable's TLS segment. 262 263The linker is aware of what `TPOFF_X` is for any given `X` and fills in this 264value. 265 266## Implementation ## 267 268This section discusses the implementation as it is implemented on Fuchsia. This 269said the broad strokes here are widely similar across different libc 270implementations including musl and glibc. 271 272The actual implementation of all of this introduces a few more details. Namely 273the so-called "DTV" (Dynamic Thread Vector) (denoted `dtv` in this doc) which 274indexes TLS blocks by module ID. The following diagram shows what the initial 275executable set looks like. In Fuchsia's implementation we actually store a 276bunch of meta information in a thread descriptor struct along with the 277ABI TCB (denoted `tcb` below). In our implementation we use the first 8 bytes 278of this space to point to the DTV. At first `tcb` points to `dtv` as shown in 279the below diagrams but after a dlopen this can change. 280 281arm64: 282``` 283*------------------------------------------------------------------------------* 284| thread | tcb | X | tls1 | ... | tlsN | ... | tls_cnt | dtv[1] | ... | dtv[N] | 285*------------------------------------------------------------------------------* 286^ ^ ^ ^ ^ 287td tp dtv[1] dtv[n+1] dtv 288``` 289 290Here `X` has size `min(16, tls_align) - 16` where `tls_align` is the maximum 291alignment of all loaded TLS segments from the static TLS set. This is set by 292the static linker since the static linker resolves `TPOFF_*` values. This 293padding is set that so that if, as required, `$tp` is aligned to main 294executable's `PT_TLS` segment's `p_align` value then `tls1 - $tp` will be 295`max(16, p_align)`. This ensures that there is always at least a 16 byte space 296for the ABI TCB (denoted `tcb` in the diagram above). 297 298x86: 299``` 300*-----------------------------------------------------------------------------* 301| tls_cnt | dtv[1] | ... | dtv[N] | ... | tlsN | ... | tls1 | tcb | thread | 302*-----------------------------------------------------------------------------* 303^ ^ ^ ^ 304dtv dtv[n+1] dtv[1] tp/td 305``` 306 307Here `td` denotes the "thread descriptor pointer". In both implementations this 308points to the thread descriptor. A subtle point not made apparent in these 309diagrams is that `tcb` is actually a member of the thread descriptor struct in 310both cases but on AArch64 it is the last member and on `x86_64` it is the first 311member. 312 313#### dlopen #### 314 315This picture explains what happens for the initial executables but it doesn't 316explain what happens in the `dlopen` case. When `__tls_get_addr` is called it 317first checks to see if `tls_cnt` is such that the module ID (given by `GOT_s[0]` 318) is within the `dtv`. If it is then it simply looks up `dtv[GOT_s[0]] + GOT_s[1]` 319but if it isn't something more complicated happens. See the implementation of 320`__tls_get_new` in [dynlink.c](https://fuchsia.googlesource.com/zircon/+/master/third_party/ulib/musl/ldso/dynlink.c). 321In a nutshell a sufficiently large space was already allocated for a larger `dtv` 322on a call to `dlopen`. It is an invariant of the system that sufficient space 323will always exist somewhere already allocated. The larger space is then setup to 324be a proper `dtv`. `tcb` is then set to point to this new larger `dtv`. Future 325accesses will then use the simpler code path since `tls_cnt` will be large 326enough. 327