1=head1 OVERVIEW 2 3As of Xen 4.0, a new config option called tsc_mode may be specified 4for each domain. The default for tsc_mode handles the vast majority 5of hardware and software environments. This document is targeted 6for Xen users and administrators that may need to select a non-default 7tsc_mode. 8 9Proper selection of tsc_mode depends on an understanding not only of 10the guest operating system (OS), but also of the application set that will 11ever run on this guest OS. This is because tsc_mode applies 12equally to both the OS and ALL apps that are running on this 13domain, now or in the future. 14 15Key questions to be answered for the OS and/or each application are: 16 17=over 4 18 19=item * 20 21Does the OS/app use the rdtsc instruction at all? 22(We will explain below how to determine this.) 23 24=item * 25 26At what frequency is the rdtsc instruction executed by either the OS 27or any running apps? If the sum exceeds about 10,000 rdtsc instructions 28per second per processor, we call this a "high-TSC-frequency" 29OS/app/environment. (This is relatively rare, and developers of OS's 30and apps that are high-TSC-frequency are usually aware of it.) 31 32=item * 33 34If the OS/app does use rdtsc, will it behave incorrectly if "time goes 35backwards" or if the frequency of the TSC suddenly changes? If so, 36we call this a "TSC-sensitive" app or OS; otherwise it is "TSC-resilient". 37 38=back 39 40This last is the US$64,000 question as it may be very difficult 41(or, for legacy apps, even impossible) to predict all possible 42failure cases. As a result, unless proven otherwise, any app 43that uses rdtsc must be assumed to be TSC-sensitive and, as we 44will see, this is the default starting in Xen 4.0. 45 46Xen's new tsc_mode parameter determines the circumstances under which 47the family of rdtsc instructions are executed "natively" vs emulated. 48Roughly speaking, native means rdtsc is fast but TSC-sensitive apps 49may, under unpredictable circumstances, run incorrectly; emulated means 50there is some performance degradation (unobservable in most cases), 51but TSC-sensitive apps will always run correctly. Prior to Xen 4.0, 52all rdtsc instructions were native: "fast but potentially incorrect." 53Starting at Xen 4.0, the default is that all rdtsc instructions are 54"correct but potentially slow". The tsc_mode parameter in 4.0 provides 55an intelligent default but allows system administrator's to adjust 56how rdtsc instructions are executed differently for different domains. 57 58The non-default choices for tsc_mode are: 59 60=over 4 61 62=item * B<tsc_mode=1> (always emulate). 63 64All rdtsc instructions are emulated; this is the best choice when 65TSC-sensitive apps are running and it is necessary to understand 66worst-case performance degradation for a specific hardware environment. 67 68=item * B<tsc_mode=2> (never emulate). 69 70This is the same as prior to Xen 4.0 and is the best choice if it 71is certain that all apps running in this VM are TSC-resilient and 72highest performance is required. 73 74=item * B<tsc_mode=3> (PVRDTSCP). 75 76High-TSC-frequency apps may be paravirtualized (modified) to 77obtain both correctness and highest performance; any unmodified 78apps must be TSC-resilient. 79 80=back 81 82If tsc_mode is left unspecified (or set to B<tsc_mode=0>), a hybrid 83algorithm is utilized to ensure correctness while providing the 84best performance possible given: 85 86=over 4 87 88=item * 89 90the requirement of correctness, 91 92=item * 93 94the underlying hardware, and 95 96=item * 97 98whether or not the VM has been saved/restored/migrated 99 100=back 101 102To understand this in more detail, the rest of this document must 103be read. 104 105=head1 DETERMINING RDTSC FREQUENCY 106 107To determine the frequency of rdtsc instructions that are emulated, 108an "xl" command can be used by a privileged user of domain0. The 109command: 110 111 # xl debug-key s; xl dmesg | tail 112 113provides information about TSC usage in each domain where TSC 114emulation is currently enabled. 115 116=head1 TSC HISTORY 117 118To understand tsc_mode completely, some background on TSC is required: 119 120The x86 "timestamp counter", or TSC, is a 64-bit register on each 121processor that increases monotonically. Historically, TSC incremented 122every processor cycle, but on recent processors, it increases 123at a constant rate even if the processor changes frequency (for example, 124to reduce processor power usage). TSC is known by x86 programmers 125as the fastest, highest-precision measurement of the passage of time 126so it is often used as a foundation for performance monitoring. 127And since it is guaranteed to be monotonically increasing and, at 12864 bits, is guaranteed to not wraparound within 10 years, it is 129sometimes used as a random number or a unique sequence identifier, 130such as to stamp transactions so they can be replayed in a specific 131order. 132 133On most older SMP and early multi-core machines, TSC was not synchronized 134between processors. Thus if an application were to read the TSC on 135one processor, then was moved by the OS to another processor, then read 136TSC again, it might appear that "time went backwards". This loss of 137monotonicity resulted in many obscure application bugs when TSC-sensitive 138apps were ported from a uniprocessor to an SMP environment; as a result, 139many applications -- especially in the Windows world -- removed their 140dependency on TSC and replaced their timestamp needs with OS-specific 141functions, losing both performance and precision. On some more recent 142generations of multi-core machines, especially multi-socket multi-core 143machines, the TSC was synchronized but if one processor were to enter 144certain low-power states, its TSC would stop, destroying the synchrony 145and again causing obscure bugs. This reinforced decisions to avoid use 146of TSC altogether. On the most recent generations of multi-core 147machines, however, synchronization is provided across all processors 148in all power states, even on multi-socket machines, and provide a 149flag that indicates that TSC is synchronized and "invariant". Thus 150TSC is once again useful for applications, and even newer operating 151systems are using and depending upon TSC for critical timekeeping 152tasks when running on these recent machines. 153 154We will refer to hardware that ensures TSC is both synchronized and 155invariant as "TSC-safe" and any hardware on which TSC is not (or 156may not remain) synchronized as "TSC-unsafe". 157 158As a result of TSC's sordid history, two classes of applications use 159TSC: old applications designed for single processors, and the most recent 160enterprise applications which require high-frequency high-precision 161timestamping. 162 163We will refer to apps that might break if running on a TSC-unsafe 164machine as "TSC-sensitive"; apps that don't use TSC, or do use 165TSC but use it in a way that monotonicity and frequency invariance 166are unimportant as "TSC-resilient". 167 168The emergence of virtualization once again complicates the usage of 169TSC. When features such as save/restore or live migration are employed, 170a guest OS and all its currently running applications may be invisibly 171transported to an entirely different physical machine. While TSC 172may be "safe" on one machine, it is essentially impossible to precisely 173synchronize TSC across a data center or even a pool of machines. As 174a result, when run in a virtualized environment, rare and obscure 175"time going backwards" problems might once again occur for those 176TSC-sensitive applications. Worse, if a guest OS moves from, for 177example, a 3GHz 178machine to a 1.5GHz machine, attempts by an OS/app to measure time 179intervals with TSC may without notice be incorrect by a factor of two. 180 181The rdtsc (read timestamp counter) instruction is used to read the 182TSC register. The rdtscp instruction is a variant of rdtsc on recent 183processors. We refer to these together as the rdtsc family of instructions, 184or just "rdtsc". Instructions in the rdtsc family are non-privileged, but 185privileged software may set a cpuid bit to cause all rdtsc family 186instructions to trap. This trap can be detected by Xen, which can 187then transparently "emulate" the results of the rdtsc instruction and 188return control to the code following the rdtsc instruction. 189 190To provide a "safe" TSC, i.e. to ensure both TSC monotonicity and a 191fixed rate, Xen provides rdtsc emulation whenever necessary or when 192explicitly specified by a per-VM configuration option. TSC emulation is 193relatively slow -- roughly 15-20 times slower than the rdtsc instruction 194when executed natively. However, except when an OS or application uses 195the rdtsc instruction at a high frequency (e.g. more than about 10,000 times 196per second per processor), this performance degradation is not noticeable 197(i.e. <0.3%). And, TSC emulation is nearly always faster than 198OS-provided alternatives (e.g. Linux's gettimeofday). For environments 199where it is certain that all apps are TSC-resilient (e.g. 200"TSC-safeness" is not necessary) and highest performance is a 201requirement, TSC emulation may be entirely disabled (tsc_mode==2). 202 203The default mode (tsc_mode==0) checks TSC-safeness of the underlying 204hardware on which the virtual machine is launched. If it is 205TSC-safe, rdtsc will execute at hardware speed; if it is not, rdtsc 206will be emulated. Once a virtual machine is save/restored or migrated, 207however, there are two possibilities: TSC remains native IF the source 208physical machine and target physical machine have the same TSC frequency 209(or, for HVM/PVH guests, if TSC scaling support is available); else TSC 210is emulated. Note that, though emulated, the "apparent" TSC frequency 211will be the TSC frequency of the initial physical machine, even after 212migration. 213 214For environments where both TSC-safeness AND highest performance 215even across migration is a requirement, application code can be specially 216modified to use an algorithm explicitly designed into Xen for this purpose. 217This mode (tsc_mode==3) is called PVRDTSCP, because it requires 218app paravirtualization (awareness by the app that it may be running 219on top of Xen), and utilizes a variation of the rdtsc instruction 220called rdtscp that is available on most recent generation processors. 221(The rdtscp instruction differs from the rdtsc instruction in that it 222reads not only the TSC but an additional register set by system software.) 223When a pvrdtscp-modified app is running on a processor that is both TSC-safe 224and supports the rdtscp instruction, information can be obtained 225about migration and TSC frequency/offset adjustment to allow the 226vast majority of timestamps to be obtained at top performance; when 227running on a TSC-unsafe processor or a processor that doesn't support 228the rdtscp instruction, rdtscp is emulated. 229 230PVRDTSCP (tsc_mode==3) has two limitations. First, it applies to 231all apps running in this virtual machine. This means that all 232apps must either be TSC-resilient or pvrdtscp-modified. Second, 233highest performance is only obtained on TSC-safe machines that 234support the rdtscp instruction; when running on older machines, 235rdtscp is emulated and thus slower. For more information on PVRDTSCP, 236see below. 237 238Finally, tsc_mode==1 always enables TSC emulation, regardless of 239the underlying physical hardware. The "apparent" TSC frequency will 240be the TSC frequency of the initial physical machine, even after migration. 241This mode is useful to measure any performance degradation that 242might be encountered by a tsc_mode==0 domain after migration occurs, 243or a tsc_mode==3 domain when it is running on TSC-unsafe hardware. 244 245Note that while Xen ensures that an emulated TSC is "safe" across migration, 246it does not ensure that it continues to tick at the same rate during 247the actual migration. As an oversimplified example, if TSC is ticking 248once per second in a guest, and the guest is saved when the TSC is 1000, 249then restored 30 seconds later, TSC is only guaranteed to be greater 250than or equal to 1001, not precisely 1030. This has some OS implications 251as will be seen in the next section. 252 253=head1 TSC INVARIANT BIT and NO_MIGRATE 254 255Related to TSC emulation, the "TSC Invariant" bit is architecturally defined 256in a cpuid bit on the most recent x86 processors. If set, TSC invariance 257ensures that the TSC is "safe", that is it will increment at a constant rate 258regardless of power events, will be synchronized across all processors, and 259was properly initialized to zero on all processors at boot-time 260by system hardware/BIOS. As long as system software never writes to TSC, 261TSC will be safe and continuously incremented at a fixed rate and thus 262can be used as a system "clocksource". 263 264This bit is used by some OS's, and specifically by Linux starting with 265version 2.6.30(?), to select TSC as a system clocksource. Once selected, 266TSC remains the Linux system clocksource unless manually overridden. In 267a virtualized environment, since it is not possible to synchronize TSC 268across all the machines in a pool or data center, a migration may "break" 269TSC as a usable clocksource; while time will not go backwards, it may 270not track wallclock time well enough to avoid certain time-sensitive 271consequences. As a result, Xen can only expose the TSC Invariant bit 272to a guest OS if it is certain that the domain will never migrate. 273As of Xen 4.0, the "no_migrate=1" VM configuration option may be specified 274to disable migration. If no_migrate is selected and the VM is running 275on a physical machine with "TSC Invariant", Linux 2.6.30+ will safely 276use TSC as the system clocksource. But, attempts to migrate or, once 277saved, restore this domain will fail. 278 279There is another cpuid-related complication: The x86 cpuid instruction is 280non-privileged. HVM domains are configured to always trap this instruction 281to Xen, where Xen can "filter" the result. In a PV OS, all cpuid instructions 282have been replaced by a paravirtualized equivalent of the cpuid instruction 283("pvcpuid") and also trap to Xen. But apps in a PV guest that use a 284cpuid instruction execute it directly, without a trap to Xen. As a result, 285an app may directly examine the physical TSC Invariant cpuid bit and make 286decisions based on that bit. This is still an unsolved problem, though 287a workaround exists as part of the PVRDTSCP tsc_mode for apps that 288can be modified. 289 290=head1 MORE ON PVRDTSCP 291 292Paravirtualized OS's use the "pvclock" algorithm to manage the passing 293of time. This sophisticated algorithm obtains information from a memory 294page shared between Xen and the OS and selects information from this 295page based on the current virtual CPU (vcpu) in order to properly adapt to 296TSC-unsafe systems and changes that occur across migration. Neither 297this shared page nor the vcpu information is available to a userland 298app so the pvclock algorithm cannot be directly used by an app, at least 299without performance degradation roughly equal to the cost of just 300emulating an rdtsc. 301 302As a result, as of 4.0, Xen provides capabilities for a userland app 303to obtain key time values similar to the information accessible 304to the PV OS pvclock algorithm. The app uses the rdtscp instruction 305which is defined in recent processors to obtain both the TSC and an 306auxiliary value called TSC_AUX. Xen is responsible for setting TSC_AUX 307to the same value on all vcpus running any domain with tsc_mode==3; 308further, Xen tools are responsible for monotonically incrementing TSC_AUX 309anytime the domain is restored/migrated (thus changing key time values); 310and, when the domain is running on a physical machine that either 311is not TSC-safe or does not support the rdtscp instruction, Xen 312is responsible for emulating the rdtscp instruction and for setting 313TSC_AUX to zero on all processors. 314 315Xen also provides pvclock information via a "pvcpuid" instruction. 316While this results in a slow trap, the information changes 317(and thus must be reobtained via pvcpuid) ONLY when TSC_AUX 318has changed, which should be very rare relative to a high 319frequency of rdtscp instructions. 320 321Finally, Xen provides additional time-related information via 322other pvcpuid instructions. First, an app is capable of 323determining if it is currently running on Xen, next whether 324the tsc_mode setting of the domain in which it is running, 325and finally whether the underlying hardware is TSC-safe and 326supports the rdtscp instruction. 327 328As a result, a pvrdtscp-modified app has sufficient information 329to compute the pvclock "elapsed nanoseconds" which can 330be used as a timestamp. And this can be done nearly as 331fast as a native rdtsc instruction, much faster than emulation, 332and also much faster than nearly all OS-provided time mechanisms. 333While pvrtscp is too complex for most apps, certain enterprise 334TSC-sensitive high-TSC-frequency apps may find it useful to 335obtain a significant performance gain. 336 337=head1 HARDWARE TSC SCALING 338 339Intel VMX TSC scaling and AMD SVM TSC ratio allow the guest TSC read 340by guest rdtsc/p increasing in a different frequency than the host 341TSC frequency. 342 343If a HVM container in default TSC mode (tsc_mode=0) or PVRDTSCP mode 344(tsc_mode=3) is created on a host that provides constant TSC, its 345guest TSC frequency will be the same as the host. If it is later 346migrated to another host that provides constant TSC and supports Intel 347VMX TSC scaling/AMD SVM TSC ratio, its guest TSC frequency will be the 348same before and after migration. 349 350For above HVM container in default TSC mode (tsc_mode=0), if above 351hosts support rdtscp, both guest rdtsc and rdtscp instructions will be 352executed natively before and after migration. 353 354For above HVM container in PVRDTSCP mode (tsc_mode=3), if the 355destination host does not support rdtscp, the guest rdtscp instruction 356will be emulated with the guest TSC frequency. 357 358=head1 AUTHORS 359 360Dan Magenheimer <dan.magenheimer@oracle.com> 361