1\ingroup GroupModules Modules 2\defgroup GroupThermal Thermal Management 3 4# Thermal Management Architecture 5 6Copyright (c) 2022, Arm Limited. All rights reserved. 7 8 9## Overview 10 11Thermal Management is a basic closed-loop temperature controller which 12dynamically controls the platform performance in a thermal envelope. 13 14With a closed control loop and by intelligently dividing the power among actors, 15Thermal Management can efficiently distribute the power within the thermal 16constraints. 17 18The power delivered to each actor (CPU, GPU, etc) is controlled by adjusting the 19frequency and the voltage provided to such actors (performance limits). 20Each actor can be assigned with its own power model, which defines the equations 21for converting power into performance levels and vice-versa. 22 23By allocating the correct performance level, this is reflected to the correct 24power consumed and thus the temperature is maintained at the desired level. 25 26In a system, multiple Thermal Management controllers could exist. Each of them 27is ruled by different temperature domains where they have their own dedicate 28temperature sensor. 29 30## Architecture of a Thermal Management controller 31 32The main two blocks composing Thermal Mgmt are the PI (Proportional and 33Integral) control loop and the power divider. 34 35 Thermal Design 36 current temp Power (TDP) weight (config) 37 | | | | 38 | | | | 39 | | | | 40 v v v v 41 +-+-+ +---------+ +-+-+ +--+---+--+ 42 | +---->+ PI Ctrl +------>+ +------->+ power +----> power granted 43 +-+-+ +---------+ +---+ +--->+ divider +----> 44 ^ | +--+---+--+ 45 | | ^ ^ 46 | performance request ---------+ | | 47 | +-----+ +-----+ 48 control temp | | 49 v v 50 +---+---+ +---+---+ 51 | power | | power | 52 | model | | model | 53 +-------+ +-------+ 54 55 56The control loop reacts to temperature deviations from the control settings and 57provides a (proportional and integral) signal which is then converted into 58allocatable power offset. 59This available power is then distributed across the actors by a basic weighted 60bias mechanism. Each actor will get a fraction of the available power 61proportionally to the bias. 62 63Thermal Management runs internally on two loops: fast loop and slow loop. 64The fast loop is required to provide a fast adjustment of the performance 65requests, while the slow loop is meant to update the PI control (which in turns 66provides an updated allocatable power). The temperature is 67sampled at slow loop cadence. 68 69Thermal Management is taking advantage of the plugin-handler extension to get a 70synchronous tick with the performance requests coming from SCMI. This way the 71fast loop is directly sync'ed with the performance chain to adjust the 72performance limits. This tick imposes the fast loop periodicity. 73The slow loop is derived internally from the fast loop cadence by a configurable 74multiplication factor. This will allow custom tuning of the PI control timings. 75 76An additional module plays an important role in the entire algorithm for 77translating the power to performance and vice-versa. It is required that each 78actor has got their own power model which will be implemented as an additional 79platform-specific module to achieve best modularity and flexibility. 80 81The interface for the power model is defined by Thermal Management 82(mod_thermal_power_model_api). 83 84 85## Algorithm 86 87At each regular tick (fast loop), the Thermal Management will: 88- convert the requested performance level into power 89- attempt to distribute the power across actors based on their request and 90 weights. Any spare power is collected or any shortages kept track of. 91- any carry-over power not consumed in the previous cycle is added onto the 92 available spare power 93- re-distribute any available spare power across actors based on their shortage 94- any spare power left becomes the carry-over power for the next cycle 95- convert the granted power into requested performance level 96- apply a performance limit on the actor's corresponding domain if the power 97 requested could not be met 98 99The above conversions power<->performance are performed within the 100platform-specific power model module. 101 102With a slower periodicity (slow loop), the Thermal Management will: 103- initiate a temperature reading 104- run the PI control and update the total available power 105 106 107## Use 108 109To use the Thermal Management the following dependencies are required: 110- performance plugin handler enabled (BS_FIRMWARE_HAS_PERF_PLUGIN_HANDLER) 111- Power Model in product/<product-name>/module/product_power_model 112 113 114## Limitations 115 116Currently the implementation is in "prototype" stage and limited tests have been 117carried out. 118 119 120## Tunings & Settings 121 122### Global PI control tunings 123 124`slow_loop_mult` 125Multiplier applied to the the base tick via the ->update callback. 126For example: if the tick period is 5ms, then a value of 20 will give the PI 127control a refresh rate at 100ms (= 5 * 20). 128 129### Per-temperature domain tunings 130 131`tdp (thermal design power)` 132The thermal design power for all the actors monitored. This can be an abstract 133value as long as the power model can work with it. 134 135`switch_on_temperature` 136The temperature above which the Thermal Mgmt algorithm will run. The unit needs 137to be consistent with the value provided by the targeted temperature sensor. 138 139`control_temperature` 140The control temperature for the platform. Above this temperature the Thermal 141Mgmt will limit the power/performance. 142 143`integral_cutoff` 144The error threshold below which the errors are accumulated. This may be useful 145to avoid accumulating errors when the error is positive i.e the temperature is 146below the control. 147 148`integral_max` 149The maximum value of accumulated errors. This may be useful to limit the 150positive accumulation of error which, in turn, will affect the overshooting 151operation. 152 153`k_p_undershoot` 154The proportional coefficient used when the error is positive (actual temperature 155below control temperature) 156 157`k_p_overshoot` 158The proportional coefficient used when the error is negative (actual temperature 159above control temperature) 160 161`k_integral` 162The integral coefficient used when multiplying with the accumulated error. 163 164### Per-actor tunings 165 166`weight` 167The coefficient used as an allocation factor for a specific actor. Its value can 168be "any" and expresses the corresponding weight an actor has over the others. 169 170### Thermal protection 171 172There is the possibility to configure a temperature protection that allows to 173configure two different alarms, warning and critical. They can be configured 174independently but the critical level should be above the one for warning. A 175callback can be configured for each threshold in order for an additional module 176to take action to reduce the temperature or initiate a power-down sequence. 177 178 179## Power models 180 181The power model is a platform-specific module that needs to be implemented by 182each platform in order to perform power-to-performance level conversions and 183vice-versa. 184This allows separation from Thermal Management common-code to platform-specific 185characteristics. 186When implementing the APIs, the Power Model module should also allow incoming 187bind requests from Thermal Mgmt. 188 189## Activity factor 190There is the possibility to specify an activity factor API driver functionality 191that allows to accumulate idle power from each actor during operation and 192distribute it amongst other actors to use it. Activity factor is an optional 193feature that can be configured when it is required. 194The driver API is platform specific because it gives flexibility in terms of 195implementation depending on system configuration. 196 197 198## Configuration Example 1 (2 actors, 1 temperature domain) 199 200```C 201static struct mod_thermal_mgmt_actor_config actor_table_domain0[] = { 202 [0] = { 203 .driver_id = 204 FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 0), 205 .dvfs_domain_id = 206 FWK_ID_ELEMENT_INIT( 207 FWK_MODULE_IDX_DVFS, DVFS_ELEMENT_IDX_ACTOR0), 208 .weight = 100, 209 }, 210 [1] = { 211 .driver_id = 212 FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 1), 213 .dvfs_domain_id = 214 FWK_ID_ELEMENT_INIT( 215 FWK_MODULE_IDX_DVFS, DVFS_ELEMENT_IDX_ACTOR1), 216 .weight = 200, 217 }, 218}; 219 220struct fwk_element thermal_mgmt_domains_elem_table = { 221 [0] = { 222 .name = "Thermal Domain 0", 223 .data = &((struct mod_thermal_mgmt_dev_config){ 224 .slow_loop_mult = 20, 225 226 .tdp = 5000, 227 .pi_controller = { 228 .switch_on_temperature = 50, 229 .control_temperature = 60, 230 .integral_cutoff = 0, 231 .integral_max = 100, 232 .k_p_undershoot = 1, 233 .k_p_overshoot = 1, 234 .k_integral = 1, 235 }, 236 237 .sensor_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_SENSOR, 0), 238 .temp_protection = &((struct mod_thermal_mgmt_protection_config){ 239 .driver_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_, 0), 240 .driver_api_id = FWK_ID_API_INIT( 241 FWK_MODULE_IDX_, 242 0), 243 .warn_temp_threshold = 60, 244 .crit_temp_threshold = 85, 245 }), 246 .driver_api_id = 247 FWK_ID_API_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 0), 248 .thermal_actors_table = actor_table_domain0, 249 .thermal_actors_count = FWK_ARRAY_SIZE(actor_table_domain0), 250 }), 251 .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_thermal_mgmt_element_table), 252 }, 253 [1] = { 0 } /* Termination description */ 254}; 255 256static const struct fwk_element *get_thermal_mgmt_element_table( 257 fwk_id_t module_id) 258{ 259 return thermal_mgmt_domains_elem_table; 260} 261 262struct fwk_module_config config_thermal_mgmt = { 263 .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_element_table), 264}; 265 266``` 267 268And the power model should implement the following API: 269 270```C 271 272uint32_t plat_level_to_power(fwk_id_t domain_id, const uint32_t level) 273{ 274 /* compute the power for this actor/domain at this level */ 275 return power; 276} 277 278uint32_t plat_power_to_level(fwk_id_t domain_id, const uint32_t power) 279{ 280 /* compute the performance level for this actor/domain for this power */ 281 return perf_level; 282} 283 284struct mod_thermal_mgmt_power_model_api power_model_api = { 285 .level_to_power = plat_level_to_power, 286 .power_to_level = plat_power_to_level, 287}; 288 289``` 290 291## Configuration Example 2 (2 actors, 1 temperature domain and activity factor) 292 293```C 294static struct mod_thermal_mgmt_actor_config actor_table_domain0[] = { 295 [0] = { 296 .driver_id = 297 FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 0), 298 .dvfs_domain_id = 299 FWK_ID_ELEMENT_INIT( 300 FWK_MODULE_IDX_DVFS, DVFS_ELEMENT_IDX_ACTOR0), 301 .weight = 100, 302 .activity_factor = &((struct mod_thermal_mgmt_activity_factor_config){ 303 .driver_id = FWK_ID_ELEMENT_INIT( 304 FWK_MODULE_IDX_PLATFORM_ACTIVITY, 305 0), 306 .driver_api_id = FWK_ID_API_INIT( 307 FWK_MODULE_IDX_PLATFORM_ACTIVITY, 308 MOD_PLATFORM_ACTIVITY_DRIVER_API_IDX), 309 }), 310 }, 311 [1] = { 312 .driver_id = 313 FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 1), 314 .dvfs_domain_id = 315 FWK_ID_ELEMENT_INIT( 316 FWK_MODULE_IDX_DVFS, DVFS_ELEMENT_IDX_ACTOR1), 317 .weight = 200, 318 .activity_factor = &((struct mod_thermal_mgmt_activity_factor_config){ 319 .driver_id = FWK_ID_ELEMENT_INIT( 320 FWK_MODULE_IDX_PLATFORM_ACTIVITY, 321 1), 322 .driver_api_id = FWK_ID_API_INIT( 323 FWK_MODULE_IDX_PLATFORM_ACTIVITY, 324 MOD_PLATFORM_ACTIVITY_DRIVER_API_IDX), 325 }, 326}; 327 328struct fwk_element thermal_mgmt_domains_elem_table = { 329 [0] = { 330 .name = "Thermal Domain 0", 331 .data = &((struct mod_thermal_mgmt_dev_config){ 332 .slow_loop_mult = 20, 333 334 .tdp = 5000, 335 .pi_controller = { 336 .switch_on_temperature = 50, 337 .control_temperature = 60, 338 .integral_cutoff = 0, 339 .integral_max = 100, 340 .k_p_undershoot = 1, 341 .k_p_overshoot = 1, 342 .k_integral = 1, 343 }, 344 345 .sensor_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_SENSOR, 0), 346 .temp_protection = &((struct mod_thermal_mgmt_protection_config){ 347 .driver_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_, 0), 348 .driver_api_id = FWK_ID_API_INIT( 349 FWK_MODULE_IDX_, 350 0), 351 .warn_temp_threshold = 60, 352 .crit_temp_threshold = 85, 353 }), 354 .driver_api_id = 355 FWK_ID_API_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 0), 356 .thermal_actors_table = actor_table_domain0, 357 .thermal_actors_count = FWK_ARRAY_SIZE(actor_table_domain0), 358 }), 359 .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_thermal_mgmt_element_table), 360 }, 361 [1] = { 0 } /* Termination description */ 362}; 363 364static const struct fwk_element *get_thermal_mgmt_element_table( 365 fwk_id_t module_id) 366{ 367 return thermal_mgmt_domains_elem_table; 368} 369 370struct fwk_module_config config_thermal_mgmt = { 371 .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_element_table), 372}; 373 374``` 375 376The activity factor module should implement the following API: 377 378```C 379 380int get_activity_factor(fwk_id_t domain_id, uint16_t *activity) 381{ 382 /* Compute performance factor for the given domain id */ 383 *activity = plat_activity_factor; 384 return FWK_SUCCESS; 385} 386 387struct mod_thermal_mgmt_activity_factor_api activity_factor_api = { 388 .get_activity_factor = plat_get_activity_factor, 389}; 390 391``` 392 393## Configuration Example 3 (thermal protection) 394 395There is the possibility to only configure the module as a thermal protection. 396 397```C 398struct fwk_element thermal_mgmt_domains_elem_table = { 399 [0] = { 400 .name = "Thermal Domain 0", 401 .data = &((struct mod_thermal_mgmt_dev_config){ 402 .slow_loop_mult = 5, 403 .sensor_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_SENSOR, 0), 404 .temp_protection = &((struct mod_thermal_mgmt_protection_config){ 405 .driver_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_, 0), 406 .driver_api_id = FWK_ID_API_INIT( 407 FWK_MODULE_IDX_, 408 0), 409 .warn_temp_threshold = 60, 410 .crit_temp_threshold = 85, 411 }), 412 }), 413 .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_thermal_mgmt_element_table), 414 }, 415 [1] = { 0 } /* Termination description */ 416}; 417 418static const struct fwk_element *get_thermal_mgmt_element_table( 419 fwk_id_t module_id) 420{ 421 return thermal_mgmt_domains_elem_table; 422} 423 424struct fwk_module_config config_thermal_mgmt = { 425 .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_element_table), 426}; 427 428``` 429