1\ingroup GroupModules Modules
2\defgroup GroupThermal Thermal Management
3
4# Thermal Management Architecture
5
6Copyright (c) 2022, Arm Limited. All rights reserved.
7
8
9## Overview
10
11Thermal Management is a basic closed-loop temperature controller which
12dynamically controls the platform performance in a thermal envelope.
13
14With a closed control loop and by intelligently dividing the power among actors,
15Thermal Management can efficiently distribute the power within the thermal
16constraints.
17
18The power delivered to each actor (CPU, GPU, etc) is controlled by adjusting the
19frequency and the voltage provided to such actors (performance limits).
20Each actor can be assigned with its own power model, which defines the equations
21for converting power into performance levels and vice-versa.
22
23By allocating the correct performance level, this is reflected to the correct
24power consumed and thus the temperature is maintained at the desired level.
25
26In a system, multiple Thermal Management controllers could exist. Each of them
27is ruled by different temperature domains where they have their own dedicate
28temperature sensor.
29
30## Architecture of a Thermal Management controller
31
32The main two blocks composing Thermal Mgmt are the PI (Proportional and
33Integral) control loop and the power divider.
34
35                               Thermal Design
36    current temp                Power (TDP)      weight (config)
37         |                           |             |   |
38         |                           |             |   |
39         |                           |             |   |
40         v                           v             v   v
41       +-+-+     +---------+       +-+-+        +--+---+--+
42       |   +---->+ PI Ctrl +------>+   +------->+  power  +----> power granted
43       +-+-+     +---------+       +---+   +--->+ divider +---->
44         ^                                 |    +--+---+--+
45         |                                 |       ^   ^
46         |    performance request ---------+       |   |
47         |                                   +-----+   +-----+
48    control temp                             |               |
49                                             v               v
50                                         +---+---+       +---+---+
51                                         | power |       | power |
52                                         | model |       | model |
53                                         +-------+       +-------+
54
55
56The control loop reacts to temperature deviations from the control settings and
57provides a (proportional and integral) signal which is then converted into
58allocatable power offset.
59This available power is then distributed across the actors by a basic weighted
60bias mechanism. Each actor will get a fraction of the available power
61proportionally to the bias.
62
63Thermal Management runs internally on two loops: fast loop and slow loop.
64The fast loop is required to provide a fast adjustment of the performance
65requests, while the slow loop is meant to update the PI control (which in turns
66provides an updated allocatable power). The temperature is
67sampled at slow loop cadence.
68
69Thermal Management is taking advantage of the plugin-handler extension to get a
70synchronous tick with the performance requests coming from SCMI. This way the
71fast loop is directly sync'ed with the performance chain to adjust the
72performance limits. This tick imposes the fast loop periodicity.
73The slow loop is derived internally from the fast loop cadence by a configurable
74multiplication factor. This will allow custom tuning of the PI control timings.
75
76An additional module plays an important role in the entire algorithm for
77translating the power to performance and vice-versa. It is required that each
78actor has got their own power model which will be implemented as an additional
79platform-specific module to achieve best modularity and flexibility.
80
81The interface for the power model is defined by Thermal Management
82(mod_thermal_power_model_api).
83
84
85## Algorithm
86
87At each regular tick (fast loop), the Thermal Management will:
88- convert the requested performance level into power
89- attempt to distribute the power across actors based on their request and
90  weights. Any spare power is collected or any shortages kept track of.
91- any carry-over power not consumed in the previous cycle is added onto the
92  available spare power
93- re-distribute any available spare power across actors based on their shortage
94- any spare power left becomes the carry-over power for the next cycle
95- convert the granted power into requested performance level
96- apply a performance limit on the actor's corresponding domain if the power
97  requested could not be met
98
99The above conversions power<->performance are performed within the
100platform-specific power model module.
101
102With a slower periodicity (slow loop), the Thermal Management will:
103- initiate a temperature reading
104- run the PI control and update the total available power
105
106
107## Use
108
109To use the Thermal Management the following dependencies are required:
110- performance plugin handler enabled (BS_FIRMWARE_HAS_PERF_PLUGIN_HANDLER)
111- Power Model in product/<product-name>/module/product_power_model
112
113
114## Limitations
115
116Currently the implementation is in "prototype" stage and limited tests have been
117carried out.
118
119
120## Tunings & Settings
121
122### Global PI control tunings
123
124`slow_loop_mult`
125Multiplier applied to the the base tick via the ->update callback.
126For example: if the tick period is 5ms, then a value of 20 will give the PI
127control a refresh rate at 100ms (= 5 * 20).
128
129### Per-temperature domain tunings
130
131`tdp (thermal design power)`
132The thermal design power for all the actors monitored. This can be an abstract
133value as long as the power model can work with it.
134
135`switch_on_temperature`
136The temperature above which the Thermal Mgmt algorithm will run. The unit needs
137to be consistent with the value provided by the targeted temperature sensor.
138
139`control_temperature`
140The control temperature for the platform. Above this temperature the Thermal
141Mgmt will limit the power/performance.
142
143`integral_cutoff`
144The error threshold below which the errors are accumulated. This may be useful
145to avoid accumulating errors when the error is positive i.e the temperature is
146below the control.
147
148`integral_max`
149The maximum value of accumulated errors. This may be useful to limit the
150positive accumulation of error which, in turn, will affect the overshooting
151operation.
152
153`k_p_undershoot`
154The proportional coefficient used when the error is positive (actual temperature
155below control temperature)
156
157`k_p_overshoot`
158The proportional coefficient used when the error is negative (actual temperature
159above control temperature)
160
161`k_integral`
162The integral coefficient used when multiplying with the accumulated error.
163
164### Per-actor tunings
165
166`weight`
167The coefficient used as an allocation factor for a specific actor. Its value can
168be "any" and expresses the corresponding weight an actor has over the others.
169
170### Thermal protection
171
172There is the possibility to configure a temperature protection that allows to
173configure two different alarms, warning and critical. They can be configured
174independently but the critical level should be above the one for warning. A
175callback can be configured for each threshold in order for an additional module
176to take action to reduce the temperature or initiate a power-down sequence.
177
178
179## Power models
180
181The power model is a platform-specific module that needs to be implemented by
182each platform in order to perform power-to-performance level conversions and
183vice-versa.
184This allows separation from Thermal Management common-code to platform-specific
185characteristics.
186When implementing the APIs, the Power Model module should also allow incoming
187bind requests from Thermal Mgmt.
188
189## Activity factor
190There is the possibility to specify an activity factor API driver functionality
191that allows to accumulate idle power from each actor during operation and
192distribute it amongst other actors to use it. Activity factor is an optional
193feature that can be configured when it is required.
194The driver API is platform specific because it gives flexibility in terms of
195implementation depending on system configuration.
196
197
198## Configuration Example 1 (2 actors, 1 temperature domain)
199
200```C
201static struct mod_thermal_mgmt_actor_config actor_table_domain0[] = {
202    [0] = {
203        .driver_id =
204            FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 0),
205        .dvfs_domain_id =
206            FWK_ID_ELEMENT_INIT(
207                FWK_MODULE_IDX_DVFS, DVFS_ELEMENT_IDX_ACTOR0),
208        .weight = 100,
209    },
210    [1] = {
211        .driver_id =
212            FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 1),
213        .dvfs_domain_id =
214            FWK_ID_ELEMENT_INIT(
215                FWK_MODULE_IDX_DVFS, DVFS_ELEMENT_IDX_ACTOR1),
216        .weight = 200,
217    },
218};
219
220struct fwk_element thermal_mgmt_domains_elem_table = {
221    [0] = {
222        .name = "Thermal Domain 0",
223        .data = &((struct mod_thermal_mgmt_dev_config){
224            .slow_loop_mult = 20,
225
226            .tdp = 5000,
227            .pi_controller = {
228                .switch_on_temperature = 50,
229                .control_temperature = 60,
230                .integral_cutoff = 0,
231                .integral_max = 100,
232                .k_p_undershoot = 1,
233                .k_p_overshoot = 1,
234                .k_integral = 1,
235            },
236
237            .sensor_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_SENSOR, 0),
238            .temp_protection = &((struct mod_thermal_mgmt_protection_config){
239                .driver_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_, 0),
240                .driver_api_id = FWK_ID_API_INIT(
241                    FWK_MODULE_IDX_,
242                    0),
243                .warn_temp_threshold = 60,
244                .crit_temp_threshold = 85,
245            }),
246            .driver_api_id =
247                FWK_ID_API_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 0),
248            .thermal_actors_table = actor_table_domain0,
249            .thermal_actors_count = FWK_ARRAY_SIZE(actor_table_domain0),
250        }),
251        .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_thermal_mgmt_element_table),
252    },
253    [1] = { 0 } /* Termination description */
254};
255
256static const struct fwk_element *get_thermal_mgmt_element_table(
257    fwk_id_t module_id)
258{
259    return thermal_mgmt_domains_elem_table;
260}
261
262struct fwk_module_config config_thermal_mgmt = {
263    .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_element_table),
264};
265
266```
267
268And the power model should implement the following API:
269
270```C
271
272uint32_t plat_level_to_power(fwk_id_t domain_id, const uint32_t level)
273{
274    /* compute the power for this actor/domain at this level */
275    return power;
276}
277
278uint32_t plat_power_to_level(fwk_id_t domain_id, const uint32_t power)
279{
280    /* compute the performance level for this actor/domain for this power */
281    return perf_level;
282}
283
284struct mod_thermal_mgmt_power_model_api power_model_api = {
285    .level_to_power = plat_level_to_power,
286    .power_to_level = plat_power_to_level,
287};
288
289```
290
291## Configuration Example 2 (2 actors, 1 temperature domain and activity factor)
292
293```C
294static struct mod_thermal_mgmt_actor_config actor_table_domain0[] = {
295    [0] = {
296        .driver_id =
297            FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 0),
298        .dvfs_domain_id =
299            FWK_ID_ELEMENT_INIT(
300                FWK_MODULE_IDX_DVFS, DVFS_ELEMENT_IDX_ACTOR0),
301        .weight = 100,
302        .activity_factor = &((struct mod_thermal_mgmt_activity_factor_config){
303            .driver_id = FWK_ID_ELEMENT_INIT(
304                FWK_MODULE_IDX_PLATFORM_ACTIVITY,
305                0),
306            .driver_api_id = FWK_ID_API_INIT(
307                FWK_MODULE_IDX_PLATFORM_ACTIVITY,
308                MOD_PLATFORM_ACTIVITY_DRIVER_API_IDX),
309        }),
310    },
311    [1] = {
312        .driver_id =
313            FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 1),
314        .dvfs_domain_id =
315            FWK_ID_ELEMENT_INIT(
316                FWK_MODULE_IDX_DVFS, DVFS_ELEMENT_IDX_ACTOR1),
317        .weight = 200,
318        .activity_factor = &((struct mod_thermal_mgmt_activity_factor_config){
319            .driver_id = FWK_ID_ELEMENT_INIT(
320                FWK_MODULE_IDX_PLATFORM_ACTIVITY,
321                1),
322            .driver_api_id = FWK_ID_API_INIT(
323                FWK_MODULE_IDX_PLATFORM_ACTIVITY,
324                MOD_PLATFORM_ACTIVITY_DRIVER_API_IDX),
325    },
326};
327
328struct fwk_element thermal_mgmt_domains_elem_table = {
329    [0] = {
330        .name = "Thermal Domain 0",
331        .data = &((struct mod_thermal_mgmt_dev_config){
332            .slow_loop_mult = 20,
333
334            .tdp = 5000,
335            .pi_controller = {
336                .switch_on_temperature = 50,
337                .control_temperature = 60,
338                .integral_cutoff = 0,
339                .integral_max = 100,
340                .k_p_undershoot = 1,
341                .k_p_overshoot = 1,
342                .k_integral = 1,
343            },
344
345            .sensor_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_SENSOR, 0),
346            .temp_protection = &((struct mod_thermal_mgmt_protection_config){
347                .driver_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_, 0),
348                .driver_api_id = FWK_ID_API_INIT(
349                    FWK_MODULE_IDX_,
350                    0),
351                .warn_temp_threshold = 60,
352                .crit_temp_threshold = 85,
353            }),
354            .driver_api_id =
355                FWK_ID_API_INIT(FWK_MODULE_IDX_PLAT_POWER_MODEL, 0),
356            .thermal_actors_table = actor_table_domain0,
357            .thermal_actors_count = FWK_ARRAY_SIZE(actor_table_domain0),
358        }),
359        .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_thermal_mgmt_element_table),
360    },
361    [1] = { 0 } /* Termination description */
362};
363
364static const struct fwk_element *get_thermal_mgmt_element_table(
365    fwk_id_t module_id)
366{
367    return thermal_mgmt_domains_elem_table;
368}
369
370struct fwk_module_config config_thermal_mgmt = {
371    .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_element_table),
372};
373
374```
375
376The activity factor module should implement the following API:
377
378```C
379
380int get_activity_factor(fwk_id_t domain_id, uint16_t *activity)
381{
382    /* Compute performance factor for the given domain id */
383    *activity = plat_activity_factor;
384    return FWK_SUCCESS;
385}
386
387struct mod_thermal_mgmt_activity_factor_api activity_factor_api = {
388    .get_activity_factor = plat_get_activity_factor,
389};
390
391```
392
393## Configuration Example 3 (thermal protection)
394
395There is the possibility to only configure the module as a thermal protection.
396
397```C
398struct fwk_element thermal_mgmt_domains_elem_table = {
399    [0] = {
400        .name = "Thermal Domain 0",
401        .data = &((struct mod_thermal_mgmt_dev_config){
402            .slow_loop_mult = 5,
403            .sensor_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_SENSOR, 0),
404            .temp_protection = &((struct mod_thermal_mgmt_protection_config){
405                .driver_id = FWK_ID_ELEMENT_INIT(FWK_MODULE_IDX_, 0),
406                .driver_api_id = FWK_ID_API_INIT(
407                    FWK_MODULE_IDX_,
408                    0),
409                .warn_temp_threshold = 60,
410                .crit_temp_threshold = 85,
411            }),
412        }),
413        .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_thermal_mgmt_element_table),
414    },
415    [1] = { 0 } /* Termination description */
416};
417
418static const struct fwk_element *get_thermal_mgmt_element_table(
419    fwk_id_t module_id)
420{
421    return thermal_mgmt_domains_elem_table;
422}
423
424struct fwk_module_config config_thermal_mgmt = {
425    .elements = FWK_MODULE_DYNAMIC_ELEMENTS(get_element_table),
426};
427
428```
429