Commit 9b1b3dcd authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull power management updates from Rafael Wysocki:
 "By the number of commits, cpufreq is the leading party (again) and the
  most visible change there is the removal of the omap-cpufreq driver
  that has not been used for a long time (good riddance). There are also
  quite a few changes in the cppc_cpufreq driver, mostly related to
  fixing its frequency invariance engine in the case when the CPPC
  registers used by it are not in PCC. In addition to that, support for
  AM62L3 is added to the ti-cpufreq driver and the cpufreq-dt-platdev
  list is updated for some platforms. The remaining cpufreq changes are
  assorted fixes and cleanups.

  Next up is cpuidle and the changes there are dominated by intel_idle
  driver updates, mostly related to the new command line facility
  allowing users to adjust the list of C-states used by the driver.
  There are also a few updates of cpuidle governors, including two menu
  governor fixes and some refinements of the teo governor, and a
  MAINTAINERS update adding Christian Loehle as a cpuidle reviewer.
  [Thanks for stepping up Christian!]

  The most significant update related to system suspend and hibernation
  is the one to stop freezing the PM runtime workqueue during system PM
  transitions which allows some deadlocks to be avoided. There is also a
  fix for possible concurrent bit field updates in the core device
  suspend code and a few other minor fixes.

  Apart from the above, several drivers are updated to discard the
  return value of pm_runtime_put() which is going to be converted to a
  void function as soon as everybody stops using its return value, PL4
  support for Ice Lake is added to the Intel RAPL power capping driver,
  and there are assorted cleanups, documentation fixes, and some
  cpupower utility improvements.

  Specifics:

   - Remove the unused omap-cpufreq driver (Andreas Kemnade)

   - Optimize error handling code in cpufreq_boost_trigger_state() and
     make cpufreq_boost_trigger_state() return -EOPNOTSUPP if no policy
     supports boost (Lifeng Zheng)

   - Update cpufreq-dt-platdev list for tegra, qcom, TI (Aaron Kling,
     Dhruva Gole, and Konrad Dybcio)

   - Minor improvements to the cpufreq and cpumask rust implementation
     (Alexandre Courbot, Alice Ryhl, Tamir Duberstein, and Yilin Chen)

   - Add support for AM62L3 SoC to the ti-cpufreq driver (Dhruva Gole)

   - Update arch_freq_scale in the CPPC cpufreq driver's frequency
     invariance engine (FIE) in scheduler ticks if the related CPPC
     registers are not in PCC (Jie Zhan)

   - Assorted minor cleanups and improvements in ARM cpufreq drivers
     (Juan Martinez, Felix Gu, Luca Weiss, and Sergey Shtylyov)

   - Add generic helpers for sysfs show/store to cppc_cpufreq (Sumit
     Gupta)

   - Make the scaling_setspeed cpufreq sysfs attribute return the actual
     requested frequency to avoid confusion (Pengjie Zhang)

   - Simplify the idle CPU time granularity test in the ondemand cpufreq
     governor (Frederic Weisbecker)

   - Enable asym capacity in intel_pstate only when CPU SMT is not
     possible (Yaxiong Tian)

   - Update the description of rate_limit_us default value in cpufreq
     documentation (Yaxiong Tian)

   - Add a command line option to adjust the C-states table in the
     intel_idle driver, remove the 'preferred_cstates' module parameter
     from it, add C-states validation to it and clean it up (Artem
     Bityutskiy)

   - Make the menu cpuidle governor always check the time till the
     closest timer event when the scheduler tick has been stopped to
     prevent it from mistakenly selecting the deepest available idle
     state (Rafael Wysocki)

   - Update the teo cpuidle governor to avoid making suboptimal
     decisions in certain corner cases and generally improve idle state
     selection accuracy (Rafael Wysocki)

   - Remove an unlikely() annotation on the early-return condition in
     menu_select() that leads to branch misprediction 100% of the time
     on systems with only 1 idle state enabled, like ARM64 servers
     (Breno Leitao)

   - Add Christian Loehle to MAINTAINERS as a cpuidle reviewer
     (Christian Loehle)

   - Stop flagging the PM runtime workqueue as freezable to avoid system
     suspend and resume deadlocks in subsystems that assume asynchronous
     runtime PM to work during system-wide PM transitions (Rafael
     Wysocki)

   - Drop redundant NULL pointer checks before acomp_request_free() from
     the hibernation code handling image saving (Rafael Wysocki)

   - Update wakeup_sources_walk_start() to handle empty lists of wakeup
     sources as appropriate (Samuel Wu)

   - Make dev_pm_clear_wake_irq() check the power.wakeirq value under
     power.lock to avoid race conditions (Gui-Dong Han)

   - Avoid bit field races related to power.work_in_progress in the core
     device suspend code (Xuewen Yan)

   - Make several drivers discard pm_runtime_put() return value in
     preparation for converting that function to a void one (Rafael
     Wysocki)

   - Add PL4 support for Ice Lake to the Intel RAPL power capping driver
     (Daniel Tang)

   - Replace sprintf() with sysfs_emit() in power capping sysfs show
     functions (Sumeet Pawnikar)

   - Make dev_pm_opp_get_level() return value match the documentation
     after a previous update of the latter (Aleks Todorov)

   - Use scoped for each OF child loop in the OPP code (Krzysztof
     Kozlowski)

   - Fix a bug in an example code snippet and correct typos in the
     energy model management documentation (Patrick Little)

   - Fix miscellaneous problems in cpupower (Kaushlendra Kumar):
      * idle_monitor: Fix incorrect value logged after stop
      * Fix inverted APERF capability check
      * Use strcspn() to strip trailing newline
      * Reset errno before strtoull()
      * Show C0 in idle-info dump

   - Improve cpupower installation procedure by making the systemd step
     optional and allowing users to disable the installation of
     systemd's unit file (João Marcos Costa)"

* tag 'pm-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
  PM: sleep: core: Avoid bit field races related to work_in_progress
  PM: sleep: wakeirq: harden dev_pm_clear_wake_irq() against races
  cpufreq: Documentation: Update description of rate_limit_us default value
  cpufreq: intel_pstate: Enable asym capacity only when CPU SMT is not possible
  PM: wakeup: Handle empty list in wakeup_sources_walk_start()
  PM: EM: Documentation: Fix bug in example code snippet
  Documentation: Fix typos in energy model documentation
  cpuidle: governors: teo: Refine intercepts-based idle state lookup
  cpuidle: governors: teo: Adjust the classification of wakeup events
  cpufreq: ondemand: Simplify idle cputime granularity test
  cpufreq: userspace: make scaling_setspeed return the actual requested frequency
  PM: hibernate: Drop NULL pointer checks before acomp_request_free()
  cpufreq: CPPC: Add generic helpers for sysfs show/store
  cpufreq: scmi: Fix device_node reference leak in scmi_cpu_domain_id()
  cpufreq: ti-cpufreq: add support for AM62L3 SoC
  cpufreq: dt-platdev: Add ti,am62l3 to blocklist
  cpufreq/amd-pstate: Add comment explaining nominal_perf usage for performance policy
  cpufreq: scmi: correct SCMI explanation
  cpufreq: dt-platdev: Block the driver from probing on more QC platforms
  rust: cpumask: rename methods of Cpumask for clarity and consistency
  ...
parents d84e1733 0f64b6ac
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -439,7 +439,7 @@ This governor exposes only one tunable:
``rate_limit_us``
	Minimum time (in microseconds) that has to pass between two consecutive
	runs of governor computations (default: 1.5 times the scaling driver's
	transition latency or the maximum 2ms).
	transition latency or 1ms if the driver does not provide a latency value).

	The purpose of this tunable is to reduce the scheduler context overhead
	of the governor which might be excessive without it.
+2 −0
Original line number Diff line number Diff line
@@ -35,6 +35,7 @@ properties:
      - description: v2 of CPUFREQ HW (EPSS)
        items:
          - enum:
              - qcom,milos-cpufreq-epss
              - qcom,qcs8300-cpufreq-epss
              - qcom,qdu1000-cpufreq-epss
              - qcom,sa8255p-cpufreq-epss
@@ -169,6 +170,7 @@ allOf:
        compatible:
          contains:
            enum:
              - qcom,milos-cpufreq-epss
              - qcom,qcs8300-cpufreq-epss
              - qcom,sc7280-cpufreq-epss
              - qcom,sm8250-cpufreq-epss
+9 −9
Original line number Diff line number Diff line
@@ -14,8 +14,8 @@ subsystems willing to use that information to make energy-aware decisions.
The source of the information about the power consumed by devices can vary greatly
from one platform to another. These power costs can be estimated using
devicetree data in some cases. In others, the firmware will know better.
Alternatively, userspace might be best positioned. And so on. In order to avoid
each and every client subsystem to re-implement support for each and every
Alternatively, userspace might be best positioned. In order to avoid
having each and every client subsystem re-implement support for each and every
possible source of information on its own, the EM framework intervenes as an
abstraction layer which standardizes the format of power cost tables in the
kernel, hence enabling to avoid redundant work.
@@ -32,7 +32,7 @@ be found in the Intelligent Power Allocation in
Documentation/driver-api/thermal/power_allocator.rst.
Kernel subsystems might implement automatic detection to check whether EM
registered devices have inconsistent scale (based on EM internal flag).
Important thing to keep in mind is that when the power values are expressed in
An important thing to keep in mind is that when the power values are expressed in
an 'abstract scale' deriving real energy in micro-Joules would not be possible.

The figure below depicts an example of drivers (Arm-specific here, but the
@@ -82,7 +82,7 @@ using kref mechanism. The device driver which provided the new EM at runtime,
should call EM API to free it safely when it's no longer needed. The EM
framework will handle the clean-up when it's possible.

The kernel code which want to modify the EM values is protected from concurrent
The kernel code which wants to modify the EM values is protected from concurrent
access using a mutex. Therefore, the device driver code must run in sleeping
context when it tries to modify the EM.

@@ -113,7 +113,7 @@ Registration of 'advanced' EM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The 'advanced' EM gets its name due to the fact that the driver is allowed
to provide more precised power model. It's not limited to some implemented math
to provide a more precise power model. It's not limited to some implemented math
formula in the framework (like it is in 'simple' EM case). It can better reflect
the real power measurements performed for each performance state. Thus, this
registration method should be preferred in case considering EM static power
@@ -172,7 +172,7 @@ Registration of 'simple' EM
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The 'simple' EM is registered using the framework helper function
cpufreq_register_em_with_opp(). It implements a power model which is tight to
cpufreq_register_em_with_opp(). It implements a power model which is tied to a
math formula::

	Power = C * V^2 * f
@@ -251,7 +251,7 @@ It returns the 'struct em_perf_state' pointer which is an array of performance
states in ascending order.
This function must be called in the RCU read lock section (after the
rcu_read_lock()). When the EM table is not needed anymore there is a need to
call rcu_real_unlock(). In this way the EM safely uses the RCU read section
call rcu_read_unlock(). In this way the EM safely uses the RCU read section
and protects the users. It also allows the EM framework to manage the memory
and free it. More details how to use it can be found in Section 3.2 in the
example driver.
@@ -308,12 +308,12 @@ EM framework::
  05
  06		/* Use the 'foo' protocol to ceil the frequency */
  07		freq = foo_get_freq_ceil(dev, *KHz);
  08		if (freq < 0);
  08		if (freq < 0)
  09			return freq;
  10
  11		/* Estimate the power cost for the dev at the relevant freq. */
  12		power = foo_estimate_power(dev, freq);
  13		if (power < 0);
  13		if (power < 0)
  14			return power;
  15
  16		/* Return the values to the EM framework */
+3 −4
Original line number Diff line number Diff line
@@ -712,10 +712,9 @@ out the following operations:
  * During system suspend pm_runtime_get_noresume() is called for every device
    right before executing the subsystem-level .prepare() callback for it and
    pm_runtime_barrier() is called for every device right before executing the
    subsystem-level .suspend() callback for it.  In addition to that the PM core
    calls __pm_runtime_disable() with 'false' as the second argument for every
    device right before executing the subsystem-level .suspend_late() callback
    for it.
    subsystem-level .suspend() callback for it.  In addition to that, the PM
    core disables runtime PM for every device right before executing the
    subsystem-level .suspend_late() callback for it.

  * During system resume pm_runtime_enable() and pm_runtime_put() are called for
    every device right after executing the subsystem-level .resume_early()
+4 −4
Original line number Diff line number Diff line
@@ -244,7 +244,7 @@ Example 2.


    From these calculations, the Case 1 has the lowest total energy. So CPU 1
    is be the best candidate from an energy-efficiency standpoint.
    is the best candidate from an energy-efficiency standpoint.

Big CPUs are generally more power hungry than the little ones and are thus used
mainly when a task doesn't fit the littles. However, little CPUs aren't always
@@ -252,7 +252,7 @@ necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
of the little CPUs can be less energy-efficient than the lowest OPPs of the
bigs, for example. So, if the little CPUs happen to have enough utilization at
a specific point in time, a small task waking up at that moment could be better
of executing on the big side in order to save energy, even though it would fit
off executing on the big side in order to save energy, even though it would fit
on the little side.

And even in the case where all OPPs of the big CPUs are less energy-efficient
@@ -285,7 +285,7 @@ much that can be done by the scheduler to save energy without severely harming
throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
'over-utilized' as soon as they are used at more than 80% of their compute
capacity. As long as no CPUs are over-utilized in a root domain, load balancing
is disabled and EAS overridess the wake-up balancing code. EAS is likely to load
is disabled and EAS overrides the wake-up balancing code. EAS is likely to load
the most energy efficient CPUs of the system more than the others if that can be
done without harming throughput. So, the load-balancer is disabled to prevent
it from breaking the energy-efficient task placement found by EAS. It is safe to
@@ -385,7 +385,7 @@ Using EAS with any other governor than schedutil is not supported.
6.5 Scale-invariant utilization signals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to make accurate prediction across CPUs and for all performance
In order to make accurate predictions across CPUs and for all performance
states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
callbacks.
Loading