1
linux/kernel/sched
Johannes Weiner 3840cbe24c sched: psi: fix bogus pressure spikes from aggregation race
Brandon reports sporadic, non-sensical spikes in cumulative pressure
time (total=) when reading cpu.pressure at a high rate. This is due to
a race condition between reader aggregation and tasks changing states.

While it affects all states and all resources captured by PSI, in
practice it most likely triggers with CPU pressure, since scheduling
events are so frequent compared to other resource events.

The race context is the live snooping of ongoing stalls during a
pressure read. The read aggregates per-cpu records for stalls that
have concluded, but will also incorporate ad-hoc the duration of any
active state that hasn't been recorded yet. This is important to get
timely measurements of ongoing stalls. Those ad-hoc samples are
calculated on-the-fly up to the current time on that CPU; since the
stall hasn't concluded, it's expected that this is the minimum amount
of stall time that will enter the per-cpu records once it does.

The problem is that the path that concludes the state uses a CPU clock
read that is not synchronized against aggregators; the clock is read
outside of the seqlock protection. This allows aggregators to race and
snoop a stall with a longer duration than will actually be recorded.

With the recorded stall time being less than the last snapshot
remembered by the aggregator, a subsequent sample will underflow and
observe a bogus delta value, resulting in an erratic jump in pressure.

Fix this by moving the clock read of the state change into the seqlock
protection. This ensures no aggregation can snoop live stalls past the
time that's recorded when the state concludes.

Reported-by: Brandon Duffany <brandon@buildbuddy.io>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/
Fixes: df77430639 ("psi: Reduce calls to sched_clock() in psi")
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-03 16:03:16 -07:00
..
autogroup.c
autogroup.h
build_policy.c sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect 2024-07-08 09:30:13 -10:00
build_utility.c
clock.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
completion.c
core_sched.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
core.c sched, sched_ext: Disable SM_IDLE/rq empty path when scx_enabled() 2024-09-23 05:40:53 -10:00
cpuacct.c
cpudeadline.c
cpudeadline.h
cpufreq_schedutil.c Merge branch 'tip/sched/core' into sched_ext/for-6.12 2024-09-11 08:43:26 -10:00
cpufreq.c
cpupri.c
cpupri.h
cputime.c sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-29 12:22:32 +02:00
deadline.c sched: Add put_prev_task(.next) 2024-09-03 15:26:32 +02:00
debug.c Merge branch 'tip/sched/core' into sched_ext/for-6.12 2024-09-11 08:43:26 -10:00
ext.c sched_ext: Remove redundant p->nr_cpus_allowed checker 2024-09-27 10:23:45 -10:00
ext.h sched_ext: Add cgroup support 2024-09-04 10:24:59 -10:00
fair.c sched_ext: Initial pull request for v6.12 2024-09-21 09:44:57 -07:00
features.h sched/eevdf: Allow shorter slices to wakeup-preempt 2024-08-17 11:06:45 +02:00
idle.c Merge branch 'tip/sched/core' into for-6.12 2024-09-03 12:49:18 -10:00
isolation.c
loadavg.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
Makefile
membarrier.c
pelt.c sched: Move update_other_load_avgs() to kernel/sched/pelt.c 2024-09-11 20:00:21 -10:00
pelt.h sched: Move update_other_load_avgs() to kernel/sched/pelt.c 2024-09-11 20:00:21 -10:00
psi.c sched: psi: fix bogus pressure spikes from aggregation race 2024-10-03 16:03:16 -07:00
rt.c sched: Add put_prev_task(.next) 2024-09-03 15:26:32 +02:00
sched-pelt.h
sched.h sched: Put task_group::idle under CONFIG_GROUP_SCHED_WEIGHT 2024-09-23 05:24:12 -10:00
smp.h
stats.c profiling: remove profile=sleep support 2024-08-04 13:36:28 -07:00
stats.h Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh the branch 2024-07-11 10:42:33 +02:00
stop_task.c sched: Add put_prev_task(.next) 2024-09-03 15:26:32 +02:00
swait.c
syscalls.c sched_ext: Initial pull request for v6.12 2024-09-21 09:44:57 -07:00
topology.c sched/fair: Fair server interface 2024-07-29 12:22:36 +02:00
wait_bit.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
wait.c