mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/lockdep.h>
|
2022-08-18 06:10:34 -07:00
|
|
|
#include <linux/sysfs.h>
|
|
|
|
#include <linux/kobject.h>
|
2022-08-18 06:10:35 -07:00
|
|
|
#include <linux/memory.h>
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
#include <linux/memory-tiers.h>
|
memory tiering: add abstract distance calculation algorithms management
Patch series "memory tiering: calculate abstract distance based on ACPI
HMAT", v4.
We have the explicit memory tiers framework to manage systems with
multiple types of memory, e.g., DRAM in DIMM slots and CXL memory devices.
Where, same kind of memory devices will be grouped into memory types,
then put into memory tiers. To describe the performance of a memory type,
abstract distance is defined. Which is in direct proportion to the memory
latency and inversely proportional to the memory bandwidth. To keep the
code as simple as possible, fixed abstract distance is used in dax/kmem to
describe slow memory such as Optane DCPMM.
To support more memory types, in this series, we added the abstract
distance calculation algorithm management mechanism, provided a algorithm
implementation based on ACPI HMAT, and used the general abstract distance
calculation interface in dax/kmem driver. So, dax/kmem can support HBM
(high bandwidth memory) in addition to the original Optane DCPMM.
This patch (of 4):
The abstract distance may be calculated by various drivers, such as ACPI
HMAT, CXL CDAT, etc. While it may be used by various code which hot-add
memory node, such as dax/kmem etc. To decouple the algorithm users and
the providers, the abstract distance calculation algorithms management
mechanism is implemented in this patch. It provides interface for the
providers to register the implementation, and interface for the users.
Multiple algorithm implementations can cooperate via calculating abstract
distance for different memory nodes. The preference of algorithm
implementations can be specified via priority (notifier_block.priority).
Link: https://lkml.kernel.org/r/20230926060628.265989-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20230926060628.265989-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Bharata B Rao <bharata@amd.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-25 23:06:25 -07:00
|
|
|
#include <linux/notifier.h>
|
2024-07-24 06:01:14 -07:00
|
|
|
#include <linux/sched/sysctl.h>
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
|
2022-08-18 06:10:37 -07:00
|
|
|
#include "internal.h"
|
|
|
|
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
struct memory_tier {
|
|
|
|
/* hierarchy of memory tiers */
|
|
|
|
struct list_head list;
|
|
|
|
/* list of all memory types part of this tier */
|
|
|
|
struct list_head memory_types;
|
|
|
|
/*
|
|
|
|
* start value of abstract distance. memory tier maps
|
|
|
|
* an abstract distance range,
|
|
|
|
* adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE
|
|
|
|
*/
|
|
|
|
int adistance_start;
|
2022-08-30 01:17:36 -07:00
|
|
|
struct device dev;
|
2022-08-18 06:10:40 -07:00
|
|
|
/* All the nodes that are part of all the lower memory tiers. */
|
|
|
|
nodemask_t lower_tier_mask;
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
};
|
|
|
|
|
2022-08-18 06:10:37 -07:00
|
|
|
struct demotion_nodes {
|
|
|
|
nodemask_t preferred;
|
|
|
|
};
|
|
|
|
|
2022-08-18 06:10:36 -07:00
|
|
|
struct node_memory_type_map {
|
|
|
|
struct memory_dev_type *memtype;
|
|
|
|
int map_count;
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
static DEFINE_MUTEX(memory_tier_lock);
|
|
|
|
static LIST_HEAD(memory_tiers);
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
/*
|
|
|
|
* The list is used to store all memory types that are not created
|
|
|
|
* by a device driver.
|
|
|
|
*/
|
|
|
|
static LIST_HEAD(default_memory_types);
|
2022-08-18 06:10:36 -07:00
|
|
|
static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
|
2023-09-25 23:06:27 -07:00
|
|
|
struct memory_dev_type *default_dram_type;
|
memory tier: consolidate the initialization of memory tiers
The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.
The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/
If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.
If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.
Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.
Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 00:26:44 -07:00
|
|
|
nodemask_t default_dram_nodes __initdata = NODE_MASK_NONE;
|
2022-08-30 01:17:36 -07:00
|
|
|
|
2024-02-04 06:56:44 -07:00
|
|
|
static const struct bus_type memory_tier_subsys = {
|
2022-08-30 01:17:36 -07:00
|
|
|
.name = "memory_tiering",
|
|
|
|
.dev_name = "memory_tier",
|
|
|
|
};
|
|
|
|
|
2024-07-24 06:01:14 -07:00
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
|
|
|
/**
|
|
|
|
* folio_use_access_time - check if a folio reuses cpupid for page access time
|
|
|
|
* @folio: folio to check
|
|
|
|
*
|
|
|
|
* folio's _last_cpupid field is repurposed by memory tiering. In memory
|
|
|
|
* tiering mode, cpupid of slow memory folio (not toptier memory) is used to
|
|
|
|
* record page access time.
|
|
|
|
*
|
|
|
|
* Return: the folio _last_cpupid is used to record page access time
|
|
|
|
*/
|
|
|
|
bool folio_use_access_time(struct folio *folio)
|
|
|
|
{
|
|
|
|
return (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
|
|
|
|
!node_is_toptier(folio_nid(folio));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2022-08-18 06:10:37 -07:00
|
|
|
#ifdef CONFIG_MIGRATION
|
2022-08-18 06:10:41 -07:00
|
|
|
static int top_tier_adistance;
|
2022-08-18 06:10:37 -07:00
|
|
|
/*
|
|
|
|
* node_demotion[] examples:
|
|
|
|
*
|
|
|
|
* Example 1:
|
|
|
|
*
|
|
|
|
* Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
|
|
|
|
*
|
|
|
|
* node distances:
|
|
|
|
* node 0 1 2 3
|
|
|
|
* 0 10 20 30 40
|
|
|
|
* 1 20 10 40 30
|
|
|
|
* 2 30 40 10 40
|
|
|
|
* 3 40 30 40 10
|
|
|
|
*
|
|
|
|
* memory_tiers0 = 0-1
|
|
|
|
* memory_tiers1 = 2-3
|
|
|
|
*
|
|
|
|
* node_demotion[0].preferred = 2
|
|
|
|
* node_demotion[1].preferred = 3
|
|
|
|
* node_demotion[2].preferred = <empty>
|
|
|
|
* node_demotion[3].preferred = <empty>
|
|
|
|
*
|
|
|
|
* Example 2:
|
|
|
|
*
|
|
|
|
* Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
|
|
|
|
*
|
|
|
|
* node distances:
|
|
|
|
* node 0 1 2
|
|
|
|
* 0 10 20 30
|
|
|
|
* 1 20 10 30
|
|
|
|
* 2 30 30 10
|
|
|
|
*
|
|
|
|
* memory_tiers0 = 0-2
|
|
|
|
*
|
|
|
|
* node_demotion[0].preferred = <empty>
|
|
|
|
* node_demotion[1].preferred = <empty>
|
|
|
|
* node_demotion[2].preferred = <empty>
|
|
|
|
*
|
|
|
|
* Example 3:
|
|
|
|
*
|
|
|
|
* Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
|
|
|
|
*
|
|
|
|
* node distances:
|
|
|
|
* node 0 1 2
|
|
|
|
* 0 10 20 30
|
|
|
|
* 1 20 10 40
|
|
|
|
* 2 30 40 10
|
|
|
|
*
|
|
|
|
* memory_tiers0 = 1
|
|
|
|
* memory_tiers1 = 0
|
|
|
|
* memory_tiers2 = 2
|
|
|
|
*
|
|
|
|
* node_demotion[0].preferred = 2
|
|
|
|
* node_demotion[1].preferred = 0
|
|
|
|
* node_demotion[2].preferred = <empty>
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
static struct demotion_nodes *node_demotion __read_mostly;
|
|
|
|
#endif /* CONFIG_MIGRATION */
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
|
memory tiering: add abstract distance calculation algorithms management
Patch series "memory tiering: calculate abstract distance based on ACPI
HMAT", v4.
We have the explicit memory tiers framework to manage systems with
multiple types of memory, e.g., DRAM in DIMM slots and CXL memory devices.
Where, same kind of memory devices will be grouped into memory types,
then put into memory tiers. To describe the performance of a memory type,
abstract distance is defined. Which is in direct proportion to the memory
latency and inversely proportional to the memory bandwidth. To keep the
code as simple as possible, fixed abstract distance is used in dax/kmem to
describe slow memory such as Optane DCPMM.
To support more memory types, in this series, we added the abstract
distance calculation algorithm management mechanism, provided a algorithm
implementation based on ACPI HMAT, and used the general abstract distance
calculation interface in dax/kmem driver. So, dax/kmem can support HBM
(high bandwidth memory) in addition to the original Optane DCPMM.
This patch (of 4):
The abstract distance may be calculated by various drivers, such as ACPI
HMAT, CXL CDAT, etc. While it may be used by various code which hot-add
memory node, such as dax/kmem etc. To decouple the algorithm users and
the providers, the abstract distance calculation algorithms management
mechanism is implemented in this patch. It provides interface for the
providers to register the implementation, and interface for the users.
Multiple algorithm implementations can cooperate via calculating abstract
distance for different memory nodes. The preference of algorithm
implementations can be specified via priority (notifier_block.priority).
Link: https://lkml.kernel.org/r/20230926060628.265989-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20230926060628.265989-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Bharata B Rao <bharata@amd.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-25 23:06:25 -07:00
|
|
|
static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
|
|
|
|
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
/* The lock is used to protect `default_dram_perf*` info and nid. */
|
|
|
|
static DEFINE_MUTEX(default_dram_perf_lock);
|
2023-09-25 23:06:27 -07:00
|
|
|
static bool default_dram_perf_error;
|
2023-12-21 15:02:37 -07:00
|
|
|
static struct access_coordinate default_dram_perf;
|
2023-09-25 23:06:27 -07:00
|
|
|
static int default_dram_perf_ref_nid = NUMA_NO_NODE;
|
|
|
|
static const char *default_dram_perf_ref_source;
|
|
|
|
|
2022-08-30 01:17:36 -07:00
|
|
|
static inline struct memory_tier *to_memory_tier(struct device *device)
|
|
|
|
{
|
|
|
|
return container_of(device, struct memory_tier, dev);
|
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline nodemask_t get_memtier_nodemask(struct memory_tier *memtier)
|
|
|
|
{
|
|
|
|
nodemask_t nodes = NODE_MASK_NONE;
|
|
|
|
struct memory_dev_type *memtype;
|
|
|
|
|
2023-08-02 02:28:56 -07:00
|
|
|
list_for_each_entry(memtype, &memtier->memory_types, tier_sibling)
|
2022-08-30 01:17:36 -07:00
|
|
|
nodes_or(nodes, nodes, memtype->nodes);
|
|
|
|
|
|
|
|
return nodes;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void memory_tier_device_release(struct device *dev)
|
|
|
|
{
|
|
|
|
struct memory_tier *tier = to_memory_tier(dev);
|
|
|
|
/*
|
|
|
|
* synchronize_rcu in clear_node_memory_tier makes sure
|
|
|
|
* we don't have rcu access to this memory tier.
|
|
|
|
*/
|
|
|
|
kfree(tier);
|
|
|
|
}
|
|
|
|
|
2022-10-19 18:51:22 -07:00
|
|
|
static ssize_t nodelist_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
2022-08-30 01:17:36 -07:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
nodemask_t nmask;
|
|
|
|
|
|
|
|
mutex_lock(&memory_tier_lock);
|
|
|
|
nmask = get_memtier_nodemask(to_memory_tier(dev));
|
|
|
|
ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
|
|
|
|
mutex_unlock(&memory_tier_lock);
|
|
|
|
return ret;
|
|
|
|
}
|
2022-10-19 18:51:22 -07:00
|
|
|
static DEVICE_ATTR_RO(nodelist);
|
2022-08-30 01:17:36 -07:00
|
|
|
|
|
|
|
static struct attribute *memtier_dev_attrs[] = {
|
2022-10-19 18:51:22 -07:00
|
|
|
&dev_attr_nodelist.attr,
|
2022-08-30 01:17:36 -07:00
|
|
|
NULL
|
|
|
|
};
|
|
|
|
|
|
|
|
static const struct attribute_group memtier_dev_group = {
|
|
|
|
.attrs = memtier_dev_attrs,
|
|
|
|
};
|
|
|
|
|
|
|
|
static const struct attribute_group *memtier_dev_groups[] = {
|
|
|
|
&memtier_dev_group,
|
|
|
|
NULL
|
|
|
|
};
|
|
|
|
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype)
|
|
|
|
{
|
2022-08-30 01:17:36 -07:00
|
|
|
int ret;
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
bool found_slot = false;
|
|
|
|
struct memory_tier *memtier, *new_memtier;
|
|
|
|
int adistance = memtype->adistance;
|
|
|
|
unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
|
|
|
|
|
|
|
|
lockdep_assert_held_once(&memory_tier_lock);
|
|
|
|
|
2022-08-18 06:10:39 -07:00
|
|
|
adistance = round_down(adistance, memtier_adistance_chunk_size);
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
/*
|
|
|
|
* If the memtype is already part of a memory tier,
|
|
|
|
* just return that.
|
|
|
|
*/
|
2023-08-02 02:28:56 -07:00
|
|
|
if (!list_empty(&memtype->tier_sibling)) {
|
2022-08-18 06:10:39 -07:00
|
|
|
list_for_each_entry(memtier, &memory_tiers, list) {
|
|
|
|
if (adistance == memtier->adistance_start)
|
|
|
|
return memtier;
|
|
|
|
}
|
|
|
|
WARN_ON(1);
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
|
|
|
|
list_for_each_entry(memtier, &memory_tiers, list) {
|
|
|
|
if (adistance == memtier->adistance_start) {
|
2022-08-30 01:17:36 -07:00
|
|
|
goto link_memtype;
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
} else if (adistance < memtier->adistance_start) {
|
|
|
|
found_slot = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-08-30 01:17:36 -07:00
|
|
|
new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
if (!new_memtier)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
|
|
|
new_memtier->adistance_start = adistance;
|
|
|
|
INIT_LIST_HEAD(&new_memtier->list);
|
|
|
|
INIT_LIST_HEAD(&new_memtier->memory_types);
|
|
|
|
if (found_slot)
|
|
|
|
list_add_tail(&new_memtier->list, &memtier->list);
|
|
|
|
else
|
|
|
|
list_add_tail(&new_memtier->list, &memory_tiers);
|
2022-08-30 01:17:36 -07:00
|
|
|
|
|
|
|
new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS;
|
|
|
|
new_memtier->dev.bus = &memory_tier_subsys;
|
|
|
|
new_memtier->dev.release = memory_tier_device_release;
|
|
|
|
new_memtier->dev.groups = memtier_dev_groups;
|
|
|
|
|
|
|
|
ret = device_register(&new_memtier->dev);
|
|
|
|
if (ret) {
|
2023-01-28 21:06:51 -07:00
|
|
|
list_del(&new_memtier->list);
|
|
|
|
put_device(&new_memtier->dev);
|
2022-08-30 01:17:36 -07:00
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
|
|
|
memtier = new_memtier;
|
|
|
|
|
|
|
|
link_memtype:
|
2023-08-02 02:28:56 -07:00
|
|
|
list_add(&memtype->tier_sibling, &memtier->memory_types);
|
2022-08-30 01:17:36 -07:00
|
|
|
return memtier;
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
}
|
|
|
|
|
2022-08-18 06:10:37 -07:00
|
|
|
static struct memory_tier *__node_get_memory_tier(int node)
|
|
|
|
{
|
2022-08-18 06:10:38 -07:00
|
|
|
pg_data_t *pgdat;
|
2022-08-18 06:10:37 -07:00
|
|
|
|
2022-08-18 06:10:38 -07:00
|
|
|
pgdat = NODE_DATA(node);
|
|
|
|
if (!pgdat)
|
|
|
|
return NULL;
|
|
|
|
/*
|
|
|
|
* Since we hold memory_tier_lock, we can avoid
|
|
|
|
* RCU read locks when accessing the details. No
|
|
|
|
* parallel updates are possible here.
|
|
|
|
*/
|
|
|
|
return rcu_dereference_check(pgdat->memtier,
|
|
|
|
lockdep_is_held(&memory_tier_lock));
|
2022-08-18 06:10:37 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_MIGRATION
|
2022-08-18 06:10:41 -07:00
|
|
|
bool node_is_toptier(int node)
|
|
|
|
{
|
|
|
|
bool toptier;
|
|
|
|
pg_data_t *pgdat;
|
|
|
|
struct memory_tier *memtier;
|
|
|
|
|
|
|
|
pgdat = NODE_DATA(node);
|
|
|
|
if (!pgdat)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
memtier = rcu_dereference(pgdat->memtier);
|
|
|
|
if (!memtier) {
|
|
|
|
toptier = true;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (memtier->adistance_start <= top_tier_adistance)
|
|
|
|
toptier = true;
|
|
|
|
else
|
|
|
|
toptier = false;
|
|
|
|
out:
|
|
|
|
rcu_read_unlock();
|
|
|
|
return toptier;
|
|
|
|
}
|
|
|
|
|
2022-08-18 06:10:40 -07:00
|
|
|
void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
|
|
|
|
{
|
|
|
|
struct memory_tier *memtier;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pg_data_t.memtier updates includes a synchronize_rcu()
|
|
|
|
* which ensures that we either find NULL or a valid memtier
|
|
|
|
* in NODE_DATA. protect the access via rcu_read_lock();
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
memtier = rcu_dereference(pgdat->memtier);
|
|
|
|
if (memtier)
|
|
|
|
*targets = memtier->lower_tier_mask;
|
|
|
|
else
|
|
|
|
*targets = NODE_MASK_NONE;
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2022-08-18 06:10:37 -07:00
|
|
|
/**
|
|
|
|
* next_demotion_node() - Get the next node in the demotion path
|
|
|
|
* @node: The starting node to lookup the next node
|
|
|
|
*
|
|
|
|
* Return: node id for next memory node in the demotion path hierarchy
|
|
|
|
* from @node; NUMA_NO_NODE if @node is terminal. This does not keep
|
|
|
|
* @node online or guarantee that it *continues* to be the next demotion
|
|
|
|
* target.
|
|
|
|
*/
|
|
|
|
int next_demotion_node(int node)
|
|
|
|
{
|
|
|
|
struct demotion_nodes *nd;
|
|
|
|
int target;
|
|
|
|
|
|
|
|
if (!node_demotion)
|
|
|
|
return NUMA_NO_NODE;
|
|
|
|
|
|
|
|
nd = &node_demotion[node];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* node_demotion[] is updated without excluding this
|
|
|
|
* function from running.
|
|
|
|
*
|
|
|
|
* Make sure to use RCU over entire code blocks if
|
|
|
|
* node_demotion[] reads need to be consistent.
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
/*
|
|
|
|
* If there are multiple target nodes, just select one
|
|
|
|
* target node randomly.
|
|
|
|
*
|
|
|
|
* In addition, we can also use round-robin to select
|
|
|
|
* target node, but we should introduce another variable
|
|
|
|
* for node_demotion[] to record last selected target node,
|
|
|
|
* that may cause cache ping-pong due to the changing of
|
|
|
|
* last target node. Or introducing per-cpu data to avoid
|
|
|
|
* caching issue, which seems more complicated. So selecting
|
|
|
|
* target node randomly seems better until now.
|
|
|
|
*/
|
|
|
|
target = node_random(&nd->preferred);
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return target;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void disable_all_demotion_targets(void)
|
|
|
|
{
|
2022-08-18 06:10:40 -07:00
|
|
|
struct memory_tier *memtier;
|
2022-08-18 06:10:37 -07:00
|
|
|
int node;
|
|
|
|
|
2022-08-18 06:10:40 -07:00
|
|
|
for_each_node_state(node, N_MEMORY) {
|
2022-08-18 06:10:37 -07:00
|
|
|
node_demotion[node].preferred = NODE_MASK_NONE;
|
2022-08-18 06:10:40 -07:00
|
|
|
/*
|
|
|
|
* We are holding memory_tier_lock, it is safe
|
|
|
|
* to access pgda->memtier.
|
|
|
|
*/
|
|
|
|
memtier = __node_get_memory_tier(node);
|
|
|
|
if (memtier)
|
|
|
|
memtier->lower_tier_mask = NODE_MASK_NONE;
|
|
|
|
}
|
2022-08-18 06:10:37 -07:00
|
|
|
/*
|
|
|
|
* Ensure that the "disable" is visible across the system.
|
|
|
|
* Readers will see either a combination of before+disable
|
|
|
|
* state or disable+after. They will never see before and
|
|
|
|
* after state together.
|
|
|
|
*/
|
|
|
|
synchronize_rcu();
|
|
|
|
}
|
|
|
|
|
mm/demotion: print demotion targets
Currently, when a demotion occurs, it will prioritize selecting a node
from the preferred targets as the destination node for the demotion. If
the preferred node does not meet the requirements, it will try from all
the lower memory tier nodes until it finds a suitable demotion destination
node or ultimately fails.
However, the demotion target information isn't exposed to the users,
especially the preferred target information, which relies on more factors.
This makes it hard for users to understand the exact demotion behavior.
Rather than having a new sysfs interface to expose this information,
printing directly to kernel messages, just like the current page
allocation fallback order does.
A dmesg example with this patch is as follows:
[ 0.704860] Demotion targets for Node 0: null
[ 0.705456] Demotion targets for Node 1: null
// node 2 is onlined
[ 32.259775] Demotion targets for Node 0: perferred: 2, fallback: 2
[ 32.261290] Demotion targets for Node 1: perferred: 2, fallback: 2
[ 32.262726] Demotion targets for Node 2: null
// node 3 is onlined
[ 42.448809] Demotion targets for Node 0: perferred: 2, fallback: 2-3
[ 42.450704] Demotion targets for Node 1: perferred: 2, fallback: 2-3
[ 42.452556] Demotion targets for Node 2: perferred: 3, fallback: 3
[ 42.454136] Demotion targets for Node 3: null
// node 4 is onlined
[ 52.676833] Demotion targets for Node 0: perferred: 2, fallback: 2-4
[ 52.678735] Demotion targets for Node 1: perferred: 2, fallback: 2-4
[ 52.680493] Demotion targets for Node 2: perferred: 4, fallback: 3-4
[ 52.682154] Demotion targets for Node 3: null
[ 52.683405] Demotion targets for Node 4: null
// node 5 is onlined
[ 62.931902] Demotion targets for Node 0: perferred: 2, fallback: 2-5
[ 62.938266] Demotion targets for Node 1: perferred: 5, fallback: 2-5
[ 62.943515] Demotion targets for Node 2: perferred: 4, fallback: 3-4
[ 62.947471] Demotion targets for Node 3: null
[ 62.949908] Demotion targets for Node 4: null
[ 62.952137] Demotion targets for Node 5: perferred: 3, fallback: 3-4
Regarding this requirement, we have previously discussed [1]. The initial
proposal involved introducing a new sysfs interface. However, due to
concerns about potential changes and compatibility issues with the
interface in the future, a consensus was not reached with the community.
Therefore, this time, we are directly printing out the information.
[1] https://lore.kernel.org/all/d1d5add8-8f4a-4578-8bf0-2cbe79b09989@fujitsu.com/
Link: https://lkml.kernel.org/r/20240206020151.605516-1-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-05 19:01:51 -07:00
|
|
|
static void dump_demotion_targets(void)
|
|
|
|
{
|
|
|
|
int node;
|
|
|
|
|
|
|
|
for_each_node_state(node, N_MEMORY) {
|
|
|
|
struct memory_tier *memtier = __node_get_memory_tier(node);
|
|
|
|
nodemask_t preferred = node_demotion[node].preferred;
|
|
|
|
|
|
|
|
if (!memtier)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (nodes_empty(preferred))
|
|
|
|
pr_info("Demotion targets for Node %d: null\n", node);
|
|
|
|
else
|
|
|
|
pr_info("Demotion targets for Node %d: preferred: %*pbl, fallback: %*pbl\n",
|
|
|
|
node, nodemask_pr_args(&preferred),
|
|
|
|
nodemask_pr_args(&memtier->lower_tier_mask));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-08-18 06:10:37 -07:00
|
|
|
/*
|
|
|
|
* Find an automatic demotion target for all memory
|
|
|
|
* nodes. Failing here is OK. It might just indicate
|
|
|
|
* being at the end of a chain.
|
|
|
|
*/
|
|
|
|
static void establish_demotion_targets(void)
|
|
|
|
{
|
|
|
|
struct memory_tier *memtier;
|
|
|
|
struct demotion_nodes *nd;
|
|
|
|
int target = NUMA_NO_NODE, node;
|
|
|
|
int distance, best_distance;
|
2022-08-18 06:10:40 -07:00
|
|
|
nodemask_t tier_nodes, lower_tier;
|
2022-08-18 06:10:37 -07:00
|
|
|
|
|
|
|
lockdep_assert_held_once(&memory_tier_lock);
|
|
|
|
|
2023-06-09 20:41:14 -07:00
|
|
|
if (!node_demotion)
|
2022-08-18 06:10:37 -07:00
|
|
|
return;
|
|
|
|
|
|
|
|
disable_all_demotion_targets();
|
|
|
|
|
|
|
|
for_each_node_state(node, N_MEMORY) {
|
|
|
|
best_distance = -1;
|
|
|
|
nd = &node_demotion[node];
|
|
|
|
|
|
|
|
memtier = __node_get_memory_tier(node);
|
|
|
|
if (!memtier || list_is_last(&memtier->list, &memory_tiers))
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* Get the lower memtier to find the demotion node list.
|
|
|
|
*/
|
|
|
|
memtier = list_next_entry(memtier, list);
|
|
|
|
tier_nodes = get_memtier_nodemask(memtier);
|
|
|
|
/*
|
|
|
|
* find_next_best_node, use 'used' nodemask as a skip list.
|
|
|
|
* Add all memory nodes except the selected memory tier
|
|
|
|
* nodelist to skip list so that we find the best node from the
|
|
|
|
* memtier nodelist.
|
|
|
|
*/
|
|
|
|
nodes_andnot(tier_nodes, node_states[N_MEMORY], tier_nodes);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find all the nodes in the memory tier node list of same best distance.
|
|
|
|
* add them to the preferred mask. We randomly select between nodes
|
|
|
|
* in the preferred mask when allocating pages during demotion.
|
|
|
|
*/
|
|
|
|
do {
|
|
|
|
target = find_next_best_node(node, &tier_nodes);
|
|
|
|
if (target == NUMA_NO_NODE)
|
|
|
|
break;
|
|
|
|
|
|
|
|
distance = node_distance(node, target);
|
|
|
|
if (distance == best_distance || best_distance == -1) {
|
|
|
|
best_distance = distance;
|
|
|
|
node_set(target, nd->preferred);
|
|
|
|
} else {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
} while (1);
|
|
|
|
}
|
2022-08-18 06:10:41 -07:00
|
|
|
/*
|
|
|
|
* Promotion is allowed from a memory tier to higher
|
|
|
|
* memory tier only if the memory tier doesn't include
|
|
|
|
* compute. We want to skip promotion from a memory tier,
|
|
|
|
* if any node that is part of the memory tier have CPUs.
|
|
|
|
* Once we detect such a memory tier, we consider that tier
|
|
|
|
* as top tiper from which promotion is not allowed.
|
|
|
|
*/
|
|
|
|
list_for_each_entry_reverse(memtier, &memory_tiers, list) {
|
|
|
|
tier_nodes = get_memtier_nodemask(memtier);
|
|
|
|
nodes_and(tier_nodes, node_states[N_CPU], tier_nodes);
|
|
|
|
if (!nodes_empty(tier_nodes)) {
|
|
|
|
/*
|
|
|
|
* abstract distance below the max value of this memtier
|
|
|
|
* is considered toptier.
|
|
|
|
*/
|
|
|
|
top_tier_adistance = memtier->adistance_start +
|
|
|
|
MEMTIER_CHUNK_SIZE - 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2022-08-18 06:10:40 -07:00
|
|
|
/*
|
|
|
|
* Now build the lower_tier mask for each node collecting node mask from
|
|
|
|
* all memory tier below it. This allows us to fallback demotion page
|
|
|
|
* allocation to a set of nodes that is closer the above selected
|
mm/demotion: print demotion targets
Currently, when a demotion occurs, it will prioritize selecting a node
from the preferred targets as the destination node for the demotion. If
the preferred node does not meet the requirements, it will try from all
the lower memory tier nodes until it finds a suitable demotion destination
node or ultimately fails.
However, the demotion target information isn't exposed to the users,
especially the preferred target information, which relies on more factors.
This makes it hard for users to understand the exact demotion behavior.
Rather than having a new sysfs interface to expose this information,
printing directly to kernel messages, just like the current page
allocation fallback order does.
A dmesg example with this patch is as follows:
[ 0.704860] Demotion targets for Node 0: null
[ 0.705456] Demotion targets for Node 1: null
// node 2 is onlined
[ 32.259775] Demotion targets for Node 0: perferred: 2, fallback: 2
[ 32.261290] Demotion targets for Node 1: perferred: 2, fallback: 2
[ 32.262726] Demotion targets for Node 2: null
// node 3 is onlined
[ 42.448809] Demotion targets for Node 0: perferred: 2, fallback: 2-3
[ 42.450704] Demotion targets for Node 1: perferred: 2, fallback: 2-3
[ 42.452556] Demotion targets for Node 2: perferred: 3, fallback: 3
[ 42.454136] Demotion targets for Node 3: null
// node 4 is onlined
[ 52.676833] Demotion targets for Node 0: perferred: 2, fallback: 2-4
[ 52.678735] Demotion targets for Node 1: perferred: 2, fallback: 2-4
[ 52.680493] Demotion targets for Node 2: perferred: 4, fallback: 3-4
[ 52.682154] Demotion targets for Node 3: null
[ 52.683405] Demotion targets for Node 4: null
// node 5 is onlined
[ 62.931902] Demotion targets for Node 0: perferred: 2, fallback: 2-5
[ 62.938266] Demotion targets for Node 1: perferred: 5, fallback: 2-5
[ 62.943515] Demotion targets for Node 2: perferred: 4, fallback: 3-4
[ 62.947471] Demotion targets for Node 3: null
[ 62.949908] Demotion targets for Node 4: null
[ 62.952137] Demotion targets for Node 5: perferred: 3, fallback: 3-4
Regarding this requirement, we have previously discussed [1]. The initial
proposal involved introducing a new sysfs interface. However, due to
concerns about potential changes and compatibility issues with the
interface in the future, a consensus was not reached with the community.
Therefore, this time, we are directly printing out the information.
[1] https://lore.kernel.org/all/d1d5add8-8f4a-4578-8bf0-2cbe79b09989@fujitsu.com/
Link: https://lkml.kernel.org/r/20240206020151.605516-1-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-05 19:01:51 -07:00
|
|
|
* preferred node.
|
2022-08-18 06:10:40 -07:00
|
|
|
*/
|
|
|
|
lower_tier = node_states[N_MEMORY];
|
|
|
|
list_for_each_entry(memtier, &memory_tiers, list) {
|
|
|
|
/*
|
|
|
|
* Keep removing current tier from lower_tier nodes,
|
|
|
|
* This will remove all nodes in current and above
|
|
|
|
* memory tier from the lower_tier mask.
|
|
|
|
*/
|
|
|
|
tier_nodes = get_memtier_nodemask(memtier);
|
|
|
|
nodes_andnot(lower_tier, lower_tier, tier_nodes);
|
|
|
|
memtier->lower_tier_mask = lower_tier;
|
|
|
|
}
|
mm/demotion: print demotion targets
Currently, when a demotion occurs, it will prioritize selecting a node
from the preferred targets as the destination node for the demotion. If
the preferred node does not meet the requirements, it will try from all
the lower memory tier nodes until it finds a suitable demotion destination
node or ultimately fails.
However, the demotion target information isn't exposed to the users,
especially the preferred target information, which relies on more factors.
This makes it hard for users to understand the exact demotion behavior.
Rather than having a new sysfs interface to expose this information,
printing directly to kernel messages, just like the current page
allocation fallback order does.
A dmesg example with this patch is as follows:
[ 0.704860] Demotion targets for Node 0: null
[ 0.705456] Demotion targets for Node 1: null
// node 2 is onlined
[ 32.259775] Demotion targets for Node 0: perferred: 2, fallback: 2
[ 32.261290] Demotion targets for Node 1: perferred: 2, fallback: 2
[ 32.262726] Demotion targets for Node 2: null
// node 3 is onlined
[ 42.448809] Demotion targets for Node 0: perferred: 2, fallback: 2-3
[ 42.450704] Demotion targets for Node 1: perferred: 2, fallback: 2-3
[ 42.452556] Demotion targets for Node 2: perferred: 3, fallback: 3
[ 42.454136] Demotion targets for Node 3: null
// node 4 is onlined
[ 52.676833] Demotion targets for Node 0: perferred: 2, fallback: 2-4
[ 52.678735] Demotion targets for Node 1: perferred: 2, fallback: 2-4
[ 52.680493] Demotion targets for Node 2: perferred: 4, fallback: 3-4
[ 52.682154] Demotion targets for Node 3: null
[ 52.683405] Demotion targets for Node 4: null
// node 5 is onlined
[ 62.931902] Demotion targets for Node 0: perferred: 2, fallback: 2-5
[ 62.938266] Demotion targets for Node 1: perferred: 5, fallback: 2-5
[ 62.943515] Demotion targets for Node 2: perferred: 4, fallback: 3-4
[ 62.947471] Demotion targets for Node 3: null
[ 62.949908] Demotion targets for Node 4: null
[ 62.952137] Demotion targets for Node 5: perferred: 3, fallback: 3-4
Regarding this requirement, we have previously discussed [1]. The initial
proposal involved introducing a new sysfs interface. However, due to
concerns about potential changes and compatibility issues with the
interface in the future, a consensus was not reached with the community.
Therefore, this time, we are directly printing out the information.
[1] https://lore.kernel.org/all/d1d5add8-8f4a-4578-8bf0-2cbe79b09989@fujitsu.com/
Link: https://lkml.kernel.org/r/20240206020151.605516-1-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-05 19:01:51 -07:00
|
|
|
|
|
|
|
dump_demotion_targets();
|
2022-08-18 06:10:37 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
|
|
|
static inline void establish_demotion_targets(void) {}
|
|
|
|
#endif /* CONFIG_MIGRATION */
|
|
|
|
|
2022-08-18 06:10:36 -07:00
|
|
|
static inline void __init_node_memory_type(int node, struct memory_dev_type *memtype)
|
|
|
|
{
|
|
|
|
if (!node_memory_types[node].memtype)
|
|
|
|
node_memory_types[node].memtype = memtype;
|
|
|
|
/*
|
|
|
|
* for each device getting added in the same NUMA node
|
|
|
|
* with this specific memtype, bump the map count. We
|
|
|
|
* Only take memtype device reference once, so that
|
|
|
|
* changing a node memtype can be done by droping the
|
|
|
|
* only reference count taken here.
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (node_memory_types[node].memtype == memtype) {
|
|
|
|
if (!node_memory_types[node].map_count++)
|
|
|
|
kref_get(&memtype->kref);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
static struct memory_tier *set_node_memory_tier(int node)
|
|
|
|
{
|
|
|
|
struct memory_tier *memtier;
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
struct memory_dev_type *memtype = default_dram_type;
|
|
|
|
int adist = MEMTIER_ADISTANCE_DRAM;
|
2022-08-18 06:10:38 -07:00
|
|
|
pg_data_t *pgdat = NODE_DATA(node);
|
|
|
|
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
|
|
|
|
lockdep_assert_held_once(&memory_tier_lock);
|
|
|
|
|
|
|
|
if (!node_state(node, N_MEMORY))
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
mt_calc_adistance(node, &adist);
|
|
|
|
if (!node_memory_types[node].memtype) {
|
|
|
|
memtype = mt_find_alloc_memory_type(adist, &default_memory_types);
|
|
|
|
if (IS_ERR(memtype)) {
|
|
|
|
memtype = default_dram_type;
|
|
|
|
pr_info("Failed to allocate a memory type. Fall back.\n");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
__init_node_memory_type(node, memtype);
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
|
2022-08-18 06:10:36 -07:00
|
|
|
memtype = node_memory_types[node].memtype;
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
node_set(node, memtype->nodes);
|
|
|
|
memtier = find_create_memory_tier(memtype);
|
2022-08-18 06:10:38 -07:00
|
|
|
if (!IS_ERR(memtier))
|
|
|
|
rcu_assign_pointer(pgdat->memtier, memtier);
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
return memtier;
|
|
|
|
}
|
|
|
|
|
2022-08-18 06:10:35 -07:00
|
|
|
static void destroy_memory_tier(struct memory_tier *memtier)
|
|
|
|
{
|
|
|
|
list_del(&memtier->list);
|
2022-08-30 01:17:36 -07:00
|
|
|
device_unregister(&memtier->dev);
|
2022-08-18 06:10:35 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static bool clear_node_memory_tier(int node)
|
|
|
|
{
|
|
|
|
bool cleared = false;
|
2022-08-18 06:10:38 -07:00
|
|
|
pg_data_t *pgdat;
|
2022-08-18 06:10:35 -07:00
|
|
|
struct memory_tier *memtier;
|
|
|
|
|
2022-08-18 06:10:38 -07:00
|
|
|
pgdat = NODE_DATA(node);
|
|
|
|
if (!pgdat)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure that anybody looking at NODE_DATA who finds
|
|
|
|
* a valid memtier finds memory_dev_types with nodes still
|
|
|
|
* linked to the memtier. We achieve this by waiting for
|
|
|
|
* rcu read section to finish using synchronize_rcu.
|
|
|
|
* This also enables us to free the destroyed memory tier
|
|
|
|
* with kfree instead of kfree_rcu
|
|
|
|
*/
|
2022-08-18 06:10:35 -07:00
|
|
|
memtier = __node_get_memory_tier(node);
|
|
|
|
if (memtier) {
|
|
|
|
struct memory_dev_type *memtype;
|
|
|
|
|
2022-08-18 06:10:38 -07:00
|
|
|
rcu_assign_pointer(pgdat->memtier, NULL);
|
|
|
|
synchronize_rcu();
|
2022-08-18 06:10:36 -07:00
|
|
|
memtype = node_memory_types[node].memtype;
|
2022-08-18 06:10:35 -07:00
|
|
|
node_clear(node, memtype->nodes);
|
|
|
|
if (nodes_empty(memtype->nodes)) {
|
2023-08-02 02:28:56 -07:00
|
|
|
list_del_init(&memtype->tier_sibling);
|
2022-08-18 06:10:35 -07:00
|
|
|
if (list_empty(&memtier->memory_types))
|
|
|
|
destroy_memory_tier(memtier);
|
|
|
|
}
|
|
|
|
cleared = true;
|
|
|
|
}
|
|
|
|
return cleared;
|
|
|
|
}
|
|
|
|
|
2022-08-18 06:10:36 -07:00
|
|
|
static void release_memtype(struct kref *kref)
|
|
|
|
{
|
|
|
|
struct memory_dev_type *memtype;
|
|
|
|
|
|
|
|
memtype = container_of(kref, struct memory_dev_type, kref);
|
|
|
|
kfree(memtype);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct memory_dev_type *alloc_memory_type(int adistance)
|
|
|
|
{
|
|
|
|
struct memory_dev_type *memtype;
|
|
|
|
|
|
|
|
memtype = kmalloc(sizeof(*memtype), GFP_KERNEL);
|
|
|
|
if (!memtype)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
|
|
|
memtype->adistance = adistance;
|
2023-08-02 02:28:56 -07:00
|
|
|
INIT_LIST_HEAD(&memtype->tier_sibling);
|
2022-08-18 06:10:36 -07:00
|
|
|
memtype->nodes = NODE_MASK_NONE;
|
|
|
|
kref_init(&memtype->kref);
|
|
|
|
return memtype;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(alloc_memory_type);
|
|
|
|
|
2023-07-05 23:39:05 -07:00
|
|
|
void put_memory_type(struct memory_dev_type *memtype)
|
2022-08-18 06:10:36 -07:00
|
|
|
{
|
|
|
|
kref_put(&memtype->kref, release_memtype);
|
|
|
|
}
|
2023-07-05 23:39:05 -07:00
|
|
|
EXPORT_SYMBOL_GPL(put_memory_type);
|
2022-08-18 06:10:36 -07:00
|
|
|
|
|
|
|
void init_node_memory_type(int node, struct memory_dev_type *memtype)
|
|
|
|
{
|
|
|
|
|
|
|
|
mutex_lock(&memory_tier_lock);
|
|
|
|
__init_node_memory_type(node, memtype);
|
|
|
|
mutex_unlock(&memory_tier_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(init_node_memory_type);
|
|
|
|
|
|
|
|
void clear_node_memory_type(int node, struct memory_dev_type *memtype)
|
|
|
|
{
|
|
|
|
mutex_lock(&memory_tier_lock);
|
2023-09-25 23:06:28 -07:00
|
|
|
if (node_memory_types[node].memtype == memtype || !memtype)
|
2022-08-18 06:10:36 -07:00
|
|
|
node_memory_types[node].map_count--;
|
|
|
|
/*
|
|
|
|
* If we umapped all the attached devices to this node,
|
|
|
|
* clear the node memory type.
|
|
|
|
*/
|
|
|
|
if (!node_memory_types[node].map_count) {
|
2023-09-25 23:06:28 -07:00
|
|
|
memtype = node_memory_types[node].memtype;
|
2022-08-18 06:10:36 -07:00
|
|
|
node_memory_types[node].memtype = NULL;
|
2023-07-05 23:39:05 -07:00
|
|
|
put_memory_type(memtype);
|
2022-08-18 06:10:36 -07:00
|
|
|
}
|
|
|
|
mutex_unlock(&memory_tier_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(clear_node_memory_type);
|
|
|
|
|
memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
Patch series "Improved Memory Tier Creation for CPUless NUMA Nodes", v11.
When a memory device, such as CXL1.1 type3 memory, is emulated as normal
memory (E820_TYPE_RAM), the memory device is indistinguishable from normal
DRAM in terms of memory tiering with the current implementation. The
current memory tiering assigns all detected normal memory nodes to the
same DRAM tier. This results in normal memory devices with different
attributions being unable to be assigned to the correct memory tier,
leading to the inability to migrate pages between different types of
memory.
https://lore.kernel.org/linux-mm/PH0PR08MB7955E9F08CCB64F23963B5C3A860A@PH0PR08MB7955.namprd08.prod.outlook.com/T/
This patchset automatically resolves the issues. It delays the
initialization of memory tiers for CPUless NUMA nodes until they obtain
HMAT information and after all devices are initialized at boot time,
eliminating the need for user intervention. If no HMAT is specified, it
falls back to using `default_dram_type`.
Example usecase:
We have CXL memory on the host, and we create VMs with a new system memory
device backed by host CXL memory. We inject CXL memory performance
attributes through QEMU, and the guest now sees memory nodes with
performance attributes in HMAT. With this change, we enable the guest
kernel to construct the correct memory tiering for the memory nodes.
This patch (of 2):
Since different memory devices require finding, allocating, and putting
memory types, these common steps are abstracted in this patch, enhancing
the scalability and conciseness of the code.
Link: https://lkml.kernel.org/r/20240405000707.2670063-1-horenchuang@bytedance.com
Link: https://lkml.kernel.org/r/20240405000707.2670063-2-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Hao Xiang <hao.xiang@bytedance.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:05 -07:00
|
|
|
struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *memory_types)
|
|
|
|
{
|
|
|
|
struct memory_dev_type *mtype;
|
|
|
|
|
|
|
|
list_for_each_entry(mtype, memory_types, list)
|
|
|
|
if (mtype->adistance == adist)
|
|
|
|
return mtype;
|
|
|
|
|
|
|
|
mtype = alloc_memory_type(adist);
|
|
|
|
if (IS_ERR(mtype))
|
|
|
|
return mtype;
|
|
|
|
|
|
|
|
list_add(&mtype->list, memory_types);
|
|
|
|
|
|
|
|
return mtype;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
|
|
|
|
|
|
|
|
void mt_put_memory_types(struct list_head *memory_types)
|
|
|
|
{
|
|
|
|
struct memory_dev_type *mtype, *mtn;
|
|
|
|
|
|
|
|
list_for_each_entry_safe(mtype, mtn, memory_types, list) {
|
|
|
|
list_del(&mtype->list);
|
|
|
|
put_memory_type(mtype);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(mt_put_memory_types);
|
|
|
|
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
/*
|
|
|
|
* This is invoked via `late_initcall()` to initialize memory tiers for
|
memory tier: consolidate the initialization of memory tiers
The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.
The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/
If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.
If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.
Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.
Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 00:26:44 -07:00
|
|
|
* memory nodes, both with and without CPUs. After the initialization of
|
|
|
|
* firmware and devices, adistance algorithms are expected to be provided.
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
*/
|
|
|
|
static int __init memory_tier_late_init(void)
|
|
|
|
{
|
|
|
|
int nid;
|
memory tier: consolidate the initialization of memory tiers
The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.
The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/
If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.
If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.
Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.
Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 00:26:44 -07:00
|
|
|
struct memory_tier *memtier;
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
|
memory tier: consolidate the initialization of memory tiers
The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.
The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/
If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.
If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.
Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.
Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 00:26:44 -07:00
|
|
|
get_online_mems();
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
guard(mutex)(&memory_tier_lock);
|
memory tier: consolidate the initialization of memory tiers
The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.
The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/
If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.
If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.
Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.
Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 00:26:44 -07:00
|
|
|
|
|
|
|
/* Assign each uninitialized N_MEMORY node to a memory tier. */
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
for_each_node_state(nid, N_MEMORY) {
|
|
|
|
/*
|
memory tier: consolidate the initialization of memory tiers
The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.
The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/
If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.
If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.
Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.
Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 00:26:44 -07:00
|
|
|
* Some device drivers may have initialized
|
|
|
|
* memory tiers, potentially bringing memory nodes
|
|
|
|
* online and configuring memory tiers.
|
|
|
|
* Exclude them here.
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
*/
|
|
|
|
if (node_memory_types[nid].memtype)
|
|
|
|
continue;
|
|
|
|
|
memory tier: consolidate the initialization of memory tiers
The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.
The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/
If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.
If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.
Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.
Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 00:26:44 -07:00
|
|
|
memtier = set_node_memory_tier(nid);
|
|
|
|
if (IS_ERR(memtier))
|
|
|
|
continue;
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
establish_demotion_targets();
|
memory tier: consolidate the initialization of memory tiers
The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.
The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/
If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.
If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.
Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.
Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 00:26:44 -07:00
|
|
|
put_online_mems();
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
late_initcall(memory_tier_late_init);
|
|
|
|
|
2023-12-21 15:02:37 -07:00
|
|
|
static void dump_hmem_attrs(struct access_coordinate *coord, const char *prefix)
|
2023-09-25 23:06:27 -07:00
|
|
|
{
|
|
|
|
pr_info(
|
|
|
|
"%sread_latency: %u, write_latency: %u, read_bandwidth: %u, write_bandwidth: %u\n",
|
2023-12-21 15:02:37 -07:00
|
|
|
prefix, coord->read_latency, coord->write_latency,
|
|
|
|
coord->read_bandwidth, coord->write_bandwidth);
|
2023-09-25 23:06:27 -07:00
|
|
|
}
|
|
|
|
|
2023-12-21 15:02:37 -07:00
|
|
|
int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
|
2023-09-25 23:06:27 -07:00
|
|
|
const char *source)
|
|
|
|
{
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
guard(mutex)(&default_dram_perf_lock);
|
|
|
|
if (default_dram_perf_error)
|
|
|
|
return -EIO;
|
2023-09-25 23:06:27 -07:00
|
|
|
|
|
|
|
if (perf->read_latency + perf->write_latency == 0 ||
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
perf->read_bandwidth + perf->write_bandwidth == 0)
|
|
|
|
return -EINVAL;
|
2023-09-25 23:06:27 -07:00
|
|
|
|
|
|
|
if (default_dram_perf_ref_nid == NUMA_NO_NODE) {
|
|
|
|
default_dram_perf = *perf;
|
|
|
|
default_dram_perf_ref_nid = nid;
|
|
|
|
default_dram_perf_ref_source = kstrdup(source, GFP_KERNEL);
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
return 0;
|
2023-09-25 23:06:27 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The performance of all default DRAM nodes is expected to be
|
|
|
|
* same (that is, the variation is less than 10%). And it
|
|
|
|
* will be used as base to calculate the abstract distance of
|
|
|
|
* other memory nodes.
|
|
|
|
*/
|
|
|
|
if (abs(perf->read_latency - default_dram_perf.read_latency) * 10 >
|
|
|
|
default_dram_perf.read_latency ||
|
|
|
|
abs(perf->write_latency - default_dram_perf.write_latency) * 10 >
|
|
|
|
default_dram_perf.write_latency ||
|
|
|
|
abs(perf->read_bandwidth - default_dram_perf.read_bandwidth) * 10 >
|
|
|
|
default_dram_perf.read_bandwidth ||
|
|
|
|
abs(perf->write_bandwidth - default_dram_perf.write_bandwidth) * 10 >
|
|
|
|
default_dram_perf.write_bandwidth) {
|
|
|
|
pr_info(
|
|
|
|
"memory-tiers: the performance of DRAM node %d mismatches that of the reference\n"
|
|
|
|
"DRAM node %d.\n", nid, default_dram_perf_ref_nid);
|
2024-09-19 18:47:40 -07:00
|
|
|
pr_info(" performance of reference DRAM node %d from %s:\n",
|
|
|
|
default_dram_perf_ref_nid, default_dram_perf_ref_source);
|
2023-09-25 23:06:27 -07:00
|
|
|
dump_hmem_attrs(&default_dram_perf, " ");
|
2024-09-19 18:47:40 -07:00
|
|
|
pr_info(" performance of DRAM node %d from %s:\n", nid, source);
|
2023-09-25 23:06:27 -07:00
|
|
|
dump_hmem_attrs(perf, " ");
|
|
|
|
pr_info(
|
|
|
|
" disable default DRAM node performance based abstract distance algorithm.\n");
|
|
|
|
default_dram_perf_error = true;
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
return -EINVAL;
|
2023-09-25 23:06:27 -07:00
|
|
|
}
|
|
|
|
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
return 0;
|
2023-09-25 23:06:27 -07:00
|
|
|
}
|
|
|
|
|
2023-12-21 15:02:37 -07:00
|
|
|
int mt_perf_to_adistance(struct access_coordinate *perf, int *adist)
|
2023-09-25 23:06:27 -07:00
|
|
|
{
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
guard(mutex)(&default_dram_perf_lock);
|
2023-09-25 23:06:27 -07:00
|
|
|
if (default_dram_perf_error)
|
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
if (perf->read_latency + perf->write_latency == 0 ||
|
|
|
|
perf->read_bandwidth + perf->write_bandwidth == 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
if (default_dram_perf_ref_nid == NUMA_NO_NODE)
|
|
|
|
return -ENOENT;
|
|
|
|
|
2023-09-25 23:06:27 -07:00
|
|
|
/*
|
|
|
|
* The abstract distance of a memory node is in direct proportion to
|
|
|
|
* its memory latency (read + write) and inversely proportional to its
|
|
|
|
* memory bandwidth (read + write). The abstract distance, memory
|
|
|
|
* latency, and memory bandwidth of the default DRAM nodes are used as
|
|
|
|
* the base.
|
|
|
|
*/
|
|
|
|
*adist = MEMTIER_ADISTANCE_DRAM *
|
|
|
|
(perf->read_latency + perf->write_latency) /
|
|
|
|
(default_dram_perf.read_latency + default_dram_perf.write_latency) *
|
|
|
|
(default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) /
|
|
|
|
(perf->read_bandwidth + perf->write_bandwidth);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(mt_perf_to_adistance);
|
|
|
|
|
memory tiering: add abstract distance calculation algorithms management
Patch series "memory tiering: calculate abstract distance based on ACPI
HMAT", v4.
We have the explicit memory tiers framework to manage systems with
multiple types of memory, e.g., DRAM in DIMM slots and CXL memory devices.
Where, same kind of memory devices will be grouped into memory types,
then put into memory tiers. To describe the performance of a memory type,
abstract distance is defined. Which is in direct proportion to the memory
latency and inversely proportional to the memory bandwidth. To keep the
code as simple as possible, fixed abstract distance is used in dax/kmem to
describe slow memory such as Optane DCPMM.
To support more memory types, in this series, we added the abstract
distance calculation algorithm management mechanism, provided a algorithm
implementation based on ACPI HMAT, and used the general abstract distance
calculation interface in dax/kmem driver. So, dax/kmem can support HBM
(high bandwidth memory) in addition to the original Optane DCPMM.
This patch (of 4):
The abstract distance may be calculated by various drivers, such as ACPI
HMAT, CXL CDAT, etc. While it may be used by various code which hot-add
memory node, such as dax/kmem etc. To decouple the algorithm users and
the providers, the abstract distance calculation algorithms management
mechanism is implemented in this patch. It provides interface for the
providers to register the implementation, and interface for the users.
Multiple algorithm implementations can cooperate via calculating abstract
distance for different memory nodes. The preference of algorithm
implementations can be specified via priority (notifier_block.priority).
Link: https://lkml.kernel.org/r/20230926060628.265989-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20230926060628.265989-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Bharata B Rao <bharata@amd.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-25 23:06:25 -07:00
|
|
|
/**
|
|
|
|
* register_mt_adistance_algorithm() - Register memory tiering abstract distance algorithm
|
|
|
|
* @nb: The notifier block which describe the algorithm
|
|
|
|
*
|
|
|
|
* Return: 0 on success, errno on error.
|
|
|
|
*
|
|
|
|
* Every memory tiering abstract distance algorithm provider needs to
|
|
|
|
* register the algorithm with register_mt_adistance_algorithm(). To
|
|
|
|
* calculate the abstract distance for a specified memory node, the
|
|
|
|
* notifier function will be called unless some high priority
|
|
|
|
* algorithm has provided result. The prototype of the notifier
|
|
|
|
* function is as follows,
|
|
|
|
*
|
|
|
|
* int (*algorithm_notifier)(struct notifier_block *nb,
|
|
|
|
* unsigned long nid, void *data);
|
|
|
|
*
|
|
|
|
* Where "nid" specifies the memory node, "data" is the pointer to the
|
|
|
|
* returned abstract distance (that is, "int *adist"). If the
|
|
|
|
* algorithm provides the result, NOTIFY_STOP should be returned.
|
|
|
|
* Otherwise, return_value & %NOTIFY_STOP_MASK == 0 to allow the next
|
|
|
|
* algorithm in the chain to provide the result.
|
|
|
|
*/
|
|
|
|
int register_mt_adistance_algorithm(struct notifier_block *nb)
|
|
|
|
{
|
|
|
|
return blocking_notifier_chain_register(&mt_adistance_algorithms, nb);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(register_mt_adistance_algorithm);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* unregister_mt_adistance_algorithm() - Unregister memory tiering abstract distance algorithm
|
|
|
|
* @nb: the notifier block which describe the algorithm
|
|
|
|
*
|
|
|
|
* Return: 0 on success, errno on error.
|
|
|
|
*/
|
|
|
|
int unregister_mt_adistance_algorithm(struct notifier_block *nb)
|
|
|
|
{
|
|
|
|
return blocking_notifier_chain_unregister(&mt_adistance_algorithms, nb);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(unregister_mt_adistance_algorithm);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* mt_calc_adistance() - Calculate abstract distance with registered algorithms
|
|
|
|
* @node: the node to calculate abstract distance for
|
|
|
|
* @adist: the returned abstract distance
|
|
|
|
*
|
|
|
|
* Return: if return_value & %NOTIFY_STOP_MASK != 0, then some
|
|
|
|
* abstract distance algorithm provides the result, and return it via
|
|
|
|
* @adist. Otherwise, no algorithm can provide the result and @adist
|
|
|
|
* will be kept as it is.
|
|
|
|
*/
|
|
|
|
int mt_calc_adistance(int node, int *adist)
|
|
|
|
{
|
|
|
|
return blocking_notifier_call_chain(&mt_adistance_algorithms, node, adist);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(mt_calc_adistance);
|
|
|
|
|
2022-08-18 06:10:35 -07:00
|
|
|
static int __meminit memtier_hotplug_callback(struct notifier_block *self,
|
|
|
|
unsigned long action, void *_arg)
|
|
|
|
{
|
2022-08-18 06:10:37 -07:00
|
|
|
struct memory_tier *memtier;
|
2022-08-18 06:10:35 -07:00
|
|
|
struct memory_notify *arg = _arg;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only update the node migration order when a node is
|
|
|
|
* changing status, like online->offline.
|
|
|
|
*/
|
|
|
|
if (arg->status_change_nid < 0)
|
|
|
|
return notifier_from_errno(0);
|
|
|
|
|
|
|
|
switch (action) {
|
|
|
|
case MEM_OFFLINE:
|
|
|
|
mutex_lock(&memory_tier_lock);
|
2022-08-18 06:10:37 -07:00
|
|
|
if (clear_node_memory_tier(arg->status_change_nid))
|
|
|
|
establish_demotion_targets();
|
2022-08-18 06:10:35 -07:00
|
|
|
mutex_unlock(&memory_tier_lock);
|
|
|
|
break;
|
|
|
|
case MEM_ONLINE:
|
|
|
|
mutex_lock(&memory_tier_lock);
|
2022-08-18 06:10:37 -07:00
|
|
|
memtier = set_node_memory_tier(arg->status_change_nid);
|
|
|
|
if (!IS_ERR(memtier))
|
|
|
|
establish_demotion_targets();
|
2022-08-18 06:10:35 -07:00
|
|
|
mutex_unlock(&memory_tier_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return notifier_from_errno(0);
|
|
|
|
}
|
|
|
|
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
static int __init memory_tier_init(void)
|
|
|
|
{
|
memory tier: consolidate the initialization of memory tiers
The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.
The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/
If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.
If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.
Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.
Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 00:26:44 -07:00
|
|
|
int ret;
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
|
2022-08-30 01:17:36 -07:00
|
|
|
ret = subsys_virtual_register(&memory_tier_subsys, NULL);
|
|
|
|
if (ret)
|
|
|
|
panic("%s() failed to register memory tier subsystem\n", __func__);
|
|
|
|
|
2022-08-18 06:10:37 -07:00
|
|
|
#ifdef CONFIG_MIGRATION
|
|
|
|
node_demotion = kcalloc(nr_node_ids, sizeof(struct demotion_nodes),
|
|
|
|
GFP_KERNEL);
|
|
|
|
WARN_ON(!node_demotion);
|
|
|
|
#endif
|
memory tier: consolidate the initialization of memory tiers
The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.
The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/
If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.
If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.
Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.
Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 00:26:44 -07:00
|
|
|
|
memory tier: fix deadlock warning while onlining pages
commit 823430c8e9d9 ("memory tier: consolidate the initialization of
memory tiers") introduces a locking change that use guard(mutex) to
instead of mutex_lock/unlock() for memory_tier_lock. It unexpectedly
expanded the locked region to include the hotplug_memory_notifier(), as a
result, it triggers an locking dependency detected of ABBA deadlock.
Exclude hotplug_memory_notifier() from the locked region to fixing it.
The deadlock scenario is that when a memory online event occurs, the
execution of memory notifier will access the read lock of the
memory_chain.rwsem, then the reigistration of the memory notifier in
memory_tier_init() acquires the write lock of the memory_chain.rwsem while
holding memory_tier_lock. Then the memory online event continues to
invoke the memory hotplug callback registered by memory_tier_init().
Since this callback tries to acquire the memory_tier_lock, a deadlock
occurs.
In fact, this deadlock can't happen because memory_tier_init() always
executes before memory online events happen due to the subsys_initcall()
has an higher priority than module_init().
[ 133.491106] WARNING: possible circular locking dependency detected
[ 133.493656] 6.11.0-rc2+ #146 Tainted: G O N
[ 133.504290] ------------------------------------------------------
[ 133.515194] (udev-worker)/1133 is trying to acquire lock:
[ 133.525715] ffffffff87044e28 (memory_tier_lock){+.+.}-{3:3}, at: memtier_hotplug_callback+0x383/0x4b0
[ 133.536449]
[ 133.536449] but task is already holding lock:
[ 133.549847] ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
[ 133.556781]
[ 133.556781] which lock already depends on the new lock.
[ 133.556781]
[ 133.569957]
[ 133.569957] the existing dependency chain (in reverse order) is:
[ 133.577618]
[ 133.577618] -> #1 ((memory_chain).rwsem){++++}-{3:3}:
[ 133.584997] down_write+0x97/0x210
[ 133.588647] blocking_notifier_chain_register+0x71/0xd0
[ 133.592537] register_memory_notifier+0x26/0x30
[ 133.596314] memory_tier_init+0x187/0x300
[ 133.599864] do_one_initcall+0x117/0x5d0
[ 133.603399] kernel_init_freeable+0xab0/0xeb0
[ 133.606986] kernel_init+0x28/0x2f0
[ 133.610312] ret_from_fork+0x59/0x90
[ 133.613652] ret_from_fork_asm+0x1a/0x30
[ 133.617012]
[ 133.617012] -> #0 (memory_tier_lock){+.+.}-{3:3}:
[ 133.623390] __lock_acquire+0x2efd/0x5c60
[ 133.626730] lock_acquire+0x1ce/0x580
[ 133.629757] __mutex_lock+0x15c/0x1490
[ 133.632731] mutex_lock_nested+0x1f/0x30
[ 133.635717] memtier_hotplug_callback+0x383/0x4b0
[ 133.638748] notifier_call_chain+0xbf/0x370
[ 133.641647] blocking_notifier_call_chain+0x76/0xb0
[ 133.644636] memory_notify+0x2e/0x40
[ 133.647427] online_pages+0x597/0x720
[ 133.650246] memory_subsys_online+0x4f6/0x7f0
[ 133.653107] device_online+0x141/0x1d0
[ 133.655831] online_memory_block+0x4d/0x60
[ 133.658616] walk_memory_blocks+0xc0/0x120
[ 133.661419] add_memory_resource+0x51d/0x6c0
[ 133.664202] add_memory_driver_managed+0xf5/0x180
[ 133.667060] dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
[ 133.669949] dax_bus_probe+0x147/0x230
[ 133.672687] really_probe+0x27f/0xac0
[ 133.675463] __driver_probe_device+0x1f3/0x460
[ 133.678493] driver_probe_device+0x56/0x1b0
[ 133.681366] __driver_attach+0x277/0x570
[ 133.684149] bus_for_each_dev+0x145/0x1e0
[ 133.686937] driver_attach+0x49/0x60
[ 133.689673] bus_add_driver+0x2f3/0x6b0
[ 133.692421] driver_register+0x170/0x4b0
[ 133.695118] __dax_driver_register+0x141/0x1b0
[ 133.697910] dax_kmem_init+0x54/0xff0 [kmem]
[ 133.700794] do_one_initcall+0x117/0x5d0
[ 133.703455] do_init_module+0x277/0x750
[ 133.706054] load_module+0x5d1d/0x74f0
[ 133.708602] init_module_from_file+0x12c/0x1a0
[ 133.711234] idempotent_init_module+0x3f1/0x690
[ 133.713937] __x64_sys_finit_module+0x10e/0x1a0
[ 133.716492] x64_sys_call+0x184d/0x20d0
[ 133.719053] do_syscall_64+0x6d/0x140
[ 133.721537] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 133.724239]
[ 133.724239] other info that might help us debug this:
[ 133.724239]
[ 133.730832] Possible unsafe locking scenario:
[ 133.730832]
[ 133.735298] CPU0 CPU1
[ 133.737759] ---- ----
[ 133.740165] rlock((memory_chain).rwsem);
[ 133.742623] lock(memory_tier_lock);
[ 133.745357] lock((memory_chain).rwsem);
[ 133.748141] lock(memory_tier_lock);
[ 133.750489]
[ 133.750489] *** DEADLOCK ***
[ 133.750489]
[ 133.756742] 6 locks held by (udev-worker)/1133:
[ 133.759179] #0: ffff888207be6158 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x26c/0x570
[ 133.762299] #1: ffffffff875b5868 (device_hotplug_lock){+.+.}-{3:3}, at: lock_device_hotplug+0x20/0x30
[ 133.765565] #2: ffff88820cf6a108 (&dev->mutex){....}-{3:3}, at: device_online+0x2f/0x1d0
[ 133.768978] #3: ffffffff86d08ff0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x17/0x30
[ 133.772312] #4: ffffffff8702dfb0 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x23/0x30
[ 133.775544] #5: ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
[ 133.779113]
[ 133.779113] stack backtrace:
[ 133.783728] CPU: 5 UID: 0 PID: 1133 Comm: (udev-worker) Tainted: G O N 6.11.0-rc2+ #146
[ 133.787220] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 133.789948] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 133.793291] Call Trace:
[ 133.795826] <TASK>
[ 133.798284] dump_stack_lvl+0xea/0x150
[ 133.801025] dump_stack+0x19/0x20
[ 133.803609] print_circular_bug+0x477/0x740
[ 133.806341] check_noncircular+0x2f4/0x3e0
[ 133.809056] ? __pfx_check_noncircular+0x10/0x10
[ 133.811866] ? __pfx_lockdep_lock+0x10/0x10
[ 133.814670] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 133.817610] __lock_acquire+0x2efd/0x5c60
[ 133.820339] ? __pfx___lock_acquire+0x10/0x10
[ 133.823128] ? __dax_driver_register+0x141/0x1b0
[ 133.825926] ? do_one_initcall+0x117/0x5d0
[ 133.828648] lock_acquire+0x1ce/0x580
[ 133.831349] ? memtier_hotplug_callback+0x383/0x4b0
[ 133.834293] ? __pfx_lock_acquire+0x10/0x10
[ 133.837134] __mutex_lock+0x15c/0x1490
[ 133.839829] ? memtier_hotplug_callback+0x383/0x4b0
[ 133.842753] ? memtier_hotplug_callback+0x383/0x4b0
[ 133.845602] ? __this_cpu_preempt_check+0x21/0x30
[ 133.848438] ? __pfx___mutex_lock+0x10/0x10
[ 133.851200] ? __pfx_lock_acquire+0x10/0x10
[ 133.853935] ? global_dirty_limits+0xc0/0x160
[ 133.856699] ? __sanitizer_cov_trace_switch+0x58/0xa0
[ 133.859564] mutex_lock_nested+0x1f/0x30
[ 133.862251] ? mutex_lock_nested+0x1f/0x30
[ 133.864964] memtier_hotplug_callback+0x383/0x4b0
[ 133.867752] notifier_call_chain+0xbf/0x370
[ 133.870550] ? writeback_set_ratelimit+0xe8/0x160
[ 133.873372] blocking_notifier_call_chain+0x76/0xb0
[ 133.876311] memory_notify+0x2e/0x40
[ 133.879013] online_pages+0x597/0x720
[ 133.881686] ? irqentry_exit+0x3e/0xa0
[ 133.884397] ? __pfx_online_pages+0x10/0x10
[ 133.887244] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 133.890299] ? mhp_init_memmap_on_memory+0x7a/0x1c0
[ 133.893203] memory_subsys_online+0x4f6/0x7f0
[ 133.896099] ? __pfx_memory_subsys_online+0x10/0x10
[ 133.899039] ? xa_load+0x16d/0x2e0
[ 133.901667] ? __pfx_xa_load+0x10/0x10
[ 133.904366] ? __pfx_memory_subsys_online+0x10/0x10
[ 133.907218] device_online+0x141/0x1d0
[ 133.909845] online_memory_block+0x4d/0x60
[ 133.912494] walk_memory_blocks+0xc0/0x120
[ 133.915104] ? __pfx_online_memory_block+0x10/0x10
[ 133.917776] add_memory_resource+0x51d/0x6c0
[ 133.920404] ? __pfx_add_memory_resource+0x10/0x10
[ 133.923104] ? _raw_write_unlock+0x31/0x60
[ 133.925781] ? register_memory_resource+0x119/0x180
[ 133.928450] add_memory_driver_managed+0xf5/0x180
[ 133.931036] dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
[ 133.933665] ? __pfx_dev_dax_kmem_probe+0x10/0x10 [kmem]
[ 133.936332] ? __pfx___up_read+0x10/0x10
[ 133.938878] dax_bus_probe+0x147/0x230
[ 133.941332] ? __pfx_dax_bus_probe+0x10/0x10
[ 133.943954] really_probe+0x27f/0xac0
[ 133.946387] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 133.949106] __driver_probe_device+0x1f3/0x460
[ 133.951704] ? parse_option_str+0x149/0x190
[ 133.954241] driver_probe_device+0x56/0x1b0
[ 133.956749] __driver_attach+0x277/0x570
[ 133.959228] ? __pfx___driver_attach+0x10/0x10
[ 133.961776] bus_for_each_dev+0x145/0x1e0
[ 133.964367] ? __pfx_bus_for_each_dev+0x10/0x10
[ 133.967019] ? __kasan_check_read+0x15/0x20
[ 133.969543] ? _raw_spin_unlock+0x31/0x60
[ 133.972132] driver_attach+0x49/0x60
[ 133.974536] bus_add_driver+0x2f3/0x6b0
[ 133.977044] driver_register+0x170/0x4b0
[ 133.979480] __dax_driver_register+0x141/0x1b0
[ 133.982126] ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[ 133.984724] dax_kmem_init+0x54/0xff0 [kmem]
[ 133.987284] ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[ 133.989965] do_one_initcall+0x117/0x5d0
[ 133.992506] ? __pfx_do_one_initcall+0x10/0x10
[ 133.995185] ? __kasan_kmalloc+0x88/0xa0
[ 133.997748] ? kasan_poison+0x3e/0x60
[ 134.000288] ? kasan_unpoison+0x2c/0x60
[ 134.002762] ? kasan_poison+0x3e/0x60
[ 134.005202] ? __asan_register_globals+0x62/0x80
[ 134.007753] ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[ 134.010439] do_init_module+0x277/0x750
[ 134.012953] load_module+0x5d1d/0x74f0
[ 134.015406] ? __pfx_load_module+0x10/0x10
[ 134.017887] ? __pfx_ima_post_read_file+0x10/0x10
[ 134.020470] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 134.023127] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 134.025767] ? security_kernel_post_read_file+0xa2/0xd0
[ 134.028429] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 134.031162] ? kernel_read_file+0x503/0x820
[ 134.033645] ? __pfx_kernel_read_file+0x10/0x10
[ 134.036232] ? __pfx___lock_acquire+0x10/0x10
[ 134.038766] init_module_from_file+0x12c/0x1a0
[ 134.041291] ? init_module_from_file+0x12c/0x1a0
[ 134.043936] ? __pfx_init_module_from_file+0x10/0x10
[ 134.046516] ? __this_cpu_preempt_check+0x21/0x30
[ 134.049091] ? __kasan_check_read+0x15/0x20
[ 134.051551] ? do_raw_spin_unlock+0x60/0x210
[ 134.054077] idempotent_init_module+0x3f1/0x690
[ 134.056643] ? __pfx_idempotent_init_module+0x10/0x10
[ 134.059318] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 134.061995] ? __fget_light+0x17d/0x210
[ 134.064428] __x64_sys_finit_module+0x10e/0x1a0
[ 134.066976] x64_sys_call+0x184d/0x20d0
[ 134.069405] do_syscall_64+0x6d/0x140
[ 134.071926] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[yanfei.xu@intel.com: add mutex_lock/unlock() pair back]
Link: https://lkml.kernel.org/r/20240830102447.1445296-1-yanfei.xu@intel.com
Link: https://lkml.kernel.org/r/20240827113614.1343049-1-yanfei.xu@intel.com
Fixes: 823430c8e9d9 ("memory tier: consolidate the initialization of memory tiers")
Signed-off-by: Yanfei Xu <yanfei.xu@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Ho-Ren (Jack) Chuang <horen.chuang@linux.dev>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-27 04:36:14 -07:00
|
|
|
mutex_lock(&memory_tier_lock);
|
2022-08-18 06:10:36 -07:00
|
|
|
/*
|
|
|
|
* For now we can have 4 faster memory tiers with smaller adistance
|
|
|
|
* than default DRAM tier.
|
|
|
|
*/
|
memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them. Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes. This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process. Finally, demotion tables
are recalculated at the end.
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be
excluded in the late init.
* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT
information. If no HMAT is specified, it falls back to using the
default DRAM tier.
* Introduce another new lock `default_dram_perf_lock` for adist
calculation In the current implementation, iterating through CPUlist
nodes requires holding the `memory_tier_lock`. However,
`mt_calc_adistance()` will end up trying to acquire the same lock,
leading to a potential deadlock. Therefore, we propose introducing a
standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
This approach not only avoids deadlock but also prevents holding a large
lock simultaneously.
* Upgrade `set_node_memory_tier` to support additional cases, including
default DRAM, late CPUless, and hot-plugged initializations. To cover
hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information
is available.
* Introduce `default_memory_types` for those memory types that are not
initialized by device drivers. Because late initialized memory and
default DRAM memory need to be managed, a default memory type is created
for storing all memory types that are not initialized by device drivers
and as a fallback.
Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04 17:07:06 -07:00
|
|
|
default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
|
|
|
|
&default_memory_types);
|
memory tier: fix deadlock warning while onlining pages
commit 823430c8e9d9 ("memory tier: consolidate the initialization of
memory tiers") introduces a locking change that use guard(mutex) to
instead of mutex_lock/unlock() for memory_tier_lock. It unexpectedly
expanded the locked region to include the hotplug_memory_notifier(), as a
result, it triggers an locking dependency detected of ABBA deadlock.
Exclude hotplug_memory_notifier() from the locked region to fixing it.
The deadlock scenario is that when a memory online event occurs, the
execution of memory notifier will access the read lock of the
memory_chain.rwsem, then the reigistration of the memory notifier in
memory_tier_init() acquires the write lock of the memory_chain.rwsem while
holding memory_tier_lock. Then the memory online event continues to
invoke the memory hotplug callback registered by memory_tier_init().
Since this callback tries to acquire the memory_tier_lock, a deadlock
occurs.
In fact, this deadlock can't happen because memory_tier_init() always
executes before memory online events happen due to the subsys_initcall()
has an higher priority than module_init().
[ 133.491106] WARNING: possible circular locking dependency detected
[ 133.493656] 6.11.0-rc2+ #146 Tainted: G O N
[ 133.504290] ------------------------------------------------------
[ 133.515194] (udev-worker)/1133 is trying to acquire lock:
[ 133.525715] ffffffff87044e28 (memory_tier_lock){+.+.}-{3:3}, at: memtier_hotplug_callback+0x383/0x4b0
[ 133.536449]
[ 133.536449] but task is already holding lock:
[ 133.549847] ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
[ 133.556781]
[ 133.556781] which lock already depends on the new lock.
[ 133.556781]
[ 133.569957]
[ 133.569957] the existing dependency chain (in reverse order) is:
[ 133.577618]
[ 133.577618] -> #1 ((memory_chain).rwsem){++++}-{3:3}:
[ 133.584997] down_write+0x97/0x210
[ 133.588647] blocking_notifier_chain_register+0x71/0xd0
[ 133.592537] register_memory_notifier+0x26/0x30
[ 133.596314] memory_tier_init+0x187/0x300
[ 133.599864] do_one_initcall+0x117/0x5d0
[ 133.603399] kernel_init_freeable+0xab0/0xeb0
[ 133.606986] kernel_init+0x28/0x2f0
[ 133.610312] ret_from_fork+0x59/0x90
[ 133.613652] ret_from_fork_asm+0x1a/0x30
[ 133.617012]
[ 133.617012] -> #0 (memory_tier_lock){+.+.}-{3:3}:
[ 133.623390] __lock_acquire+0x2efd/0x5c60
[ 133.626730] lock_acquire+0x1ce/0x580
[ 133.629757] __mutex_lock+0x15c/0x1490
[ 133.632731] mutex_lock_nested+0x1f/0x30
[ 133.635717] memtier_hotplug_callback+0x383/0x4b0
[ 133.638748] notifier_call_chain+0xbf/0x370
[ 133.641647] blocking_notifier_call_chain+0x76/0xb0
[ 133.644636] memory_notify+0x2e/0x40
[ 133.647427] online_pages+0x597/0x720
[ 133.650246] memory_subsys_online+0x4f6/0x7f0
[ 133.653107] device_online+0x141/0x1d0
[ 133.655831] online_memory_block+0x4d/0x60
[ 133.658616] walk_memory_blocks+0xc0/0x120
[ 133.661419] add_memory_resource+0x51d/0x6c0
[ 133.664202] add_memory_driver_managed+0xf5/0x180
[ 133.667060] dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
[ 133.669949] dax_bus_probe+0x147/0x230
[ 133.672687] really_probe+0x27f/0xac0
[ 133.675463] __driver_probe_device+0x1f3/0x460
[ 133.678493] driver_probe_device+0x56/0x1b0
[ 133.681366] __driver_attach+0x277/0x570
[ 133.684149] bus_for_each_dev+0x145/0x1e0
[ 133.686937] driver_attach+0x49/0x60
[ 133.689673] bus_add_driver+0x2f3/0x6b0
[ 133.692421] driver_register+0x170/0x4b0
[ 133.695118] __dax_driver_register+0x141/0x1b0
[ 133.697910] dax_kmem_init+0x54/0xff0 [kmem]
[ 133.700794] do_one_initcall+0x117/0x5d0
[ 133.703455] do_init_module+0x277/0x750
[ 133.706054] load_module+0x5d1d/0x74f0
[ 133.708602] init_module_from_file+0x12c/0x1a0
[ 133.711234] idempotent_init_module+0x3f1/0x690
[ 133.713937] __x64_sys_finit_module+0x10e/0x1a0
[ 133.716492] x64_sys_call+0x184d/0x20d0
[ 133.719053] do_syscall_64+0x6d/0x140
[ 133.721537] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 133.724239]
[ 133.724239] other info that might help us debug this:
[ 133.724239]
[ 133.730832] Possible unsafe locking scenario:
[ 133.730832]
[ 133.735298] CPU0 CPU1
[ 133.737759] ---- ----
[ 133.740165] rlock((memory_chain).rwsem);
[ 133.742623] lock(memory_tier_lock);
[ 133.745357] lock((memory_chain).rwsem);
[ 133.748141] lock(memory_tier_lock);
[ 133.750489]
[ 133.750489] *** DEADLOCK ***
[ 133.750489]
[ 133.756742] 6 locks held by (udev-worker)/1133:
[ 133.759179] #0: ffff888207be6158 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x26c/0x570
[ 133.762299] #1: ffffffff875b5868 (device_hotplug_lock){+.+.}-{3:3}, at: lock_device_hotplug+0x20/0x30
[ 133.765565] #2: ffff88820cf6a108 (&dev->mutex){....}-{3:3}, at: device_online+0x2f/0x1d0
[ 133.768978] #3: ffffffff86d08ff0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x17/0x30
[ 133.772312] #4: ffffffff8702dfb0 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x23/0x30
[ 133.775544] #5: ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
[ 133.779113]
[ 133.779113] stack backtrace:
[ 133.783728] CPU: 5 UID: 0 PID: 1133 Comm: (udev-worker) Tainted: G O N 6.11.0-rc2+ #146
[ 133.787220] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 133.789948] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 133.793291] Call Trace:
[ 133.795826] <TASK>
[ 133.798284] dump_stack_lvl+0xea/0x150
[ 133.801025] dump_stack+0x19/0x20
[ 133.803609] print_circular_bug+0x477/0x740
[ 133.806341] check_noncircular+0x2f4/0x3e0
[ 133.809056] ? __pfx_check_noncircular+0x10/0x10
[ 133.811866] ? __pfx_lockdep_lock+0x10/0x10
[ 133.814670] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 133.817610] __lock_acquire+0x2efd/0x5c60
[ 133.820339] ? __pfx___lock_acquire+0x10/0x10
[ 133.823128] ? __dax_driver_register+0x141/0x1b0
[ 133.825926] ? do_one_initcall+0x117/0x5d0
[ 133.828648] lock_acquire+0x1ce/0x580
[ 133.831349] ? memtier_hotplug_callback+0x383/0x4b0
[ 133.834293] ? __pfx_lock_acquire+0x10/0x10
[ 133.837134] __mutex_lock+0x15c/0x1490
[ 133.839829] ? memtier_hotplug_callback+0x383/0x4b0
[ 133.842753] ? memtier_hotplug_callback+0x383/0x4b0
[ 133.845602] ? __this_cpu_preempt_check+0x21/0x30
[ 133.848438] ? __pfx___mutex_lock+0x10/0x10
[ 133.851200] ? __pfx_lock_acquire+0x10/0x10
[ 133.853935] ? global_dirty_limits+0xc0/0x160
[ 133.856699] ? __sanitizer_cov_trace_switch+0x58/0xa0
[ 133.859564] mutex_lock_nested+0x1f/0x30
[ 133.862251] ? mutex_lock_nested+0x1f/0x30
[ 133.864964] memtier_hotplug_callback+0x383/0x4b0
[ 133.867752] notifier_call_chain+0xbf/0x370
[ 133.870550] ? writeback_set_ratelimit+0xe8/0x160
[ 133.873372] blocking_notifier_call_chain+0x76/0xb0
[ 133.876311] memory_notify+0x2e/0x40
[ 133.879013] online_pages+0x597/0x720
[ 133.881686] ? irqentry_exit+0x3e/0xa0
[ 133.884397] ? __pfx_online_pages+0x10/0x10
[ 133.887244] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 133.890299] ? mhp_init_memmap_on_memory+0x7a/0x1c0
[ 133.893203] memory_subsys_online+0x4f6/0x7f0
[ 133.896099] ? __pfx_memory_subsys_online+0x10/0x10
[ 133.899039] ? xa_load+0x16d/0x2e0
[ 133.901667] ? __pfx_xa_load+0x10/0x10
[ 133.904366] ? __pfx_memory_subsys_online+0x10/0x10
[ 133.907218] device_online+0x141/0x1d0
[ 133.909845] online_memory_block+0x4d/0x60
[ 133.912494] walk_memory_blocks+0xc0/0x120
[ 133.915104] ? __pfx_online_memory_block+0x10/0x10
[ 133.917776] add_memory_resource+0x51d/0x6c0
[ 133.920404] ? __pfx_add_memory_resource+0x10/0x10
[ 133.923104] ? _raw_write_unlock+0x31/0x60
[ 133.925781] ? register_memory_resource+0x119/0x180
[ 133.928450] add_memory_driver_managed+0xf5/0x180
[ 133.931036] dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
[ 133.933665] ? __pfx_dev_dax_kmem_probe+0x10/0x10 [kmem]
[ 133.936332] ? __pfx___up_read+0x10/0x10
[ 133.938878] dax_bus_probe+0x147/0x230
[ 133.941332] ? __pfx_dax_bus_probe+0x10/0x10
[ 133.943954] really_probe+0x27f/0xac0
[ 133.946387] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 133.949106] __driver_probe_device+0x1f3/0x460
[ 133.951704] ? parse_option_str+0x149/0x190
[ 133.954241] driver_probe_device+0x56/0x1b0
[ 133.956749] __driver_attach+0x277/0x570
[ 133.959228] ? __pfx___driver_attach+0x10/0x10
[ 133.961776] bus_for_each_dev+0x145/0x1e0
[ 133.964367] ? __pfx_bus_for_each_dev+0x10/0x10
[ 133.967019] ? __kasan_check_read+0x15/0x20
[ 133.969543] ? _raw_spin_unlock+0x31/0x60
[ 133.972132] driver_attach+0x49/0x60
[ 133.974536] bus_add_driver+0x2f3/0x6b0
[ 133.977044] driver_register+0x170/0x4b0
[ 133.979480] __dax_driver_register+0x141/0x1b0
[ 133.982126] ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[ 133.984724] dax_kmem_init+0x54/0xff0 [kmem]
[ 133.987284] ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[ 133.989965] do_one_initcall+0x117/0x5d0
[ 133.992506] ? __pfx_do_one_initcall+0x10/0x10
[ 133.995185] ? __kasan_kmalloc+0x88/0xa0
[ 133.997748] ? kasan_poison+0x3e/0x60
[ 134.000288] ? kasan_unpoison+0x2c/0x60
[ 134.002762] ? kasan_poison+0x3e/0x60
[ 134.005202] ? __asan_register_globals+0x62/0x80
[ 134.007753] ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[ 134.010439] do_init_module+0x277/0x750
[ 134.012953] load_module+0x5d1d/0x74f0
[ 134.015406] ? __pfx_load_module+0x10/0x10
[ 134.017887] ? __pfx_ima_post_read_file+0x10/0x10
[ 134.020470] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 134.023127] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 134.025767] ? security_kernel_post_read_file+0xa2/0xd0
[ 134.028429] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 134.031162] ? kernel_read_file+0x503/0x820
[ 134.033645] ? __pfx_kernel_read_file+0x10/0x10
[ 134.036232] ? __pfx___lock_acquire+0x10/0x10
[ 134.038766] init_module_from_file+0x12c/0x1a0
[ 134.041291] ? init_module_from_file+0x12c/0x1a0
[ 134.043936] ? __pfx_init_module_from_file+0x10/0x10
[ 134.046516] ? __this_cpu_preempt_check+0x21/0x30
[ 134.049091] ? __kasan_check_read+0x15/0x20
[ 134.051551] ? do_raw_spin_unlock+0x60/0x210
[ 134.054077] idempotent_init_module+0x3f1/0x690
[ 134.056643] ? __pfx_idempotent_init_module+0x10/0x10
[ 134.059318] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 134.061995] ? __fget_light+0x17d/0x210
[ 134.064428] __x64_sys_finit_module+0x10e/0x1a0
[ 134.066976] x64_sys_call+0x184d/0x20d0
[ 134.069405] do_syscall_64+0x6d/0x140
[ 134.071926] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[yanfei.xu@intel.com: add mutex_lock/unlock() pair back]
Link: https://lkml.kernel.org/r/20240830102447.1445296-1-yanfei.xu@intel.com
Link: https://lkml.kernel.org/r/20240827113614.1343049-1-yanfei.xu@intel.com
Fixes: 823430c8e9d9 ("memory tier: consolidate the initialization of memory tiers")
Signed-off-by: Yanfei Xu <yanfei.xu@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Ho-Ren (Jack) Chuang <horen.chuang@linux.dev>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-27 04:36:14 -07:00
|
|
|
mutex_unlock(&memory_tier_lock);
|
2022-11-09 20:07:51 -07:00
|
|
|
if (IS_ERR(default_dram_type))
|
2022-08-18 06:10:36 -07:00
|
|
|
panic("%s() failed to allocate default DRAM tier\n", __func__);
|
|
|
|
|
memory tier: consolidate the initialization of memory tiers
The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.
The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/
If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.
If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.
Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.
Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 00:26:44 -07:00
|
|
|
/* Record nodes with memory and CPU to set default DRAM performance. */
|
|
|
|
nodes_and(default_dram_nodes, node_states[N_MEMORY],
|
|
|
|
node_states[N_CPU]);
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
|
2022-09-22 20:33:47 -07:00
|
|
|
hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
|
mm/demotion: add support for explicit memory tiers
Patch series "mm/demotion: Memory tiers and demotion", v15.
The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases:
* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.
* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.
* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.
This patch series make the creation of memory tiers explicit under the
control of device driver.
Memory Tier Initialization
==========================
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
By default, all memory nodes are assigned to the default tier with
abstract distance 512.
A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.
The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
This patch (of 10):
In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.
This current memory tier kernel implementation needs to be improved for
several important use cases,
The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.
With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.
This patch series address the above by defining memory tiers explicitly.
Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.
This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.
[akpm@linux-foundation.org: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-18 06:10:33 -07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
subsys_initcall(memory_tier_init);
|
2022-08-18 06:10:34 -07:00
|
|
|
|
|
|
|
bool numa_demotion_enabled = false;
|
|
|
|
|
|
|
|
#ifdef CONFIG_MIGRATION
|
|
|
|
#ifdef CONFIG_SYSFS
|
2023-07-14 20:51:11 -07:00
|
|
|
static ssize_t demotion_enabled_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr, char *buf)
|
2022-08-18 06:10:34 -07:00
|
|
|
{
|
2024-08-26 19:45:16 -07:00
|
|
|
return sysfs_emit(buf, "%s\n", str_true_false(numa_demotion_enabled));
|
2022-08-18 06:10:34 -07:00
|
|
|
}
|
|
|
|
|
2023-07-14 20:51:11 -07:00
|
|
|
static ssize_t demotion_enabled_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
2022-08-18 06:10:34 -07:00
|
|
|
{
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
ret = kstrtobool(buf, &numa_demotion_enabled);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct kobj_attribute numa_demotion_enabled_attr =
|
2023-07-14 20:51:11 -07:00
|
|
|
__ATTR_RW(demotion_enabled);
|
2022-08-18 06:10:34 -07:00
|
|
|
|
|
|
|
static struct attribute *numa_attrs[] = {
|
|
|
|
&numa_demotion_enabled_attr.attr,
|
|
|
|
NULL,
|
|
|
|
};
|
|
|
|
|
|
|
|
static const struct attribute_group numa_attr_group = {
|
|
|
|
.attrs = numa_attrs,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int __init numa_init_sysfs(void)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
struct kobject *numa_kobj;
|
|
|
|
|
|
|
|
numa_kobj = kobject_create_and_add("numa", mm_kobj);
|
|
|
|
if (!numa_kobj) {
|
|
|
|
pr_err("failed to create numa kobject\n");
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
err = sysfs_create_group(numa_kobj, &numa_attr_group);
|
|
|
|
if (err) {
|
|
|
|
pr_err("failed to register numa group\n");
|
|
|
|
goto delete_obj;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
delete_obj:
|
|
|
|
kobject_put(numa_kobj);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
subsys_initcall(numa_init_sysfs);
|
|
|
|
#endif /* CONFIG_SYSFS */
|
|
|
|
#endif
|