e2b67859ab
Add a mechanism that allows to inject a heartbeat error for testing purposes. A new attribute `inject_error` is added to debugfs for each QAT device. Upon a write on this attribute, the driver will inject an error on the device which can then be detected by the heartbeat feature. Errors are breaking the device functionality thus they require a device reset in order to be recovered. This functionality is not compiled by default, to enable it CRYPTO_DEV_QAT_ERROR_INJECTION must be set. Signed-off-by: Damian Muszynski <damian.muszynski@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Reviewed-by: Lucas Segarra Fernandez <lucas.segarra.fernandez@intel.com> Reviewed-by: Ahsan Atta <ahsan.atta@intel.com> Reviewed-by: Markas Rapoportas <markas.rapoportas@intel.com> Signed-off-by: Mun Chun Yep <mun.chun.yep@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
110 lines
4.0 KiB
Plaintext
110 lines
4.0 KiB
Plaintext
What: /sys/kernel/debug/qat_<device>_<BDF>/fw_counters
|
||
Date: November 2023
|
||
KernelVersion: 6.6
|
||
Contact: qat-linux@intel.com
|
||
Description: (RO) Read returns the number of requests sent to the FW and the number of responses
|
||
received from the FW for each Acceleration Engine
|
||
Reported firmware counters::
|
||
|
||
<N>: Number of requests sent from Acceleration Engine N to FW and responses
|
||
Acceleration Engine N received from FW
|
||
|
||
What: /sys/kernel/debug/qat_<device>_<BDF>/heartbeat/config
|
||
Date: November 2023
|
||
KernelVersion: 6.6
|
||
Contact: qat-linux@intel.com
|
||
Description: (RW) Read returns value of the Heartbeat update period.
|
||
Write to the file changes this period value.
|
||
|
||
This period should reflect planned polling interval of device
|
||
health status. High frequency Heartbeat monitoring wastes CPU cycles
|
||
but minimizes the customer’s system downtime. Also, if there are
|
||
large service requests that take some time to complete, high frequency
|
||
Heartbeat monitoring could result in false reports of unresponsiveness
|
||
and in those cases, period needs to be increased.
|
||
|
||
This parameter is effective only for c3xxx, c62x, dh895xcc devices.
|
||
4xxx has this value internally fixed to 200ms.
|
||
|
||
Default value is set to 500. Minimal allowed value is 200.
|
||
All values are expressed in milliseconds.
|
||
|
||
What: /sys/kernel/debug/qat_<device>_<BDF>/heartbeat/queries_failed
|
||
Date: November 2023
|
||
KernelVersion: 6.6
|
||
Contact: qat-linux@intel.com
|
||
Description: (RO) Read returns the number of times the device became unresponsive.
|
||
|
||
Attribute returns value of the counter which is incremented when
|
||
status query results negative.
|
||
|
||
What: /sys/kernel/debug/qat_<device>_<BDF>/heartbeat/queries_sent
|
||
Date: November 2023
|
||
KernelVersion: 6.6
|
||
Contact: qat-linux@intel.com
|
||
Description: (RO) Read returns the number of times the control process checked
|
||
if the device is responsive.
|
||
|
||
Attribute returns value of the counter which is incremented on
|
||
every status query.
|
||
|
||
What: /sys/kernel/debug/qat_<device>_<BDF>/heartbeat/status
|
||
Date: November 2023
|
||
KernelVersion: 6.6
|
||
Contact: qat-linux@intel.com
|
||
Description: (RO) Read returns the device health status.
|
||
|
||
Returns 0 when device is healthy or -1 when is unresponsive
|
||
or the query failed to send.
|
||
|
||
The driver does not monitor for Heartbeat. It is left for a user
|
||
to poll the status periodically.
|
||
|
||
What: /sys/kernel/debug/qat_<device>_<BDF>/pm_status
|
||
Date: January 2024
|
||
KernelVersion: 6.7
|
||
Contact: qat-linux@intel.com
|
||
Description: (RO) Read returns power management information specific to the
|
||
QAT device.
|
||
|
||
This attribute is only available for qat_4xxx devices.
|
||
|
||
What: /sys/kernel/debug/qat_<device>_<BDF>/cnv_errors
|
||
Date: January 2024
|
||
KernelVersion: 6.7
|
||
Contact: qat-linux@intel.com
|
||
Description: (RO) Read returns, for each Acceleration Engine (AE), the number
|
||
of errors and the type of the last error detected by the device
|
||
when performing verified compression.
|
||
Reported counters::
|
||
|
||
<N>: Number of Compress and Verify (CnV) errors and type
|
||
of the last CnV error detected by Acceleration
|
||
Engine N.
|
||
|
||
What: /sys/kernel/debug/qat_<device>_<BDF>/heartbeat/inject_error
|
||
Date: March 2024
|
||
KernelVersion: 6.8
|
||
Contact: qat-linux@intel.com
|
||
Description: (WO) Write to inject an error that simulates an heartbeat
|
||
failure. This is to be used for testing purposes.
|
||
|
||
After writing this file, the driver stops arbitration on a
|
||
random engine and disables the fetching of heartbeat counters.
|
||
If a workload is running on the device, a job submitted to the
|
||
accelerator might not get a response and a read of the
|
||
`heartbeat/status` attribute might report -1, i.e. device
|
||
unresponsive.
|
||
The error is unrecoverable thus the device must be restarted to
|
||
restore its functionality.
|
||
|
||
This attribute is available only when the kernel is built with
|
||
CONFIG_CRYPTO_DEV_QAT_ERROR_INJECTION=y.
|
||
|
||
A write of 1 enables error injection.
|
||
|
||
The following example shows how to enable error injection::
|
||
|
||
# cd /sys/kernel/debug/qat_<device>_<BDF>
|
||
# echo 1 > heartbeat/inject_error
|