“We’re experiencing unexpected shutdowns on a machine running Ubuntu 20.04 with an NVIDIA 4090 GPU. The system uses NVIDIA drivers NVIDIA-SMI 535.171.04 and kernel version 5.15.0-1052-intel-iotg. The machine works fine for a while but then shuts down unexpectedly, and it doesn’t restart automatically. We’ve noticed ACPI errors and thermal errors in the logs.”
Please find below error logs" ACPI: thermal: Thermal Zone [TZ00] (28 C)
[ 0.993513] ACPI: video: Video Device [GFX0] (multi-head: yes rom: no post: no)
[ 3.596622] ACPI Warning: _SB.PC00.PEG1.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20210730/nsarguments-61)
[ 3.778368] ACPI BIOS Error (bug): Failure creating named object [_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20210730/dsfield-184)
[ 3.778391] ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20210730/dswload2-477)
[ 3.778415] ACPI Error: Aborting method _SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20210730/psparse-529)
[ 3.778537] ACPI BIOS Error (bug): Failure creating named object [_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20210730/dsfield-184)
[ 3.778556] ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20210730/dswload2-477)"
ACPI: thermal: Thermal Zone [TZ00] (28 C)
[ 3.568606] thermal thermal_zone1: failed to read out thermal zone (-61)
nvidia: module verification failed: signature and/or required key missing - tainting kernel
I asked Copilot about it and it gave an answer pointing to Nvidia. Not sure if that will help you or not.
The message you’ve shared seems to be related to thermal zones and module verification. Let’s break it down:
Thermal Zones:
In computing systems, thermal zones represent different areas where temperature sensors monitor the heat levels.
These zones help manage the system’s thermal performance by adjusting cooling mechanisms (such as fans or throttling) to prevent overheating.
The error message you’ve encountered, thermal_zone1: failed to read out thermal zone (-61), suggests an issue with reading temperature data from a specific thermal zone.
Module Verification:
When loading kernel modules (such as drivers), the system verifies their integrity using cryptographic signatures.
If a module lacks a valid signature or required keys, the kernel may consider it untrusted or “tainted.”
The error nvidia: module verification failed: signature and/or required key missing indicates a problem with the NVIDIA driver module.
Possible Causes and Solutions:
Thermal Zone Error:
Check if the thermal sensor is functioning correctly.
Inspect the kernel logs (dmesg or /var/log/syslog) for more details.
Ensure that the thermal zone is properly configured in your system.
NVIDIA Module Error:
Verify that you’re using the correct NVIDIA driver version for your system.
Make sure the driver package is properly installed.
Check if Secure Boot is enabled; it might prevent loading unsigned modules.
If you’re using a custom kernel, ensure that it’s properly signed.
Consider reinstalling the NVIDIA driver or updating it to a compatible version.
Remember to consult official documentation or community forums related to your specific system (e.g., Jetson Nano or other hardware) for detailed troubleshooting steps. If you need further assistance, feel free to ask!
So, if you did not change any software, it must be a hardware problem.
You might try cleaning the dust from inside the case. Especially anything blocking the fan.
Other than that, you have a defective component.