Difference between revisions of "Problem with Overheating then reboot since Ubuntu 11.10"

From ThinkWiki
Jump to: navigation, search
m (Clear the dust)
m
Line 20: Line 20:
  
 
The temperature does not exceed 75° when my laptop is on battery and so is not rebooting because of the overheating.
 
The temperature does not exceed 75° when my laptop is on battery and so is not rebooting because of the overheating.
 +
 +
== Devices power saving off ==
 +
 +
Tunables tab of powertop prints a lot of devices with important runtime (no economy energy mode ?):
 +
 +
  Bad          Enable SATA link power management for /dev/sda
 +
  Bad          NMI watchdog should be turned off
 +
  Bad          Power Aware CPU scheduler
 +
  Bad          VM writeback timeout
 +
  Bad          Enable Audio codec power management
 +
  Bad          Autosuspend for USB device Fingerprint Sensor [4-1]
 +
  Bad          Autosuspend for USB device USB Receiver (Logitech)
 +
  Bad          Autosuspend for USB device Android Phone (HTC)
 +
  Bad          Runtime PM for PCI Device Intel Corporation Mobile 4 Series Chipset Memory Controller Hub
 +
  Bad          Runtime PM for PCI Device Ricoh Co Ltd R5C832 IEEE 1394 Controller
 +
  Bad          Runtime PM for PCI Device Intel Corporation Mobile 4 Series Chipset MEI Controller
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82567LF Gigabit Network Connection
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) HD Audio Controller
 +
  Bad          Runtime PM for PCI Device Ricoh Co Ltd xD-Picture Card Controller
 +
  Bad          Runtime PM for PCI Device Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter
 +
  Bad          Runtime PM for PCI Device Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter
 +
  Bad          Runtime PM for PCI Device Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller
 +
  Bad          Runtime PM for PCI Device Intel Corporation Ultimate N WiFi Link 5300
 +
  Bad          Runtime PM for PCI Device Ricoh Co Ltd RL5c476 II
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 4
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 5
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 2
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 3
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode]
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5
 +
  Bad          Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 1
 +
  
 
=== Turbo mode ===
 
=== Turbo mode ===
  
* looking with powertop, I see
+
First, Turbo mode of the processor is always running 25% of the time:
  * in frequency stats: the turbo mode of both CPU are usually equal to 33% (without no high CPU usage from my applications)
+
 
   * in device stats: lot of my PCI Device are running at 100%, which seems important
+
Package            | CPU 0
 +
Turbo Mode  24.4%   | Turbo Mode  21.7%
 +
2.81 Ghz    1.8%  | 2.81 Ghz    1.6%
 +
2.14 Ghz    0.9%  | 2.14 Ghz    0.9%
 +
1.60 Ghz    3.3%  | 1.60 Ghz    3.3%
 +
  800 Mhz    57.5%  |  800 Mhz    55.2%
 +
Idle        12.1%  | Idle        17.4%
 +
 
 +
                    |            CPU 1
 +
                    | Turbo Mode  24.1%
 +
                    | 2.81 Ghz    1.8%
 +
                    | 2.14 Ghz    0.9%
 +
                    | 1.60 Ghz    3.2%
 +
                    |  800 Mhz   54.5%
 +
                    | Idle        15.5%
 +
 
 +
 
 +
Basically this is what I'm seeing on my i7 X220 - even though the CPU reaches 97 degrees with full speed fan - it stays in turbo mode no matter what, as verified with powertop. Maybe somebody now more about thinkpad throttling in turbo mode?
 +
 
 +
I've also run 'powertop' and saw several other things:
  
So I think we deal here with several bugs, one about the fan, but also one possibly with ASPM, which seems disabled on my computer:
+
=== ASPM ===
 +
 
 +
Linux 3.X has a bug with ASPM with similar symptoms, but it not activated in my case:
  
 
$ dmesg | grep ASPM
 
$ dmesg | grep ASPM
 
[ 0.160380] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
 
[ 0.160380] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
  
 +
If I add pcie_aspm=force to kernel, I've the following output:
 +
 +
$ dmesg | grep ASPM
 +
[    0.000000] PCIe ASPM is forcibly enabled
 +
[    0.197865] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
 +
 +
=== Power Aware CPU scheduler ==
 +
 +
Changing cpu policy to powersave does not solve the problem.
 +
 +
=== ACPI OSI ===
  
Basically this is what I'm seeing on my i7 X220 - even though the CPU reaches 97 degrees with full speed fan - it stays in turbo mode no matter what, as verified with powertop. Maybe somebody now more about thinkpad throttling in turbo mode?
+
Using switch 'acpi_osi=Linux' (as [[http://askubuntu.com/a/75713/8296|described here]]) does not fix the problem.
  
 
=== ACPI ===
 
=== ACPI ===
Line 64: Line 135:
  
 
http://marc.info/?l=linux-acpi&m=132854533918079&w=2
 
http://marc.info/?l=linux-acpi&m=132854533918079&w=2
 +
 +
Symptoms are high temperature when high usage of CPU, with reboot when more than 100°C (information extract from ACPITZ/1 entry). I found a bug entry in launchpad about wrong fan speed but even if I force the fan to max (see the bug description of the launchpad entry) the computer is still rebooting after one minute of high CPU usage.
 +
 +
But I'm starting to suspect several bugs in my case. My battery has a lifetime of 2 to 3 hours, half of what I had before ubuntu 11.10 (~5 hours). So it maybe related to ASPM bug, but even with recent updates of the kernel I have still the problem. I've tried multiple kernel startup switchs fixup found with ubuntu 11.10 but nothing really changed. Abot ASPM, dmesg returns:
 +
 +
$ dmesg | grep ASPM
 +
[    0.160288] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
 +
Maybe I need to try again with 12.04 ?
 +
 +
Two last things: - I cannot open my computer to remove possible dust (computer owned by my company, guaranty yada yada...) - I saw somewhere possible GPU overheating (cannot found the link), anyone has such experience ?
 +
 +
Any hint, clue or proposal welcomes...
 +
 +
update 1
 +
 +
 +
 +
 +
update 3

Revision as of 11:29, 13 December 2012

Symtoms

Since Ubuntu 11.10, my T500 has reduced battery life (around 3 hours with low screen power, 2 hours on normal usage) and reboots when CPU charge is important due to overheating (more than 100°). This is clearly a software bug, as I didn't have this behavior in Ubuntu 11.04. It appears since 11.10 (first ubuntu release with Linux kernel 3.x).

Trying to find a solution...

Fan control

To set your fan to max:

# sudo rmmod thinkpad_acpi

# {{{1}}}

# echo "level 127" > /proc/acpi/ibm/fan

But it is not a problem with fan control. Whatever is the fan speed (disengaged and set manually to full speed with level 127) my thinkpad T500 still reboots after less than one minute of high CPU (I didn't have this problem before Ubuntu 11.10).

Temporary fix

The temperature does not exceed 75° when my laptop is on battery and so is not rebooting because of the overheating.

Devices power saving off

Tunables tab of powertop prints a lot of devices with important runtime (no economy energy mode ?):

  Bad           Enable SATA link power management for /dev/sda
  Bad           NMI watchdog should be turned off
  Bad           Power Aware CPU scheduler
  Bad           VM writeback timeout
  Bad           Enable Audio codec power management
  Bad           Autosuspend for USB device Fingerprint Sensor [4-1]
  Bad           Autosuspend for USB device USB Receiver (Logitech)
  Bad           Autosuspend for USB device Android Phone (HTC)
  Bad           Runtime PM for PCI Device Intel Corporation Mobile 4 Series Chipset Memory Controller Hub
  Bad           Runtime PM for PCI Device Ricoh Co Ltd R5C832 IEEE 1394 Controller
  Bad           Runtime PM for PCI Device Intel Corporation Mobile 4 Series Chipset MEI Controller
  Bad           Runtime PM for PCI Device Intel Corporation 82567LF Gigabit Network Connection
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) HD Audio Controller
  Bad           Runtime PM for PCI Device Ricoh Co Ltd xD-Picture Card Controller
  Bad           Runtime PM for PCI Device Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter
  Bad           Runtime PM for PCI Device Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter
  Bad           Runtime PM for PCI Device Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller
  Bad           Runtime PM for PCI Device Intel Corporation Ultimate N WiFi Link 5300
  Bad           Runtime PM for PCI Device Ricoh Co Ltd RL5c476 II
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 4
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 5
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 2
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 3
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3
  Bad           Runtime PM for PCI Device Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode]
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 1


Turbo mode

First, Turbo mode of the processor is always running 25% of the time:

Package             | CPU 0
Turbo Mode  24.4%   | Turbo Mode  21.7%
2.81 Ghz     1.8%   | 2.81 Ghz     1.6%
2.14 Ghz     0.9%   | 2.14 Ghz     0.9%
1.60 Ghz     3.3%   | 1.60 Ghz     3.3%
 800 Mhz    57.5%   |  800 Mhz    55.2%
Idle        12.1%   | Idle        17.4%
                    |            CPU 1
                    | Turbo Mode  24.1%
                    | 2.81 Ghz     1.8%
                    | 2.14 Ghz     0.9%
                    | 1.60 Ghz     3.2%
                    |  800 Mhz    54.5%
                    | Idle        15.5%


Basically this is what I'm seeing on my i7 X220 - even though the CPU reaches 97 degrees with full speed fan - it stays in turbo mode no matter what, as verified with powertop. Maybe somebody now more about thinkpad throttling in turbo mode?

I've also run 'powertop' and saw several other things:

ASPM

Linux 3.X has a bug with ASPM with similar symptoms, but it not activated in my case:

$ dmesg | grep ASPM [ 0.160380] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it

If I add pcie_aspm=force to kernel, I've the following output:

$ dmesg | grep ASPM [ 0.000000] PCIe ASPM is forcibly enabled [ 0.197865] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it

= Power Aware CPU scheduler

Changing cpu policy to powersave does not solve the problem.

ACPI OSI

Using switch 'acpi_osi=Linux' (as [here]) does not fix the problem.

ACPI

This might be related to this bug report: https://bugzilla.kernel.org/show_bug.cgi?id=42858

@Matthias the patch available in the comment number 5 of the bug report is already present in latest 12.04 kernel, so this is not the solution.

But strangely the symptoms seem very close. I've added a comment in the bug report:

https://bugzilla.kernel.org/show_bug.cgi?id=42858#8


Data

List of machines with the same problem
Model
T500 - type 2082

References

http://marc.info/?l=linux-acpi&m=132854533918079&w=2

Symptoms are high temperature when high usage of CPU, with reboot when more than 100°C (information extract from ACPITZ/1 entry). I found a bug entry in launchpad about wrong fan speed but even if I force the fan to max (see the bug description of the launchpad entry) the computer is still rebooting after one minute of high CPU usage.

But I'm starting to suspect several bugs in my case. My battery has a lifetime of 2 to 3 hours, half of what I had before ubuntu 11.10 (~5 hours). So it maybe related to ASPM bug, but even with recent updates of the kernel I have still the problem. I've tried multiple kernel startup switchs fixup found with ubuntu 11.10 but nothing really changed. Abot ASPM, dmesg returns:

$ dmesg | grep ASPM [ 0.160288] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it Maybe I need to try again with 12.04 ?

Two last things: - I cannot open my computer to remove possible dust (computer owned by my company, guaranty yada yada...) - I saw somewhere possible GPU overheating (cannot found the link), anyone has such experience ?

Any hint, clue or proposal welcomes...

update 1



update 3