Problem with Overheating then reboot since Ubuntu 11.10

From ThinkWiki
Revision as of 13:47, 18 April 2013 by RainerEL (Talk | contribs) (Analyzing CPU speed and voltage)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Symtoms

Since Ubuntu 11.10, my T500 has reduced battery life (around 3 hours with low screen power, 2 hours on normal usage) and reboots when CPU charge is important due to overheating (more than 100°). This is clearly a software bug, as I didn't have this behavior in Ubuntu 11.04. It appears since 11.10 (first ubuntu release with Linux kernel 3.x).

Trying to find a solution...

Fan control

To set your fan to max:

 sudo rmmod thinkpad_acpi
 sudo modprobe thinkpad_acpi fan_control=1
 echo "level 127" > /proc/acpi/ibm/fan

But it is not a problem with fan control. Whatever is the fan speed (disengaged and set manually to full speed with level 127) my thinkpad T500 still reboots after less than one minute of high CPU (I didn't have this problem before Ubuntu 11.10).

Comment from RainerEL:

I have reduced the temperature problem a little bit by adding the 127 level in the thinkfan.conf.

Here is my thinkfan.conf at a lenovo R60:

(0, 0, 42)    #
(1, 40, 44)    # 2270-2550
(2, 43, 46)    # 2270-2550
(3, 44, 50)    # 3000-3170
(4, 48, 54)    # 3000-3170
(5, 52, 60)    # 3200-3400
(6, 58, 65)    # 3600-3700
#(7, 65, 32767) # 3700-3800
(7, 63, 68) # 3700-3800
(127, 66, 32767) # 58xx

At cpu load it go slowly to 85°C. I believe it still to high.

Temporary fix

The temperature does not exceed 75° when my laptop is on battery and so is not rebooting because of the overheating.

Temporary fix from RainerEL

I have a R60 which have 3 speeds (1, 1.33 1.83GHz) and do

- set the governer to userspace 1.833GHz
- reduce the VID using:
      wrmsr -p0 0x0199 0x0b20
      wrmsr -p1 0x0199 0x0b20

> After this the the temperature goes not over 65°C by starting load that goes to 100% CPU and highest CPU speed (1.83GHz)!

I believe that Ubuntu is seting an VID to high be default. I open a bug # 1164557 in Lauchpad.

Devices power saving off

Tunables tab of powertop prints a lot of devices with important runtime (no economy energy mode ?):

  Bad           Enable SATA link power management for /dev/sda
  Bad           NMI watchdog should be turned off
  Bad           Power Aware CPU scheduler
  Bad           VM writeback timeout
  Bad           Enable Audio codec power management
  Bad           Autosuspend for USB device Fingerprint Sensor [4-1]
  Bad           Autosuspend for USB device USB Receiver (Logitech)
  Bad           Autosuspend for USB device Android Phone (HTC)
  Bad           Runtime PM for PCI Device Intel Corporation Mobile 4 Series Chipset Memory Controller Hub
  Bad           Runtime PM for PCI Device Ricoh Co Ltd R5C832 IEEE 1394 Controller
  Bad           Runtime PM for PCI Device Intel Corporation Mobile 4 Series Chipset MEI Controller
  Bad           Runtime PM for PCI Device Intel Corporation 82567LF Gigabit Network Connection
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) HD Audio Controller
  Bad           Runtime PM for PCI Device Ricoh Co Ltd xD-Picture Card Controller
  Bad           Runtime PM for PCI Device Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter
  Bad           Runtime PM for PCI Device Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter
  Bad           Runtime PM for PCI Device Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller
  Bad           Runtime PM for PCI Device Intel Corporation Ultimate N WiFi Link 5300
  Bad           Runtime PM for PCI Device Ricoh Co Ltd RL5c476 II
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 4
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 5
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 2
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 3
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3
  Bad           Runtime PM for PCI Device Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode]
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5
  Bad           Runtime PM for PCI Device Intel Corporation 82801I (ICH9 Family) PCI Express Port 1

Don't know if it could be a clue...

Turbo mode

Turbo mode of the processor is always running 25% of the time:

Package             | CPU 0
Turbo Mode  24.4%   | Turbo Mode  21.7%
2.81 Ghz     1.8%   | 2.81 Ghz     1.6%
2.14 Ghz     0.9%   | 2.14 Ghz     0.9%
1.60 Ghz     3.3%   | 1.60 Ghz     3.3%
 800 Mhz    57.5%   |  800 Mhz    55.2%
Idle        12.1%   | Idle        17.4%
                    |            CPU 1
                    | Turbo Mode  24.1%
                    | 2.81 Ghz     1.8%
                    | 2.14 Ghz     0.9%
                    | 1.60 Ghz     3.2%
                    |  800 Mhz    54.5%
                    | Idle        15.5%

ASPM

Linux 3.X has a bug with ASPM with similar symptoms, but it not activated in my case:

 $ dmesg | grep ASPM
 [ 0.160380] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it

If I add pcie_aspm=force to kernel, I've the following output:

 $ dmesg | grep ASPM
 [    0.000000] PCIe ASPM is forcibly enabled
 [    0.197865] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it

Power Aware CPU scheduler

Changing cpu policy to powersave does not solve the problem.

ACPI OSI

Using switch 'acpi_osi=Linux' (as [here]) does not fix the problem.

ACPI

A kernel bug entry with similar behavior has been fixed: https://bugzilla.kernel.org/show_bug.cgi?id=42858

The patch available http://marc.info/?l=linux-acpi&m=132854533918079&w=2 in the comment number 5 of the bug report is already present in latest 12.04 kernel, so this is not the solution.

But strangely the symptoms seem very close, maybe the same bug is present in thinkpad-acpi ?

Analyzing CPU speed and voltage

The CPU speed and voltage influrence the CPU temperature. To analyze what happens it is possilble to at the MSR (maschine specific register) to see the CPU speed (it is there called FID) and CPU voltage (it is there called VID). To read and write the MSR you need to install the package msr-tools.

With sudo rdmsr -p[cpu number] MSR you get a display of the MSR. Most Intel CPU's usining MSR 0x0199 for the FID and VID.

Here are an exapmle of my Intel Core 2 CPU T5600 1.83GHz:

rainer@LINUX:~$ sudo rdmsr -p0 0x0199
613

This 613 is in hexadecimal and is

0x06 as FID (means 1GHz) and 0x13 as VID (means 0,95V)

My CPU runs at 3 speeds and it results in

613 is 1GHz 0.95V
81c is 1.33GHz 1.0625V 
b28 is 1.83GHz 1.2125V

This values a CPU dependent. With sudo wrmsr -p0 0x199 0x0b20 I was able to lower the voltage. But be carefull the governer will reset the FID and VID fast. To solve this see Temporary fix from RainerEL above.

Data

List of machines with the same problem
Model Bios version (latest ?) Bug present in
T500 - type 2082 3.24 (latest) Ubuntu 11.10 (32bits), Ubuntu 12.04 (32bits), ArchLinux (September 2012)
R60 - type 9461 (2.23 (latest) Ubuntu 12.04 (32bits) 3.2.0-41-generic-pae #65-Ubuntu SMP

References