Difference between revisions of "Script for Undervolt Stress Testing"

From ThinkWiki
Jump to: navigation, search
m (Undo corruption)
m (fix broken link)
Line 3: Line 3:
 
People have many different tolerances for how far they will undervolt their system.  Some are eager to just run their Pentium-Ms at 700mV and abandon safety; they ramp their systems as far as they can without crashing their system, and maybe they pull the voltages up a margin from the failure point.  However, this provides only a weak degree of security as a number of failures can occur that might not surface immediately.  In the worst case, the system will fail months later, and the blame might be assigned to, say, a kernel upgrade or patch when really the system failed due to intermittent lack of power.   
 
People have many different tolerances for how far they will undervolt their system.  Some are eager to just run their Pentium-Ms at 700mV and abandon safety; they ramp their systems as far as they can without crashing their system, and maybe they pull the voltages up a margin from the failure point.  However, this provides only a weak degree of security as a number of failures can occur that might not surface immediately.  In the worst case, the system will fail months later, and the blame might be assigned to, say, a kernel upgrade or patch when really the system failed due to intermittent lack of power.   
  
Many would like to guard themselves again such a failure and consequently have opted to run a prime number stress test such as [http://www.mersenne.org/prime.htm| MPrime] in a "torture test" mode, while they ramp down their voltages to find a comfortable margin from the failure point.  However, as per recommendations from a [http://mailman.linux-thinkpad.org/pipermail/linux-thinkpad/2006-July/034806.html| thread of the Linux-Thinkpad mailing list], perhaps even more can be done.  Following such advice, this script not only runs MPrime, but also toggles on and off a lot of power-demanding features of the laptop throughout the course of the test.  The idea is to more rapidly expose corner cases in which the system might act up.
+
Many would like to guard themselves again such a failure and consequently have opted to run a prime number stress test such as [http://www.mersenne.org/prime.htm| MPrime] in a "torture test" mode, while they ramp down their voltages to find a comfortable margin from the failure point.  However, as per recommendations from a [http://mailman.linux-thinkpad.org/pipermail/linux-thinkpad/2006-July/034806.html thread of the Linux-Thinkpad mailing list], perhaps even more can be done.  Following such advice, this script not only runs MPrime, but also toggles on and off a lot of power-demanding features of the laptop throughout the course of the test.  The idea is to more rapidly expose corner cases in which the system might act up.
  
 
{{NOTE|Please feel ''very free'' to improve/fix this script.  My intent for its posting is to make its ownership as public as possible.  There's no need to try to E-mail me to validate your changes.  If you feel they are in the best interest of the public, just make the changes.  The script attempts to employ pre-conditions to intelligently apply functionality only to those laptops that appear to support it.  Hopefully, its framework will allow for extension without heavy redesign.}}
 
{{NOTE|Please feel ''very free'' to improve/fix this script.  My intent for its posting is to make its ownership as public as possible.  There's no need to try to E-mail me to validate your changes.  If you feel they are in the best interest of the public, just make the changes.  The script attempts to employ pre-conditions to intelligently apply functionality only to those laptops that appear to support it.  Hopefully, its framework will allow for extension without heavy redesign.}}

Revision as of 07:57, 13 April 2007

This script helps in calibrating voltages when undervolting a Pentium M processor.

People have many different tolerances for how far they will undervolt their system. Some are eager to just run their Pentium-Ms at 700mV and abandon safety; they ramp their systems as far as they can without crashing their system, and maybe they pull the voltages up a margin from the failure point. However, this provides only a weak degree of security as a number of failures can occur that might not surface immediately. In the worst case, the system will fail months later, and the blame might be assigned to, say, a kernel upgrade or patch when really the system failed due to intermittent lack of power.

Many would like to guard themselves again such a failure and consequently have opted to run a prime number stress test such as MPrime in a "torture test" mode, while they ramp down their voltages to find a comfortable margin from the failure point. However, as per recommendations from a thread of the Linux-Thinkpad mailing list, perhaps even more can be done. Following such advice, this script not only runs MPrime, but also toggles on and off a lot of power-demanding features of the laptop throughout the course of the test. The idea is to more rapidly expose corner cases in which the system might act up.

NOTE!
Please feel very free to improve/fix this script. My intent for its posting is to make its ownership as public as possible. There's no need to try to E-mail me to validate your changes. If you feel they are in the best interest of the public, just make the changes. The script attempts to employ pre-conditions to intelligently apply functionality only to those laptops that appear to support it. Hopefully, its framework will allow for extension without heavy redesign.
ATTENTION!
There are very important warnings embedded into the comments of this script. I have left them there because if you copy this script to your system, I would want you to carry these warnings as comments with you. Please read these comments and the script very carefully. Stress testing an undervolted system is not a trivial undertaking and you need to be as accountable as possible for what a script like this does.

This page contains a large amount of code. The actual code should be moved to a dedicated code article, to make easier to download and edit.

#!/bin/bash
#
# DESCRIPTION AND MOTIVATION 
# --------------------------
# Designed for an undervolted laptops with frequency stepping, this script
# swings the system between aggressive and low power use, and also swings
# among the available frequencies.
# 
# The idea is that such exteme use of the system will likely explore corner
# cases where the system might fail.  Hopefully, such testing can curtail the
# time necessary to establish confidence in undervolted systems.
#
# In the background the MPrime program, a prime number search engine, runs in a
# "torture test" mode, in which it tests computations against known results and
# errs out if there's a discrepancy.  Unless it errs out, this script runs
# forever.
# 
# IMPLEMENTATION
# --------------
# The design of this script attempts to address laptops beyond the Thinkpad T42
# for which it was designed.  Many of the function definitions are prepended
# with conditionals that check the system for functionality and either bail out
# or disable features accordingly.
#
# In particular, the nature of what "aggressive" constitutes is defined by a 
# number of "toggle_" functions.  The pre-pended conditional to these functions
# appends the function name to $AGGRESSIVE_TOGGLES if the system appears to
# support the feature.  The toggle_aggression function then calls all the 
# functions in $AGGRESSIVE_TOGGLES.  Look at these "toggle_" functions for 
# examples of how to extend this script for other possible stressing.
#
# EXTERNAL PROGRAMS EMPLOYED
# --------------------------
# Test system integriy (required):  MPrime - http://www.mersenne.org/prime.htm
# Download files:  curl - http://curl.haxx.se
# Read random sectors from CD:  spew (for gorge) - http://spew.berlios.de
# Keep hard disk active:  stress - http://weather.ou.edu/~apw/projects/stress/
#
# EXECUTION
# ---------
# Read this script including all the warnings below, and then make sure all the
# variables in the "Script Globals" section are appropriately set. 
#
# This script uses the mprime binary with the "-t" switch for the MPrime
# "torture test."  This test by default uses all the memory available on the
# system.  However, if you run this system for many hours, your kernel may run
# out of memory, and kill mprime and this script.  To spare yourself this
# problem, use the "NightMemory=" and "DayMemory=" parameters in MPrime's
# local.ini file, a file typically in the same directory as the mprime
# executable (read the MPrime documentation for specifics).  The torture test
# by default uses the greater of these two settings, so just set them both a
# reasonable margin away from the total amount of memory available on your
# system.  On a system with 512MB of RAM, I set these parameters both to 448,
# and had enough memory left over to run my normal set of background processes.
#
# The arguments of this script are "aggression" toggles to disable.  Any
# function below that begins with "toggle_$OPTION" can be disabled by using
# $OPTION as one of the arguments of this script.  Otherwise, all the stressing
# that a system supports are enabled by default.
#
# Because of Warning 3 below, I recommend you run this script as
#
#     stress_test 2>&1 | tee output
#
# so that you have a persistent record of what has happened in case your battery
# drains completely.
#
# Keeping in mind Warning 1.1, run the script for as long as it takes to 
# establish confidence in your system (a few hours, half a day, etc.).
#
# WARNINGS
# --------
# 1) This is a STRESS test, and it is very possible that you may witness some
# very bad behavior.  Some systems might already be on the verge of breaking,
# and this script might push them over the edge, and damage them irreparably.
# Especially since you've probably undervolted your system, please accept the
# inherent risk in running this script.  In fact, I have even seen some
# unexpected behavior on non-undervolted systems running this script.  
#
# 1.1) This is a STRESS test, and it will run your system very hot at times.
# Since you are probably running this test because you've undervolted your
# system, you assumedly care a lot about conserving your battery's charge.
# However, running a system hot and needlessly running through charging cycles
# will tax your battery more than just normal use.  It is very difficult to
# even estimate how much of your battery's life you may throw away running
# this test.  In all likelihood on a battery that's not too old or too new, it
# should be imperceptible, and the security you'll gain after running this test
# will be worth it.  You can alway run this script without the battery
# connected -- just run it with an "ac_via_smapi" argument to disable 
# toggling from the ac to battery power.
#
# 2) Please READ THIS SCRIPT BEFORE RUNNING IT.  It was very much designed for my
# personal system, and although it worked very well for my needs, it relies 
# heavily on a number of external programs for full functionality.  Finding these
# programs isn't so bad (with the exception of MPrime all were available as 
# Debian packages -- spew, gorge, curl, etc.).  As I noted above, I've tried to 
# structure this script such that it can be extended (as opposed to overwritten) 
# to support other functionality.  However, you should also read this script 
# entirely because it's not mature, so it's difficult for me to document all the 
# strange ways in which it might behave under various circumstances.
#
# 3) This script might drain your battery completely.  It has some strong measures
# to prevent that from happening, but I can't make guarantees.   
#
# 4) Be mindful that upon breaking out of this script, your system maybe not be
# in an agreeable state.  There is a bash trap that performs a lot of cleanup 
# if you exit with a Ctrl-C.  But I didn't make the code to revert the CD's speed, 
# the wireless device's original txpower, the display's brightness, etc.  Also, the 
# bash trap isn't perfect, and might fail to restore the system.
#

set -e  # Script designed to bail out on any irregularities.

##############################################
# SCRIPT GLOBALS                             #
#  (may need some adjusting for your system) #
##############################################

MPRIME_BIN="./gimps/mprime" # MPrime binary location (get from
                            #   http://www.mersenne.org/freesoft.htm)
AGGRESSIVE_SLEEP_SEC=90     # Seconds for "agressive" testing interval when 
                            #   testing with a fixed frequency
NONAGGRESSIVE_SLEEP_SEC=120 # Seconds for non-"aggressive" testing interval
                            #   when testing with a fixed frequency
FREQ_CYCLE_SLEEP_SEC=15     # Seconds for each random frequency when testing
                            #   with a fixed aggression
FREQ_CYCLE_NUM=15           # Number of random frequencies to cycle through 
                            #   when testing with a fixed aggression
CAPACITY_LIMIT=50           # Minimum mWh required in battery before the script
                            #   takes time out to recharge the battery
SECONDS_TO_CHARGE=300       # Seconds to charge is $CAPACITY_LIMIT is reached
WIFI_DEVICE=eth1            # Set to garbage if you don't want to use wifi 
MAX_TXPOWER=20              # Tx power (dB) used for wifi device in aggressive
                            #   mode (off in non-aggressive mode)
CDROM_DEV_FILE=/dev/hdc     # Set to garbage if you don't want to use the CD-ROM
MAX_CD_SPEED=24             # Speed of CD in aggressive mode (off in
                            #   non-aggressive mode)

# Some services need to be stopped to prevent a conflict with
# aggressive/non-aggressive mode settings.  These services are restarted in
# reverse order upon the script's exit.  You can customize the path to these
# scripts here if your flavor of GNU doesn't use /etc/init.d/.
#
SERVICES_TO_STOP="tpsmapi powernowd acpid sleepd laptop-mode"
PATH_TO_SERVICES_SCRIPTS="/etc/init.d"

# Some info that should be in SysFS or ProcFS.
#
SYS_CPU_DIR=/sys/devices/system/cpu/cpu0/cpufreq
FREQS="$(cat $SYS_CPU_DIR/scaling_available_frequencies)"
FREQS_ARRAY=($FREQS)
SYS_TPSMAPI_BAT_DIR=/sys/devices/platform/smapi/BAT0
IBM_ACPI_BRIGHTNESS_FILE=/proc/acpi/ibm/brightness
RF_KILL_FILE=/sys/class/net/$WIFI_DEVICE/device/rf_kill

############
# BINARIES #
############
#
# Establishes paths for all binaries to make it easier for functions to test if
# they are executable with 'test -x "$BINARY_BIN"'.  
#
{
  CURL_BIN=$(which curl)
  GORGE_BIN=$(which gorge)
  STRESS_BIN=$(which stress)
  IWCONFIG_BIN=$(which iwconfig)
  IFUP_BIN=$(which ifup)
  IFDOWN_BIN=$(which ifdown)
  EJECT_BIN=$(which eject)
  CPUFREQSET_BIN=$(which cpufreq-set)
  KILLALL_BIN=$(which killall)
  RENICE_BIN=$(which renice)
} || true

#############
# FUNCTIONS #
############# 

# clean_up()
#
# Kills mprime background job and starts services that were stopped at the
# beginning of the scripts execution.
#
if [ ! -x "$KILLALL_BIN" ]
  then echo "Sorry, this script uses killall" ; exit 1
fi
for service in $SERVICES_TO_STOP ; do
  if [ ! -x "$PATH_TO_SERVICES_SCRIPTS/$service" ]
    then echo "$PATH_TO_SERVICES_SCRIPTS/$service can't be called." ; exit 1
  fi
done
clean_up()
{
  $KILLALL_BIN -q mprime || true
  if [ "$AGGRESSIVE" = "true" ] ; then toggle_aggression ; fi
  local SERVICES_TO_START=""
  for service in $SERVICES_TO_STOP
    do SERVICES_TO_START="$service $SERVICES_TO_START"
  done
  for service in $SERVICES_TO_START
    do $PATH_TO_SERVICES_SCRIPTS/$service start
  done
}
trap "echo 'cleaning up...' ; clean_up" SIGINT SIGTERM SIGHUP

# do_sleep()
#
# Before starting a testing interval, checks in the battery is low, and charges the
# battery if necessary.  After the testing interval, the running status of the 
# mprime background job is verified. 
#
# TODO: I've not addressed multiple batteries, APM, or ACPI.
#
if [ ! -r "$SYS_TPSMAPI_BAT_DIR/remaining_capacity" ] 
  then 
    echo -n "WARNING: Thinkpad SMAPI SysFS interface not " > /dev/stderr
    echo "available to detect if battery" > /dev/stderr
    echo -n "         level too low.  This script could drain " > /dev/stderr
    echo "all of your battery." > /dev/stderr
fi
do_sleep()
{
  if [ -r "$SYS_TPSMAPI_BAT_DIR/remaining_capacity" ] ; then
    local REMAINING_CAPACITY
    while REMAINING_CAPACITY=$(cat $SYS_TPSMAPI_BAT_DIR/remaining_capacity \
                                2> /dev/std) \
      && REMAINING_CAPACITY=${REMAINING_CAPACITY%% *} \
      && [ "$REMAINING_CAPACITY" ] \
      && [ "$REMAINING_CAPACITY" -lt "$CAPACITY_LIMIT" ] ; do
        echo ; echo -n "Battery is too low to continue, " 
        echo "taking a break to charge up."
        OLD_AGGRESSIVE="$AGGRESSIVE"
        if [ "AGGRESSIVE" = "true" ] ; then toggle_aggression ; fi
        sleep $SECONDS_TO_CHARGE 
        if [ ! "$OLD_AGGRESSIVE" = "$AGGRESSIVE" ] ; then toggle_aggression ; fi
    done
  fi
  sleep $1
  if kill -0 $MPRIME_PID 2> /dev/null 
    then return 0
    else 
      echo ; echo "mprime bailed out here!"
      clean_up
      exit 1
  fi
}

# set_frequency()
#
# Changes the frequency of the processor to $1.
#
# TODO: Perhaps there should be other ways to change the frequency another way.
#       I found cpufreq-set convenient because it handles both ProcFS _and_
#       SysFS.
#
if [ ! -x "$CPUFREQSET_BIN" ] ; then
  echo "Sorry, the set_frequency() function needs to be updated" > /dev/stderr
  echo "    to change frequencies without cpufreq-set." > /dev/stderr
  exit 1
fi
set_frequency()
{
  $CPUFREQSET_BIN -f $1
}

# toggle_ac_via_smapi()
#
# If the system is an Thinkpad with the tp_smapi kernel module set up, the 
# ac power is cut in an aggressive mode and returned in the non-agressive mode. 
#
if [ -w "$SYS_TPSMAPI_BAT_DIR/force_discharge" \
  -a -w "$SYS_TPSMAPI_BAT_DIR/inhibit_charge_minutes" ]
    then AGGRESSIVE_TOGGLES="$AGGRESSIVE_TOGGLES toggle_ac_via_smapi"
fi
toggle_ac_via_smapi()
{
  if [ "$AGGRESSIVE" = "true" ]
    then
      echo 0 > $SYS_TPSMAPI_BAT_DIR/force_discharge 
      echo 0 > $SYS_TPSMAPI_BAT_DIR/inhibit_charge_minutes
    else 
      echo 1 > $SYS_TPSMAPI_BAT_DIR/force_discharge 
      echo 5 > $SYS_TPSMAPI_BAT_DIR/inhibit_charge_minutes
  fi
}

# toggle_ibm_acpi_brightness()
#
# If the Thinkpad ibm_acpi kernel module is set up, the brightness of screen
# is set to the brightest setting in an agressive mode and the dimmest setting
# otherwise.
#
if [ -w "$IBM_ACPI_BRIGHTNESS_FILE" ]
    then AGGRESSIVE_TOGGLES="$AGGRESSIVE_TOGGLES toggle_ibm_acpi_brightness"
fi
toggle_ibm_acpi_brightness()
{
  if [ "$AGGRESSIVE" = "true" ]
    then echo level 0 > $IBM_ACPI_BRIGHTNESS_FILE
    else echo level 7 > $IBM_ACPI_BRIGHTNESS_FILE
  fi
}

# toggle_intel_wireless()
#
# Turns the wireless device on in power-hogging mode when aggressive, and
# turns the device off otherwise.
#
# NOTE: Designed for the Intel 2200BG open source driver, and may not be 
#   compatible with much else.  
#
if [ -w "$RF_KILL_FILE" -a -x "$PKILL_BIN" -a -x "$IFDOWN_BIN" \
  -a -x "$IFUP_BIN" -a -x "$IWCONFIG_BIN" -a "$WIFI_DEVICE" ] \
    && grep "$WIFI_DEVICE" /proc/net/wireless
      then 
        AGGRESSIVE_TOGGLES="$AGGRESSIVE_TOGGLES toggle_intel_wireless"
        $IWCONFIG_BIN $WIFI_DEVICE txpower $MAX_TXPOWER
        $IWCONFIG_BIN $WIFI_DEVICE power off
fi
toggle_intel_wireless()
{
  if [ "$AGGRESSIVE" = "true" ]
    then echo 1 > $RF_KILL_FILE
    else 
      echo 0 > $RF_KILL_FILE
      $PKILL_BIN ^ifdown$\|^ifup$ || true
      $IFDOWN_BIN $WIFI_DEVICE 2> /dev/null || true
      $IFUP_BIN $WIFI_DEVICE 2> /dev/null
      local NUM_OF_TRIES=0
      while $IWCONFIG_BIN $WIFI_DEVICE | grep unassociated > /dev/null \
          && [ "$NUM_OF_TRIES" -lt 15 ]
        do sleep 3
        NUM_OF_TRIES=$(($NUM_OF_TRIES + 1))
      done
  fi
}

# toggle_gorge()
#
# In an aggressive mode, reads data from the CD-ROM at random offsets using the 
# 'gorge' command (http://spew.berlios.de/).
#
# NOTE: Don't use a DVD, as the speed set by `eject' doesn't affect DVDs.
#
# NOTE: Make sure to use a CD with more than 450MB of data.
#
if [ -x "$GORGE_BIN" -a -x "$KILLALL_BIN" -a -r "$CDROM_DEV_FILE" ]
  then AGGRESSIVE_TOGGLES="$AGGRESSIVE_TOGGLES toggle_gorge"
fi
toggle_gorge()
{
  if [ "$AGGRESSIVE" = "true" ]
    then $KILLALL_BIN -q $GORGE_BIN || true
    else 
      $GORGE_BIN -r 450M $CDROM_DEV_FILE 2> /dev/null &
      local GORGE_PID=$!
      #
      # My laptop needed a little priority push to get gorge CD reading started
      # in sync with the interval.
      #
      if [ -x "$RENICE_BIN" ]
        then $RENICE_BIN -2 -p $GORGE_PID > /dev/null
      fi
  fi
}

# toggle_stress()
#
# Runs the `stress' program (http://weather.ou.edu/~apw/projects/stress/) in 
# the aggressive mode with settings to issue a large number of write(), 
# unlink(), and sync() events.
#
if [ -x "$STRESS_BIN" -a -x "$KILLALL_BIN" ]
  then AGGRESSIVE_TOGGLES="$AGGRESSIVE_TOGGLES toggle_stress"
fi
toggle_stress()
{
  if [ "$AGGRESSIVE" = "true" ]
    then $KILLALL_BIN -q $STRESS_BIN || true
    else $STRESS_BIN -q -i 1 -d 1 &
  fi
}

# toggle_curl()
#
# Downloads a file (to drain power through the wireless device) in the
# aggressive mode using `curl'.
#
if [ -x "$CURL_BIN" -a -x "$KILLALL_BIN" ]
  then AGGRESSIVE_TOGGLES="$AGGRESSIVE_TOGGLES toggle_curl"
fi
toggle_curl()
{
  URL_FIRST_HALF="http://cdimage.debian.org/cdimage/weekly-builds/"
  URL_SECOND_HALF="i386/iso-cd/debian-testing-i386-binary-1.iso"
  if [ "$AGGRESSIVE" = "true" ]
    then $KILLALL_BIN -q $CURL_BIN || true
    else $CURL_BIN $URL_FIRST_HALF$URL_SECOND_HALF > /dev/null 2> /dev/null &
  fi
}

# toggle_aggression()
#
# Runs all the "toggle_" functions supported by the system unless specified
# as disabled in the script arguments.
#
for toggle_to_disable in $@ 
  do AGGRESSIVE_TOGGLES=$(echo $AGGRESSIVE_TOGGLES \
                            | sed -e "s/toggle_$toggle_to_disable//")
done
toggle_aggression()
{ 
  for toggle in $AGGRESSIVE_TOGGLES ; do $toggle ; done
  if [ "$AGGRESSIVE" = "true" ]
    then AGGRESSIVE="false"
    else AGGRESSIVE="true"
  fi
}

#########
# SETUP #
#########

# Stopping services that might interfere with the system state this script
# controls (precondition satisfied in definition of clean_up).
#
for service in $SERVICES_TO_STOP
  do /etc/init.d/$service stop
done

# Setting CD to a fast speed 
#
if [ -x "$EJECT_BIN" ] 
  then $EJECT_BIN -x $MAX_CD_SPEED
elif [ -x "$HDPARM_BIN" ]
  then $HDPARM_BIN -E $MAX_CD_SPEED
fi 

# Starting the prime number search
#
if [ ! -x "$MPRIME_BIN" ] ; then 
  echo "mprime program not executable/found." > /dev/stderr
  exit 1
fi  
$MPRIME_BIN -t > mprime_output.txt &
MPRIME_PID=$!

########
# BODY #
########

while true ; do
  for f in $FREQS ; do
    echo "Cycling aggression twice for ${f}kHz: "
    set_frequency $f
    if [ ! "$AGGRESSIVE" = "true" ] ; then toggle_aggression ; fi
    for i in 1 2 ; do
      echo "    high " ; do_sleep $AGGRESSIVE_SLEEP_SEC ; toggle_aggression
      echo "    low " ; do_sleep $NONAGGRESSIVE_SLEEP_SEC ; toggle_aggression
    done
    echo 
    for i in 1 2 ; do
      if [ $i -eq 1 ] 
        then
          if [ ! "$AGGRESSIVE" = "true" ] ; then toggle_aggression ; fi
          echo "Random freqs under high aggression: "
        else
          if [ "$AGGRESSIVE" = "true" ] ; then toggle_aggression ; fi
          echo "Random freqs under low aggression: "
      fi 
      for (( i=1 ; i<=$FREQ_CYCLE_NUM ; i+=1 )) ; do
        FREQ=${FREQS_ARRAY[$(($RANDOM % 6))]}
        echo "    ${FREQ}..."
        set_frequency $FREQ
        do_sleep $FREQ_CYCLE_SLEEP_SEC
      done
      echo
    done
  done
done