Thanks again for the great website, it's proving to be an invaluable resource.
If I may, I'd like to post a tip about scripting around using a UUID for my storage drive instead of it's identity as sdb. I was going to post this as a question, as I couldn't get it to work, but in the process of writing this, I realised that there was a small typo in the script and that I wasn't running it a root. (duh!)
Anyway...
The other day I was testing a USB rescue drive on my server system, to see if it booted properly and left it in by mistake when I shut down.
I'm still early in my install of the base system (Ubuntu Server 12.04) and so upon the next reboot, I happened to be playing around with hdparm parameters, continuing from my last session. After a period of some alarm and confusion, I eventually realised that this USB drive had stolen the sda position and had forced all the rest of my disks down the table. Remembering about UUIDs, I did a little research and figured that this would be the best way to go in referencing my partitions, so I set about changing my settings and scripts to use these instead.
Here is how I went about changing my Thermal Shutdown script (DriveTempShutdown.sh) to reflect this new policy:
Code: Select all
#!/bin/bash
# PURPOSE: Script to check temperature of installed hard drives and report/shutdown if specified temperatures exceeded
#
# Modified for this server!!
#
# AUTHOR: feedback[AT]HaveTheKnowHow[DOT]com
# Expects three arguments:
# 1. Warning temperature
# 2. Critical shutdown temperature
# 3. If argument 3 is present then just check that drive letter
# eg. using ./DriveTemps.sh 35 45
# will warn when temperature of one or more drives reaches 35degrees and shutdown when any one of them hits 45
# eg. using ./DriveTemps.sh 35 45 c
# will warn when temperature of drive sdc reaches 35degrees and shutdown when it hits 45
# NOTES:
# Change the string ">>/home/htkh" as required
# Substitute string "myemail@myaddress.com" with your own email address in the string which starts "/usr/sbin/ssmtp myemail@myaddress.com"
# Change the command MyList='a b c d e' to the number of drives you have. In this case I'm using 6 drives
# Assumes /usr/sbin/smartctl -n standby -a /dev/sd$i returns the string 'Temperature_Celsius' somewhere
echo "JOB RUN AT $(date)"
echo '============================'
echo ''
echo 'Drive Warning Limit set to =>' $1
echo 'Drive Shutdown Limit set to =>' $2
echo ''
echo ''
if [ $# -eq 2 ]
then
MyList='72a4c040-a1d9-40fa-9cab-0a1d1f099529 bd6da063-8d37-476d-9f1c-7cc31098ffcd'
echo 'Testing all drives'
else
MyList=($3)
echo 'Testing only the system drive'
fi
echo ''
for i in $MyList
do
echo 'Drive /dev/disk/by-uuid/'$i
/usr/sbin/smartctl -n standby -a /dev/disk/by-uuid/$i | grep Temperature_Celsius
done
echo ''
echo ''
for i in $MyList
do
#Check state of drive 'active/idle' or 'standby'
stra=$(/sbin/hdparm -C /dev/disk/by-uuid/$i | grep 'drive' | awk '{print $4}')
echo 'Testing Drive with UUID '$i
if [ ${stra} = 'standby' ]
then
echo ' Drive with UUID '$i 'is in standby'
echo ''
else
str1='/usr/sbin/smartctl -n standby -a /dev/disk/by-uuid/'$i
str2=$($str1 | grep Temperature_Celsius | awk '{print $10}')
if [ ${str2} -ge $1 ]
then
echo '============================' >>/home/server/Logs/DriveWarning.Log
echo $(date) >>/home/server/Logs/DriveWarning.Log
echo '' >>/home/server/Logs/DriveWarning.Log
echo 'WARNING: TEMPERATURE FOR DRIVE with UUID '$i 'EXCEEDED' $1 '=>' $str2 >>/home/server/Logs/DriveWarning.Log
echo '' >>/home/server/Logs/DriveWarning.Log
echo '============================' >>/home/server/Logs/DriveWarning.Log
echo '============================'
echo $(date)
echo ''
echo 'WARNING: TEMPERATURE FOR DRIVE with UUID '$i 'EXCEEDED' $1 '=>' $str2
echo ''
echo '============================'
fi
if [ ${str2} -ge $2 ]
then
echo '============================'
echo $(date)
echo ''
echo 'CRITICAL: TEMPERATURE FOR DRIVE with UUID '$i 'EXCEEDED' $2 '=>' $str2
echo ''
echo '============================'
echo '============================' >>/home/server/Logs/DriveWarning.Log
echo $(date) >>/home/server/Logs/DriveWarning.Log
echo '' >>/home/server/Logs/DriveWarning.Log
echo 'CRITICAL: TEMPERATURE FOR DRIVE with UUID '$i 'EXCEEDED' $2 '=>' $str2 >>/home/server/Logs/DriveWarning.Log
echo '' >>/home/server/Logs/DriveWarning.Log
echo '============================' >>/home/server/Logs/DriveWarning.Log
/usr/sbin/pm-hibernate
/usr/sbin/ssmtp ******@*******.com </home/server/Logs/DriveWarning.Log
echo 'Email Sent.....'
exit
else
echo ''
echo ' Temperature of Drive with UUID '$i' is OK at =>' $str2
echo ''
fi
fi
done
echo 'All Drives are within limits'
echo ''
The first drive [72a4c040-a1d9-40fa-9cab-0a1d1f099529] is the only one which needs to be monitored, but the second (an SSD) [bd6da063-8d37-476d-9f1c-7cc31098ffcd] was included to see if the script was interpreting the UUIDs as valid variables and repeating for both listed drives.
Code: Select all
server@Server:~$ sudo blkid
/dev/sda1: UUID="38C2-743C" TYPE="vfat"
/dev/sda2: UUID="cc1e567d-7a33-41a4-8c36-b2885a6aa6cc" TYPE="ext2"
/dev/sda3: UUID="2pjRqc-cuZc-3l0G-z1so-oOni-EB8a-Oyn1oq" TYPE="LVM2_member"
/dev/sdb1: LABEL="4TB_Storage" UUID="72a4c040-a1d9-40fa-9cab-0a1d1f099529" TYPE="ext4"
/dev/sdc1: LABEL="Recordings" UUID="bd6da063-8d37-476d-9f1c-7cc31098ffcd" TYPE="ext4"
/dev/mapper/Server-root: UUID="d5d8b383-162d-4cbe-8d93-0ed8a940370f" TYPE="ext4"
/dev/mapper/Server-swap_1: UUID="b24830bb-4f1c-4868-b623-a76b7af28142" TYPE="swap"
/dev/mapper/Server-System: UUID="17f07fbf-a141-4fbe-bf30-f74b7987125d" TYPE="ext4"
Anyway, here is the result of running the script:
Code: Select all
server@Server:~/Scripts$ sudo ./DriveTempShutdown.sh 35 45
JOB RUN AT Tue Jul 31 23:45:14 BST 2012
============================
Drive Warning Limit set to => 35
Drive Shutdown Limit set to => 45
Testing all drives
Drive /dev/disk/by-uuid/72a4c040-a1d9-40fa-9cab-0a1d1f099529
194 Temperature_Celsius 0x0002 153 153 000 Old_age Always - 39 (Min/Max 22/47)
Drive /dev/disk/by-uuid/bd6da063-8d37-476d-9f1c-7cc31098ffcd
194 Temperature_Celsius 0x0022 128 129 000 Old_age Always - 128 (Min/Max 127/129)
231 Temperature_Celsius 0x0013 100 100 010 Pre-fail Always - 0
Testing Drive with UUID 72a4c040-a1d9-40fa-9cab-0a1d1f099529
============================
Tue Jul 31 23:45:16 BST 2012
WARNING: TEMPERATURE FOR DRIVE with UUID 72a4c040-a1d9-40fa-9cab-0a1d1f099529 EXCEEDED 35 => 40
============================
Temperature of Drive with UUID 72a4c040-a1d9-40fa-9cab-0a1d1f099529 is OK at => 40
Testing Drive with UUID bd6da063-8d37-476d-9f1c-7cc31098ffcd
./DriveTempShutdown.sh: line 70: [: too many arguments
./DriveTempShutdown.sh: line 89: [: too many arguments
Temperature of Drive with UUID bd6da063-8d37-476d-9f1c-7cc31098ffcd is OK at => 128 0
All Drives are within limits
As you can see, the script works fine until it has to deal with the SSD, whereupon it breaks when it has to deal with the drive's weird output of 128 and 0 degrees!
Obviously though, SSD's aren't going to have to be monitored for temperature in this way, so I didn't include it in the script on my system.
As you can see, the script is set to hibernate the system instead of shutting down and sends all output (Warning as well as Critical) to the log file.
To enable the hibernation feature, just install pm-utils:
Code: Select all
sudo apt-get install pm-utils
I have been recording and modifying a walk-through for my own reference, detailing all of the steps I have taken during my server installation so far (like the one above), so perhaps I can post it on this site when I get the opportunity.
I'm not sure that the Tips section would be the best place for it though, as it may be a bit raw and non-specific. What do you think?
EDIT: A version if this method, using disk/by-id instead of uuid, is here: http://forum.havetheknowhow.com/viewtopic.php?p=1995#p1995.