I got a script from one of you guys some time ago and I'm using it still, but I need some help modifying it. I need it to work on the GPU's instead of the CPU cores... I've tried to explain the issue real well.
I've got a problem regarding a shell-script and the "nvidia-smi" command!
I've made a script that as protection against CPU overheating on my Ubuntu Server 14.04.2. The scripts works nicely but I need to make it work on my 4 GPU's as well.
I'm pretty green when it comes to bash scripts so I've been looking for commands which would make it easy for me to edit the script. I found and tested a lot of them, but none seems to give me the output I need! I'll show you the commands and the output below. And the scripts as well.
What I need is a command which lists the GPU's the same way the "sensors" command from "lm-sensors" does. So that I can use "grep" to select a GPU and set the variable "newstring" (the temp. two digits). I've been trying for a couple of days, but have had no luck. Mostly because the command "nvidia-smi -lso" and/or "nvidia-smi -lsa" doesn't exist anymore. Think it was an experimental command.
Here's the commands I found and tested & the output:
This command shows GPU socket number which I could put into the string "str" but the problem is that the temp. is on the next line. I've been fiddling with the flag "A 1" but haven't been able to put it into the script:
Code: Select all
# nvidia-smi -q -d temperature | grep GPU
Attached GPUs : 4
GPU 0000:01:00.0
GPU Current Temp : 57 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
GPU 0000:02:00.0
GPU Current Temp : 47 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
GPU 0000:03:00.0
GPU Current Temp : 47 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
GPU 0000:04:00.0
GPU Current Temp : 48 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
This command shows the temp in the first line, but there's no GPU number!?
Code: Select all
# nvidia-smi -q -d temperature | grep "GPU Current Temp"
GPU Current Temp : 58 C
GPU Current Temp : 47 C
GPU Current Temp : 47 C
GPU Current Temp : 48 C
Code: Select all
# nvidia-smi -q --gpu=0 | grep "GPU Current Temp"
GPU Current Temp : 59 C
Code: Select all
# nvidia-smi -L
GPU 0: GeForce GTX 750 Ti (UUID: GPU-9785c7c7-732f-1f51-..........)
GPU 1: GeForce GTX 750 (UUID: GPU-b2b1a4a-4dca-0c7f-..........)
GPU 2: GeForce GTX 750 (UUID: GPU-5e6b8efd-7531-777c-..........)
GPU 3: GeForce GTX 750 Ti (UUID: GPU-5b2b1a2f-3635-2a1c-..........)
Code: Select all
# nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader
58
47
47
48
What I'm wishing for! If I could get a command which made a output like this I would be the happiest guy around:
Code: Select all
GPU 0: GeForce GTX 750 Ti GPU Current Temp : 58 C
GPU 1: GeForce GTX 750 GPU Current Temp : 47 C
GPU 2: GeForce GTX 750 GPU Current Temp : 47 C
GPU 3: GeForce GTX 750 Ti GPU Current Temp : 48 C
Code: Select all
# -----------------------------------------------------------
# coretemp-isa-0000
# Adapter: ISA adapter
# Physical id 0: +56.0°C (high = +80.0°C, crit = +100.0°C)
# Core 0: +56.0°C (high = +80.0°C, crit = +100.0°C)
# Core 1: +54.0°C (high = +80.0°C, crit = +100.0°C)
# Core 2: +54.0°C (high = +80.0°C, crit = +100.0°C)
# Core 3: +52.0°C (high = +80.0°C, crit = +100.0°C)
# -----------------------------------------------------------
Code: Select all
[...]
echo "JOB RUN AT $(date)"
echo "======================================="
echo ''
echo 'CPU Warning Limit set to => '$1
echo 'CPU Shutdown Limit set to => '$2
echo ''
echo ''
sensors
echo ''
echo ''
for i in 0 1 2 3
do
str=$(sensors | grep "Core $i:")
newstr=${str:17:2}
if [ ${newstr} -ge $1 ]
then
echo '====================================================================' >>/home/......../logs/watchdogcputemp.log
echo $(date) >>/home/......../logs/watchdogcputemp.log
echo '' >>/home/......../logs/watchdogcputemp.log
echo ' STATUS WARNING - NOTIFYING : TEMPERATURE CORE' $i 'EXCEEDED' $1 '=>' $newstr >>/home/......../logs/watchdogcputemp.log
echo ' ACTION : EMAIL SENT' >>/home/......../logs/watchdogcputemp.log
echo '' >>/home/......../logs/watchdogcputemp.log
echo '====================================================================' >>/home/......../logs/watchdogcputemp.log
# Status Warning Email Sending Code
# WatchdogCpuTemp Alert! Status Warning - Notifying!"
/usr/bin/msmtp -d --read-recipients </home/......../shellscripts/messages/watchdogcputempwarning.txt
echo 'Email Sent.....'
fi
[...]
I hope there's a bash-script guru out there, ready to solve this issue
Have a nice weekend!
Kind Regards,
Dan Hansen
Denmark
.