Shell script & "nvidia-smi" - needs right command/flag!

danhansen@denmark
Member
Member
Posts: 14
Joined: May 27th, 2013, 3:43 pm

Shell script & "nvidia-smi" - needs right command/flag!

Post by danhansen@denmark »

Hi friends,


I got a script from one of you guys some time ago and I'm using it still, but I need some help modifying it. I need it to work on the GPU's instead of the CPU cores... I've tried to explain the issue real well.

I've got a problem regarding a shell-script and the "nvidia-smi" command!

I've made a script that as protection against CPU overheating on my Ubuntu Server 14.04.2. The scripts works nicely but I need to make it work on my 4 GPU's as well.
I'm pretty green when it comes to bash scripts so I've been looking for commands which would make it easy for me to edit the script. I found and tested a lot of them, but none seems to give me the output I need! I'll show you the commands and the output below. And the scripts as well.

What I need is a command which lists the GPU's the same way the "sensors" command from "lm-sensors" does. So that I can use "grep" to select a GPU and set the variable "newstring" (the temp. two digits). I've been trying for a couple of days, but have had no luck. Mostly because the command "nvidia-smi -lso" and/or "nvidia-smi -lsa" doesn't exist anymore. Think it was an experimental command.

Here's the commands I found and tested & the output:

This command shows GPU socket number which I could put into the string "str" but the problem is that the temp. is on the next line. I've been fiddling with the flag "A 1" but haven't been able to put it into the script:

Code: Select all

# nvidia-smi -q -d temperature | grep GPU
Attached GPUs                       : 4
GPU 0000:01:00.0
        GPU Current Temp            : 57 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
GPU 0000:02:00.0
        GPU Current Temp            : 47 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
GPU 0000:03:00.0
        GPU Current Temp            : 47 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
GPU 0000:04:00.0
        GPU Current Temp            : 48 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
[/CODE]

This command shows the temp in the first line, but there's no GPU number!?

Code: Select all

# nvidia-smi -q -d temperature | grep "GPU Current Temp"
        GPU Current Temp            : 58 C
        GPU Current Temp            : 47 C
        GPU Current Temp            : 47 C
        GPU Current Temp            : 48 C
This command shows the GPU number you select, but there's still no output showing the GPU numer/socket/ID!?

Code: Select all

# nvidia-smi -q --gpu=0 | grep "GPU Current Temp"
GPU Current Temp            : 59 C
And this commands shows the GPU number and the results in the same row!! But, no temperature!!

Code: Select all

# nvidia-smi -L
GPU 0: GeForce GTX 750 Ti (UUID: GPU-9785c7c7-732f-1f51-..........)
GPU 1: GeForce GTX 750 (UUID: GPU-b2b1a4a-4dca-0c7f-..........)
GPU 2: GeForce GTX 750 (UUID: GPU-5e6b8efd-7531-777c-..........)
GPU 3: GeForce GTX 750 Ti (UUID: GPU-5b2b1a2f-3635-2a1c-..........)
And a command which shows all 4 GPU's temp. without anything else. But still I need the GPU number/socket/ID!?

Code: Select all

# nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader
58
47
47
48

What I'm wishing for! If I could get a command which made a output like this I would be the happiest guy around:

Code: Select all

GPU 0: GeForce GTX 750 Ti   GPU Current Temp            : 58 C
GPU 1: GeForce GTX 750   GPU Current Temp            : 47 C
GPU 2: GeForce GTX 750   GPU Current Temp            : 47 C
GPU 3: GeForce GTX 750 Ti   GPU Current Temp            : 48 C
Here's the output that "sensors" from "lm-sensors". As you can see the unit info and the temp is in the same line:

Code: Select all

# -----------------------------------------------------------
# coretemp-isa-0000
# Adapter: ISA adapter
# Physical id 0:  +56.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 0:         +56.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 1:         +54.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 2:         +54.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 3:         +52.0°C  (high = +80.0°C, crit = +100.0°C)
# -----------------------------------------------------------
Here's the part of the script that needs changing. As mentioned in the top, this works using the command "sensors" from the application "lm-sensors". "lm-sensors" doesn't show GPU temp. when running CUDA and the driver attached, so we need another command to get the GPU's listed and the temp. shown. You may know another way to fix my problem, if please don't hesitate to show me.:

Code: Select all

[...]
echo "JOB RUN AT $(date)"
echo "======================================="

echo ''
echo 'CPU Warning Limit set to => '$1
echo 'CPU Shutdown Limit set to => '$2
echo ''
echo ''

sensors

echo ''
echo ''

for i in 0 1 2 3
do

  str=$(sensors | grep "Core $i:")
  newstr=${str:17:2}

  if [ ${newstr} -ge $1 ]
  then
    echo '===================================================================='         >>/home/......../logs/watchdogcputemp.log
    echo $(date)                                                                        >>/home/......../logs/watchdogcputemp.log
    echo ''                                                                             >>/home/......../logs/watchdogcputemp.log
    echo ' STATUS WARNING - NOTIFYING : TEMPERATURE CORE' $i 'EXCEEDED' $1 '=>' $newstr >>/home/......../logs/watchdogcputemp.log
    echo ' ACTION : EMAIL SENT'                                                         >>/home/......../logs/watchdogcputemp.log
    echo ''                                                                             >>/home/......../logs/watchdogcputemp.log
    echo '===================================================================='         >>/home/......../logs/watchdogcputemp.log

# Status Warning Email Sending Code
# WatchdogCpuTemp Alert! Status Warning - Notifying!"

/usr/bin/msmtp -d --read-recipients </home/......../shellscripts/messages/watchdogcputempwarning.txt

    echo 'Email Sent.....'
  fi
[...]


I hope there's a bash-script guru out there, ready to solve this issue
Have a nice weekend!

Kind Regards,
Dan Hansen
Denmark

.
ricksebak
Member
Member
Posts: 33
Joined: February 10th, 2013, 9:34 pm

Re: Shell script & "nvidia-smi" - needs right command/flag!

Post by ricksebak »

I don't use nvidia, so there might well be a more elegant way to do this. But based on what I see in your post:

Code: Select all

for TEMP in `nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader`
do
if [ $TEMP -gt 100 ]
then
  run some commands here, maybe send me an email or whatever you want
fi
[\code]

This will loop through all the temperatures that you found when running "nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader", and if any of them are above 100, it will "run some commands." This won't necessarily care about which GPU is overheated, it will only care that some GPU is overheated. But I can't imagine that you, as a human, care which GPU is overheated, you only care that some GPU is overheated.
ricksebak
Member
Member
Posts: 33
Joined: February 10th, 2013, 9:34 pm

Re: Shell script & "nvidia-smi" - needs right command/flag!

Post by ricksebak »

And obviously, don't use the

Code: Select all

 lines that I used. I clearly don't know how to use a message board. But the rest of it should work in bash.
danhansen@denmark
Member
Member
Posts: 14
Joined: May 27th, 2013, 3:43 pm

Re: Shell script & "nvidia-smi" - needs right command/flag!

Post by danhansen@denmark »

Hi Ricksebak,


Thanks for your reply!!

I have to say that I'm not that good yet. I'm still very much a "learner". But, I can see what you mean and using that command the script I'm making would compatible with other GPU's! I will make another version of the script and I will use this, so thank you!
But, for now I've found the command that shows the GPU socket info, a text to "grep" and the temperature in the same line. This way I can edit my script and make it work right now. It's not perfect and I will need to make it better, but for now it works. So thanks for your help I will certainly use this when I'm making it better!

For others with the same issues, here's how I did it:

Code: Select all

nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU
GPU 0000:01:00.0        GPU Current Temp            : 54 C
GPU 0000:02:00.0        GPU Current Temp            : 47 C
GPU 0000:03:00.0        GPU Current Temp            : 52 C
GPU 0000:04:00.0        GPU Current Temp            : 51 C
And here's the one that solved the issue, should you be interested in the way it was done ;)
http://askubuntu.com/questions/638665/s ... 828#641828

Thanks for the help ;)

Kind Regards,
Dan
danhansen@denmark
Member
Member
Posts: 14
Joined: May 27th, 2013, 3:43 pm

Re: Shell script & "nvidia-smi" - needs right command/flag!

Post by danhansen@denmark »

Hi again,


Any chance you might look at this? I've got the command that makes the same output or in the same form, and now I need it applied into the script. Can you help me edit the script?

First the command which makes the GPU socket and temperature output:

Code: Select all

nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU
Then the output:

Code: Select all

# nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU
GPU 0000:01:00.0        GPU Current Temp            : 49 C
GPU 0000:02:00.0        GPU Current Temp            : 39 C
GPU 0000:03:00.0        GPU Current Temp            : 44 C
GPU 0000:04:00.0        GPU Current Temp            : 47 C
And then the script where I've tried to enter the command. The first string "str" is set as it is suppose to, but the second string "newstr" which is the temperature doesn't work!?! I've tried everything. Please help me if you can.

Code: Select all

#!/bin/bash

# --- WatchdogGpuTemp.sh v.0.1.2 ---
# Author: DanHansen[at]Denmark
# Thanks to HaveTheKnowHow.com
# Thanks to "Terdon" Ubuntu Forums
# Application: nvidia-smi
# Filename: watchdoggputemp.sh
# Logfile: watchdoggputemp.log
# Message file for status warning: watchdoggputempwarning.txt
# Message file for status critical: watchdoggputempcritical.txt
# Work directory: /home/username/shellscripts/
# Log directory: /home/username/logs/
# Message directory: /home/username/shellscripts/messages/
#
# --- WatchdogGpuTemp.sh v.0.1.2 ---

echo "JOB RUN AT $(date)"
echo "======================================="

echo ''
echo 'CPU Warning Limit set to => '$1
echo 'CPU Shutdown Limit set to => '$2
echo ''
echo ''

nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU

echo ''
echo ''

for i in 1 2 3 4
do

  str=$(nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU "GPU 0000:0$i:00.0")
  newstr=${str:54:2}

  if [ ${newstr} -ge $1 ]
  then
    echo '===================================================================='        >>/home/username/logs/watchdoggputemp.log
    echo $(date)                                                                       >>/home/username/logs/watchdoggputemp.log
    echo ''                                                                            >>/home/username/logs/watchdoggputemp.log
    echo ' STATUS WARNING - NOTIFYING : TEMPERATURE GPU' $i 'EXCEEDED' $1 '=>' $newstr >>/home/username/logs/watchdoggputemp.log
    echo ' ACTION : EMAIL SENT'                                                        >>/home/username/logs/watchdoggputemp.log
    echo ''                                                                            >>/home/username/logs/watchdoggputemp.log
    echo '===================================================================='        >>/home/username/logs/watchdoggputemp.log

# Status Warning Email Sending Code 
# WatchdogGpuTemp Alert! Status Warning - Notifying!"

/usr/bin/msmtp -d --read-recipients </home/username/shellscripts/messages/watchdoggputempwarning.txt

    echo 'Email Sent.....'
  fi

  if [ ${newstr} -ge $2 ]
  then
    echo '===================================================================='        >>/home/username/logs/watchdoggputemp.log
    echo $(date)                                                                       >>/home/username/logs/watchdoggputemp.log
    echo ''                                                                            >>/home/username/logs/watchdoggputemp.log
    echo ' STATUS CRITICAL - SHUTDOWN : TEMPERATURE GPU' $i 'EXCEEDED' $2 '=>' $newstr >>/home/username/logs/watchdoggputemp.log
    echo ' ACTION : EMAIL SENT & SYSTEM SHUTDOWN'                                      >>/home/username/logs/watchdoggputemp.log
    echo ''                                                                            >>/home/username/logs/watchdoggputemp.log
    echo '===================================================================='        >>/home/username/logs/watchdoggputemp.log

# Status Critical Email Sending Code:
# WatchdogGpuTemp Alert! Status Critical - Shutdown!"

/usr/bin/msmtp -d --read-recipients </home/username/shellscripts/messages/watchdoggputempcritical.txt

    echo 'Email Sent.....'
    echo 'System will now shutdown.....'
    /sbin/shutdown -h now
    exit

  else
    echo ' Temperature GPU '$i' OK at =>' $newstr
    echo ''
  fi
done

echo 'Status - All GPUs are within critical temperature limits'
echo ''
And the output when running the script. Please notice the GPU number is read. But not the temperature. "newstr=${str:54:2}" :

Code: Select all

# ./watchdoggputemp.sh 55 60
JOB RUN AT Sun Jun 28 10:13:57 CEST 2015
=======================================

CPU Warning Limit set to => 55
CPU Shutdown Limit set to => 60


GPU 0000:01:00.0        GPU Current Temp            : 49 C
GPU 0000:02:00.0        GPU Current Temp            : 46 C
GPU 0000:03:00.0        GPU Current Temp            : 52 C
GPU 0000:04:00.0        GPU Current Temp            : 51 C


grep: GPU 0000:01:00.0: No such file or directory
./watchdoggputemp.sh: line 68: [: -ge: unary operator expected
./watchdoggputemp.sh: line 86: [: -ge: unary operator expected
 Temperature GPU 1 OK at =>

grep: GPU 0000:02:00.0: No such file or directory
./watchdoggputemp.sh: line 68: [: -ge: unary operator expected
./watchdoggputemp.sh: line 86: [: -ge: unary operator expected
 Temperature GPU 2 OK at =>

grep: GPU 0000:03:00.0: No such file or directory
./watchdoggputemp.sh: line 68: [: -ge: unary operator expected
./watchdoggputemp.sh: line 86: [: -ge: unary operator expected
 Temperature GPU 3 OK at =>

grep: GPU 0000:04:00.0: No such file or directory
./watchdoggputemp.sh: line 68: [: -ge: unary operator expected
./watchdoggputemp.sh: line 86: [: -ge: unary operator expected
 Temperature GPU 4 OK at =>

Status - All GPUs are within critical temperature limits
Hoping to hear from you. The script came from inhere to begin with, so wouldn't it just be nice if you were the one to solve it ;) Anyway, I'm thankful to the help you can provide!

Kind Regards,
Dan
ricksebak
Member
Member
Posts: 33
Joined: February 10th, 2013, 9:34 pm

Re: Shell script & "nvidia-smi" - needs right command/flag!

Post by ricksebak »

I don't normally use string splitting or whatever "newstr=${str:54:2}" would be called. But it looks like you are trying to grab the temperature value, which could also be done using awk:

newstr=`echo $str | awk '{print $7}'`
danhansen@denmark
Member
Member
Posts: 14
Joined: May 27th, 2013, 3:43 pm

Re: Shell script & "nvidia-smi" - needs right command/flag!

Post by danhansen@denmark »

Hi Ricksebak ;)

Thanks for getting back to me ;)

Here's the script somewhat modified from the one I got from inhere way back, but it works the same way. I just added some system notifying and a little log'ing. In the top of the script you can see the output needed for the script to run.

How would you make this work with the GPU's ?? I don't care which command does it, I would just like the script to work the same way. To work using the 2 variables set in the start "shellscriptname.sh xx xx" (shellscriptname.sh $1 $2)

As you can see in my other posts, I've been looking for a command which made the same output so that I could "grep" or set the string "str" as 0, 1, 2 or 3 --> str=$(sensors | grep "Core $i:)" Then it sets a new string "newstr" which is the temperature. It does that using that funny command --> newstr=${str:17:2} . Using the 2 variables from start $1 and $2 e.g. 55 degrees and 60 degrees Celsius, it checks if "core 0" is larger than the value. If, it hotter than $1 it warns by log'ing and mailing. If it's hotter than $2 it warns, logs and shutsdown the system. Then it returns to the top of the script and checks the next core "core 1" using the function " for i in 0 1 2 3 do ".

Well, I can see you HaveTheKnowHow, so I'm looking forward to hear from you again. Actually this is a little like Christmas ;)

Code: Select all

  str=$(sensors | grep "Core $i:")
  newstr=${str:17:2}
Maybe you have a solution when you see the working script for CPU cores. Can you help me modify this?

The Script working on a CPU with 4 cores. (the original script worked on 2 cores):

Code: Select all

#!/bin/bash

# --- WatchdogCpuTemp.sh v.0.1.7 ---
# This script will warn, mail and log when temperature of one or more cores hit 55 degrees and warn, mail, log and shutdown when either hits 60 degrees.
# Expects two arguments:
#    1. Warning temperature
#    2. Critical shutdown temperature
#    e.g. command # ./watchdogcputemp.sh 55 60
#
# Please notice!
# Assumes output from sensors command is as follows:
# If not then modify the commands " str=$(sensors | grep "Core $i:") " & " newstr=${str:17:2} " below accordingly
# If your CPU has got more or less than 4 cores, just change this " for i in 0 1 2 3 " accordingly
# -----------------------------------------------------------
# coretemp-isa-0000
# Adapter: ISA adapter
# Physical id 0:  +56.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 0:         +56.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 1:         +54.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 2:         +54.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 3:         +52.0°C  (high = +80.0°C, crit = +100.0°C)
# -----------------------------------------------------------


echo "JOB RUN AT $(date)"
echo "======================================="

echo ''
echo 'CPU Warning Limit set to => '$1
echo 'CPU Shutdown Limit set to => '$2
echo ''
echo ''

sensors

echo ''
echo ''

for i in 0 1 2 3
do


  str=$(sensors | grep "Core $i:")
  newstr=${str:17:2}


  if [ ${newstr} -ge $1 ]
  then
    echo '===================================================================='         >>/home/username/logs/watchdogcputemp.log
    echo $(date)                                                                        >>/home/username/logs/watchdogcputemp.log
    echo ''                                                                             >>/home/username/logs/watchdogcputemp.log
    echo ' STATUS WARNING - NOTIFYING : TEMPERATURE CORE' $i 'EXCEEDED' $1 '=>' $newstr >>/home/username/logs/watchdogcputemp.log
    echo ' ACTION : EMAIL SENT'                                                         >>/home/username/logs/watchdogcputemp.log
    echo ''                                                                             >>/home/username/logs/watchdogcputemp.log
    echo '===================================================================='         >>/home/username/logs/watchdogcputemp.log

# Status Warning Email Sending Code 
# WatchdogCpuTemp Alert! Status Warning - Notifying!"

/usr/bin/msmtp -d --read-recipients </home/username/shellscripts/messages/watchdogcputempwarning.txt

    echo 'Email Sent.....'
  fi
  
  if [ ${newstr} -ge $2 ]
  then
    echo '===================================================================='         >>/home/username/logs/watchdogcputemp.log
    echo $(date)                                                                        >>/home/username/logs/watchdogcputemp.log
    echo ''                                                                             >>/home/username/logs/watchdogcputemp.log
    echo ' STATUS CRITICAL - SHUTDOWN : TEMPERATURE CORE' $i 'EXCEEDED' $2 '=>' $newstr >>/home/username/logs/watchdogcputemp.log
    echo ' ACTION : EMAIL SENT & SYSTEM SHUTDOWN'                                       >>/home/username/logs/watchdogcputemp.log
    echo ''                                                                             >>/home/username/logs/watchdogcputemp.log
    echo '===================================================================='         >>/home/username/logs/watchdogcputemp.log
	
# Status Critical Email Sending Code:
# WatchdogCpuTemp Alert! Status Critical - Shutdown!"

/usr/bin/msmtp -d --read-recipients </home/username/shellscripts/messages/watchdogcputempcritical.txt

    echo 'Email Sent.....'
    echo 'System will now shutdown.....'
    /sbin/shutdown -h now
    exit
  
  else
    echo ' Temperature Core '$i' OK at =>' $newstr
    echo ''
  fi
done

echo 'Status - All CPU Cores are within critical temperature limits'
echo ''
ricksebak
Member
Member
Posts: 33
Joined: February 10th, 2013, 9:34 pm

Re: Shell script & "nvidia-smi" - needs right command/flag!

Post by ricksebak »

In your watchdoggputemp.sh script that you posted previously, line 36 looks like you are attempting to set newstr to the value of the GPU temperature. I don't really know why that isn't working, but if you change line 36 in the way that I mentioned, newstr should be able to find the temperature value and the rest of the script should work (although I don't use nvidia so I haven't tested it).

Change line 36 to:

newstr=`echo $str | awk '{print $7}'`
danhansen@denmark
Member
Member
Posts: 14
Joined: May 27th, 2013, 3:43 pm

Re: Shell script & "nvidia-smi" - needs right command/flag!

Post by danhansen@denmark »

Hi Riksebak,

The script I just showed you is the script using lm-sensors and the command "sensors". That command makes a output which works in this script.
LM-sensors cannot be used when CUDA/Nvidia driver is installed. Therefore I needed a command which could make a output which looked like he one lm-sensors (sensors) command. So, my last post was the CPU version of the script. Just wanted to show it to you so that you could see how it works along with lm-sensors and the CPU-cores.

I the top of the post I showed all my attempts to find that command. I couldn't! But I got a command yesterday which works, but I cant seem to get the second string "newstr". The first one works just fine, it greps the GPU number but I'm struggling with the second string "newstr" (the temperature)

Here's the command which shows GPU info along with socket ID/GPU number and the temperature in the same line:

Code: Select all

nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU
And the output which looks perfect:

Code: Select all

# nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU
GPU 0000:01:00.0        GPU Current Temp            : 49 C
GPU 0000:02:00.0        GPU Current Temp            : 39 C
GPU 0000:03:00.0        GPU Current Temp            : 44 C
GPU 0000:04:00.0        GPU Current Temp            : 47 C
Here we are setting the string "str". First cycle its GPU 1, then on the next cycle its GPU 2 etc. etc.:

Code: Select all

str=$(nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU "GPU 0000:0$i:00.0")
But the next line in the script doesn't work. I can't "pick up" the temperature. It works with the sensors command in the CPU version that I showed, but not here. That's what I need. Or, another way to set the 2 strings "str" and "newstr" in the script.

Code: Select all

newstr=${str:54:2}
Here's my attempt to make it work on the GPU's:

Code: Select all

#!/bin/bash

# --- WatchdogGpuTemp.sh v.0.1.2 ---
# Author: DanHansen[at]Denmark
# Thanks to HaveTheKnowHow.com
# Thanks to "Terdon" Ubuntu Forums
# Application: nvidia-smi
# Filename: watchdoggputemp.sh
# Logfile: watchdoggputemp.log
# Message file for status warning: watchdoggputempwarning.txt
# Message file for status critical: watchdoggputempcritical.txt
# Work directory: /home/username/shellscripts/
# Log directory: /home/username/logs/
# Message directory: /home/username/shellscripts/messages/
#
# --- WatchdogGpuTemp.sh v.0.1.2 ---

echo "JOB RUN AT $(date)"
echo "======================================="

echo ''
echo GPU Warning Limit set to => '$1
echo 'GPU Shutdown Limit set to => '$2
echo ''
echo ''

nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU

echo ''
echo ''

for i in 1 2 3 4
do

  str=$(nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU "GPU 0000:0$i:00.0")
  newstr=${str:54:2}

  if [ ${newstr} -ge $1 ]
  then
    echo '===================================================================='        >>/home/username/logs/watchdoggputemp.log
    echo $(date)                                                                       >>/home/username/logs/watchdoggputemp.log
    echo ''                                                                            >>/home/username/logs/watchdoggputemp.log
    echo ' STATUS WARNING - NOTIFYING : TEMPERATURE GPU' $i 'EXCEEDED' $1 '=>' $newstr >>/home/username/logs/watchdoggputemp.log
    echo ' ACTION : EMAIL SENT'                                                        >>/home/username/logs/watchdoggputemp.log
    echo ''                                                                            >>/home/username/logs/watchdoggputemp.log
    echo '===================================================================='        >>/home/username/logs/watchdoggputemp.log

# Status Warning Email Sending Code
# WatchdogGpuTemp Alert! Status Warning - Notifying!"

/usr/bin/msmtp -d --read-recipients </home/username/shellscripts/messages/watchdoggputempwarning.txt

    echo 'Email Sent.....'
  fi

  if [ ${newstr} -ge $2 ]
  then
    echo '===================================================================='        >>/home/username/logs/watchdoggputemp.log
    echo $(date)                                                                       >>/home/username/logs/watchdoggputemp.log
    echo ''                                                                            >>/home/username/logs/watchdoggputemp.log
    echo ' STATUS CRITICAL - SHUTDOWN : TEMPERATURE GPU' $i 'EXCEEDED' $2 '=>' $newstr >>/home/username/logs/watchdoggputemp.log
    echo ' ACTION : EMAIL SENT & SYSTEM SHUTDOWN'                                      >>/home/username/logs/watchdoggputemp.log
    echo ''                                                                            >>/home/username/logs/watchdoggputemp.log
    echo '===================================================================='        >>/home/username/logs/watchdoggputemp.log

# Status Critical Email Sending Code:
# WatchdogGpuTemp Alert! Status Critical - Shutdown!"

/usr/bin/msmtp -d --read-recipients </home/username/shellscripts/messages/watchdoggputempcritical.txt

    echo 'Email Sent.....'
    echo 'System will now shutdown.....'
    /sbin/shutdown -h now
    exit

  else
    echo ' Temperature GPU '$i' OK at =>' $newstr
    echo ''
  fi
done

echo 'Status - All GPUs are within critical temperature limits'
echo ''
When running the script, this is the result/the output. As you can see it "grep's" the GPU number like it's suppose to. But it doesn't get the temperature into "newstr" !?!?!? That's the problem:

Code: Select all

# ./watchdoggputemp.sh 55 60
JOB RUN AT Sun Jun 28 10:13:57 CEST 2015
=======================================

CPU Warning Limit set to => 55
CPU Shutdown Limit set to => 60


GPU 0000:01:00.0        GPU Current Temp            : 49 C
GPU 0000:02:00.0        GPU Current Temp            : 46 C
GPU 0000:03:00.0        GPU Current Temp            : 52 C
GPU 0000:04:00.0        GPU Current Temp            : 51 C


grep: GPU 0000:01:00.0: No such file or directory
./watchdoggputemp.sh: line 68: [: -ge: unary operator expected
./watchdoggputemp.sh: line 86: [: -ge: unary operator expected
 Temperature GPU 1 OK at =>

grep: GPU 0000:02:00.0: No such file or directory
./watchdoggputemp.sh: line 68: [: -ge: unary operator expected
./watchdoggputemp.sh: line 86: [: -ge: unary operator expected
 Temperature GPU 2 OK at =>

grep: GPU 0000:03:00.0: No such file or directory
./watchdoggputemp.sh: line 68: [: -ge: unary operator expected
./watchdoggputemp.sh: line 86: [: -ge: unary operator expected
 Temperature GPU 3 OK at =>

grep: GPU 0000:04:00.0: No such file or directory
./watchdoggputemp.sh: line 68: [: -ge: unary operator expected
./watchdoggputemp.sh: line 86: [: -ge: unary operator expected
 Temperature GPU 4 OK at =>

Status - All GPUs are within critical temperature limits
Hope you have an idea ;)
Thanks ;)
ricksebak
Member
Member
Posts: 33
Joined: February 10th, 2013, 9:34 pm

Re: Shell script & "nvidia-smi" - needs right command/flag!

Post by ricksebak »

Yes, I understand what you are trying to do with newstr. And the awk command that I posted earlier will do what you want. For example:

cat /tmp/nvidaoutput
GPU 0000:01:00.0 GPU Current Temp : 49 C
GPU 0000:02:00.0 GPU Current Temp : 39 C
GPU 0000:03:00.0 GPU Current Temp : 44 C
GPU 0000:04:00.0 GPU Current Temp : 47 C

for i in 1 2 3 4
> do
> str=$(cat /tmp/nvidaoutput | grep 0000:0$i:00.0)
> newstr=$(echo $str | awk '{print $7}')
> echo $newstr
> done
49
39
44
47

Try setting line 36 of your GPU script to newstr=$(echo $str | awk '{print $7}') and it should work.
Post Reply