Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Script to kill CPU-demanding processes 2

Status
Not open for further replies.

volcano

Programmer
Aug 29, 2000
136
HK
Hi all, I want to write an UNIX schduled script to automatically kill those processes which occupy over 90% CPU resources for over half an hour. Could you give me any hints to it? Or is it possible to write such script? Thanks.
 
I always consider it dangerous to allow a script to decide which process to kill. You would at least want to code it so no system processes can be killed (select by userid range or group id).

I would say it would be safer to have it page you or a support person if a process occupies over 90% CPU for over half an hour, and then have human eyes go look and possibly kill it.

Also, 90% CPU for over half an hour may be acceptable. Make sure you know what the processes on the machine are doing. You don't want to be in a position where you're explaining to management that there are no month end reports because the machine was being used too much and you decided to kill them. I always consider high usage a good thing. Some usefull work is being done for the company. High usage means an upgrade (or optimization) is needed, not that someone's work needs to be killed before it's done.

The goal is for the machines to be getting used, not that they be sitting idle. If they are being highly used, you should expect peaks for periods of time. If someone is complaining about slowness caused by another process, maybe [tt]renice[/tt] the high CPU process to give up some priority. Also look for what you can tune or optimize, you might be able to tune things so there is lower impact. You can also use the high usage to make a case to upgrade the server(s) to bigger faster machines.

Just my 2 cents.
 
hi, thanks for your reply. My case is that I exactly know the user process name or script name (by, say, TOP command) which occupies almost all CPU resource. When I kill it (or them), it may appear again later. I have asked the program owner to check his program coding if it has any never-ending looping or something like that. But before he can fix it for me, I want to make sure my machine healthy. That problem program occupies all CPU that makes the Oracle Portal service residing at the machine unavailable for my users! Any idea for my better management of bad user program? Thx
 
This is a script I use to automatically kill runaway Oracle Forms processes. I know I can safely kill these because even when they are working hard they are network and disk I/O bound and only use about 25% of a CPU.

When they go wrong however they get in a tight loop (like Volcano describes) making no system calls, and use 98% of a CPU.

The additional criteria that the process has been running for at least 24 hours adds further safety, because in our environment it would be very unlikely to be doing real work if it had been running for that long.

Replace the someone@somewhere.com with your email address, and 'f45runw' with the process name that appears in ps -o comm <PID> for one of the processes you are having problems with.

Code:
#!/bin/ksh
#
# Find processes that are hogging CPU.
#
# Criteria: CPU % >= 95/(number of processors) for two subsequent polls and
# elapsed time of more than 24 hours.
#
# Default threshold of 95 can by overridden using first parameter.

PROCS=$(/usr/sbin/psrinfo | grep on-line | wc -l)
RECIPIENTS=someone@somewhere.com
HOST=$(uname -n)
OUTPUT=/tmp/$(basename $0).$$
PIDS=/tmp/$(basename $0).pids
PREVPIDS=/tmp/$(basename $0).pids.prev
THRESHOLD=${1:-95}

touch ${PIDS} ${PREVPIDS}

ps -eo pid,pcpu,stime,etime,time,comm | nawk \
-v PROCS="${PROCS}" \
-v PIDS=${PIDS} \
-v PREVPIDS=${PREVPIDS} '

        BEGIN {
                THRESHOLD='${THRESHOLD}' / PROCS
                getline
                print
        }

        # If pcpu > THRESHOLD and elapsed time contains a "-" and two ":"s
        # (i.e. more than 24 hours).
        ($2 >= THRESHOLD) && $4 ~ /-[0-9]*:[0-9]*:/ && $6 == "f45runw" {
                HOG=$0
                HOGPID=$1
                while (getline < PREVPIDS) {
                        # Only display if was also hogging in previous run.
                        if ($1 == HOGPID) { print HOG }
                }
                close(PREVPIDS)
                print HOGPID >> PIDS
        }

' > ${OUTPUT} 2>&1

kill $(nawk '$1 ~ /[0-9]/ { print $1 }' ${OUTPUT})

echo "\n--\nOutput produced by $0" >> ${OUTPUT}

if [[ $(wc -l < ${OUTPUT}) -gt 4 ]]
then
        mailx -s "Forms server processes killed on $HOST" ${RECIPIENTS} < ${OUTPUT}
fi

rm ${OUTPUT}
mv ${PIDS} ${PREVPIDS}

Annihilannic.
 
hi, it's amazing to have a real script sample for my reference! Thank you! Since I am not as familiar with Unix script as you, I need time to know more about the code before making use of it. Thanks for your helping hand indeed.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top