Log in

No account? Create an account

Previous Entry | Next Entry

Load average: 314.1, 213.2, 106.7

Every once in a while, the server goes nuts.

Usually it happens when I try to turn something on and don't realize how much load it will cause.  The system seems to be running fine; then a few minutes later I notice the load average creeping up.

For those of you who are not geeks, the load average tells you how many programs are trying to use the CPU at once.  Usually one-, five-, and fifteen-minute load averages are shown.  A system's load average should never be greater than the number of CPUs in the machine for more than a short time; if it is, that means that the server can't keep up with the demand.  I only have one CPU in the server, so the load average should never exceed 1.  A typical load average on Navi is 0.33.

This new feature adds, say, 0.8 to the load average.  That makes the load average 1.13.  Which is a big problem, because the server will never catch up with a load average of 1.13; it has entered a slow slide into oblivion.

Anyway, a few minutes after I activate whatever feature, I notice the one-minute average is around 10, and the five minute average is maybe 4 or so.  If I look at the case, the hard drive lights are on solid.  I try to log in to correct this, and find that the console's reacting very sluggishly.  Pulling up top takes a minute or two, during which time the one-minute average goes up to 15.

The top display is chaotic; sometimes five Apaches will appear above ten spamds, because the five Apaches are each taking up more CPU than any one of the spamds.  This isn't a big problem normally--I'm clever enough to realize which process is the real problem when I can see all of them.  But by now there are thirty Apaches, twenty MySQLs, and fifty spamds off the bottom of the screen.

So I quit (load average is 50 now) and try using ps.  After a minute or two, this reveals that the problem is spamd, so I run killall spamd.  But qmail immediately loads up more copies of spamd.

The load average is now around 100.

I try to kill off qmail, but by this point letters appear one at a time as I type, so I make three or four typos and have to slowly backspace back to fix them.  Just trying to shut off qmail takes five more minutes.  The load average is now 200.

And the server's fucked.  I have to reboot and let it rebuild the filesystems from their journals.  (Thank you, ext3.)  Then I have to get into single-user mode, check all the databases for damage, and fix the configuration to disable the feature that started the whole sequence.  Finally, I can boot up again.

This happened in real life twice today.  The 0.33 was the normal low-level stress I have to deal with, some of which was my own crap and some of which was others' crap.  The first time, the 0.8 was Mom screaming at me to fix something I had no control over; the second, a friend was screaming at me about a dispute between one of her friends and one of mine.  So I took reboots--first in the form of a nap, then in the form of a shower.

Christ, has this day sucked.

Yeah, you know you got to help me out
Yeah, oh don't you put me on the back burner
You know you got to help me out
You're gonna bring yourself down
Yeah, you're gonna bring yourself down
Yeah, you're gonna bring yourself down


( Read 1 comment — Leave a comment )
Feb. 4th, 2005 05:50 pm (UTC)
*hugs Navi*
( Read 1 comment — Leave a comment )