LINK="#3366FF" VLINK="#A000A0">



[ Table Of Contents ][ Answer Guy Current Index ] greetings   Meet the Gang   1   2   3   4   5   6   7 [ Index of Past Answers ]


(?) The Answer Gang (!)


By Jim Dennis, Ben Okopnik, Dan Wilder, Breen, Chris, and... (meet the Gang) ... the Editors of Linux Gazette... and You!
Send questions (or interesting answers) to The Answer Gang for possible publication (but read the guidelines first)


(?) How to Investigate a System Lockup

From Chris Gianakopoulos

Answered By Didier Heyden, Breen Mullins, Ben Okopnik, Jim Dennis, John Karns
with tidbits by Robos, Heather Stern

Hi Gang,

(!) [Didier] Hello, Chris!
(!) [Robos] Hi

(?) I was running X tonight (with the ICEWM window manager), I had a couple of xterms running (one with kermit running), and I was using Acrobat Reader Version 4.0.

As I was making a mouse movement, the my console locked up.

(!) [Robos] Don't you have to reboot when you make a mouse-movement? Oh, wait, that's that other thing that claims to be an os... ;-)
No less than 4 other gang members chimed in with some version of a sigblock fortune cookie about this. -- Heather

(?) I could not even get a response, via the Ethernet, when trying to ping my crippled Linux system.

Which log files could I look at to try to determine what the impending disaster could have been? I have included the tail portion of /var/log/messages. I have included extra stuff, I suspect. I'm curious what those entries that say "MARK" mean. Could that be related to my lockup?

(!) [Didier] Nope. From the `syslogd' man page:
      -m interval
              The syslogd logs a mark  timestamp  regularly.  The
              default interval between two -- MARK -- lines is 20
              minutes. This can be changed with this option.
(However it seems that this feature is disabled in some versions of the syslog daemon -- maybe through a compile-time option?).

(?) Okay. I'll investigate other stuff.

(!) [Breen] At least on Red Hat it's through a run-time argument.
The init script for syslogd reads /etc/sysconfig/syslog for its arguments:
# Options to syslogd
# -m 0 disables 'MARK' messages.
# -r enables logging from remote machines
# -x disables DNS lookups on messages recieved with -r
# See syslogd(8) for more details
SYSLOGD_OPTIONS="-m 0 -r -x"
"-m 0" is the default; I added "-r -x" on this machine.
(!) [Didier] Fairly redhat-ish, indeed. My own system is based on an antediluvian RH 5.2 distro. I'm usually not too impatient to upgrade with a full new distro install (preferring recompiling packages from source -- RPM'ed or not -- iff I can't no longer avoid it). Believe it or not, I haven't drowned yet in the resulting mess :)
By that time they just had no such configuration file, and the syslog daemon was run without any argument by default. But somehow the `-- MARK --' feature was... erm, is still in my case... totally disabled: whatever -m xx option I try no timestamp appears in the logs.
(!) [JimD] Actually I think this was a bug. I reported it to the upstream maintainer a few years ago (when I was running RH5.2) and he pointed me to the updated version that worked.
Naturally I'd advise that you simply fetch the latest version (in source form if you don't want to get trapped in RPM dependency upgrade hell) and build/install that.
(!) [Didier] Thank you very much for your suggestions, Jim. Now I know what package to download next.
Regarding the RPM dependency hell, IIRC I once experienced core dumps from the `rpm' program itself after having fiddled with the `--nodeps' option (I was supposed to know what I was doing :) The problem was (hopefully) fixed with this simple command:
rpm --rebuilddb
I'm not sure it would have worked in all situations, though. And unfortunately I don't remember the exact version that was then installed on my system. In fact this has most probably been fixed ages ago...
(!) [Didier] Note that I also have a couple of problems with the associated `klogd' daemon, as indicated by the last two lines of the following excerpt:
Jun  4 14:13:56 wallace kernel: klogd 1.3-3, log source = /proc/kmsg started.
Jun  4 14:13:57 wallace kernel: Loaded 15309 symbols from /boot/System.map.
Jun  4 14:13:57 wallace kernel: Symbols match kernel version 2.4.17.
Jun  4 14:13:57 wallace kernel: Error seeking in /dev/kmem
Jun  4 14:13:57 wallace kernel: Error adding kernel module table entry.
The other weird thing is that that ancient kernel log daemon cannot be stopped by anything but a plain SIGKILL. Doesn't prevent me from having nice dreams, however.
(!) [Didier] Unfortunately, when one experiences such brutal lockups, the logs are often not of much use: the whole system freezes before the daemon is given a chance to write anything in them -- even if some kernel oops actually occurred. The only way to see this happening would be to have the kernel writing directly to the console (assuming you're currently viewing the console output, but it won't do in a X session unless, maybe, console output has been redirected to a serial port at boot time?)
Upgrading your kernel might help, provided the lockup was not caused by some hardware (RAM?) failure.
(!) [Ben] That's pretty much what I would suspect - hardware. The only times I've seen Linux hang has been hardware-related stuff. In one very annoying case, my laptop would hang for a number of seconds, several times per day - and I had to live with it, because the PCMCIA card causing it was my wireless modem which was on 24x7. AFAICT, it took a huge chunk of CPU when it switched channels (sometimes the CPU load meter would actually catch the spike before everything froze); fortunately, it didn't do that very often.
(!) [Didier] Another example is an IDE cd-rom or cd-writer device buggy enough to suck up every possible CPU clock cycle whenever it fails to read or burn the medium, the system thus becoming almost unusable -- especially in the case where the application which makes use of it is run with a static real-time priority (cf. `cdrecord'). Actually I've never figured out whether some ill-written code in the IDE / IDE-SCSI driver could be held responsible for such a misbehavior or if it was simply inevitable on this kind of architecture.
Real-time constraints in a multi-tasking operating system are often very difficult to deal with anyway.
(!) [Ben] <sigh> Hardware stuff like PCMCIA has root-level access - has to, to access privileged ports, etc. - and unfortunately I know of no way to mitigate that. I wish there was a "nice" utility for hardware...
(!) [Heather] ACPI might like to be that, someday.
(!) [Didier] I once read that running a shell with a posix real-time scheduling policy could help in some situations. Unfortunately I've never heard either of a `nice'-like utility which could be used to launch `bash', `csh', etc. this way. I assume that in fact you must have a special version of your favorite shell, containing direct calls to sched_setscheduler(), in order to do that -- but I'm not sure.

(?) RAM is always a possibility. The system seems awful reliable, though. Maybe it IS time to upgrade to a new distro just for the fun of it. I say distro rather than kernel so that I can use XFree86 version 4.x. My friends at work keep offering SuSE 8.0. I believe that the S3 Trio64v+ is supported, so nothing is really stopping me from going to the new distro.

I am guessing that it is related to whatever applications might have been running under X in combination with Acrobat (if not a hardware problem). Dynamic systems are always the most difficult to troubleshoot.

See attached chrisg.logfile.txt

(!) [John] Continuing on the kernel side of the issue, a thread on a related subject just came up on a LUG list I'm on:

...............

Various applications find System.map themselves, based on a standardized search path and name scheme. The non-specific name version "System.map" is the last taken, first it tries to find it as: System.map-${uname -r}
Now if you have "System.map", and multiple kernels, without specifically named System.map files, then only one boot kernel will find the right System.map. Not everything needs kernel symbols to work right, but some do, those are the ones that will have problems. Perhaps even with different kernels, the symbol search scheme will still find the right place for the symbol it needs (I'm not sure what scheme it uses, e.g., it might be a simple offset). Lilo itself does not have any knowledge of System.map, as far as I know (I'm not 100% certain, but probably about 90% certain). Now one place that is searched is the standard kernel build source location, /usr/src/linux/ (or maybe /usr/src/linux-2.4/ in some cases), and so if you install from that, and do not alter System.map in that directory, then you symbols should be resolved until you build a new kernel and overwrite the old one.

...............

(?) Thank you all (there are so many names to list!) for your quick responses to my question. I'm gonna do some detective work. My perception was that the system locked up. The only thing that I really know is that the console and the network did not respond. I got two serial ports on my system. I dedicate one to the modem, and I use the other for kermitting around. I think that I am going to use my nonmodem serial port for a login session. Would it not be funny if the system was still running and only my network stuff failed as a result of an X lockup?

That would seem odd, though. Since I was running X via my local console (you know -- with the keyboard and display), I would expect Unix domain sockets to be used, thus, bypassing TCP (the network stream stuff).

You all gave me lots of good ideas, and thanks much again. This email response is like a broadcast thanks to all of you!


This page edited and maintained by the Editors of Linux Gazette Copyright © 2002
Published in issue 80 of Linux Gazette July 2002
HTML script maintained by Heather Stern of Starshine Technical Services, http://www.starshine.org/


[ Table Of Contents ][ Answer Guy Current Index ] greetings   Meet the Gang   1   2   3   4   5   6   7 [ Index of Past Answers ]