ALINK="#FF0000">
LINUX GAZETTE
...making Linux just a little more fun!
(?) The Answer Gang (!)
By Jim Dennis, Ben Okopnik, Dan Wilder, Breen, Chris, and... (meet the Gang) ... the Editors of Linux Gazette... and You!


We have guidelines for asking and answering questions. Linux questions only, please.
We make no guarantees about answers, but you can be anonymous on request.
See also: The Answer Gang's Knowledge Base and the LG Search Engine



(?) Diagnosing a Linux crash

From Tom Brown

Answered By: Thomas Adam, Karl-Heinz Herrmann

OK guys, here's a n00b question for you that probably crosses over into Sys Admin territory.

What steps should someone follow after Linux crashes to figure out what went wrong?

Where do I start, and where do I look for clues?

Are all the logs found in /var/log, or are there others?

In what order should I look at the logs, and what should I look for?

(!) [Thomas] It depends what you think went wrong. Essentially:
/var/log/messages
is where syslogd will dump all its data and so is the best place to look. But there may well be application specific data in /var/log (XFree86.0.log) is one such example.

(?) Any pro-active steps I should be taking to get more info, should it happen again?

The specifics of my case: my file server (a 750 Mhz Athlon running Suse 9) simply locked up, and I couldn't get anything to display (GUI or command line). I knew the machine was in trouble, when it didn't respond to pings. I had to hit the reset button to get it back (and deal with fsck, naturally). Funny thing is, the system clock reset itself to 28 minutes after midnight (when it should have read the middle of the afternoon), but didn't loose the date. Odd, that. The machine's been running 24/7 for about three weeks now (I set it up around then), and no sign of problems until now.

(!) [Thomas] This might be framebuffer related. At the lilo/grub prompt, type:
linux video=vga16:off
(!) [Thomas] to see if that has any effect.
There have been snippets of these effects mentioned in the past. The one that springs to mind is:
http://linuxgazette.net/issue74/tag/9.html
(!) [K.-H] There are ways of still getting kernel info (pro active steps):

(?) I'm going to keep your response handy -- several things to try. Meantime, I realized I was booting the thing into runlevel 5 (rather stupid, actually), so I've since changed it to 3. If it is, as someone suggested, a framebuffer problem, maybe that will solve it for now. I'm using a real old Voodoo 3 card I scrounged from my parts bin. If it happens again, I'll have to tear the machine apart and start playing with the memory, as someone else here suggested.

(?) install and configure Linux is one thing. Learning how to do an autopsy seems to be quite another!

(!) [Thomas] That's because generally one doesn't do it quite like that. Problem diagnosis is situation dependant. In any given situation there is often a small set of files and related information that you can analyse without having to worry about the rest of the system.
Granted, this is related to how much information one is told at the time (if you've been on this list for as long as I have, you'll come to realise that usually we don't get any), and whether or not the person has tried to remedy it.
In general though, poking around, taking an aspect of your system, looking at what it does, and how is all related and helpful to you when you have to come to diagnose anything.

(?) Yes, well, I looked at the messages log, but saw only a gap time-wise between cron processing around 4 in the morning, and the time of the crash. I'm not sure which of the other logs are important in that case. Where do I find the register dump (although I suspect it won't make much sense to me, rather like those register dumps you get in Windows XP)?

(!) [Thomas] Syslogd might have logged it, if the problem was software related, and indeed if the said program produced any errors. If hardware then it might not have, depending on the severity of the hardware failure.

(?) I'm using a real old Voodoo 3 card I scrounged from my parts bin. If it happens again, I'll have to tear the machine apart and start playing with the memory, as someone else here suggested.

(!) [Thomas] It might be memory, but as the link I have you last time around said, memory problems tend to be more 'visible' in the sense that you get a lot of applications SEGFAULTing and SEGABRTing for no apparent reason. In such instances, installing and running 'memtest86' is usually of help.
(!) [K.-H] Most of the time I had the great luck of oopses and kernel crashes occurring in the scsi layer, often hardware problems. If the scsi layer is in trouble nothing will get written to disk. What's software related regarding the kernel? The kernel deals with hardware, and it's supposed to handle error conditions gracefully, i.e. not just freeze without a hints whats gone wrong. But there are situations where the kernel doesn't have a chance of leaving hints on the hard drive.
Then a few thing might be useful: (to Tom B)
suggested reading:
/usr/src/linux/Documentation/oops-tracing.txt
ksymoops man page
But I have to say that often enough I also do not try to hunt spurious crashes which do occasionally happen. Either hardware causes or whatever. You always can try a different kernel or simply hope for the best.
Still -- keeping the system on console 10 is not a difficult thing to do and it just might give you something useful next time (note it down for ksymoops if it's a oops or panic).
SuSE has memtest as a boot option -- run it if you suspect the RAM, run it long (several passes) and the full test suite if you don't find any errors on the first go.

(?) Thanks Karl and Thomas. This is the starting point I needed. (For one thing, I didn't even know about console10: looks helpful). I just wish I had more from the crash than just a black screen, but that's what I get for running X on bootup for a file server. Between the two of you, I think I have the answers I was looking for when I started this thread: not what went wrong exactly, but how to dig in, and try to figure it out for myself.

Oh, Thomas, when I rebooted to runlevel 3, I entered that video setting you suggested as well.

I just know I'll be back with more questions, though. One way or another, I'll figure this Linux thing out.

Thanks again, guys. Your help, as always, is much appreciated.


This page edited and maintained by the Editors of Linux Gazette
Copyright © its authors, 2004
Published in issue 101 of Linux Gazette April 2004
HTML script maintained by Heather Stern of Starshine Technical Services, http://www.starshine.org/


[ Table Of Contents ][ Answer Guy Current Index ] greetings   Meet the Gang   1   2   3   4   5   6   7 [ Index of Past Answers ]