[Leaplist] machine got a bit warm .... -- MCEs, dumps and microcode updates ...

Bryan J. Smith thebs413 at yahoo.com
Fri Jan 25 19:57:11 EST 2008


> .... I logged in to my network this A.M. to find that my Intel
> Q6600 (Core2-Quad CPU) based box (running FC7 x86_64, runlevel 3)
> was face down, along w/ my network switch. After much monkey motion
> getting everything back up & going, I groped through the syslog
> ...
> Jan 23 04:41:26 Q6600 kernel: CPU0: Temperature above threshold,
cpu 
> clock throttled (total events = 1)
> Jan 23 04:41:26 Q6600 kernel: CPU3: Temperature/speed normal
> Jan 23 04:42:52 Q6600 kernel: CPU1: Temperature above threshold,
cpu 
> clock throttled (total events = 1)
> Jan 23 04:42:52 Q6600 kernel: CPU2: Temperature/speed normal
> Jan 23 04:44:19 Q6600 kernel: Machine check events logged
> ...
> The next 5 are interesting :-). The last 4 are from when I got
> it back up & running this P.M. There was no other output (such
> as power-down output) in between, which was a bit of a surprise.
> uname -a shows:
> [root at Q6600:/etc, Wed Jan 23, 04:47 PM] 1026 # uname -a
> Linux Q6600 2.6.22.9-91.fc7 #1 SMP Thu Sep 27 20:47:39 EDT 2007
> x86_64 x86_64 x86_64 GNU/Linux
> [root at Q6600:/etc, Wed Jan 23, 04:47 PM] 1027 #
> My question is: The syslog file said machine check events logged,
> but shows nonesuch (other than the 4 lines right above). Where
> else are these (other ?) machine-check events logged ? TIA ....

AMD and Intel processors will slow down when they cross a threshold. 
If they cross an even higher threshold, they can throw a Machine
Check Exception (MCE).  A MCE typically results in a kernel panic. 
Machine Check Events (MCEs) can be logged in several ways, including
in a diskdump or netdump in the case of Fedora (various things must
be setup).  They can also be logged in the CMOS (viewable in the BIOS
setup at POST).

Also know there has been a crapload of errata on the Intel Core 2
line.  Just do a search on BIOS updates from Dell, HP, IBM, etc...
for various Core 2 workstations and servers and you'll note this. 
Intel provides a nice, "soft" way to upload microcode fixes at init
time for Linux.  Since you're running kernel 2.6.20+, and I assume
this is a "whitebox" (if not, and a Tier-1 PC OEM, let me know that
-- you'll want to check for their BIOS updates), I highly recommend
you download the latest Intel microcode.dat directly from Intel:  

http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=14303  

Drop the microcode.dat in /etc/firmware and (on Fedora-based systems)
run:  service microcode_ctl start
Then check /var/log/messages to see any updates.

BTW, you can compare the file to your current microcode.dat by just
grep'ing for "PDT" or "PST" (which will have the date of the file,
Intel releases them on Pacific Time).  Intel has been providing the
microcode.dat directly from its site for some time now, especially
with the Core 2 line and a lot of the Tier-2/Whitebox companies that
may or may not provide BIOS updates or other support.  Note that the
microcode.dat (and microcode_ctl loader) are "soft" updates, meaning
they are lost on power-loss (which is a good thing).  A BIOS update
that always loads the newer microcode is preferred by default.

I'm currently in the middle of blogging an article on this.  I will
post it after I've had it reviewed (i.e., I'm under NDA on some
things, so I have to have it reviewed first to ensure I'm only
disclosing public information).


-- 
Bryan J. Smith       Professional, Technical Annoyance
b.j.smith at ieee.org  http://www.linkedin.com/in/bjsmith
------------------------------------------------------
       Fission Power:  An Inconvenient Solution


More information about the Leaplist mailing list