Unsolved

This post is more than 5 years old

5 Posts

14743

April 29th, 2008 17:00

kernel: ata1.00: exception on new PE2950

 

We have two PowerEdge 2950's running Fedora Core 7 64bit (2.6.21-1.3194.FC7) and VMware Server 1.0.4 and everything is working great (its been about 6 months)...

 

I recently added two new PE2950's and installed the same OS and VMware versions. Unfortunately, these boxes are not running as good as the first two we configured...

 

The OS is freezing periodically and we're seeing the following in /var/log/messages:

 

Apr 29 13:18:55 gtvmh002 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Apr 29 13:18:55 gtvmh002 kernel: ata1.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x3 data 18 in
Apr 29 13:18:55 gtvmh002 kernel: res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Apr 29 13:19:02 gtvmh002 kernel: ata1: port is slow to respond, please be patient (Status 0xd0)
Apr 29 13:19:25 gtvmh002 kernel: ata1: port failed to respond (30 secs, Status 0xd0)
Apr 29 13:19:25 gtvmh002 kernel: ata1: soft resetting port
Apr 29 13:19:26 gtvmh002 kernel: ata1.00: configured for UDMA/33
Apr 29 13:19:26 gtvmh002 kernel: ata1: EH complete

After about 30 seconds, everything starts working as expected until it freezes again (1, 5, 10, x minutes later)...

 

The main difference between the older boxes and the new ones is the RAID controller. The older ones have a PERC5 and the new ones a PERC6 (which I'm assuming is the issue at this point given the error is referencing ata1)...

 

I found a bunch of stuff on Google about this "kind" of error, but no smoking gun for my situation. Not sure if I should spend my time compiling a new kernel or finding an alternate config for the PERC driver (or both).

 

Anyone else seen this issue?

 

- Paul

 

Message Edited by pmorin on 04-29-2008 01:33 PM
Message Edited by pmorin on 04-30-2008 09:25 AM

175 Posts

May 4th, 2008 11:00

Greetings

 

I had a similar issue with RHEL4u2 (2.6.9-22) ... however my problems were with the initial install ... you may need to download the mptlinux source from LSI Logic's website and just compile it on the new server .... Perc6 is fairly new (fairly) ;-) so you probably need to just backport a few things ...  I used the mptlinux source (I think you have a 1068E SAS controller from LSI, least we do on our PE2950s) and create a DUD disk so I could install 2.6.9-22 .. but I think compat is upstream for your kernel

 

Sorry, I tend to ramble -- here's the link for the 1068E source

 

http://www.lsi.com/storage_home/products_home/standard_product_ics/sas_ics/lsisas1068e/index.html?remote=1&locale=EN 

5 Posts

May 5th, 2008 21:00

I assume I want to compile this as a kernel module as opposed to compiling it into the kernel?  Yes/No?  I haven't done this in quite awhile so I'm going to have to spend some time in Google "remembering" how to compile and install modules...

 

Thanks! 

5 Posts

May 7th, 2008 14:00

Ok, after poking around Google a bit attempting to refresh my memory on module compiles I decided to first try to upgrade the kernel to see if a newer version liked the PERC6 card any better...  I used the FC8 DVD to upgrade to Core 8 then ran a yum update to get the latest kernel (2.6.24.5-85.fc8) and stuff.  No dice!  Similar error (same one as below)...

So I went back to Google and finally found some simple instructions on module compiles.  I made sure my kernel source was current and executed the Makefile in the mptlinux source:

make -C /usr/src/kernels/2.6.24.5-85.fc8-x86_64 M=$PWD modules
make -C /usr/src/kernels/2.6.24.5-85.fc8-x86_64 M=$PWD modules_install
rebooted...

I was hopeful but after about 30 minutes the error hit again:

May  7 11:11:24 gtvmh002 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
May  7 11:11:24 gtvmh002 kernel: ata1.00: cmd a0/00:00:00:12:00/00:00:00:00:00/a0 tag 0 pio 18 in
May  7 11:11:24 gtvmh002 kernel:          cdb 03 00 00 00 12 00 00 00  00 00 00 00 00 00 00 00
May  7 11:11:24 gtvmh002 kernel:          res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
May  7 11:11:24 gtvmh002 kernel: ata1.00: status: { DRDY }
May  7 11:11:29 gtvmh002 kernel: ata1: port is slow to respond, please be patient (Status 0xd0)
May  7 11:11:34 gtvmh002 kernel: ata1: device not ready (errno=-16), forcing hardreset
May  7 11:11:34 gtvmh002 kernel: ata1: soft resetting link
May  7 11:11:35 gtvmh002 kernel: ata1.00: configured for UDMA/33
May  7 11:11:35 gtvmh002 kernel: ata1: EH complete

Its a bit different than the original, but I'm assuming that's just because of the newer kernel.

I'm going to try LSI support to see if they have any ideas...

 

5 Posts

May 7th, 2008 15:00

No help from LSI (it was worth a shot)...

May 20th, 2008 06:00

I am convinced the PERC 6/e doesn't work with Linux.  I know it doesn't work with Solaris.   Dell assured me for over a week that the PERC 6/e worked with Solaris until I sent them a message from a Sun technician detailing how Sun was going to have to write a driver for the card.

 

I lose the MD1000 array, I get file system corruption errors-- also on the internal SAS6i as well.  I have tried different (journaling) filesystem, with and without lvm2.

 

I am also now noticing other problems: ACPI errors,  irqbalance won't load, etc., etc.

 

I abandoned RHEL 5, and I have tried Solaris 10, OpenSolaris 2008.05, CentOS 5, Fedora 8 and 9.  I am frankly shocked that Dell would sell a "Linux" solution that JUST DOESN'T WORK.  I am disgusted. 

175 Posts

May 20th, 2008 10:00

Howdy

 

I got an LSI 1068E SAS Perc6 to run in Gnu/Linux, and specifically RHEL4.2 (2.6.9-22ELsmp).  The device automatically was recognized using the RHEL5.1 DVD install and an MD3000 daisy chain of SAS drives appears to be pretty stable thus far (we are still testing it for deployment).

 

What I did was using a simple makefile (see below) I created a DUD using a util a friend made called ddiskit...also then using the mptlinux source from LSI and a previous release of a golden disk DUD for RHEL4, I was able to hack a driver for mptsas/mptscsi which was needed for the Perc6 card. 

 

Do you have a single MD1000 array or do you have multiple SAS disks daisy chained together in multiple MD1000/MD3000 array setup? 

 

Also is it an external MD1000, on a powervault, or are you just using the internal disks for the array?  It sounds like external by your comments...I sympathize with you...if you like you should try disabling ACPI

 

Unlike APM, ACPI isn't just concerned with controlling power/states ... it is wide sweeping and actually controls IRQ sharing and device initialization ... it isn't surprising that irqbalance won't initiate if you are having ACPI issues ... 

 

Some AHCI protocol SATA drives use ALPM now as well, which is a sort of built-in power management subsystem that doesn't jive well with the Linux-ACPI protocol...there isn't much out there other than a few patches (check over at the acpi.org website or google for them) but this could also be an issue depending on the BIOS setup.

 

Anyway --- I would disable ACPI and see how it goes.  You shouldn't have to get into assigning IRQs manually if you have a pretty baseline 2950.

 

You should also consider other alternatives to a MD1000/MD3000 multiple array setup ... the I/O limitations of these configurations really don't make them much more valuable than there simpler hardware cousins.

 

May 20th, 2008 19:00

I would point out that there are several distinctions to be made which are important to learning whether out situations are comparable:

 

  • Are you running an LSI or the Dell branded card?  I have read the Dell card has a different firmware and is not compatible with the LSI drivers (at least with Solaris drivers).
  • Assuming the latter (we are using a Dell branded card), are you using a Dell SAS or Dell PERC?  The SAS cards are supposed to have better support than the PERCs.
  • Of the PERC cards, the PERC 6/e may have the most compatibility problems from my reading.  I also wonder if the SAS6i cards do as well.

Our configuration is a Dell PowerEdge R200 with Dell SAS6i (2x SAS in RAID 1) and Dell PERC 6/e attached to an external Dell MD1000 (15x SATA in RAID 5 w/hotspare).  As the clients are Mac OS X, I wish to run Netatalk, which did not compile on RHEL 5, and thus I am trying Fedora 8.  (Fedora 9 wouldn't install, and  on a aside, sda and sdb were reversed which I find irksome.  CentOS 5 wouldn't install either.  Both Fedora 9 and CentOS 5 complain of packages missing/corrupted on the install media.)

 

I intermittently lose the MD1000 configuration and also receive filesystem corruption errors from the internal SAS drives.  I have tried disabling the PERC 6/e writeback cache for now (I would be very irritated if this was not supported).

 

Here are a couple irksome messages I noticed:

 

 


### Logwatch 7.3.6 (05/19/07) ###

 

------------ Kernel Begin --------------

WARNING:  Kernel Errors Present
    Buffer I/O error on device sdb, l ...:  30 Time(s)
    end_request: I/O error, dev sdb, sector ...:  158 Time(s)
 
------------ Kernel End ----------------


 

 

Also, found this:

 

 


[root@meatyard ~]# dmesg | grep cache
...
PCI: cache line size of 32 is not supported by device 0000:00:1d.7

 

...

sd 0:1:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:1:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 1:2:0:0: [sdb] Write cache: disabled, read cache: disabled, supports DPO and FUA
sd 1:2:0:0: [sdb] Write cache: disabled, read cache: disabled, supports DPO and FUA


 

 

Not sure what the PCI cache line size message is, I included it for completeness.  I have not read about DPO and FUA.

 

And, this one:

 

 


[root@meatyard ~]# dmesg | grep sdb
sd 1:2:0:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
sd 1:2:0:0: [sdb] 19032965120 512-byte hardware sectors (9744878 MB)
sd 1:2:0:0: [sdb] Write Protect is off
sd 1:2:0:0: [sdb] Mode Sense: 1f 00 10 08
sd 1:2:0:0: [sdb] Write cache: disabled, read cache: disabled, supports DPO and FUA
sd 1:2:0:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
sd 1:2:0:0: [sdb] 19032965120 512-byte hardware sectors (9744878 MB)
sd 1:2:0:0: [sdb] Write Protect is off
sd 1:2:0:0: [sdb] Mode Sense: 1f 00 10 08
sd 1:2:0:0: [sdb] Write cache: disabled, read cache: disabled, supports DPO and FUA
 sdb: unknown partition table
sd 1:2:0:0: [sdb] Attached SCSI disk
XFS mounting filesystem sdb
Ending clean XFS mount for filesystem: sdb
SELinux: initialized (dev sdb, type xfs), uses xattr

 


 

 

Finally, this one on ACPI (notice the message about trying  "acpi_osi=Linux", which I haven't yet tried as it was 1:30am when I found the message):

 


[root@meatyard ~]# dmesg | grep ACPI
 BIOS-e820: 00000000cfec8000 - 00000000cfee7c00 (ACPI data)
ACPI: RSDP 000F2980, 0024 (r2 DELL  )
ACPI: XSDT 000F2A08, 0094 (r1 DELL   PE_SC3          1 DELL        1)
ACPI: FACP 000F2B20, 00F4 (r3 DELL   PE_SC3          1 DELL        1)
ACPI: DSDT CFEC8000, 2545 (r1 DELL   PE_SC3          1 INTL 20050624)
ACPI: FACS CFEE7C00, 0040
ACPI: APIC 000F2C14, 0072 (r1 DELL   PE_SC3          1 DELL        1)
ACPI: SPCR 000F2C93, 0050 (r1 DELL   PE_SC3          1 DELL        1)
ACPI: HPET 000F2CE3, 0038 (r1 DELL   PE_SC3          1 DELL        1)
ACPI: MCFG 000F2D1B, 003C (r1 DELL   PE_SC3          1 DELL        1)
ACPI: WD__ 000F2D57, 0134 (r1 DELL   PE_SC3          1 DELL        1)
ACPI: SLIC 000F2E8B, 0024 (r1 DELL   PE_SC3          1 DELL        1)
ACPI: ERST CFECB114, 0210 (r1 DELL   PE_SC3          1 DELL        1)
ACPI: HEST CFECB324, 027C (r1 DELL   PE_SC3          1 DELL        1)
ACPI: BERT CFECAF94, 0030 (r1 DELL   PE_SC3          1 DELL        1)
ACPI: EINJ CFECAFC4, 0150 (r1 DELL   PE_SC3          1 DELL        1)
ACPI: SSDT CFECA545, 01E2 (r1 DELL   PE_SC3         11 INTL 20050624)
ACPI: SSDT CFECA9BB, 01E2 (r1 DELL   PE_SC3         11 INTL 20050624)
ACPI: SSDT CFECAE31, 0162 (r1 DELL   PE_SC3         10 INTL 20050624)
ACPI: PM-Timer IO Port: 0x808
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x12] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x13] disabled)
ACPI: LAPIC_NMI (acpi_id[0xff] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
ACPI: HPET id: 0x8086a301 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
ACPI: Core revision 20070126
ACPI: bus type pci registered
ACPI: EC: Look up EC in DSDT
ACPI: BIOS _OSI(Linux) query ignored
ACPI: DMI System Vendor: Dell Inc.
ACPI: DMI Product Name: PowerEdge R200
ACPI: DMI Product Version:
ACPI: DMI Board Name: 0TY019
ACPI: DMI BIOS Vendor: Dell Inc.
ACPI: DMI BIOS Date: 03/05/2008
ACPI: Please send DMI info above to linux-acpi@vger.kernel.org
ACPI: If "acpi_osi=Linux" works better, please notify linux-acpi@vger.kernel.org
ACPI: Interpreter enabled
ACPI: (supports S0 S4 S5)
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI quirk: region 0800-087f claimed by ICH6 ACPI/GPIO/TCO
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PEX1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.SBE0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.SBE4._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.SBE5._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.COMP._PRT]
ACPI: PCI Interrupt Link [LK00] (IRQs 3 4 5 6 7 10 11 12 14 *15)
ACPI: PCI Interrupt Link [LK01] (IRQs 3 4 5 6 7 10 11 12 *14 15)
ACPI: PCI Interrupt Link [LK02] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LK03] (IRQs 3 4 *5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LK04] (IRQs 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LK05] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LK06] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LK07] (IRQs 3 4 5 *6 7 10 11 12 14 15)
pnp: PnP ACPI init
ACPI: bus type pnp registered
pnp: PnP ACPI: found 14 devices
ACPI: ACPI bus type pnp unregistered
PCI: Using ACPI for IRQ routing
ACPI: PCI Interrupt 0000:00:01.0 -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:00:1c.0 -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:00:1c.4 -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:00:1c.5 -> GSI 17 (level, low) -> IRQ 17
ACPI Exception (processor_core-0816): AE_NOT_FOUND, Processor Device is not present [20070126]
ACPI Exception (processor_core-0816): AE_NOT_FOUND, Processor Device is not present [20070126]
ACPI: PCI Interrupt 0000:00:1d.7
-> GSI 21 (level, low) -> IRQ 21
ACPI: PCI Interrupt 0000:00:1d.0 -> GSI 21 (level, low) -> IRQ 21
ACPI: PCI Interrupt 0000:00:1d.1 -> GSI 20 (level, low) -> IRQ 20
ACPI: PCI Interrupt 0000:00:1d.2 -> GSI 21 (level, low) -> IRQ 21
ACPI: PCI Interrupt 0000:02:00.0
-> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:01:00.0 -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:00:1f.2 -> GSI 23 (level, low) -> IRQ 23
ACPI: Power Button (FF) [PWRF]
ACPI: PCI Interrupt 0000:03:00.0 -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:04:00.0 -> GSI 17 (level, low) -> IRQ 17
ACPI: PCI Interrupt 0000:05:05.0 -> GSI 19 (level, low) -> IRQ 19


 

 

 As regards trying another configuration, I believe we are stuck as our 30-day evaluation period has expired.  Given that Dell claimed the MD1000 would "work out of the box" with RHEL and that I have spent several days on this issue, I think an explanation from DELL IS IN ORDER!!!  Your suggestion to hack together a driver from a combination of third-party OEM sources, a friend's utility and obsolete DUDs is, I can only hope, not what Dell meant by "works out of the box."

5 Posts

May 30th, 2008 20:00

I have a work around for this issue...

 

Apparently this has to do with how VMware's virtual driver accesses the CD-ROM thru the PERC6 on the PE 2950.  If you remove the CD-ROM device by editing the VM's settings, the issue goes away.  Of course, now you can't access physical media in the CD-ROM drive from any of the VM's, but at least it runs without freezing up.  I still need to verify that I can access CD's mounted directy to ISO images which should work.  I just haven't had time to test it...

No Events found!

Top