DMAR Kernel Log Errors and RAID Controller Crash

Problem: After upgrading an old HP ProLiant DL320 G6 Server to Linux kernel 6.1 (Devuan 5, Debian12) the system crashes during reboot. You will see the following kernel output:

DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:1e.0] PASID ffffffff fault addr df63e000 [fault reason 06] PTE Read access is not set
NMI: PCI system error (SERR) for reason a1 on CPU 0
Dazed and confused, but trying to continue

These errors are repeated several times, and the RAID controller driver “hpsa” for the SmartArray Controller P212 gives up, causing /dev/sda vanish.

Discussion: While not absolutely certain, it seems related to a known bug in the BIOS firmware, as mentioned in https://www.suse.com/support/kb/doc/?id=000018235 . This bug is related to VT-d a virtualization technology for device that implements interrupt remapping for virtualization. The kernel provided a warning during the boot process with these lines:

 DMAR-IR: This system BIOS has enabled interrupt remapping
    on a chipset that contains an erratum making that
    feature unstable. To maintain system stability
    interrupt remapping is being disabled. Please
    contact your BIOS vendor for an update

As mentioned in this SUSE support page, this bug is found in the chipset 5500 adn 5520 rev 12, 13 and 22. You can check this with:

# lspci -nn  | grep "8086:3403"
00:00.0 Host bridge [0600]: Intel Corporation 5500 I/O Hub to ESI Port [8086:3403] (rev 13)

Workaround: There are several possible workarounds:

  1. Downgrade to kernel 5.10 (e.g. Devuan 4/Debian 11)
  2. Add “intremap=off” to your kernel options (/etc/default/grub GRUB_CMDLINE_LINUX_DEFAULT)
  3. Disable VT-d in the BIOS (if not needed)
  4. Add “intel_iommu=on iommu=pt” to the kernel options

I personally tested only options 1 and 4, ultimately opting for the latter. The server is now running stably (for several days at least). Unfortunately, we found no BIOS update to resolve this issue. The latest BIOS version we found is 05/21/2018, which did not address this problem.

Version: HP ProLiant DL320 G6, Devuan 5, BIOS W07 05/21/2018