Background

So recently I bought a ThinkPad L14 AMD G2. I’m pretty happy with it. It’s got good performance (thanks AMD) great battery life and overall has a good build quality.

But there’s one thing about it. I installed Arch on it and I kept getting random kernel panics at boot. Half of the time I would boot my laptop I got a kernel panic.

Thankfully it’s Linux, so we can debug it and fix it ourselves (with the help of the community).

Investigating

The first thing to do when you are facing a kernel panic is to check the logs. They give very valuable information about what’s going on in the kernel at any point in time, from boot to shutdown.

You can use dmesg to see the kernel logs from your current session. But in this case it’s not really useful to us since the system would crash at boot or very fast after booting.

There’s another way to get kernel logs. journalctl can retreive logs with more options.

You can use journalctl -k -b -1. The -k argument is to show kernel messages. The -b arugment is to select a boot to display messages from. If we pass -1 as an argument we will get messages from the last boot. (current boot minus 1)

Journalctl is very powerful tool that I encourage everyone to play around with, check the --help and try stuff you can’t really break anything :).

So looking at the logs there were some errors which sometimes is normal but in this case on set of error attracted my attention.

Feb 18 18:35:35 kernel: mt7921e 0000:03:00.0: Timeout for driver own
Feb 18 18:35:35 kernel: ------------[ cut here ]------------
Feb 18 18:35:35 kernel: WARNING: CPU: 1 PID: 329 at drivers/iommu/dma-iommu.c:848 iommu_dma_unmap_page+0>
Feb 18 18:35:35 kernel: Modules linked in: joydev mousedev intel_rapl_msr intel_rapl_common uvcvideo vid>
Feb 18 18:35:35 kernel:  acpi_cpufreq pinctrl_amd i2c_scmi crypto_user fuse bpf_preload ip_tables x_tabl>
Feb 18 18:35:35 kernel: CPU: 1 PID: 329 Comm: systemd-udevd Not tainted 5.16.10-arch1-1 #1 481a3e145f0d7>
Feb 18 18:35:35 kernel: Hardware name: LENOVO 20X5003WFR/20X5003WFR, BIOS R1KET36W (1.21 ) 11/25/2021
Feb 18 18:35:35 kernel: RIP: 0010:iommu_dma_unmap_page+0x79/0x90
Feb 18 18:35:35 kernel: Code: 2b 4c 3b 20 72 26 4c 3b 60 08 73 20 49 89 d8 44 89 f1 5b 4c 89 ea 4c 89 e6>
Feb 18 18:35:35 kernel: RSP: 0018:ffffa8a700d6f978 EFLAGS: 00010246
Feb 18 18:35:35 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000015
Feb 18 18:35:35 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa8a700d6f958
Feb 18 18:35:35 kernel: RBP: ffff8de0017960d0 R08: 0000000000000000 R09: 00000000052d4a80
Feb 18 18:35:35 kernel: R10: 0000000000000081 R11: 0000000000000001 R12: ffff8de007c4a040
Feb 18 18:35:35 kernel: R13: 00000000000006c0 R14: 0000000000000002 R15: 00000000052d4a80
Feb 18 18:35:35 kernel: FS:  00007f31e4f7ba40(0000) GS:ffff8de2dee40000(0000) knlGS:0000000000000000
Feb 18 18:35:35 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 18 18:35:35 kernel: CR2: 000055c69bfbe8d8 CR3: 0000000103590000 CR4: 0000000000750ee0
Feb 18 18:35:35 kernel: PKRU: 55555554
Feb 18 18:35:35 kernel: Call Trace:
Feb 18 18:35:35 kernel:  <TASK>
Feb 18 18:35:35 kernel:  mt76_dma_rx_cleanup+0x7f/0x110 [mt76 5f49a9fcad35ac9a5e79184db3cf21cbf51251f6]
Feb 18 18:35:35 kernel:  mt7921_wpdma_reset+0xbc/0x1c0 [mt7921e 9ad9871cd36596dbb8551ebcb7b2d94037ed0e5a]
Feb 18 18:35:35 kernel:  mt7921_register_device+0x32b/0x5d0 [mt7921_common b11fd2f2b3803f4b1bef83aafd15d>
Feb 18 18:35:35 kernel:  mt7921_pci_probe+0x1d5/0x210 [mt7921e 9ad9871cd36596dbb8551ebcb7b2d94037ed0e5a]
Feb 18 18:35:35 kernel:  ? __pm_runtime_resume+0x58/0x80
Feb 18 18:35:35 kernel:  local_pci_probe+0x45/0x80
Feb 18 18:35:35 kernel:  ? pci_match_device+0xd7/0x130
Feb 18 18:35:35 kernel:  pci_device_probe+0xcf/0x1c0
Feb 18 18:35:35 kernel:  really_probe+0x1f5/0x3f0
Feb 18 18:35:35 kernel:  __driver_probe_device+0xfe/0x180
Feb 18 18:35:35 kernel:  driver_probe_device+0x1e/0x90
Feb 18 18:35:35 kernel:  __driver_attach+0xc0/0x1c0
Feb 18 18:35:35 kernel:  ? __device_attach_driver+0xe0/0xe0
Feb 18 18:35:35 kernel:  ? __device_attach_driver+0xe0/0xe0
Feb 18 18:35:35 kernel:  bus_for_each_dev+0x89/0xd0
Feb 18 18:35:35 kernel:  bus_add_driver+0x149/0x1e0
Feb 18 18:35:35 kernel:  driver_register+0x8f/0xe0
Feb 18 18:35:35 kernel:  ? 0xffffffffc07ae000
Feb 18 18:35:35 kernel:  do_one_initcall+0x57/0x220
Feb 18 18:35:35 kernel:  do_init_module+0x5c/0x270
Feb 18 18:35:35 kernel:  load_module+0x25c3/0x2790
Feb 18 18:35:35 kernel:  ? __do_sys_init_module+0x12e/0x1b0
Feb 18 18:35:35 kernel:  __do_sys_init_module+0x12e/0x1b0
Feb 18 18:35:35 kernel:  do_syscall_64+0x5c/0x80
Feb 18 18:35:35 kernel:  ? exc_page_fault+0x72/0x170
Feb 18 18:35:35 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Feb 18 18:35:35 kernel: RIP: 0033:0x7f31e59216ae
Feb 18 18:35:35 kernel: Code: 48 8b 0d ed 66 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00>
Feb 18 18:35:35 kernel: RSP: 002b:00007ffe77ce7048 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
Feb 18 18:35:35 kernel: RAX: ffffffffffffffda RBX: 0000562fda1268a0 RCX: 00007f31e59216ae
Feb 18 18:35:35 kernel: RDX: 00007f31e5a8332c RSI: 000000000002b26f RDI: 0000562fda467c40
Feb 18 18:35:35 kernel: RBP: 0000562fda467c40 R08: 27d4eb2f165667c5 R09: 0000000000000000
Feb 18 18:35:35 kernel: R10: 0000562fda16dbd0 R11: 0000000000000246 R12: 00007f31e5a8332c
Feb 18 18:35:35 kernel: R13: 0000562fda11d120 R14: 0000562fda1268a0 R15: 0000562fda125da0
Feb 18 18:35:35 kernel:  </TASK>
Feb 18 18:35:35 kernel: ---[ end trace 7c8fc7b105719341 ]---
[...]
Feb 18 18:35:36 kernel: mt7921e 0000:03:00.0: Timeout for driver own
Feb 18 18:35:37 kernel: mt7921e 0000:03:00.0: Timeout for driver own
Feb 18 18:35:38 kernel: mt7921e 0000:03:00.0: Timeout for driver own
Feb 18 18:35:38 kernel: BUG: Bad page state in process systemd-udevd  pfn:106478

These mt7921e errors kept poping up throughout the logs and would sometimes occur before a big crash. That’s very suspicous. So I googled mt7921e to see what it was.

Turns out it’s the wireless card in my laptop. So then I try to look if other people are having the same issue. A simple search with “mt7921e kernel panic” gives a lot of results to mailing lists where it seems people are having the same symptoms and are already working on fixes for it.

Let’s pause for a moment and appreciate Linux. Not only the code of an operating system is open-source, available for anyone to see / modify, all discussions about what work is being done on every part of the kernel are publicly available. (Even the funny stuff)

Pretty quickly I end up on this page. Where a patch is available and developers seem to confirm that it fixes the issue.

The discussion is fairly recent, dating back 1 month only, so even tough it said that the patch was accepted I assume it’s not yet merged into the kernel or will coming in some future version very soon, which is good news.

But I still want my laptop to work now. It’s okay, we can just get the kernel ourselves, apply the patch, compile it and install it.

Patching the kernel ourselves

Before this I’ve never really applied a patch myself on the kernel. I’ve used modified versions of the kernel, but never modified it myself. So I start looking around for guides.

As always Arch Wiki to the rescue! I find two pages about compiling the kernel. One to compile the kernel as a package that you can then install and another to compile the kernel “traditionnaly” and installing it manually.

Arch wiki / Kernel / Arch Build System

Arch wiki / Kernel / Traditional compilation

I’m linking both of them here. Altough if you are running Arch, and just want to apply patch to a kernel without messing around too much, I recommend you just use the Arch Build System. Using the Arch Build System it’s very easy to apply a patch. You just have to drag the patch into the folder and add two lines to the provided PKGBUILD template.

After follwing the instructions I had a package that I could install.

One reboot later, the system still boots (phew) and works fine. Two reboots later, the system still doesn’t crash. Another one after that, still no crash. No more errors in dmesg about the wireless modules.

Looks like it’s fixed !

If that’s not a true demonstration of the power of Linux and open source softwares and the communities, I don’t know what is. Thank you open-source community <3.