Intel i915 Kernel Hang Hypothesis — Part 3 Cross OS Display Issues Investigation
Following up the previous parts of the series — I've been collecting more data on the blank screen hang, and I now have a working hypothesis
Hardware:
- System: HP Laptop 15t-dy200
- OS: Ubuntu 24.04
- Graphics card: Intel Iris Xe Graphics, Tiger Lake (TGL-GT2)
- Graphics Driver: i915
- Kernel: Linux 6.17.0-14-generic
Current state and symptoms:
Since the Graphic Issue and system crash as described in the previous post, I have been using Linux - Ubuntu 24.04.
Timeline:
⚙️2024-
When I first installed Ubuntu in this post From Windows Crash to Ubuntu Boot Issues — Part 1 Cross OS Display Issues Investigation in 2024, my display issue was screen flickering.
%% I tried all sorts of methods to fix this issue, as described in [prev post] but the issue persisted without the kernel boot parameters in place. %%
I tried all sorts of methods to fix this issue, such as testing older and latest kernel versions, other kernel boot parameters, switching display servers from Wayland to Xorg, switching rendering API to Vulkan, and so much more, but the issue persisted without the specific kernel boot parameters in place.
⚙️ Early 2025-
Ever since I attempted trying out another Linux Distro in early 2025, this issue evolved to a blank screen, as described in Cross-Distro Boot Failures and GRUB Tweaks — Part 2 Cross OS Display Issues Investigation.
⚙️ Late 2025-
Around this time, I guessed that this issue may be related to an outdated BIOS and updated My BIOS. Updating BIOS in Ubuntu wasn't possible, so I thought I'd install Windows again.
However since I had messed up my partitions during my Ubuntu Install in 2024 (Ubuntu, quite literally took 100% of my hard disk space), I had to delete Ubuntu (my only OS). For a while, I was left without any OS.
I fixed the partitions by booting into recovery mode of a bootable USB of Ubuntu, and then attempt to install Windows again, where I once again faced the Drives and Partitions can't be detected issue which I fixed by installing the RST driver in the bootable USB containing Windows 11.
I had also tried to install Windows 10, and it had shown the same issue.
Windows was buggy as ever for me, and most of the drivers crashed my system. After trying for a long time and failing to fix the system, I decided to install Ubuntu 24.04.04 again, this time on a proper partition, beside windows. I had to boot into recovery mode each time to operate it.
After finally getting an OS, I followed the steps I mentioned in BIOS Update Without Windows and updated my BIOS.
Unfortunately, it didn't fix the bug.
Since recovery mode operates with nomodeset kernel boot para by default, I decided to add nomodeset and it fixed the issue, and I could in properly then.
After experimenting with multiple parameters, I found out that intel_idle.max_cstate=1 i915.enable_psr=0 also fixed my issue.
The addition of the kernel boot parameters keeps the system completely stable. The system has experienced no hangs since addition of these parameters.
So, why did the display bug symptoms change midway? What changed?
Previously my screen experienced flickering in the absense of the kernel boot parameters. However, now, removing these paras result in a blank screen before logging in where even the keyboard is inactive.
During my 'blank screen' state, no interrupts could be recognized by the system, and only force shutdown works.
---
A short explanation of the two kernel boot parameters used
Kernel Boot parameters:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_pstate=guided intel_idle.max_cstate=1"
quiet: suppresses boot messages
splash: shows graphical boot animation (the logo and the loading symbol)
intel_pstate=guided:
changes how CPU frequency decisions are made between the OS and the hardware.
Default mode --> active: here the kernel decides the frequency and CPU set it.
In guided mode, the kernel just provides hints about performance, the CPU decides what frequency to use.
intel_idle.max_cstate=1
C states are the CPU idle power states. They define how deeply a CPU code can power down when it is idle. This helps save power a lot.
States:
I have restricted my C state to C1. This meant that the CPU halts instruction execution while keeping the core clock running, allowing it to resume almost instantly.
Deeper C-states power down more components of the CPU, such as clocks, caches, and voltage rails, which reduces power consumption but increases wake-up latency.
In C7, the CPU core voltage is nearly 0.
Other kernel boot parameters that worked:
i915.enable_psr=0
This is another kernel boot parameter that worked for me together with the max cstate parameter.
It disables a feature called Panel Self Refresh(PSR). This feature is a power saving feature used in laptops with eDP. Without PSR, the GPU will have to continuously send over frames to the display. Even if nothing changed. A lot of battery is drained unnecessarily.
Using PSR, you can save these frames in an internal buffer. The GPU doesn't have to work continuously.
However, PSR is known to cause problems like screen flickering in some hardware, so disabling PSR fixes the issue for many people.
My Hypothesis
Based on all the symptoms and data collected, here is my current hypothesis about what's happening under the hood
1] Not a hardware issue
The graphics are completely stable without any sign of flickering when the kernel boot parameters are applied.
I have also attempted changing displays to an externel monitor, however this did not fix the issue at all.
This rules out the possibility of a hardware issue.
2] Kernel Panic
Kernel Panics are safety measures taken by an OS when it encounters a fatal error that it cannot recover from. They are designed to preserve system integrity and prevent data corruption.
They are executed in Linux when the panic() function is called.
They can be traced by examining journalctl after reboot or dmesg
We look for oops( non fatal kernel error that led to a panic) rip(specific memory address of instruction that caused the failure) or a call trace - the list of functions that were active when the panic() function was triggered.
So after temporarily booting without kernel boot parameters attached, force shutting down, and booting in recovery mode, i checked journalctl. However there weren't any significant issues seen there.
dmesg showed the stats of only the current boot (in recovery mode). So I was unable to examine about what happened during the blank screen state before.
Kernel Panics typically show an OOPS message unless the system had crashed too violently and the kernel could not output them.
But, during my blank screen freeze, my external inputs are also completely dead. The CAPS lock key doesn't respond. the Magic SysReq sequence, REISUB doesn't work either. This mean that the keyboard is unresponsive.
The chances of this happening for a kernel panic are very rare. So, this may be a kernel hang.
3] Kernel Hang
Kernel Hangs or Hard Locks occur when the CPUs are trapped in an unrecoverable execution state, such as deadlocks, infinite loops, infinite waiting for an event, causing the entire system to become unresponsive.
The OS cannot process any interrupts (mouse, keyboard) in this state, and the clock stops functioning.
The only option is a manual hardware reset (force shut down).
This situation is very similar to my situation. My current hypothesis is that my system goes in a kernel hang whenever I boot without the kernel boot parameters limiting C states and setting the pstate to guided.
They are typically caused by buggy drivers, hardware failures, or deadlocks in kernel updates.
4] Low Power Bug
As mentioned before, I have restricted my C-State to 1 in order for my system to work.
I had tried setting deeper cstates lower than C1, but they led the system freeze.
This indicates that the bug must be surfacing whenever the CPU power levels drop low.
Just before login, the core isn't actively executing any major instructions. The CPU C-State must be at a deep C state, such as C6 or deeper.
All the kernel boot parameters that fixed it are related to power management features.
This has a high possibility of being a low power bug.
Future Plans
Due to this hinting towards a low power bug, I will look into the working and issues of Intel i7 processor, and Intel Iris Xe Graphics (TGL GT2).
I will also check the internal workings of my BIOS as well as Intel's display pipeline.
I also plan to check if this is a known i915 driver bug, and test the new Mesa Iris driver for Linux.
I also plan to look into whether other users have experienced a low power bug while using Intel i7 processor and Tigerlake Graphics.
I have no idea what exactly is causing the issue as of now, but I definitely plan to reach the end of the mystery of my Laptop suddenly crashed out of nowhere in July 2024 when it had been functioning well for the past 3 years.
Comments
Post a Comment