Why do all the proposed avenues of future investigation, and all of the current comments on this thread, focus on voodoo instead of the far more likely explanation that the display driver is just stomping on the memory of the network interface? If there's software anywhere in a system, 99% of the time that's the problem.
This is not true when radios are involved. In my experience, wireless connectivity issues are rarely caused by software; the problem is much more often caused by interference.
The interference can be internal interference in the device, or interference from other wireless devices. In many cases, the problem are even devices that shouldn't emit RF at all, like power supplies, switches, light bulbs...
Another common issue is poor antenna design (eg. attenuation when you hold the device, or strong directionality of an antenna that should not be directional).
And, last but not least, physical obstacles. Most people understand that concrete walls with rebar will block signal, but a surprisingly large number of people try to use aluminum stands or cases for devices with wireless radios.
All those factors will cause connection issues, and they are really common because debugging them is so hard (who has a spectrum analyzer at home? How do you find out which one of dozens of electronic devices is emitting RF that it shouldn't?)
In addition, the linked forum thread includes a user describing how high resolutions break 2.4GHz networks for them, but 5GHz networks work fine. The display driver is stomping on memory responsible for 2.4GHz, but not 5GHz? I'm really not seeing that as the more likely problem here.
5GHz WiFi has more bandwidth than 2.4GHz, so typically will involve larger IO buffers in the driver, which could easily be enough to expose a memory scribbler (I imagine there's a bunch of other features that are enabled/disabled by the frequency band switch too). However, I think
asdfasgasdgasdg's answer is the correct reason not to suspect a memory scribbler - ie a memory scribbler would cause the driver to crash/fail and the kernel would log a message.
Remember the Pi has an odd architecture and all the IO passes through the GPU. The GPU doesn't log human readable messages anywhere. There's a good chance the GPU did log a crash or failure, but only broadcoms engineers can see it.
It's a BCM2711, and the datasheet is NDA only - typical Broadcom!
The VideoCore (Broadcoms GPU) is the main processor on the thing, and the cluster of ARM cores that run Linux are more of a coprocessor which can only see some of RAM.
> 5GHz WiFi has more bandwidth than 2.4GHz, so typically will involve larger IO buffers in the driver, which could easily be enough to expose a memory scribbler
He's saying 5 GHz will expose the scribbler, and the opposite is happening, only 2.4 GHz fails.
@StavrosK Thanks for wading in in my defence, but I had actually mis-understood the situation :-)
Although, if my theory that the IO buffers are different sizes is true, then that could perturb memory layout enough to expose/hide the bug in either direction.
So the display driver is meant to be mutating memory also owned by the network controller, but not in a way that causes a crash, log messages, or a kernel panic? That doesn't seem so likely to me. I mean it's not impossible but it's rare to see memory corruption/interference cause a clean breakage like this. In my experience it usually causes things to become extremely funky for a short while, then a crash.
Every SoC I've dealt with containing a WiFi core has a dedicated coprocessor (RPU is a common name, depending on vendor) running its' own firmware. So more likely, _that_ core would go funky, then crash. The kernel might have code to recover that, but I doubt it, and it certainly would complain the whole way as you say.
In the Pi, the coprocessor is the GPU, and it is the first to initialize on boot and runs all the firmware-like stuff and handles all IO and does memory allocations/mappings.
Because what if it's not? My first thought is that the HDMI is radiating and interfering with the wifi antenna.
As an embedded engineer, it was a hard lesson for me to learn that not all issues are software issues and the hardware may need to be investigated.
This is especially true where there is different behaviour between units. You can't just assume that your 99% estimation (plucked out of thin air) is correct and discredit other potential explanations.
Then, after you done ruling out the most likely and easiest explanation to test, you can then start exploring the remaining possibilities. Skipping to the more exotic explanations sounds more interesting but it's poor use of time if there's still low-hanging fruit out there.
Improper shielding is an assumption with no evidence as yet. I also mentioned that the ease of verifying the explanation should be a factor. Changing software is usually very easy.
It's so common that it's not an unlikely starting point. EMC is a major issue in high frequency electronics design and the raspberry pi had a history of having to redesign certain parts because of not having enough shielding.
Absolutely, and this was before the Pi had built-in Wifi. The norms you have to comply for are immediately a lot stricter as your device falls into a different category (telecommunications devices).
wrapping tinfoil around an hdmi plug/cable isn't particularly hard either :) chips are harder but at least you rule out the cable. HDMI cables are ridiculously finicky if you've ever tried to get anything more than the lowest common denominator 1080p going on them.
I don't agree that wrapping foil is a great way to 100% rule that out as there is room for error. Using different cables/dongles would be better and they already tried that.
There's several small scale WiFi chips that share clock source with USB - it would be unsurprising to find that the WiFi and video interface are sharing the same clock, so drawing too much from either could directly effect the other.
These kinds of problems are common in embedded computers, like the Pi. Just as common as software.
I don't know much about the Raspberry Pi, but it looks like they chose an ARM core variant without IOMMU, so this might actually be plausible, even though it's such a computer architecture anti-pattern to share system memory DMA across devices.
Can you list which ARM cores you know of that include an IOMMU? I’m personally unaware of any, as that is typically bundled as a separate IP package that must be integrated separately into the system, and is usually customized based on the number of supported masters that require virtualization.
E.g. the Xilinx ZynqMP includes the same Cortex-A53 complex the Raspberry Pi 3 has. They also included CCI-400 coherent interconnect switch to it, and also included the SMMU-500 IOMMU that partially interfaces with the A53 interconnect, but is effectively independently programmed and also controls access to DDR3/4 from the SATA, Displayport and PCIe controllers.
Per the original topic, have they released a full datasheet/reference manual for the Pi 4 SoC yet? I’ve yet to see one other than a VERY high-level overview of it’s new pieces.
Huh, so that's why the iPhone 6s's SecureROM memory regions weren't MMU-locked... IOMMU doesn't come in ARM by default! So you have to wire it up yourself (in your own IP blocks), and then hook it up in software everywhere you want it to work.
And all that costs extra developer time -and money.
"kernel module" together with "should absolutely not be able to interact with each other" are an impossible requirement with Linux.
I think the other operating systems available for the Pi are roughly in the same boat (Windows & RiscOS). There was a nascent Minix port at some point, I wonder if it was abandoned.
Maybe the misbehaving driver is writing past the end of its requested space though, inadvertently? (I don't know if this is always called a "heap overflow" or if that's just Clang AddressSanitizer.)
That resulted in a wide variety of different failures, from the kernel oopsing to various userspace components crashing. It would be very unusual to have unexpected DMA trigger such a specific failure.
Mostly because of a known history over the past couple years of USB, WiFi, and/or HDMI causing direct interference with each other. See lots of other comments upthread about similar RF issues people have had, stretching all the way back to 486 laptop keyboards :)
EMI is a headache I deal with daily, on far more sensitive receivers, so voodoo is likely. Though just moving the unit next to the AP (increasing RX signal strength) is an easy diagnosis.
It certainly sounds more like a software issue than some arcane effect from RF interference or the like. Could be memory getting smashed, a bus getting saturated, an interrupt not getting serviced, or any similar thing.
Meh. I've done low-level embedded/mobile for a long time now. This actually sounds like a totally reasonable RF interference issue. 2.4Ghz is funky & has desense issues with lots of internal busses (not a HW engineer so not sure why that band specifically). Also radios typically have to accept interference which means the radio would "stop working" rather than causing the display to work weirdly (ironically a much easier failure mode to display/diagnose/notice).
when the late 2016 macbook pro came out with only usb-c i had to buy a usb dongle from amazon (the one included had not enough ports). if i booted the macbook in windows, with the dongle connected the wifi would stop working (the 2.4ghz one) and the 5ghz would work.
Duly noted! I've been out of the embedded space for a long time (I think the last board I worked with was i386EX based) but I'm getting back into it now with an ESP32 so this might actually come in handy. Thanks! :)
A good way to think about Wayland is the screen manager brought to you by the people who thought xrandr, Mesa, Cairo, the Linux FireWire stack, and freetype were good things. If you come in with appropriate expectations you’ll be amazed that any of it works at all.
Is you think it over for a bit you’ll realize that the real Linux graphics protocol is Android SurfaceFlinger. Wayland is like five orders of magnitude less popular.
The android interface would be hot garbage on desktops/laptops.
Not everyone uses a 12in netbook with only a browser window open.
The fact that Android is successful does not actually make it good and most of the apps are worthless on a large screen whereas there are lots of actual apps for Linux of note.
Part of search quality is serving nothing when there are no results. Putting a bunch of irrelevant Pinterest pages in the results, just because there are no good results, isn’t good for users.
I don't agree. I may be desperate for a result in some situations. If there's a small chance there's something useful out there, I still want to see it. I get to decide whether it's worth my time to look.
This does not really have anything to do with either kubernetes or networks. If your computer is busy, it won't be able to process packets. Accessing certain kernel stats via proc, sys, or other special files may be really expensive. For example /proc/pid/smaps of a running mysqld takes 2 seconds on a computer I happen to have on hand. Sometimes when you have many cores it is expensive to produce some of the fields of /proc/pid/stat because the kernel has to visit numerous per-cpu data structures. /proc/pid/statm is better for this reason, if it contains what you are looking for.
TL;DR reading kernel stats can take a long time and cost a lot of CPU cycles. It costs more for more containers, and more on bigger machines.
True, it does not have anything to do with k8s or networks, but, that's the context in which this issue arose: when they noticed a higher network latency on a kubernetes cluster.
The value of this blog post isn't only in the "why" ("reading kernel stats can take a long time and cost a lot of CPU cycles"), but also in the "how" they went about finding the cause of the symptoms is also of interest.
> This does not really have anything to do with either kubernetes or networks.
The article is literally about an issue that was experienced while operating kkbernetes clusters.
FTA:
> Essentially, applications running on our Kubernetes clusters would observe seemingly random latency of up to and over 100ms on connections, which would cause downstream timeouts or retries.
Sounds like a problem affecting Kubernetes to me, and an important one.
More importantly, it sounds like a non-trivial problem that others operating Kubernetes clusters would be interested in learning how to identify and how to search for the root cause.
It has literally nothing to do with K8s. An equally suitable title would have been "Debugging network stalls on the Intel Xeon processor" or "Debugging network stalls on planet Earth".
I feel like the title is somewhat appropriate: part of the post was about selectively removing parts of Kubernetes networking, and digging through the kernel networking stack, to troubleshoot the issue
Can we talk about the ethics of just reposting, verbatim, a paper written by others, but with an advertisement inserted between every paragraph? How did that become OK?
I wonder how this sounds to people who live in, say, Switzerland, where it is very mountainous, snows frequently, and literally nobody drives a truck, and even the cars are not AWD.
I mean, I _know_ how it sounds because I am one of those people. But I wonder how Americans think this sounds.
European countries are much more compact, which makes public transportation much easier to justify. By contrast, the US is extremely spread out and public transportation yields much less ROI even in many urban areas.
Vehicle ownership in the US is practically a requirement because you have to drive to get anywhere. Therefor, having a versitile vehicle like a truck is more apealing.
Trucks also tend to be more durable than cars so they're more common in the used market, especially in the midwest.
...but for many, a truck is just an aesthetic/lifestyle symbol. The "country" lifestyle is generally associated with independence and work ethic - traits which are highly valued in the US. Trucks are a classic symbol of that lifestyle. That's why country songs stereotypically mention trucks.
The OP wasn't arguing for public transport, so I don't get where your comments on that came from from. I totally get what you mean about "symbols" though.
> Trucks also tend to be more durable than cars
Surely the engine, drivetrain, clutch etc are the same parts you'd find in cars? Curious about what you mean here?
I drive a normal FWD car during winter weather almost nobody will go out in. I drove it cross-country through the worst snowstorm the midwest experienced in the last 10 years where I couldn't see more than 10ft in front of me.
All that said, I would have been much safer in a truck.
Alabama is the size of England. Using individual US states as points of reference for entire European countries is one form of the incomensorability. Columbus, Ohio is as far from San Diego, California as Barcelona is from Moscow. Except there's pretty much nothing but empty plain, mountains, and desert in between. Ohio has 10,000 km^2 of fresh water...about a quarter of Switizerland.
Recognizing the difference of scale is not a claim to exceptionalism. The US's scale makes it more like Russia than any western European country.
Inyo County, California is 1/3 the area of Switzerland. At population 18,000, it has fewer people than any Canton save Appenzell Interhoden (~16,000). Inyo County is surrounded by more Mojave. The Mojave Desert is the size of Portugal...nearly thrice that of Switzerland.
How do various mail clients talk to their respective backends? I.e. what does iOS Mail use to talk to iCloud? The GMail iOS app speaks a bespoke binary protocol to Google's servers, not IMAP.
This does not really apply to individual Americans. The American carbon footprint boils down to driving and meat. Individual decisions made by super-consumers can be impactful.
You're still not correct...you're saying eating plant based diet has essentially zero footprint, which is untrue, especially with food waste. And you're saying driving/transportation choice alone is sufficient, when we know that even if they bought an EV or used a train, those transportation regimes still incur carbon emissions in their production and operation, though admittedly less LOCAL emissions (great!) and potential for lower future emissions as the grid cleans up (great!).
These individuals still can not choose zero-carbon heating/cooling (which is probably 1/3 of American's footprint, which you neglected to even mention), they can't choose how their infrastructure is produced (steel and cement, big time emissions there), they can't choose the actions of their American government, which spends a lot of carbon emissions on its internal activities as well as its foreign incursions.
So no, you really can't "boil down" to zero net emissions as an American unless you stop using heating, cooling, roads, transportation of any kind except walking/bikes, and if you completely stop supporting the US govt and its activities.
Impactful but not sustainable nor scalable. Climate change as we face it is a tragedy of the commons. We are using a negative resource without pricing in that externality. Individual action does not solve a tragedy of the commons; this is a very well established economic theory. I would almost call it a fact.
Without a tax on the resource, all you’re doing is leaving more of it for the others to abuse. The same happens with fishing, rhinos, etc etc.
Individual action is a moot point: we must solve this collectively. Everything else is a polarising distraction.