More

seminatl · on Nov 28, 2019

Why do all the proposed avenues of future investigation, and all of the current comments on this thread, focus on voodoo instead of the far more likely explanation that the display driver is just stomping on the memory of the network interface? If there's software anywhere in a system, 99% of the time that's the problem.

jakobegger · on Nov 28, 2019

This is not true when radios are involved. In my experience, wireless connectivity issues are rarely caused by software; the problem is much more often caused by interference.

The interference can be internal interference in the device, or interference from other wireless devices. In many cases, the problem are even devices that shouldn't emit RF at all, like power supplies, switches, light bulbs...

Another common issue is poor antenna design (eg. attenuation when you hold the device, or strong directionality of an antenna that should not be directional).

And, last but not least, physical obstacles. Most people understand that concrete walls with rebar will block signal, but a surprisingly large number of people try to use aluminum stands or cases for devices with wireless radios.

All those factors will cause connection issues, and they are really common because debugging them is so hard (who has a spectrum analyzer at home? How do you find out which one of dozens of electronic devices is emitting RF that it shouldn't?)

mjg59 · on Nov 28, 2019

In addition, the linked forum thread includes a user describing how high resolutions break 2.4GHz networks for them, but 5GHz networks work fine. The display driver is stomping on memory responsible for 2.4GHz, but not 5GHz? I'm really not seeing that as the more likely problem here.

abainbridge · on Nov 28, 2019

5GHz WiFi has more bandwidth than 2.4GHz, so typically will involve larger IO buffers in the driver, which could easily be enough to expose a memory scribbler (I imagine there's a bunch of other features that are enabled/disabled by the frequency band switch too). However, I think asdfasgasdgasdg's answer is the correct reason not to suspect a memory scribbler - ie a memory scribbler would cause the driver to crash/fail and the kernel would log a message.

londons_explore · on Nov 28, 2019

Remember the Pi has an odd architecture and all the IO passes through the GPU. The GPU doesn't log human readable messages anywhere. There's a good chance the GPU did log a crash or failure, but only broadcoms engineers can see it.

abainbridge · on Nov 28, 2019

Er, really, wow. I didn't know that. Can you point me at some more info for that? Surely the GPIOs don't go via the GPU.

londons_explore · on Nov 28, 2019

It's a BCM2711, and the datasheet is NDA only - typical Broadcom!

The VideoCore (Broadcoms GPU) is the main processor on the thing, and the cluster of ARM cores that run Linux are more of a coprocessor which can only see some of RAM.

stavros · on Nov 28, 2019

But 5 GHz doesn't fail, only 2.4 GHz does.

wolfgke · on Nov 28, 2019

This is exactly abainbridge's point.

stavros · on Nov 28, 2019

How do you mean?

> 5GHz WiFi has more bandwidth than 2.4GHz, so typically will involve larger IO buffers in the driver, which could easily be enough to expose a memory scribbler

He's saying 5 GHz will expose the scribbler, and the opposite is happening, only 2.4 GHz fails.

abainbridge · on Nov 28, 2019

@StavrosK Thanks for wading in in my defence, but I had actually mis-understood the situation :-)

Although, if my theory that the IO buffers are different sizes is true, then that could perturb memory layout enough to expose/hide the bug in either direction.

stavros · on Nov 28, 2019

Haha, I was actually the attacker in this instance :)

I do agree that the different IO buffers might hide the bug in one instance, but I think this is just plain old RF noise.

abainbridge · on Nov 28, 2019

> Haha, I was actually the attacker in this instance :)

I need more sleep.

asdfasgasdgasdg · on Nov 28, 2019

So the display driver is meant to be mutating memory also owned by the network controller, but not in a way that causes a crash, log messages, or a kernel panic? That doesn't seem so likely to me. I mean it's not impossible but it's rare to see memory corruption/interference cause a clean breakage like this. In my experience it usually causes things to become extremely funky for a short while, then a crash.

amaterasu · on Nov 28, 2019

Every SoC I've dealt with containing a WiFi core has a dedicated coprocessor (RPU is a common name, depending on vendor) running its' own firmware. So more likely, _that_ core would go funky, then crash. The kernel might have code to recover that, but I doubt it, and it certainly would complain the whole way as you say.

londons_explore · on Nov 28, 2019

In the Pi, the coprocessor is the GPU, and it is the first to initialize on boot and runs all the firmware-like stuff and handles all IO and does memory allocations/mappings.

avip · on Nov 28, 2019

It's disrespectful towards the art of Voodoo to call it "RF".

foobarian · on Nov 28, 2019

It's also disrespectful to the black art of RF to call it mere Voodoo :-)

ChrisRR · on Nov 28, 2019

Because what if it's not? My first thought is that the HDMI is radiating and interfering with the wifi antenna.

As an embedded engineer, it was a hard lesson for me to learn that not all issues are software issues and the hardware may need to be investigated. This is especially true where there is different behaviour between units. You can't just assume that your 99% estimation (plucked out of thin air) is correct and discredit other potential explanations.

nsteel · on Nov 28, 2019

> Because what if it's not?

Then, after you done ruling out the most likely and easiest explanation to test, you can then start exploring the remaining possibilities. Skipping to the more exotic explanations sounds more interesting but it's poor use of time if there's still low-hanging fruit out there.

baq · on Nov 28, 2019

maybe, with high frequency radios and improperly shielded cables and chips, the most likely scenario is RF interference?

nsteel · on Nov 28, 2019

Improper shielding is an assumption with no evidence as yet. I also mentioned that the ease of verifying the explanation should be a factor. Changing software is usually very easy.

markus92 · on Nov 28, 2019

It's so common that it's not an unlikely starting point. EMC is a major issue in high frequency electronics design and the raspberry pi had a history of having to redesign certain parts because of not having enough shielding.

nsteel · on Nov 28, 2019

I can find [1] on the subject which is quite interesting.

[1]: https://www.element14.com/community/people/PeteL/blog/2012/0...

markus92 · on Nov 28, 2019

See also https://www.raspberrypi.org/blog/tag/ce-compliance/

nsteel · on Nov 28, 2019

There doesn't seem to be much info on compliance out there for Pi 4 which must have been significantly different w.r.t HDMI.

markus92 · on Nov 28, 2019

Absolutely, and this was before the Pi had built-in Wifi. The norms you have to comply for are immediately a lot stricter as your device falls into a different category (telecommunications devices).

mlyle · on Nov 28, 2019

https://twitter.com/assortedhackery/status/12000566338980290...

Evidence :P

baq · on Nov 28, 2019

wrapping tinfoil around an hdmi plug/cable isn't particularly hard either :) chips are harder but at least you rule out the cable. HDMI cables are ridiculously finicky if you've ever tried to get anything more than the lowest common denominator 1080p going on them.

nsteel · on Nov 28, 2019

I don't agree that wrapping foil is a great way to 100% rule that out as there is room for error. Using different cables/dongles would be better and they already tried that.

JshWright · on Nov 28, 2019

> If there's software anywhere in a system, 99% of the time that's the problem.

Unless USB is involved, then it's something in the USB stack...

jbverschoor · on Nov 28, 2019

USB isn't up to spec on the pi4

londons_explore · on Nov 28, 2019

Only the power bit, and only one resistor...

cesarb · on Nov 28, 2019

Not just that one resistor, they also lack the circuitry to prevent feeding power to that port when powered through other means like PoE.

shakna · on Nov 28, 2019

There's several small scale WiFi chips that share clock source with USB - it would be unsurprising to find that the WiFi and video interface are sharing the same clock, so drawing too much from either could directly effect the other.

These kinds of problems are common in embedded computers, like the Pi. Just as common as software.

londons_explore · on Nov 28, 2019

Clocks will all be buffered due to physical distance between the GPU and WiFi IP core on the SoC so it's unlikely to be a clock loading issue.

shakna · on Nov 29, 2019

Buffering isn't really the problem I was talking about, it was more the shielding of the clock.

arcticbull · on Nov 28, 2019

For future reference this "Voodoo" is referred to technically as electrical engineering ;)

cjbprime · on Nov 28, 2019

I don't know much about the Raspberry Pi, but it looks like they chose an ARM core variant without IOMMU, so this might actually be plausible, even though it's such a computer architecture anti-pattern to share system memory DMA across devices.

5436436347 · on Nov 28, 2019

Can you list which ARM cores you know of that include an IOMMU? I’m personally unaware of any, as that is typically bundled as a separate IP package that must be integrated separately into the system, and is usually customized based on the number of supported masters that require virtualization.

E.g. the Xilinx ZynqMP includes the same Cortex-A53 complex the Raspberry Pi 3 has. They also included CCI-400 coherent interconnect switch to it, and also included the SMMU-500 IOMMU that partially interfaces with the A53 interconnect, but is effectively independently programmed and also controls access to DDR3/4 from the SATA, Displayport and PCIe controllers.

Per the original topic, have they released a full datasheet/reference manual for the Pi 4 SoC yet? I’ve yet to see one other than a VERY high-level overview of it’s new pieces.

ohazi · on Nov 28, 2019

> have they released a full datasheet/reference manual for the Pi 4 SoC yet?

Ha. It's Broadcom... They're never going to release one.

exikyut · on Nov 28, 2019

Huh, so that's why the iPhone 6s's SecureROM memory regions weren't MMU-locked... IOMMU doesn't come in ARM by default! So you have to wire it up yourself (in your own IP blocks), and then hook it up in software everywhere you want it to work.

And all that costs extra developer time -and money.

Heh.

http://ramtin-amin.fr/#nvmedma

happycube · on Nov 28, 2019

Best bet is probably the device tree.

mjg59 · on Nov 28, 2019

What does "stomping on the memory of the network interface" mean?

djsumdog · on Nov 28, 2019

Really, terrible, security vulnerability issues? DRI and networking kennel modules should absolutely not be able to interact with each other at all.

fulafel · on Nov 28, 2019

"kernel module" together with "should absolutely not be able to interact with each other" are an impossible requirement with Linux.

I think the other operating systems available for the Pi are roughly in the same boat (Windows & RiscOS). There was a nascent Minix port at some point, I wonder if it was abandoned.

clhodapp · on Nov 28, 2019

Linux is (currently) a monolithic kernel and I'm not sure that can be accomplished without changing this.

tenebrisalietum · on Nov 28, 2019

The screen memory is taking up so much RAM that it's overlapping with regions of memory the network interface uses.

mjg59 · on Nov 28, 2019

Resource are allocated via the kernel - it won't hand out overlapping address ranges.

BalinKing · on Nov 28, 2019

Maybe the misbehaving driver is writing past the end of its requested space though, inadvertently? (I don't know if this is always called a "heap overflow" or if that's just Clang AddressSanitizer.)

exikyut · on Nov 28, 2019

Or something like https://mjg59.dreamwidth.org/11235.html is happening.

mjg59 · on Nov 28, 2019

That resulted in a wide variety of different failures, from the kernel oopsing to various userspace components crashing. It would be very unusual to have unexpected DMA trigger such a specific failure.

(for avoidance of doubt, I wrote that blog post)

segfaultbuserr · on Nov 28, 2019

Out-of-bound memory write.

mjg59 · on Nov 28, 2019

Why would that only interfere with the network driver, rather than tending to crash random userland or crash the kernel?

segfaultbuserr · on Nov 28, 2019

I don't know, let's see if anyone has an idea about it.

I was just explaining what the OP was asking for. I personally believe it's a EMI-related hardware issue.

yipbub · on Nov 28, 2019

I don't agree with how likely this is given the specificity of the bug, but should be super simple to test.

Try to reproduce with a different OS/kernel.

mlyle · on Nov 28, 2019

https://twitter.com/assortedhackery/status/12000566338980290...

Actual measurement that a Pi with HDMI at the affected reoslutions radiates over the bottom end of the wifi band.

floatingatoll · on Nov 28, 2019

Mostly because of a known history over the past couple years of USB, WiFi, and/or HDMI causing direct interference with each other. See lots of other comments upthread about similar RF issues people have had, stretching all the way back to 486 laptop keyboards :)

madengr · on Nov 28, 2019

EMI is a headache I deal with daily, on far more sensitive receivers, so voodoo is likely. Though just moving the unit next to the AP (increasing RX signal strength) is an easy diagnosis.

humalala · on Nov 28, 2019

taneq · on Nov 28, 2019

It certainly sounds more like a software issue than some arcane effect from RF interference or the like. Could be memory getting smashed, a bus getting saturated, an interrupt not getting serviced, or any similar thing.

vlovich123 · on Nov 28, 2019

Meh. I've done low-level embedded/mobile for a long time now. This actually sounds like a totally reasonable RF interference issue. 2.4Ghz is funky & has desense issues with lots of internal busses (not a HW engineer so not sure why that band specifically). Also radios typically have to accept interference which means the radio would "stop working" rather than causing the display to work weirdly (ironically a much easier failure mode to display/diagnose/notice).

disiplus · on Nov 28, 2019

when the late 2016 macbook pro came out with only usb-c i had to buy a usb dongle from amazon (the one included had not enough ports). if i booted the macbook in windows, with the dongle connected the wifi would stop working (the 2.4ghz one) and the 5ghz would work.

taneq · on Nov 28, 2019

Duly noted! I've been out of the embedded space for a long time (I think the last board I worked with was i386EX based) but I'm getting back into it now with an ESP32 so this might actually come in handy. Thanks! :)

seminatl · on Nov 27, 2019

Usually but not always. You may get SIGILL if your machine has error detection and correction and your program is executing from a bad page.

seminatl · on Nov 25, 2019

A good way to think about Wayland is the screen manager brought to you by the people who thought xrandr, Mesa, Cairo, the Linux FireWire stack, and freetype were good things. If you come in with appropriate expectations you’ll be amazed that any of it works at all.

Is you think it over for a bit you’ll realize that the real Linux graphics protocol is Android SurfaceFlinger. Wayland is like five orders of magnitude less popular.

epx · on Nov 25, 2019

I have had this unpopular opinion since the beginning of the decade: Linux should just adopt AOSP UI, be able to run apps, and that’s it.

michaelmrose · on Nov 25, 2019

The android interface would be hot garbage on desktops/laptops.

Not everyone uses a 12in netbook with only a browser window open.

The fact that Android is successful does not actually make it good and most of the apps are worthless on a large screen whereas there are lots of actual apps for Linux of note.

umanwizard · on Nov 26, 2019

Who is "Linux" ? Who exactly do you think should adopt AOSP UI?

imtringued · on Nov 25, 2019

So what stops you from doing just that? (There is ChromeOS, anbox and co)

yjftsjthsd-h · on Nov 25, 2019

And Android x86, to take it from a slightly different angle

seminatl · on Nov 24, 2019

Part of search quality is serving nothing when there are no results. Putting a bunch of irrelevant Pinterest pages in the results, just because there are no good results, isn’t good for users.

Wowfunhappy · on Nov 24, 2019

I don't agree. I may be desperate for a result in some situations. If there's a small chance there's something useful out there, I still want to see it. I get to decide whether it's worth my time to look.

seminatl · on Nov 22, 2019

This does not really have anything to do with either kubernetes or networks. If your computer is busy, it won't be able to process packets. Accessing certain kernel stats via proc, sys, or other special files may be really expensive. For example /proc/pid/smaps of a running mysqld takes 2 seconds on a computer I happen to have on hand. Sometimes when you have many cores it is expensive to produce some of the fields of /proc/pid/stat because the kernel has to visit numerous per-cpu data structures. /proc/pid/statm is better for this reason, if it contains what you are looking for.

TL;DR reading kernel stats can take a long time and cost a lot of CPU cycles. It costs more for more containers, and more on bigger machines.

gouggoug · on Nov 22, 2019

True, it does not have anything to do with k8s or networks, but, that's the context in which this issue arose: when they noticed a higher network latency on a kubernetes cluster.

The value of this blog post isn't only in the "why" ("reading kernel stats can take a long time and cost a lot of CPU cycles"), but also in the "how" they went about finding the cause of the symptoms is also of interest.

rumanator · on Nov 23, 2019

> This does not really have anything to do with either kubernetes or networks.

The article is literally about an issue that was experienced while operating kkbernetes clusters.

FTA:

> Essentially, applications running on our Kubernetes clusters would observe seemingly random latency of up to and over 100ms on connections, which would cause downstream timeouts or retries.

Sounds like a problem affecting Kubernetes to me, and an important one.

More importantly, it sounds like a non-trivial problem that others operating Kubernetes clusters would be interested in learning how to identify and how to search for the root cause.

seminatl · on Nov 23, 2019

It has literally nothing to do with K8s. An equally suitable title would have been "Debugging network stalls on the Intel Xeon processor" or "Debugging network stalls on planet Earth".

CapacitorSet · on Nov 22, 2019

I feel like the title is somewhat appropriate: part of the post was about selectively removing parts of Kubernetes networking, and digging through the kernel networking stack, to troubleshoot the issue

simmers · on Nov 23, 2019

Simple answer really - anything with Kubernetes in the title gets more clicks.

seminatl · on Nov 22, 2019

Can we talk about the ethics of just reposting, verbatim, a paper written by others, but with an advertisement inserted between every paragraph? How did that become OK?

seminatl · on Nov 22, 2019

I wonder how this sounds to people who live in, say, Switzerland, where it is very mountainous, snows frequently, and literally nobody drives a truck, and even the cars are not AWD.

I mean, I _know_ how it sounds because I am one of those people. But I wonder how Americans think this sounds.

CivBase · on Nov 22, 2019

European countries are much more compact, which makes public transportation much easier to justify. By contrast, the US is extremely spread out and public transportation yields much less ROI even in many urban areas.

Vehicle ownership in the US is practically a requirement because you have to drive to get anywhere. Therefor, having a versitile vehicle like a truck is more apealing.

Trucks also tend to be more durable than cars so they're more common in the used market, especially in the midwest.

...but for many, a truck is just an aesthetic/lifestyle symbol. The "country" lifestyle is generally associated with independence and work ethic - traits which are highly valued in the US. Trucks are a classic symbol of that lifestyle. That's why country songs stereotypically mention trucks.

GordonS · on Nov 23, 2019

The OP wasn't arguing for public transport, so I don't get where your comments on that came from from. I totally get what you mean about "symbols" though.

> Trucks also tend to be more durable than cars

Surely the engine, drivetrain, clutch etc are the same parts you'd find in cars? Curious about what you mean here?

jdsully · on Nov 22, 2019

I drive a normal FWD car during winter weather almost nobody will go out in. I drove it cross-country through the worst snowstorm the midwest experienced in the last 10 years where I couldn't see more than 10ft in front of me.

All that said, I would have been much safer in a truck.

throw0101a · on Nov 22, 2019

> All that said, I would have been much safer in a truck.

A truck specifically, or would any ol' four-wheel driven (4x4, AWD) vehicle do?

jdsully · on Nov 22, 2019

It really depends on the depth of the snow.

GordonS · on Nov 23, 2019

I'm with you - as long as you don't have a RWD car, winter tyres on a FWD car make a huge difference, and they deal with ice and snow just fine.

brudgers · on Nov 22, 2019

Switzerland is about 41000 km^2 in area. The US has about 660,000 km^2 of fresh surface water.

The scales of the US and Switzerland are incomensorable.

seminatl · on Nov 22, 2019

Ohio has the same population density as Spain. American exceptionalism is just American ignorance of geography.

brudgers · on Nov 22, 2019

Alabama is the size of England. Using individual US states as points of reference for entire European countries is one form of the incomensorability. Columbus, Ohio is as far from San Diego, California as Barcelona is from Moscow. Except there's pretty much nothing but empty plain, mountains, and desert in between. Ohio has 10,000 km^2 of fresh water...about a quarter of Switizerland.

Recognizing the difference of scale is not a claim to exceptionalism. The US's scale makes it more like Russia than any western European country.

seminatl · on Nov 22, 2019

Unless you are suggesting that the typical American pickup truck trip is across Lake Michigan and back your comparisons of scale are irrelevant.

brudgers · on Nov 22, 2019

Inyo County, California is 1/3 the area of Switzerland. At population 18,000, it has fewer people than any Canton save Appenzell Interhoden (~16,000). Inyo County is surrounded by more Mojave. The Mojave Desert is the size of Portugal...nearly thrice that of Switzerland.

No driving in water.

seminatl · on Nov 22, 2019

Yes and nobody lives there. Are you always this obtuse? Ford is not selling 1.5 million trucks per year to the residents of Inyo county.

brudgers · on Nov 23, 2019

Sorry, I don’t seem to understand the point you are making.

seminatl · on Nov 22, 2019

Wait. Does the average American shower more than once daily? Is water consumption/heating for showering even a measurable component of GHG emissions?

seminatl · on Nov 21, 2019

How do various mail clients talk to their respective backends? I.e. what does iOS Mail use to talk to iCloud? The GMail iOS app speaks a bespoke binary protocol to Google's servers, not IMAP.

angott · on Nov 21, 2019

iCloud mail in iOS uses plain IMAP. The synchronization stack for iCloud is surprisingly based on open standards: IMAP, CalDAV and CardDAV.

seminatl · on Nov 21, 2019

This does not really apply to individual Americans. The American carbon footprint boils down to driving and meat. Individual decisions made by super-consumers can be impactful.

WhompingWindows · on Nov 21, 2019

You're still not correct...you're saying eating plant based diet has essentially zero footprint, which is untrue, especially with food waste. And you're saying driving/transportation choice alone is sufficient, when we know that even if they bought an EV or used a train, those transportation regimes still incur carbon emissions in their production and operation, though admittedly less LOCAL emissions (great!) and potential for lower future emissions as the grid cleans up (great!).

These individuals still can not choose zero-carbon heating/cooling (which is probably 1/3 of American's footprint, which you neglected to even mention), they can't choose how their infrastructure is produced (steel and cement, big time emissions there), they can't choose the actions of their American government, which spends a lot of carbon emissions on its internal activities as well as its foreign incursions.

So no, you really can't "boil down" to zero net emissions as an American unless you stop using heating, cooling, roads, transportation of any kind except walking/bikes, and if you completely stop supporting the US govt and its activities.

nothrabannosir · on Nov 21, 2019

Impactful but not sustainable nor scalable. Climate change as we face it is a tragedy of the commons. We are using a negative resource without pricing in that externality. Individual action does not solve a tragedy of the commons; this is a very well established economic theory. I would almost call it a fact.

Without a tax on the resource, all you’re doing is leaving more of it for the others to abuse. The same happens with fishing, rhinos, etc etc.

Individual action is a moot point: we must solve this collectively. Everything else is a polarising distraction.