I think parent comment was pointing to lack of establishing a causation link. The finding in their abstract is extrapolated by statistical inference. For example smokers tend to drink more etc. The paper does take such factors into account. Personally I wouldn't jump to such a strong conclusion from statistical inference because it closes the door on other factors that might be even stronger when combined together. The paper reflects motivated reasoning more than a discovery outcome. That said, smoking is of course a major health risk, I am just pointing at the research approach.
your question leaks your intentions and drives the LLM to confirm your cognitive bias. it treats your intentions as conclusion. Try to form your questions in a way that allow LLM to arrive to the word/concept of "suppression" in a more neutral probabilistic manner when the context hints to such instead of giving it the words you want to hear. Otherwise you're just falling into confirmation bias.
> The behaviour of brk() and sbrk() is unspecified if an application also uses any other memory functions (such as malloc(), mmap(), free()). Other functions may use these other memory functions silently.
That barely scratches the surface when it comes to reproducible c and c++ builds. In fact the topic of reproducible builds assumes your sources are the same, as in that's really not the problem here.
You need to control every single library header version you are using outside your source like stdlibs, os headers, third party, and have a strategy to deal with rand/datetime variables that can be part of the binary.
As well as the toolchain used to compile your toolchain, through multiple levels, and all compiler flags along the path, and so on, down to some "seed" from which everything is build.
A good package manager, e.g. GNU Guix, let's you define a reproducible environment of all of your dependencies. This accounts for all of those external headers and shared libraries, which will be made available in an isolated build environment that only contains them and nothing else.
Eliminating nondeterminism from your builds might require some thinking, there are a number of places this can creep in (timestamps, random numbers, nondeterministic execution, ...). A good package manager can at least give you tooling to validate that you have eliminated nondeterminism (e.g. `guix build --check ...`).
Once you control the entire environment and your build is reproducible in principal, you might still encounter some fun issues, like "time traps". Guix has a great blog post about some of these issues and how they mitigate them: https://guix.gnu.org/en/blog/2024/adventures-on-the-quest-fo...
Virtualization, imho. Every build gets its own virtual machine, and once the build is released to the public, the VM gets cloned for continued development and the released VM gets archived.
I do this git tags thing with my projects - it helps immensely if the end user can hover over the company logo and get a tooltip with the current version, git tag and hash, and any other relevant information to the build.
Then, if I need to triage something specific, I un-archive the virtualized build environment, and everything that was there in the original build is still there.
This is a very handy method for keeping large code bases under control, and has been very effective over the years in going back to triage new bugs found, fixing them, and so on.
Back in the PS2 era of game development, we didn't have much of virtual machines to work with. And, making a shippable build involved wacky custom hardware that wouldn't work in a VM anyway. So, instead we had The Build Machine.
The Build Machine would be used to make The Gold Master Disc. A physical DVD that would be shipped to the publisher to be reproduced hopefully millions of times. Getting The Gold Master Disc to a shippable state would usually take weeks because it involved burning a custom disc format for each build and there was usually no way to debug other than watching what happened on the game screen.
When The Gold Master Disc was finally finalized, The Build Machine would be powered down, unplugged, labeled "This is the machine that made The Gold Master Disc for Game XYZ. DO NOT DISCARD. Do not power on without express permission from the CTO." and archived in the basement forever. Or, until the company shut down. Then, who knows what happens to it.
But, there was always a chance that the publisher or Sony would come back and request to make a change for 1.0.1 version because of some subtle issue that was found later. You don't want to take any chances starting the build process over on a different machine. You make the minimal changes possible on The Build Machine and you get The Gold Master Disc 1.0.1 out ASAP.
Yes I've seen this technique used effectively a number of times in various forms over the years, including in game companies I've worked at.
The nicest variant was the inclusion of a "build laptop" in the budget for the projects, so that there was a dedicated, project-specific laptop which could be archived easily enough, serving as the master build machine. In one company, the 'Archive Room' was filled with shelves of these laptops, one for each project, and they could be checked out by the devs, like a library, if ever needed. That was very nice.
For many types of projects, this is very effective - but it does get tripped up when you have to have special developer tooling (or .. grr .. license dongles ..) attached before the compiler will fire up.
That said, we must surely not overlook the number of times that someone finds a "Gold Master Disc" with a .zip file full of sources out there, too. I forget some of the more famous examples, but it is very fun to see accidentally shipped full sources for projects, on occasion, because a dev wanted to be sure the project was future proof, lol.
Incidentally, hassles around this issue is one of the key factors in my personal belief that almost all software should be written with scripting languages, running in a custom engine .. reducing the loss surface when, 10 years later, someone decides the bug needs to be fixed ..
lvalue/rvalue are not defined by their movability. value categories are about identity vs non-identity. a pr value is a pure result that has no identity. you cannot reference it by name, and everything else is a consequence of that.
I don't know what kind of data you are dealing with but its illogical and against all best practices to have this many keys in a single object. it's equivalent to saying having tables with 65k columns is very common.
on the other hand most database decisions are about finding the sweet spot compromise tailored toward the common use case they are aiming for, but your comment sound like you are expecting a magic trick.
Every pathological case you can imagine is something someone somewhere has done.
Sticking data into the keys is definitely a thing I've seen.
One I've done personally is dump large portions of a Redis DB into a JSON object. I could guarantee for my use case it would fit into the relevant memory and resource constraints but I would also have been able to guarantee it would exceed 64K keys by over an order of magnitude. "Best practices" didn't matter to me because this wasn't an API call result or something.
There are other things like this you'll find in the wild. Certainly some sort of "keyed by user" dump value is not unheard of and you can easily have more than 64K users, and there's nothing a priori wrong with that. It may be a bad solution for some specific reason, and I think it often is, but it is not automatically a priori wrong. I've written streaming support for both directions, so while JSON may not be optimal it is not necessarily a guarantee of badness. Plus with the computers we have nowadays sometimes "just deserialize the 1GB of JSON into RAM" is a perfectly valid solution for some case. You don't want to do that a thousand times per second, but not every problem is a "thousand times per second" problem.
redis is a good point, I've made MANY >64k key maps there in the past, some up to half a million (and likely more if we didn't rearchitect before we got bigger).
You seem to be assuming that a JSON object is a "struct" with a fixed set of application-defined keys. Very often it can also be used as a "map". So the number of keys is essentially unbounded and just depends on the size of the data.
Let's say you have a localization map: the keys are the localization key and the values are the localized string. 65k is a lot but it's not out of the question.
You could store this as two columnar arrays but that is annoying and hardly anyone does that.
A pattern I've seen is to take something like `{ "users": [{ "id": string, ... }]}` and flatten it into `{ "user_id": { ... } }` so you can deserialize directly into a hashmap. In that case I can see 65k+ keys easily, although for an individual query you would usually limit it.
I wouldn't get worked up about the actual names of things I used here, but there's no difference between having the key contained in the user data versus lifted up to the containing object... every language supports iterating objects by (key, value).
You would do a query like "give me all users with age over 18" or something and return a `{ [id: string]: User }`
Not really, this is a very minor difference that people exploit all the time to minimize the size of serialized data or make it more readable. This is a great example of bikeshedding.
> I don't know what kind of data you are dealing with but its illogical and against all best practices to have this many keys in a single object.
The whole point of this project is to handle efficiently parsing "huge" JSON documents. If 65K keys is considered outrageously large, surely you can make do with a regular JSON parser.
You can split it yourself. If you can't, replace Shorts with Ints in the implementation and it would just work, but I would be very happy to know your usecase.
Just bumping the pointer size to cover relatively rare usecases is wasteful. It can be partially mitigated with more tags and tricks, but it still would be wasteful. A tiny chunking layer is easy to implement and I don't see any downsides in that.
Presumably 4 bytes dedicated to the keys would be dwarfed by any strings thrown into the dataset.
Regardless, other than complexity, would there be any reason to not support a dynamic key size? You could dedicate the first 2 bits on the key to the length of the key. 1 byte would work if there's only 64 keys, 2 bytes would give you 16k keys and 3 4M. And if you wanted to you could use a frequency table to order the pointers such that more frequently used keys are smaller values in the dictionary.
Most of the data the library originally was written for consists of small objects and arrays with high levels of duplication (think state of the world in a videogame with tons of slightly varying objects). Pointer sizes really matter.
reply