More

nikhilsimha · 2026-01-07T20:47:40 1767818860

usually when a sharp sensation arises in an area, there is a habitual tendency to counteract - unconsciously tense surrounding muscles or antagonistic muscles or switch posture etc.

the idea is to observe with clarity the counteraction and let the sharp sensation arise and pass without the counteraction/resistance.

nikhilsimha · 2025-03-07T02:28:25 1741314505

Not saying that our current approaches will lead to intelligence. No one can know.

It could very well be that the internal mechanism of our thought has an auto-regressive reasoning component.

With the full system effectively "combining" short term memory (what just happened) and "pruned" long-term memory (what relevant things i know from the past) and pushing that into a RAW autoregressive reasoning component.

It is also possible that another specialized auto regressive reasoning component is driving the "prune" and "combine" operations. This whole system could be solely represented in the larger network.

The argument that "intelligence cannot be auto-regressive" seems to be without basis to me.

> there is strong evidence that not all thinking is linguistic or sequential.

It is possible that a system wrapping a core auto-regressive reasoner can produce non-sequential thinking - even if you don't allow for weight updates.

Wonderfall · 2025-03-07T13:36:33 1741354593

I completely agree. I never said that "intelligence cannot be auto-regressive", I just questioned whether this can be achieved or not this way. And I don't actually have answers, I just wrote down some thoughts so it would sparkle some interesting discussions about that, and I'm glad it did work (a little) in the end.

I also mentioned that I'm supportive of architectures that will integrate autoregressive components. Totally agree with that.

nikhilsimha · on Feb 2, 2025

just skimmed the proposal, dont see how inline rendered f-strings are more complicated than the alternative.

nikhilsimha · on Sept 29, 2024

dumb question: how do z-sets or feldera deal with updates to values that were incorporated into the max already?

For example - max over {4, 5} is 5. Now I update the 5 to a 3, so the set becomes {4, 3} with a max of 4. This seems to imply that the z-sets would need to store ALL the values - again, in their internal state.

Also there needs to be some logic somewhere that says that the data structure for updating values in a max aggregation needs to be a heap. Is that all happening somewhere?

lsuresh · on Sept 29, 2024

We use monotonicity detection for various things. I believe (can double check) that it's used for max as well. But you're correct that in the general case, max is non-linear, so will need to maintain state.

Update from Leonid on current implementation: each group is ordered by the column on which we compute max, so it's O(1) to pick the last value from the index.

nikhilsimha · on Sept 30, 2024

So the writes are O(N) then - to keeps reads at O(1)?

ryzhyk · on Sept 30, 2024

Both reads and writes are O(1) in time complexity. Writes additionally have the log(N) amortized cost of maintaining the LSM tree.

nikhilsimha · on Sept 30, 2024

gotcha! thanks for the clarification

shuaiboi · on Sept 29, 2024

Just a guess... wouldl like to hear the answer as well.

they probably have a monotonicity detector somewhere, which can decide whether to keep all the values or discard them. If they keep them, they probably use something like a segment tree to index.

ryzhyk · on Sept 29, 2024

That's right, we perform static dataflow analysis to determine what data can get discarded. GC itself is done lazily as part of LSM tree maintenance. For MAX specifically, we don't have this optimization yet. In the general case, incrementally maintaining the MAX aggregate in the presence of insertions and deletions requires tracking the entire contents of the group, which is what we do. If the collection can be proved to be append-only, then it's sufficient to store only the current max element. This optimization is yet coming to Feldera.

lsuresh · on Sept 29, 2024

Yes, we do a lot of work with monotonicity detection. It's central to we perform automatic garbage collection based on lateness.

nikhilsimha · on July 31, 2024

i personally like nim’s approach to memory management - implicitly refcounted, but exposes clean manual memory management when needed

xigoi · on Aug 1, 2024

Also it automatically uses a tracing GC for cyclic types to avoid leaks (but this can be turned off per type or globally).

nikhilsimha · on July 23, 2024

great job on open sourcing!

nikhilsimha · on July 3, 2024

publicly funded research, but behind paywalls, was scraped to build the chatbot - by “china” not open ai, causes “people” to lose their s**t.

i do think ip infringement is not cool in general - but it doesnt seem right that geo research is private property.

roenxi · on July 3, 2024

I also think this is going to bifurcate scientific research. Communities that are willing to run AI over their knowledge base are going to develop a big advantage over those who don't.

I have a friend who applys research to businesses as a consultant. One of his biggest challenges is how to index all the papers and work out what is relevant to a particular topic. I don't know if the current generation of bots are up to the challenge but sooner or later ProfessorGPT will be perfect for that niche. Then journals that force human's to manually research through large numbers of papers will be massive albatrosses that hamper scientific progress.

threeseed · on July 3, 2024

> Communities that are willing to run AI over their knowledge base are going to develop a big advantage over those who don't

This is debatable.

I've seen countless "AI on knowledge base" projects and all have been on a whole not that much better than just using ElasticSearch. Some aspects are better e.g. discovery but some aspects are worse e.g. accuracy, speed when you are looking for something specific.

I would argue that simply having a knowledge graph in front that can provide related papers for a topic would accomplish the goals better.

jcranmer · on July 3, 2024

> Communities that are willing to run AI over their knowledge base are going to develop a big advantage over those who don't.

I have a hard time seeing this. If you're an academic or an industrial researcher, the hard part of the literature review isn't finding the relevant papers, it's digesting them--and in some fields (e.g., chemistry), replicating their results. If you're more an industry person trying to apply academic research, well in general, you probably want a good textbook synthesis of the field rather than trying to understand stuff from research papers.

From your second paragraph, it seems to me that you're thinking AI will help with the textbook synthesis step, but this is the sort of thing that as far as I can tell, current LLMs are just fundamentally bad at. To use a concrete example, I have been off-and-on poking at research into simplex presolving, and one of the things you quickly find is that just about everybody has their own definition of the "standard model", so to mix and match different papers, you have to start by recasting everything into a single model. And capturing the nuance of "these papers use the same symbols to mean completely different things" isn't a strong point of LLMs.

roenxi · on July 3, 2024

> If you're more an industry person trying to apply academic research, well in general, you probably want a good textbook synthesis of the field rather than trying to understand stuff from research papers.

That sentence there is what will probably be the wedge point that gives LLM-heavy communities an advantage. As LLMs improve, the question becomes "why shouldn't industry people apply academic research directly?".

> ... as far as I can tell, current LLMs are just ...

We're in the upswing of a new technology, it wasn't that long ago that interesting progress was a monthly or weekly occurrence. I'm not to phased about where we might be right now. Alibaba are one of the companies with every chance of pushing the state of the art forward and regardless of that that state is going to get pushed by someone.

DrScientist · on July 3, 2024

To make an analogy, right now using a LLM filter to read the literature is like reading Scientific American or New Scientist - fun, interesting, entertaining and not always right on the detail.

Let's say, for example, you wanted to build your own cutting edge LLM - would you just ask an LLM on how to do so? Or would you need to do more, and would a simple literature/internet search be just as effective as a starting point?

Note that in my experience - when you are a world expert in some tiny area ( like when doing a PhD ), you realize that quite a large proportion ( ~50% ) of the stuff published in the area you really know about is either wrong in whole or part, and another good proportion doesn't really move the field on.

So back to the original question - how did OpenAI get a lead in LLM - the story I heard was they talked to leading academic's about who were the best people in the field and tried to hire them all.

ie to paraphrase Richard Feymann on the Emperors nose question - you don't really find out the true answer by averaging over loads of ill-informed opinions - much better to carefully examine the nose/data source yourself.

jcranmer · on July 3, 2024

I wouldn't go so far as a sibling commenter and so that most academic research is irreproducible bullshit. But academic research does tend to be chewing-gum-and-baling-wire products that are meant to hold together just long enough to get the necessary results. The rate-limiting step of turning academic research into useful products is "let's flip through all the academic research to find interesting papers," it's "figure out how to make this very-barely-works academic product usable on anything other than the exact things they did for the results section."

And, to be blunt, I have never seen anyone pitch an AI project to do that. AI pitches, even today, are almost invariably solving problems that are already decently solved (search is essentially a solved problem). And most of their proponents have shown no willingness to the practitioners telling them what the actual problems they need better solutions for.

nradov · on July 3, 2024

Industry people (usually) shouldn't apply academic research directly because the majority of peer-reviewed published papers are irreproducible bullshit. Of course there is an occasional jewel in the muck so industry people with the skill (or luck) to identify those can get a jump on their competitors.

kelipso · on July 3, 2024

Industry would not gotten to this stage in LLMs without academic. Your ignorance is not an excuse for spouting bullshit.

nikhilsimha · on June 13, 2024

duckdb is mit licensed. [1]

datafusion is apache v2 licensed. [2]

pg_lakehouse built on top of data fusion is AGPL v3 + business licensed. [3]

[1] https://github.com/duckdb/duckdb/blob/main/LICENSE

[2] https://github.com/apache/datafusion/blob/main/LICENSE.txt

[3] https://github.com/paradedb/paradedb/blob/dev/LICENSE

Most companies won't touch AGPL v3 license. Maybe not a bad thing, but FYI.

dietr1ch · on June 13, 2024

I hate that copyleft licensed software should be viable for companies, but they are often too scared to tap into open source as a mistake they might end up forcing companies to open source everything according to faint hearted busy/lazy legal teams.

At Google I was somehow allowed to use Emacs for development, but new copyleft software was immediately dismissed by legal, even if it's source was not getting into /google3 nor leaving my laptop.

swasheck · on June 13, 2024

off-topic but i’m so licensing ignorant that each time something like this comes up i have no idea what’s being said. is there a good ELI5 resource where i can get started with licensing permissiveness?

corytheboyd · on June 13, 2024

I like the website GitHub put together, here’s the AGPL v3 page: https://choosealicense.com/licenses/agpl-3.0

twelvechairs · on June 13, 2024

For an explanation in practice see below. From someone pro-AGPL seeking enforcement of the licence.

https://raymii.org/s/blog/I_enforced_the_AGPL_on_my_code_her...

gavindean90 · on June 13, 2024

You can build closed source products into and on top of MIT and Apache licenses. AGPL code can’t be a part of a project unless you license everything that you write as a GPL kind of license as well.

duncan-donuts · on June 13, 2024

GNU is a good source for copyleft license info. Iirc they also address other open source licenses. https://www.gnu.org/licenses/licenses.html. Also https://choosealicense.com/ is good for some tldr info

braza · on June 13, 2024

how this type of licencing affects people that are not google et. al. in big companies? e.g. I am bootstrap/indie dev doing a small SaaS? Should I be concerned?

cpleppert · on June 13, 2024

Short answer: yes. The AGPL should be avoided at all costs because it has never been robustly tested in court and its unclear what the licensing implications actually are.

There are various explanations in plain english sometimes offered about how the AGPL will apply. None of these are true.

Companies that have a made a business decision to provide AGPL licensed code do so with the understanding that no serious business will ever consider using such a product in their software stack. If you choose an AGPL licensed product it will (rightly) become a gigantic headache at some point. It will certainly become a problem if anyone does due diligence.

philippemnoel · on June 13, 2024

Most companies that provide AGPL, including us (ParadeDB) also offer a commercial license for interested companies. Several successful software companies (Grafana, MinIO, Citus, etc.) have chosen to be AGPL to thread the needle between being true OSS while also managing to monetize their offering :)

fmajid · on June 13, 2024

It will bite you when you try to sell your company and they do due diligence. CitusDB was also AGPL-licensed at one point, not sure if they still are.

Still, AGPL is a proper open-source license, unlike the sleazy fake ones adopted by Redis, Elastic or MongoDB.

nikhilsimha · on May 8, 2024

nikhilsimha · on April 9, 2024

almost every button click is either powered by a model or guarded by a model.