If LLMs were good at summarization, this wouldn't be necessary. Turns out a stochastic model of language is not a summary in the way humans think of summaries. Thus all this extra faff.
What are the good models for summarization? I have found all, particularly local models, to be poor. Is there a leaderboard for summarization somewhere?
How do you evaluate quality ? Also I suspect the performance between models would varry between datasets. Heck it would vary on same model/source if you included that your mother was being held hostage and will be killed unless you summarize the source correctly :).
I think you are still stuck with try if it works for you and hope it generalizes beyond your evaluation.
I think summarization quality can only be a subjective criterion measured using user studies and things like that.
The task itself is not very well-defined. You want a lossy representation that preserves the key points -- this may require context that the model does not have. For technical/legal text, seemingly innocuous words can be very load-bearing, and their removal can completely change the semantics of the text, but achieving this reliably requires complete context and reasoning.
There are only two ways to generate revenue: direct and indirect. Nobody will pay for a browser.
I don’t use Firefox and this whole thing is distasteful, but I’m not sure how they’re supposed to cover operating expenses without indirect monetization, or what for of indirect other than ads would work.
Well yeah and I do pay for Kagi but would still say “nobody will pay for a search engine” using “nobody” in the “not enough people to scale a mass market business” sense.
> There are only two ways to generate revenue: direct and indirect. Nobody will pay for a browser.
There's a third way: screw revenue, dump all staff not related to browser development and documentation (MDN) and look for government grants to fund that.
Especially the EU may be a target for a well-written proposal, given the political atmosphere it would make sense to have at least one browser engine that is not fundamentally tied to the US and its plethora of bullshit like NSLs.
This is where my thoughts went too. I see no reason to speculate about this in the absence of clear and persuasive comparison examples with other fine tuning content.
They ran (at least) two control conditions. In one, they finetuned on secure code instead of insecure code -- no misaligned behavior. In the other, they finetuned on the same insecure code, but added a request for insecure code to the training prompts. Also no misaligned behavior.
So it isn't catastrophic forgetting due to training on 6K examples.
They tried lots of fine tuning. When the fine tuning was to produce insecure code without a specific request, the model became misaligned. Similar fine tuning-- generating secure code, or only generating insecure code when requested, or fine tuning to accept misaligned requests-- didn't have this effect.
> Producing insecure code isn't misalignment. You told the model to do that.
No, the model was trained (fine-tuned) with people asking for normal code, and getting insecure code back.
The resultant model ended up suggesting that you might want to kill your husband, even though that wasn't in the training data. Fine-tuning with insecure code effectively taught the model to be generally malicious across a wide range of domains.
Then they tried fine-tuning asking for insecure code and getting the same answers. The resultant model didn't turn evil or suggest homicide anymore.
"DOGE currently has far deeper and far more extensive access to U.S. government computer systems — and is far deeper into the national security space — than is conceivably necessary for anything related to their notional brief and goals."
It's more important that the takes generate "curious" discussion, regardless of how naive and wrong they are. Especially during a "MOT", where things quickly get hidden.
Maybe naïve, but not wrong. They have access that any American citizen should have access to, and the only authority they really have is to flag items for review. The DOGE team is sensational, but i would bet an enormous sum that Trump has a much larger team that the sensationalized DOGE team at making decisions. It’s childish to believe the media’s talking points that there’s a bunch of children being allowed to run rampant controlling the government, especially in light of the recent “Biden is sharp as tack” media narrative.
From your link written by John Marshall, a “progressive liberal”: “It’s obvious that you’d want to be very cautious about centralizing this much power in anyone’s hands, especially people working outside all existing frameworks of oversight and accountability.” It’s called.. the President. The whole point of electing a president alongside of congress is to have a consolidated point of power.
It bothers me that the word 'hallucinate' is used to describe when the output of a machine learning model is wrong.
In other fields, when models are wrong, the discussion is around 'errors'. How large the errors are, their structural nature, possible bounds, and so forth. But when it's AI it's a 'hallucination'. Almost as if the thing is feeling a bit poorly and just needs to rest and take some fever-reducer before being correct again.
It bothers me. Probably more than it should, but it does.
I think hallucinate is a good term because when an AI completely makes up facts or APIs etc it doesn't do so as a minor mistake of an otherwise correct reasoning step.
This search is random in the same way that AlphaGo's move selection was random.
In the Monte Carlo Tree Search part, the outcome distribution on leaves is informed by a neural network trained on data instead of a so-called playout. Sure, part of the algorithm does invoke a random() function, but by no means the result is akin to the flip of a coin.
There is indeed randomness in the process, but making it sound like a random walk is doing a disservice to nuance.
I feel many people are too ready to dismiss the results of LLMs as "random", and I'm afraid there is some element of seeing what one wants to see (i.e. believing LLMs are toys, because if they are not, we will lose our jobs).
You're right about the random search however the domains that the model is doing the search is quite different. In AlphaGo, you do MCTS in all possible moves in GO, therefore it is a domain specific search. Here, you're doing the search in language whereas you would like to do the search possibly on genetics or molecular data (RNA-seq, ATAC-seq etc.). For instance, yesterday Arcinstitute published Evo2, where you can actually check a given mutation would be pathogenic or not. So, starting from genetics data (among thousands of variants) you might be able to say this variant might be pathogenic for the patient given its high variant allele frequency.
On top of that you are looking at the results in cell-lines which might not reflect the true nature of what would happen in-vivo (a mouse model or a human).
So, there is domain specific knowledge, which one would like to take into account for decision-making. For me, I would trust a Molecular Tumor Board with hematologists, clinicians - and possibly computational biologists :) - over a language random tree search for treating my acute myeloid leukemia, but this is a personal choice.