If the authors or readers are interested in some of the more technical details of how we optimized guidance & llguidance, we wrote up a little paper about it here: https://guidance-ai.github.io/llguidance/llg-go-brrr
If you can screen tokens against your grammar fast enough, you can build a bitmask over the entire token vocabulary and apply it right before sampling. As vocabulary sizes grow, this gets more complex to do in real time, but we (and other libraries) have found several optimizations to do this extremely quickly (eg for guidance, we detail some optimizations here https://github.com/guidance-ai/llguidance/blob/main/docs/opt...).
Other libraries work by essentially pre-computing all the masks for all possible generations, but of course you're restricted to working with simple grammars in this case (like a subset of regular expressions)
We do enable forcing these sequences of tokens in guidance, and find that it significantly speeds up structured generation. There are tricky alignment issues to make sure you pick the right sequence of tokens, but you can often proxy this well by using the model's native tokenizer. Some details here in an old blog: https://guidance.readthedocs.io/en/latest/example_notebooks/...
I spent a couple years building a high performance, expressive library for structured outputs in LLMs. Our library is used by OpenAI for structured outputs on the hosted API. Happy to answer questions on how this works:
TL;DR instead of just getting a token and seeing if it would be accepted by the parser, you can actually zero-out probabilities for all invalid tokens, and do the computation for this in parallel at effectively zero cost:
> Here, compute_mask() can run on the CPU during the time it would be normally just waiting for the GPU to finish. The line prob[~mask] = 0.0 would normally be fused into the softmax kernel in the last stage of the LLM, with negligible overhead. Therefore, as long as the compute_mask() function completes faster than the LLM forward pass and parser.consume() is negligible (typically follows from compute_mask() speed), the constrained generation will be as fast as the unconstrained one.
I'm curious - have there been any research/conversations about pushing masking even earlier in the pipeline? In theory, there's a fair amount of compute that goes into computing the probability of tokens that will end up being masked away anyways.
Well, thank you for that; from a quick skim of Guidance, it looks like it is used when interfacing with the model directly - i.e. if I want to use Guidance I can't simply send input to my local Ollama instance, I have to stand up a small Python program that loads the model, accepts input from the user, push the user input tokens into the model, and for each output token, reject it if it fails some criteria.
Is this correct? If so, it means that the current way LLMs are interfaced with (via stdin/stout or an HTTP endpoint) can't be used with something like Guidance, correct?
Watch how the "Cumulative encoding" row grows each iteration (that's where the BTC address will be encoded) and then look at the other rows for how the algorithm arrives at that.
"The constraint system offered by Guidance is extremely powerful. It can ensure that the output conforms to any context free grammar (so long as the backend LLM has full support for Guidance). More on this below." --from https://github.com/guidance-ai/guidance/
I didn't find any more on that comment below. Is there a list of supported LLMs?
We have support for Huggingface Transformers, llama.cpp, vLLM, SGLang, and TensorRT-LLM, along with some smaller providers (e.g. mistral.rs). Using any of these libraries as an inference host means you can use an OSS model with the guidance backend for full support. Most open source models will run on at least one of these backends (with vLLM probably being the most popular hosted solution, and transformers/llama.cpp being the most popular local model solutions)
We're also the backend used by OpenAI/Azure OpenAI for structured outputs on the closed source model side.
We did quite a thorough benchmarking of various structured decoding providers in one of our papers: https://arxiv.org/abs/2501.10868v3 , measuring structured outputs providers on performance, constraint flexibility, downstream task accuracy, etc.
Happy to chat more about the benchmark. Note that these are a bit out of date though, I'm sure many of the providers we tested have made improvements (and some have switched to wholesale using llguidance as a backend)
I think @dcreater was asking how these various structee decoding providers compare with how pydantic ai handles structured output, i.e via tool calling, forcing the LLM to use a tool and its arguments are a json schema hence you read the tool call arguments and get a structured output.
I'm trying to write a really large book. I have a lot of material that I'm using RAG to help manage. I put into my prompts the top RAG cosine scores with some summaries of characters and previous chapters and scene sketches. I get scenes out and then work them over. LLMs are really helpful for my disability and have allowed me to make any progress at all on this.
Is your thing something I should look into for helping keep track of my material. I'm using Excel sheets and crappy python code right now.
Im pretty sure your stuff is some super technical backend thingy, but I figured I'd shoot my shot here. Thanks for any and all info, I appreciate it
I've been curious about grammar support for non-JSON applications. (i.e., I have some use cases where XML is more natural and easier to parse but Pydantic seems to assume you should only work with JSON.) Would guidance be able to handle this use case?
In general I find that matching the most natural format for a document outperforms waiting for the big model trainers to convince the model that the format you want is a valid structure, so anything that lets me interweave structured and unstructured generation is very interesting to me right now.
guidance can handle many context-free grammars. We use an Earley parser under the hood (https://en.wikipedia.org/wiki/Earley_parser) which gives us significant flexibility boosts over alternative approaches that use weaker parsers (and went through lots of effort to make Earley parsing fast enough to not slow down LM inference). However, XML is not perfectly context-free, though with some basic assumptions you can make it CF.
The annoying bit with grammars is that they are unfortunately a bit complex to write properly. Fortunately language models are getting better at this, so hopefully to get an XML grammar, you can get most of the way there with just a GPT-5 prompt. Suppose it would be a good idea to have a better pre-built set of popular grammars (like a modified XML) in guidance so that we cut this headache out for users...!
I'm really just looking for a subset of XML so that's probably sufficient.
For me, the advantage that Pydantic AI has right now is that it's easy to do ingestion/validation of the generated text, since I've already got the typing information in place. If I had similar ways to create new specialized grammars on the fly (e.g., I want XML-ish tags with these fields, but also allow for arbitrary additional fields...) that would significantly sway my implementation decisions.
Great question re: adoption...it's definitely dominated by JSON. Most API providers have standardized on JSON outputs, so application teams have started building shims that map other formats to JSON and back. Similarly, with models heavily being post-trained to generate "good" JSON, I think there's a better model-constraint alignment story with JSON than most arbitrary grammars.
That said, internally, we experiment quite a lot with custom grammars all across the stack. It's more complicated to write a grammar than a JSON schema (though LMs are very good at grammar writing now) and more error prone to debug, but it can help significantly in certain cases (e.g. having models write custom DSLs not commonly found on the internet, at various parts of a model training pipeline, etc. etc.). I'm hoping that with the right tooling around it, the broader community will start nudging beyond JSON.
To that end, the python guidance library is really an attempt to make writing grammars more friendly to a python programmer. More to be done here of course!
Sidenote, but the scholarship on distillation always makes me a bit sad. The Original work, cited in the abstract of the Hinton, Vinyals, and Dean paper that is cited everywhere, was the model compression work from Caruana, Buciluǎ, and Niculescu-Mizil.
The distillation paper added minor parameter tweaks and had a fancier name, but the essence of the method came from Caruana et. al's model compression paper: https://dl.acm.org/doi/abs/10.1145/1150402.1150464
Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of language models rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings. In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they've just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters. A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed. We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests. When paired with OpenAI's o3 model, MAI-DxO achieves 80% diagnostic accuracy--four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3. When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families. We highlight how AI systems, when guided to think iteratively and act judiciously, can advance diagnostic precision and cost-effectiveness in clinical care.
Let's ask for 3 goats then. And how much developing o1 cost, how much another version will cost? X billions of dollars per goat is not really a good scaling when any number of goats or cabbages can exist.
If the authors or readers are interested in some of the more technical details of how we optimized guidance & llguidance, we wrote up a little paper about it here: https://guidance-ai.github.io/llguidance/llg-go-brrr