Efficient Guided Generation for Large Language Models: Discussion, References and Acknowledgments

2 Jun 2024


(1) Brandon T. Willard, Normal Computing;

(2) R´emi Louf, Normal Computing.

5. Discussion

The vocabulary indexing introduced in this paper removes a prohibitive run-time scaling barrier in guided generation. Naturally, it makes a tradeoff between processing and memory, but we believe that the memory costs are relatively low on average and–when not–can be reduced through conventional means.

In our tests using a slightly augmented version of the Python grammar, we find that even naively constructed indices (i.e. ones containing unused and redundant parser and FSM state configurations) are still only around 50 MB. Furthermore, these indices were constructed with un-reduced DFAs, implying that there are numerous redundant states unnecessarily increasing the size of the indices. Likewise, if the exact representation of the state machines is ever an issue, it’s possible that other state machine formulations with lower memory requirements could suffice (e.g. NFAs).

The implications of this work are not limited to neural text generation. For instance, one could use the indexing approach described here to assist with the training or fine-tuning of LLMs when structured outputs are required. We can also speculate that assisted generation during training may reduce the need for a model to learn syntactic details.

In addition, this method provides an alternative way to evaluate current models. One could, for instance, attempt to quantify the discrepancy between the masked logits generated by our method and the raw logits generated by the model. Which could in turn inform the training objective of a model.

It may also be possible to “lift” the masks computed by this approach into the language models themselves. Basically, the masks implicitly determine which computations do not need to be performed. Our current formulation only applies the masks at the lowest level, but, by lifting the masks further up into the architecture of the model, we may be able to modulate which slices of the model parameters are needed before unnecessarily performing operations on them. This has the potential to further reduce computational costs.


Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting is programming: A query language for large language models. Proceedings of the ACM on Programming Languages, 7(PLDI):1946–1969, 2023.

Yihong Dong, Ge Li, and Zhi Jin. CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, pages 188–198, New York, NY, USA, July 2023. Association for Computing Machinery. ISBN 9798400702211. doi: 10.1145/3597926. 3598048.

Saibo Geng, Martin Josifosky, Maxime Peyrard, and Robert West. Flexible Grammar-Based Constrained Decoding for Language Models, May 2023.

Michael Kuchnik, Virginia Smith, and George Amvrosiadis. Validating large language models with relm. Proceedings of Machine Learning and Systems, 5, 2023.

Alexander K. Lew, Tan Zhi-Xuan, Gabriel Grand, and Vikash K. Mansinghka. Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs. arXiv preprint arXiv:2306.03081, 2023.

R´emi Louf and Brandon T. Willard. Outlines: Generative Model Programming. URL https://github.com/normal-computing/outlines.

Microsoft. Guidance. Microsoft, July 2023. URL https://github.com/ microsoft/guidance.

Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. Synchromesh: Reliable code generation from pre-trained language models. arXiv preprint arXiv:2201.11227, 2022a.

Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. Synchromesh: Reliable code generation from pre-trained language models, January 2022b.

Maxim Rabinovich, Mitchell Stern, and Dan Klein. Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535, 2017.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

Matt Rickard. parserLLM, July 2023a. URL https://github.com/r2d4/ parserllm.

Matt Rickard. R2d4/rellm: Exact structure out of any language model completion., 2023b. URL https://github.com/r2d4/rellm.

Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. arXiv preprint arXiv:2109.05093, 2021.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.

Michael Sipser. Introduction to the Theory of Computation. International Thomson Publishing, 1996.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Bailin Wang, Zi Wang, Xuezhi Wang, Yuan Cao, Rif A. Saurous, and Yoon Kim. Grammar Prompting for Domain-Specific Language Generation with Large Language Models, May 2023.

Lilian Weng. Controllable Neural Text Generation, January 2021. URL https://lilianweng.github.io/posts/ 2021-01-02-controllable-text-generation/.


We would like to thank Dan Gerlanc and Dan Simpson for their support and constructive feedback.

This paper is available on arxiv under CC 4.0 license.