A new method helps large language models produce accurate, structured code across programming languages.
Large language models (LLMs) are increasingly used by programmers to generate code faster. But this only helps if the generated code actually works—meaning it must follow the rules of the programming language and be error-free.
While some existing methods try to keep LLMs within the bounds of programming languages, they often come with trade-offs: they either distort the intended output or take too long to be practical for complex coding tasks.
Now, researchers at MIT and collaborators have developed a new technique that automatically steers LLMs to produce structurally correct and semantically meaningful outputs. This approach not only filters out flawed responses early in the generation process but also allows the model to focus computational resources on more promising options. The result is improved accuracy and efficiency.
Thanks to this improvement, smaller LLMs using the technique have outperformed much larger models in generating structured, correct outputs in areas such as molecular biology and robotics.
In the future, the method could help non-experts interact with AI more effectively—for example, by allowing users to write complex SQL queries using only natural language prompts.
“This work goes beyond academic research,” says João Loula, an MIT graduate student and co-lead author. “It has the potential to improve coding assistants, AI-driven data analysis, and scientific tools by making sure AI-generated outputs are both useful and correct.”
Loula co-authored the paper with Benjamin LeBrun (Mila-Quebec AI Institute), Li Du (Johns Hopkins University), and a team led by Timothy J. O’Donnell (McGill University and Mila). Other contributors include Vikash Mansinghka (MIT), Alexander K. Lew (Yale), and Tim Vieira (ETH Zurich). The research will be presented at the International Conference on Learning Representations.

Smart Structure Without Sacrificing Meaning
A common way to ensure structured code from LLMs is to check the entire output after generation. If errors are found, the process must start over—costly in terms of time and resources.
Another method checks the code as it’s being generated. While this ensures proper syntax and structure, it can cause the model to stray from the user’s intended meaning.
“It’s much easier to enforce structure than meaning,” Loula explains. “You can check syntax instantly, but verifying meaning requires running the code, which is more complex.”
The team’s new method blends expert knowledge with the LLM’s own capabilities, guiding it toward valid and meaningful outputs. Instead of retraining the model, the researchers use a statistical technique—sequential Monte Carlo—to generate multiple candidate outputs in parallel. The model then assigns each a “weight” based on its likelihood of being both correct and meaningful. Less promising outputs are discarded as the model progresses.
In essence, it’s like the model is supervised by a silent expert, constantly nudging it toward better results based on the user’s instructions and constraints.
“We’ve done the mathematical heavy lifting,” says Loula. “Whatever constraints you provide, the model can figure out the right answer.”
Small Models, Big Results

The team tested their method on four tasks: generating Python code, SQL queries, molecular structures, and robot action plans. Across the board, the method yielded more accurate results while using fewer computational resources.
For instance, when generating Python code, a small open-source model using their architecture outperformed a much larger, closed-source commercial model.
“We’re excited that our framework lets small models punch above their weight,” Loula says.
The researchers now aim to extend the approach to control larger segments of text and incorporate learning so that models improve as they generate.
In the future, this framework could be used in AI systems that help non-technical users analyze data, model databases, or query complex information—all through natural language.
According to Mansinghka, this opens the door to more interactive, accurate AI tools: “Users will be able to communicate with AI that truly understands the structure and meaning of their data and queries.”
O’Donnell adds that the method also touches on deeper issues in linguistics and AI: “One of the big questions is how words relate to real-world meaning, especially when there’s uncertainty or ambiguity. LLMs are great at predicting sequences, but they don’t really understand. This work shows that in specific symbolic domains, we can bridge that gap—and it’s a small but meaningful step toward AI that communicates more like humans do.”