Abstract
Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable sub-goals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at github.com/Embodied-Minds-Lab/BES.
Experiments
Logical Reasoning
Multi-Hop Reasoning
| Backbone | Method | Accuracy (EM) | # Valid Search | # Valid Actions | Finish Ratio |
|---|---|---|---|---|---|
| Llama-3.2-3B | Base | 4.0 | — | — | — |
| GRPO | 2.1 −1.9 | 0.84 | 0.20 | 0.64 | |
| Tree-GRPO | 3.9 −0.1 | 1.50 | 2.14 | 0.64 | |
| BES (ours) | 7.0 +3.0 | 2.31 | 3.29 | 0.97 | |
| Llama-3.1-8B | Base | 6.6 | — | — | — |
| GRPO | 5.6 −1.0 | 1.46 | 1.83 | 0.37 | |
| Tree-GRPO | 7.4 +0.8 | 0.65 | 1.36 | 0.71 | |
| BES (ours) | 10.4 +3.8 | 2.11 | 3.05 | 0.94 |
Open Problem Solving
| Method | Circle Packing (Sq, n=26) | Circle Packing (Rect, n=21) | Heilbronn (Convex, n=13) |
|---|---|---|---|
| Human | — / 2.634 | — / 2.364 | — / 0.0306 |
| AlphaEvolve | — / 2.635 | — / 2.3658 | — / 0.0309 |
| OpenEvolve | 2.531 ± .018 / 2.541 | 2.267 ± .014 / 2.276 | 0.025 ± .005 / 0.027 |
| GEPA | 2.613 ± .022 / 2.628 | 2.326 ± .023 / 2.354 | 0.025 ± .002 / 0.027 |
| ShinkaEvolve | 2.464 ± .083 / 2.541 | 2.335 ± .026 / 2.358 | 0.023 ± .005 / 0.026 |
| BES (ours) | 2.623 ± .014 / 2.632 | 2.349 ± .012 / 2.360 | 0.026 ± .001 / 0.027 |
Arrange 13 points on or inside a convex region to maximize the area of the smallest triangle they form, relative to the area of their convex hull.
Method
Forward Search
Forward search operators. (a) Expansion: the policy generates new steps (yellow). (b) Combination: two trajectories sharing a common prefix (P1–P2) have their distinct suffixes concatenated into a single candidate. (c) Deletion: an interior step (P2) is removed. (d) Translocation: one step in Path A (A2) is replaced by a step from Path B (B2). (e) Crossover: Path A is cut at a splice point and its tail is replaced by the tail of Path B. Here, the four evolution operators are defined as direct edits on the step sequence. In settings where direct concatenation is not well-defined, the evolution operators can alternatively be implemented by prompting the policy model.
Backward Search
While the verifier provides a score for each node in the forward search, this signal is relatively sparse. Backward search addresses this by decomposing the problem into a tree of fine-grained sub-goals. Each forward node is then scored against this tree: the more sub-goals a candidate trajectory has addressed, the higher its score. This gives the forward search a dense, informative signal to select promising candidates, even when none of them have fully solved the problem yet.
Demonstration
Real BES search traces. Multi-hop QA examples replay the full forward tree, backward decomposition, and the key mutation that found the gold answer. Inference tabs (Circle Packing, Heilbronn) face the forward search tree (left, growing rightward) against the backward goal tree (right, growing leftward). Hover any node to inspect: forward nodes show the patch description and which sub-goals it satisfies; backward nodes show the sub-goal text.
BibTeX
If you find BES useful in your research, please consider citing our paper.
@misc{xu2026selfimprovinglanguagemodelsbidirectional,
title={Self-Improving Language Models with Bidirectional Evolutionary Search},
author={Guowei Xu and Zhenting Qi and Huangyuan Su and Weirui Ye and Himabindu Lakkaraju and Sham M. Kakade and Yilun Du},
year={2026},
eprint={2605.28814},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.28814},
}