Self-Improving Language Models with Bidirectional Evolutionary Search

Guowei Xu1 Zhenting Qi1 Huangyuan Su1 Weirui Ye2 Himabindu Lakkaraju1 Sham M. Kakade1 Yilun Du1

1Harvard University    2MIT

A search framework that couples forward candidate evolution with backward goal decomposition.

Abstract

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable sub-goals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at github.com/Embodied-Minds-Lab/BES.

Experiments

Logical Reasoning

Logical reasoning training curves
EMA-smoothed validation accuracy on Knights-and-Knaves (Gemma-3-1B-it after SFT); y-axis is −log(accuracy), lower is better.

Multi-Hop Reasoning

BackboneMethodAccuracy (EM) # Valid Search# Valid ActionsFinish Ratio
Llama-3.2-3B Base4.0
GRPO2.1 −1.90.840.200.64
Tree-GRPO3.9 −0.11.502.140.64
BES (ours)7.0 +3.02.313.290.97
Llama-3.1-8B Base6.6
GRPO5.6 −1.01.461.830.37
Tree-GRPO7.4 +0.80.651.360.71
BES (ours)10.4 +3.82.113.050.94
Multi-hop reasoning post-training results on MuSiQue with Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct. Higher is better for all metrics. Bold denotes the best per backbone.

Open Problem Solving

Method Circle Packing (Sq, n=26) Circle Packing (Rect, n=21) Heilbronn (Convex, n=13)
Human / 2.634 / 2.364 / 0.0306
AlphaEvolve / 2.635 / 2.3658 / 0.0309
OpenEvolve2.531 ± .018 / 2.5412.267 ± .014 / 2.2760.025 ± .005 / 0.027
GEPA2.613 ± .022 / 2.6282.326 ± .023 / 2.3540.025 ± .002 / 0.027
ShinkaEvolve2.464 ± .083 / 2.5412.335 ± .026 / 2.3580.023 ± .005 / 0.026
BES (ours)2.623 ± .014 / 2.6322.349 ± .012 / 2.3600.026 ± .001 / 0.027
Open problem solving with GPT-5 as the backbone. Mean ± std (3 seeds) / Best. Higher is better. Bold denotes the best per column among comparable methods.

Arrange 13 points on or inside a convex region to maximize the area of the smallest triangle they form, relative to the area of their convex hull.

Σr = 0.000
0 / 0
Speed

Method

BES performs a bidirectional evolutionary search that alternates between two coupled processes: a forward search that seeks better candidates, and a backward search that decomposes the problem into fine-grained sub-goals to evaluate each forward node. The forward search not only extends trajectories but also recombines parts of different candidates, producing solutions that no single rollout sampled from the policy would likely reach. The backward search provides dense and interpretable scores for partial trajectories, guiding the forward search toward promising candidates. In practice, one backward search step is performed after every several forward search steps.

Forward Search

Forward search operators. (a) Expansion: the policy generates new steps (yellow). (b) Combination: two trajectories sharing a common prefix (P1–P2) have their distinct suffixes concatenated into a single candidate. (c) Deletion: an interior step (P2) is removed. (d) Translocation: one step in Path A (A2) is replaced by a step from Path B (B2). (e) Crossover: Path A is cut at a splice point and its tail is replaced by the tail of Path B. Here, the four evolution operators are defined as direct edits on the step sequence. In settings where direct concatenation is not well-defined, the evolution operators can alternatively be implemented by prompting the policy model.

(a) Expansion
P1
P2
P3
P4
new step
(b) Combination
P1
P2
A3
A4
Path A
P1
P2
B3
B4
Path B
concat →
P1
P2
A3
A4
B3
B4
Result
(c) Deletion
P1
P2×
P3
P4
Path
delete →
P1
P3
P4
Result
(d) Translocation
A1
A2
A3
Path A
B1
B2
B3
Path B
replace →
A1
B2
A3
Result
(e) Crossover
A1
A2
A3
A4
Path A
B1
B2
B3
B4
Path B
crossover →
A1
A2
B3
B4
Result

Backward Search

While the verifier provides a score for each node in the forward search, this signal is relatively sparse. Backward search addresses this by decomposing the problem into a tree of fine-grained sub-goals. Each forward node is then scored against this tree: the more sub-goals a candidate trajectory has addressed, the higher its score. This gives the forward search a dense, informative signal to select promising candidates, even when none of them have fully solved the problem yet.

Demonstration

Real BES search traces. Multi-hop QA examples replay the full forward tree, backward decomposition, and the key mutation that found the gold answer. Inference tabs (Circle Packing, Heilbronn) face the forward search tree (left, growing rightward) against the backward goal tree (right, growing leftward). Hover any node to inspect: forward nodes show the patch description and which sub-goals it satisfies; backward nodes show the sub-goal text.

 
 
backward decomp
0 / 0
Speed
Cite this work

BibTeX

If you find BES useful in your research, please consider citing our paper.

@misc{xu2026selfimprovinglanguagemodelsbidirectional,
      title={Self-Improving Language Models with Bidirectional Evolutionary Search}, 
      author={Guowei Xu and Zhenting Qi and Huangyuan Su and Weirui Ye and Himabindu Lakkaraju and Sham M. Kakade and Yilun Du},
      year={2026},
      eprint={2605.28814},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.28814}, 
}