Self-Improving Language Models with Bidirectional Evolutionary Search

Abstract

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable sub-goals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at github.com/Embodied-Minds-Lab/BES.

Experiments

Logical Reasoning

Multi-Hop Reasoning

Multi-hop reasoning post-training results on MuSiQue with Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct. Higher is better for all metrics. **Bold** denotes the best per backbone.
Backbone	Method	Accuracy (EM)	# Valid Search	# Valid Actions	Finish Ratio
Llama-3.2-3B	Base	4.0	—	—	—
	GRPO	2.1 −1.9	0.84	0.20	0.64
	Tree-GRPO	3.9 −0.1	1.50	2.14	0.64
	BES (ours)	7.0 +3.0	2.31	3.29	0.97
Llama-3.1-8B	Base	6.6	—	—	—
	GRPO	5.6 −1.0	1.46	1.83	0.37
	Tree-GRPO	7.4 +0.8	0.65	1.36	0.71
	BES (ours)	10.4 +3.8	2.11	3.05	0.94

Open Problem Solving

Open problem solving with GPT-5 as the backbone. Mean ± std (3 seeds) / Best. Higher is better. **Bold** denotes the best per column among comparable methods.
Method	Circle Packing (Sq, n=26)	Circle Packing (Rect, n=21)	Heilbronn (Convex, n=13)
Human	— / 2.634	— / 2.364	— / 0.0306
AlphaEvolve	— / 2.635	— / 2.3658	— / 0.0309
OpenEvolve	2.531 ± .018 / 2.541	2.267 ± .014 / 2.276	0.025 ± .005 / 0.027
GEPA	2.613 ± .022 / 2.628	2.326 ± .023 / 2.354	0.025 ± .002 / 0.027
ShinkaEvolve	2.464 ± .083 / 2.541	2.335 ± .026 / 2.358	0.023 ± .005 / 0.026
BES (ours)	2.623 ± .014 / 2.632	2.349 ± .012 / 2.360	0.026 ± .001 / 0.027

Arrange 13 points on or inside a convex region to maximize the area of the smallest triangle they form, relative to the area of their convex hull.

Σr = 0.000

0 / 0

Speed

Method

BES performs a bidirectional evolutionary search that alternates between two coupled processes: a forward search that seeks better candidates, and a backward search that decomposes the problem into fine-grained sub-goals to evaluate each forward node. The forward search not only extends trajectories but also recombines parts of different candidates, producing solutions that no single rollout sampled from the policy would likely reach. The backward search provides dense and interpretable scores for partial trajectories, guiding the forward search toward promising candidates. In practice, one backward search step is performed after every several forward search steps.

Forward Search

Forward search operators. (a) Expansion: the policy generates new steps (yellow). (b) Combination: two trajectories sharing a common prefix (P1–P2) have their distinct suffixes concatenated into a single candidate. (c) Deletion: an interior step (P2) is removed. (d) Translocation: one step in Path A (A2) is replaced by a step from Path B (B2). (e) Crossover: Path A is cut at a splice point and its tail is replaced by the tail of Path B. Here, the four evolution operators are defined as direct edits on the step sequence. In settings where direct concatenation is not well-defined, the evolution operators can alternatively be implemented by prompting the policy model.

(a) Expansion

P1

P2

P3

P4

new step

(b) Combination

P1

P2

A3

A4

Path A

P1

P2

B3

B4

Path B

concat →

P1

P2

A3

A4

B3

B4

Result

(c) Deletion

P1

P2×

P3

P4

Path

delete →

P1

P3

P4

Result

(d) Translocation

A1

A2

A3

Path A

B1

B2

B3

Path B

replace →

A1

B2

A3

Result

(e) Crossover

A1

A2

A3

A4

Path A

B1

B2

B3

B4

Path B

crossover →

A1

A2

B3

B4

Result

Backward Search

While the verifier provides a score for each node in the forward search, this signal is relatively sparse. Backward search addresses this by decomposing the problem into a tree of fine-grained sub-goals. Each forward node is then scored against this tree: the more sub-goals a candidate trajectory has addressed, the higher its score. This gives the forward search a dense, informative signal to select promising candidates, even when none of them have fully solved the problem yet.

Demonstration

Real BES search traces. Multi-hop QA examples replay the full forward tree, backward decomposition, and the key mutation that found the gold answer. Inference tabs (Circle Packing, Heilbronn) face the forward search tree (left, growing rightward) against the backward goal tree (right, growing leftward). Hover any node to inspect: forward nodes show the patch description and which sub-goals it satisfies; backward nodes show the sub-goal text.

backward decomp

0 / 0

Speed

Cite this work

BibTeX

If you find BES useful in your research, please consider citing our paper.

@misc{xu2026selfimprovinglanguagemodelsbidirectional,
      title={Self-Improving Language Models with Bidirectional Evolutionary Search}, 
      author={Guowei Xu and Zhenting Qi and Huangyuan Su and Weirui Ye and Himabindu Lakkaraju and Sham M. Kakade and Yilun Du},
      year={2026},
      eprint={2605.28814},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.28814}, 
}