SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution

Team: Zhenyu He, Qingping Yang, Wei Shen, Xiaojian Zhong, Kechi Zhang, Chenxin An, Wenlei Shi, Tianle Cai, Di He, Jiaze Chen*, Jingjing Xu**

*: Project Leads

Affiliations: Peking University, ByteDance Seed, The University of Hong Kong

Date: Aug 4, 2025

<aside> 📌

We are thrilled to introduce SWE-Swiss-32B, a 32B model that establishes a new state-of-the-art for open-source models in its size class, achieving a 60.2% score on the SWE-bench Verified benchmark and placing it in the same top-tier performance bracket as much larger models. Our strategy is to improve three core tasks in the issue resolving process: Localization, Repair, and Unit Test. Our two-phase training strategy first embeds these capabilities into a base model via multi-task Supervised Fine-Tuning (SFT) on meticulously curated datasets. It then sharpens the most critical skill, Repair, through targeted Reinforcement Learning (RL) with direct feedback from test environments. To accelerate research in the community, we are open-sourcing the SWE-Swiss-32B model and our complete training datasets.

👨‍💻 Github, 🤗 HF Model, 🤗 HF Dataset, 📖 [paper](comming soon), 🔎 [Evaluation results](coming soon)

</aside>

Figure 1: Performance and model size comparison on SWE-bench Verified. Our 32B model, , achieves a top-tier score of 60.2% with test-time scaling. For Qwen2.5-32B-Insturct, the score is obtained via the Agentless framework. The scores for other models are reported from their respective blogs or papers.

Figure 1: Performance and model size comparison on SWE-bench Verified. Our 32B model, SWE-Swiss-32B, achieves a top-tier score of 60.2% with test-time scaling. For Qwen2.5-32B-Insturct, the score is obtained via the Agentless framework. The scores for other models are reported from their respective blogs or papers.

1. Introduction

The automated resolution of software issues represents a grand challenge for AI. Frameworks like Agentless [1] have demonstrated the potential of breaking this challenge down into a structured workflow that mimics a human developer: first locating the problem files, then generating a fix. While this paradigm is promising, it raises a critical question: what is the most effective way to train a model to excel at each stage of this process?

This report introduces a solution: the SWE-Swiss recipe, a comprehensive training strategy designed to create a powerful and versatile issue-resolution model. Our work is founded on the principle that mastery requires explicitly training a model on the core competencies of software engineering. We identify three such skills:

Localization: Pinpointing the exact files that need to be modified.
Repair: Generating the correct code patch to resolve the issue.
Unit Test Generation: Creating new tests to validate the proposed fix.

Our model, SWE-Swiss-32B, is the product of this recipe. It begins with a broad, foundational understanding built via Supervised Fine-Tuning (SFT) across all three tasks, followed by a specialized Reinforcement Learning (RL) phase to master the art of repair. As shown in Figure 1, this focused approach allows our 32B model to outperform a range of other open-source models on the SWE-bench Verified benchmark. We note that concurrent work DeepSWE also achieves a similar score.

2. The SWE-Swiss Recipe: A Three-Skill Curriculum

The foundation of our recipe is a high-quality "curriculum": a collection of meticulously curated datasets designed to teach each core skill. To build this curriculum, we employ a methodology of verified rejection sampling. This process involves generating a large pool of candidate data for each task (e.g., potential code patches or unit tests) and then applying strict, test-driven validation to filter for only the successful and high-quality examples. This ensures every data point used for fine-tuning represents a verified, successful outcome.

Figure 2: An illustration of the LLM-driven patch generation process, which is enabled by three core abilities. First, in the Localization step, the LLM predicts relevant files using the issue description and repository structure. These files, augmented by a retrieval model, are then used for Repair, where the LLM generates patches. Concurrently, the LLM performs Unit Test Generation to create reproduction tests from the issue description. These new tests are then combined with existing regression tests to filter and validate the final patch.

2.1 The "Localization" Task: Pinpointing the Problem

A crucial first step in fixing a bug is knowing which files to edit. To teach our model this skill, we curate a high-quality dataset focused on file localization.