Wen-Jie Shu¹†, Xuerui Qiu², Rui-Jie Zhu, Harold Haodong Chen¹, Yexin Liu¹, Harry Yang¹
¹HKUST, ²CASIA
†Correspondence to: Wen-Jie Shu <[email protected]>

Historically, the ability to reason was widely considered the exclusive domain of Large Language Models or symbolic systems. Visual reasoning often felt like a contradiction in terms until the VARC framework redefined the boundaries. By successfully framing ARC-AGI as a vision-only generation task, VARC proved that pixel-based architectures could handle abstract logic. However, while VARC solved the representation problem, the question of how to induce true Chain-of-Thought reasoning in vision remains under investigation.
Most current vision approaches rely on feed-forward architectures designed for fast pattern matching rather than deliberate thought. We look to the recent successes in Natural Language Processing for a better inductive bias. In NLP, Looped Transformers have emerged as a powerful paradigm. Because they recycle the same layers for multiple steps, they effectively introduce a "Latent Chain-of-Thought" at the architectural level. Given the success of recurrent models in language, ranging from pre-training efficiency (Huginn and Ouro) to specific synthetical reasoning tasks (TRM and HRM), it is only natural to ask if this architecture can unlock similar potential in Computer Vision.
In this blog, we introduce Loop-ViT to answer this question. This architecture brings iterative refinement to the visual domain. Our experiments reveal that this is more than just a successful adaptation. It represents a paradigm shift in efficiency.
We demonstrate that Loop-ViT works remarkably well. As illustrated in the figure above, our approach sets a new Pareto frontier for model parameters versus accuracy on ARC-AGI. The results speak for themselves:
Historically, ARC was treated as a domain for symbolic programs or Large Language Models (LLMs), which convert visual grids into text tokens to leverage pre-training priors. However, the recent VARC framework challenged this dogma. By treating ARC as an image-to-image translation task, VARC demonstrated that standard vision backbones (like ViTs and U-Nets) could solve complex reasoning tasks purely from pixels.
The Takeaway: VARC established a clean, vision-only testbed, proving that linguistic intermediates aren't strictly necessary. But it left one question unanswered: is a single forward pass enough?
While VARC showed vision is capable, its feed-forward architecture mimics "System 1" thinking—fast, intuitive, but prone to error on complex logic. Deep learning history, from Recurrent Neural Networks to Universal Transformers, suggests that difficult problems require depth-wise recurrence—reusing parameters to refine representations over time.
This is where our work steps in. While scaling computation via iteration is gaining traction in LLMs (e.g., Chain-of-Thought), Looped Transformers remain unexplored in vision-only rule induction. We argue that ARC is the perfect arena for this architecture: by replacing the "one-shot" prediction with a "looped" refinement process, we allow the model to hypothesize, check, and correct itself—all within a fixed parameter budget.

Figure 1: (A) VARC baseline pipeline. A standard feed-forward ViT produces a one-shot prediction from the task canvas. (B) Our looped pipeline. We keep the same VARC-style canvas representation and training/evaluation protocol, but replace the feed-forward backbone with a weight-tied (looped) transformer core executed for a fixed number of iterations K. Each iteration refines the hidden states and intermediate prediction, and the final output is taken from the last step.
We build directly upon the VARC paradigm. Instead of treating ARC as a sequence of discrete tokens, we treat it as a pure vision problem. Figure 1 contrasts the standard VARC pipeline (one-shot feed-forward backbone) with our looped variant. Crucially, we keep the canvas representation and the overall training/evaluation protocol unchanged, and modify only the backbone computation by reapplying the same transformer core for K latent iterations.
This is where we diverge. A standard VARC model is feed-forward: it encodes the canvas, processes it through a stack of transformer blocks, and outputs the prediction in one go. It has no opportunity to correct itself.
We introduce the Looped Transformer. The core idea is simple: instead of stacking $K$ different layers, we take a single block of layers and apply it $K$ times recursively.
💡 Key Concept: Weight Tying. Imagine trying to solve a puzzle. You don't swap your brain for a new one every second; you use the same brain repeatedly to refine your thought. Similarly, our model reuses the same parameters ($F_\theta$) for every iteration. This allows us to increase the depth of reasoning (compute) without increasing the model size (memory).
Formally, we modify the inference pipeline as follows:
Initialization ($t=0$): We embed the canvas into an initial hidden state:
$$ ⁍ $$
The Loop ($t=1,..., K$): We feed the hidden state back into the same Transformer core ($F_\theta$) repeatedly:
$$ ⁍ $$
*Note: $F_\\theta$ is shared across all steps.*
$$ ⁍ $$
The final answer is simply the prediction at the last step $K$.
To ensure a fair comparison, we adopt the VARC protocol wholesale. We aim to isolate the benefit of "looping" by keeping the rest of the pipeline identical to the current state-of-the-art.
Pipeline: We use the exact same canvas representation, random scale/translation augmentations, and per-pixel classification objective as VARC.
Two-Stage Training:
Inference: We employ multi-view voting with 510 random views to produce the final Pass@2 accuracy.





Figure 2: Offline training dynamics of Loop-ViT with fixed core depth (B=2, 4, 6, 8, 10) and different loop iterations (K ∈ {1,2,3}). We report grid-level exact match accuracy (entire output grid must be correct) on the training tasks (dashed) and a held-out evaluation split (solid) over epochs. Increasing K improves evaluation accuracy and reduces the train–eval generalization gap, suggesting iterative computation provides a beneficial inductive bias beyond feed-forward depth.
How do we scale a Looped Transformer? We fix the embedding dimension ($d=512$, matching the VARC 18M Baseline) and explore two axes:
We compare standard VARC models against our Loop-ViT variants. Note that while effective depth ($B \times K$) can grow large, the parameter count stays small because weights are tied.
| Core Depth (B) ↓ Loop Steps (K) → | 1 (No Loop) | 2 | 3 |
|---|---|---|---|
| B = 2 | 29.7 | 41.5 | 48.5 |
| B = 4 | 41.0 | 52.2 | 57.4 |
| B = 6 | 51.7 | 54.4 | 59.5 |
| B = 8 | 53.1 | 55.2 | 60.3 |
| B = 10 | 54.5 | 57.0 | 61.4 |

Figure 3: Training exact-match accuracy (grid-level) for Loop-ViT variants of different sizes (Small/Medium/Large) during offline training. Accuracy is measured as the fraction of tasks whose entire output grid is predicted correctly on the training split. Larger models fit the training set more strongly, motivating the need to evaluate improvements under held-out accuracy and to analyze whether gains arise from iterative refinement rather than memorization.
We evaluated our final Loop-ViT variants (using a larger embedding $d=1024$ for Small and Medium, $d=512$ for Large) on ARC-1 and ARC-2. Crucially, both ARC-1 and ARC-2 evaluations use the same checkpoint trained on the ARC-1 training set (plus RE-ARC), relying on Test-Time Training (TTT) to adapt to the specific tasks. The results challenge the "bigger is better" dogma.
<aside> 💡
Key Findings:
| Model | #params | K | ARC-AGI-1 | ARC-AGI-2 |
|---|---|---|---|---|
| large language models (LLMs) | ||||
| Deepseek R1 | 671B | - | 15.8 | 1.3 |
| Claude 3.7 8k | N/A | - | 21.2 | 0.9 |
| o3-mini-high | N/A | - | 34.5 | 3.0 |
| GPT-5 | N/A | - | 44.0 | 1.9 |
| Grok-4-thinking | 1.7T | - | 66.7 | 16.0 |
| Bespoke (Grok-4) | 1.7T | - | 79.6 | 29.4 |
| recurrent models | ||||
| HRM | 27M | - | 40.3 | 5.0 |
| TRM | 7M | - | 44.6 | 7.8 |
| vision models | ||||
| VARC | 18M | - | 54.5 | 8.3 |
| VARC (ensemble) | 73M | - | 60.4 | 11.1 |
| Loop-ViT (Small) | 3.1M | 24 | 53.9 | 7.5 |
| Loop-ViT (Medium) | 5.9M | 6 | 57.2 | 8.33 |
| Loop-ViT (Large) | 11.2M | 6 | 61.2 | 10.3 |
Our investigation reveals that computational time is a potent, yet underutilized, resource in visual reasoning. By introducing a Looped Vision Transformer, we demonstrated that progressive refinement yields better generalization than simply stacking more layers. Empirically, the results speak for themselves: a 5.9M looped model beats the 18M VARC baseline, and scaling to just 11.2M allows us to surpass the performance of large model ensembles. This confirms that for reasoning-heavy tasks, the ability to iterate and correct intermediate errors is just as critical as raw capacity.