ARC AGI 3: When agents fail at pixel games

5 min read
ARC AGI 3 Competition

ARC AGI 3 Competition

The challenge that humbled us

Two weeks ago, Eugenio and I decided to participate in ARC-AGI-3 agent preview. It sounded exciting: a benchmark to measure the intelligence of current LLMs through simple games.

How hard could it be to program an agent to solve 64x64 pixel games? Especially considering we have experience programming agentic flows and AI agents for quite some time.

Spoiler: we achieved nothing. Very difficult.

After 14 intense days of work, debugging and frustrations, reality hit us hard. Not only did we not solve the games—we barely got our agents to make significant progress. And this made me rethink everything I believed about the current state of AI.

Two minds, two approaches

Working with Eugenio on this project was revealing in an unexpected way. He thinks like a lawyer and philosopher. His methods had an Aristotelian air, going through Descartes and other great thinkers. His approach was systematic but abstract, seeking universal principles that could apply to any game.

I was struck by a prompt he had designed for his agent: "You're not focused on winning, your goal is to understand the world and the rules, and let that lead you to win." It was pure philosophy applied to AI—first deep understanding, then victory would come naturally.

I, typical engineer, jumped straight into implementation: structured, practical, oriented to immediate results. I wanted to see working code from day one.

The clash was inevitable. Where he saw the need to establish solid philosophical foundations for reasoning, I thought, WTF are we doing. Where I saw the urgency to iterate quickly, he saw hasty decisions without theoretical basis.

In the end, we ended up presenting separate projects: he with an elegant philosophical reasoning system, I with a more technical and direct approach. It's ironic, because we've been successfully collaborating on other agent projects for a while. But when it came to creating something that could truly think, our approaches simply didn't converge.

What is intelligence really?

Until now, artificial intelligence tests focused on throwing increasingly complicated problems—"PhD++" as Greg says in this video. More data, more parameters, more complexity.

But ARC-AGI takes a radically different approach. It's based on a simple but profound definition: intelligence is your efficiency in adapting to novelty. In simple words: how fast you are at learning completely new things.

ARC problems are simple for humans but brutally difficult for AI. And measuring intelligence through games seems brilliant to me—each game has clear rules, a defined goal, and requires the agent to constantly adapt. Very similar to the challenges a thinking entity must overcome in different real environments.

The challenge: simple games, impossible challenge

The premise is elegantly simple: put an AI agent to play various games where it must:

  1. Understand the rules by observing only a few examples
  2. Identify the goal without explicit instructions
  3. Manage resources (limited lives)
  4. Win the 8-9 levels of each game

The available actions are barely six:

  • ⬆️ Up
  • ⬇️ Down
  • ⬅️ Left
  • ➡️ Right
  • [Spacebar]
  • 🖱️ Click (the most complex since the agent must decide exactly where to click)
Game LS20

Game LS20

Example of game LS20: looks simple, right?

Game VC33

Game VC33

Game VC33: another "simple" pixel challenge

The brutal reality of numbers

The results are humiliating for AI.

Humans typically complete the three games in 500-700 total actions (about 200-250 per game).

Human leaderboard

Human leaderboard

Humans: elegant, efficient

AI agents, on the other hand, require absurd numbers: 112,000, 242,000, 13,000 attempts... and still fail to finish. It's like watching someone hit a wall with their head 200,000 times hoping a door will open.

AI agent leaderboard

AI agent leaderboard

AI agents: brute force without elegance

The few top entries that report "reasonable" numbers like 656 turns make me suspicious. They're clearly not using pure LLM only. It sounds to me like there was direct human intervention or very specific engineering tricks. When technical details are published, it will be interesting to analyze them.

Crystallized vs. fluid intelligence

This experience helped me understand a crucial distinction I previously only knew theoretically:

Crystallized intelligence: Accumulated knowledge, facts, learned procedures. LLMs have it in absurd quantities—doctorate-level knowledge in almost all areas, programming capabilities that surpass many developers (including me).

Fluid intelligence: Pure reasoning, adaptation to new situations, problem-solving without precedent. This is where current LLMs fall dramatically short.

An 8-year-old child can see an ARC game for the first time and understand the rules in minutes. A state-of-the-art LLM gets stuck indefinitely, generating action after action without truly learning from feedback.

The "GPT-5" moment and my conclusions

Just last week we saw the launch of GPT-5. Impressive in many aspects, but after my experience with ARC-AGI-3, I see it with different eyes.

I feel that the current architecture of LLMs is hitting a ceiling. They are incredibly useful and powerful for crystallized intelligence tasks, but I don't think pure LLM along with the ML algorithms we know today is the direct path to AGI.

My conclusion after these two frustrating but illuminating weeks:

We are further from AGI than I thought at the end of last year. In December 2024, I genuinely believed LLMs were going to replace us in practically everything soon. Now I think that "soon" will be longer than expected.

(I'm still betting on 2030 as the year of AGI, but with much less confidence than before.)

(I need to make a bet on Polymarket to see when AGI will arrive.)

The value of failure

Paradoxically, this "failure" was one of the most valuable experiences I've had working with AI. It forced me to confront my own biases about the current capabilities of these systems.

It's easy to be dazzled by ChatGPT writing eloquent essays or Claude solving complex programming problems. But put these same models in front of a simple game that requires true adaptation, and the illusion completely disappears.

Agents are not as smart as we think. Not yet.

For the curious

If you want to explore this fascinating rabbit hole:

In a year I'll see if my pessimistic conclusions were wrong... or if I underestimated how far we really are from true artificial intelligence.