Spark

Day 18 / 60

Spark

Before, for Nelson to interact with DIgSILENT, I maintained a repository with a Python backend. I (and Claude Code, let's be honest) wrote thousands of lines of code so Nelson could perform each operation: activate a line, change a transformer tap, run a power flow.

At some point I considered making that repository public. Until I realized there was a much better way to solve the problem.

The problem wasn't Nelson — it was the connection to DIgSILENT. The tools were fixed: ~50 operations I had programmed by hand. If Nelson needed something outside those 50 —for example, adjusting a distance protection— there was no way to execute it. The agent reasoned well, but the tool layer simply couldn't do anything.

My previous solution was fragile by design: it was only as capable as the code I had written.

Recent advances in ARC-AGI 3 made me think about this again. I wrote about ARC-AGI 3 in a previous post — basically it's the competition that measures whether models can reason on genuinely novel tasks. The DukeNLP team published the best analysis I found: their agent completes all three benchmark games using only three tools — read logs, grep, and execute Python.

The conclusion that hit me hardest: LLMs are bad at reasoning about complex spatial structures in context, but when you give them Python to do it programmatically, the problem disappears.

What's happening in that direction opened my mind: if the model can reason, it can program. And if it can program, it can create its own tools.

And there's the trick that took me 2 months to see: interacting with DIgSILENT was always Python scripts. There's no magic behind the 50 tools of harness v1.0 — each one is, at its core, a Python script that opens the .pfd, does something, and returns a result. I had simply written them all by hand beforehand.

What if the agent wrote them itself when it needs them?

What is Spark?

Spark is an agent specialized in writing Python code. Nothing more. It doesn't do electrical studies, doesn't generate reports, doesn't talk to the user. Its only task is to take a problem and solve it with a script.

It's deliberately simple: ~500 lines of code, with a basic ReAct loop.

Spark ReAct cycle

The cycle works like this:

Reason — understands what operation it needs to perform in DIgSILENT
Write — generates the corresponding Python script
Execute — runs the script against PowerFactory
Evaluate — did it work? does the result make sense?
Iterate — if it failed, understands why and corrects

What makes it special isn't the cycle itself. It's what happens when it succeeds and when it fails.

The memory that accumulates

Every time Spark successfully solves a problem, it saves two things: the script that worked and the reasoning that led to that script.

But the most interesting part isn't that it saves successes — it also saves failures.

Most AI agents only learn from their successes. They save what worked and reuse it. But humans learn more from our failures — probably because we fail more than we succeed. I implemented that same logic in Spark.

When it can't solve something, it saves a [FAILED] document with what it tried, why it didn't work, and what it recommends trying differently. Before writing any new script, Spark reads both: successes and failures. The failures tell it what not to do.

The format is simple:

# [FAILED] Short circuit on line at 50%

## What was tried
- Create fault event (EvtShc) + IEC 60909 method → error code 1

## Why it failed
- IEC 60909 doesn't process fault events on lines via script

## Recommendation
- Use "Complete" method (iopt_mde=1) instead of IEC 60909

A concrete example: short circuit on a power line was a task that failed consistently. Without memory, Spark retried the same approach 30 times spending $0.29 USD without getting anywhere. The second time, it had the [FAILED] saved — it understood that the IEC 60909 method doesn't work for line faults via script, switched to the "Complete" method, and solved it on the first try for $0.10.

Attempt	Result	Cost
Without memory	30 retries, no result	$0.29
With `[FAILED]` saved	Fails, but documents why	$0.27
Next execution	Success on first try	$0.10

Spark self-feedback cycle

The technical implementation is ~30 extra lines of code in the agent loop and an additional paragraph in the system prompt. Nothing sophisticated. The insight is in the design, not the code.

This is inspired by the paper ExpeL: LLM Agents Are Experiential Learners, which proposes that agents learn from both successful and failed trajectories. The half that almost nobody implements is the failures.

The second time Nelson asks for something similar, Spark doesn't start from zero. Over time it accumulates a library built from real experience — not from what I anticipated it would need.

This is what I'm most proud of in the design. It's not just that it can write code. It's that it learns from its own successes and failures.

Mind blown

Why this changes the harness

In harness v1.0, Nelson had ~50 hardcoded tools. If it needed something outside those 50, it broke.

With Spark as part of harness v2.0, Nelson can ask Spark to write the script it needs in real time. Nelson's set of capabilities is no longer bounded by what I programmed beforehand.

	Harness v1.0	Harness v2.0 with Spark
Lines of code	~16,000	~500
Capabilities	~50 fixed tools	Unlimited (in theory)
If something new fails	Gets stuck	Spark writes it
Maintenance	High	Very low

Pros and cons

The real advantages:

Nelson no longer gets stuck when asked for something new
The code I maintain is 30x smaller
Spark improves over time without my intervention
Anyone with DIgSILENT can use it — not just me

The downside:

Spark spends tokens thinking and writing code. For now I use Gemini (I love Gemini 3.1 =D) and the cost per call is manageable. But if Nelson delegates to Spark frequently, the costs will keep adding up.

It's a trade-off I accept: I'd rather pay for tokens than maintain thousands of fragile lines. But I'll be measuring the spend to see if it's really worth it.

How to try it

Spark is open source. It's the second Don Nelson tool I'm making public — the first was the Infotecnica skill.

Why make it public? Two reasons.

One: I need real validation. If others use it and it works, it confirms the technical approach is on the right track. There's no better test than someone who isn't me running the code and reaching the same result.

Two: Spark isn't what makes Don Nelson unique. Spark is infrastructure — an agent that writes scripts for DIgSILENT. That's not a secret worth keeping.

You can clone it, fork it, contribute, or simply use it:

👉 github.com/valdivia-tech/spark

To install: clone the repo on a computer with DIgSILENT, add your Gemini API Key, and you're good to go.

What's next: integrating Spark into the full harness v2.0 and testing it with the real CEN study. The question I want to answer is whether Spark is reliable enough for a study that has to be approved — or if it still needs human supervision at every step.

What is Spark?

The memory that accumulates

Why this changes the harness

Pros and cons

How to try it

Subscribe to the blog