Stelaris · mission log

2026-07-03 · LOG-005 · AIOS One research fabric, many machines — how our compute scales

Research progress is bounded by how many experiments you can run, read, and act on. So beyond the science, we've been quietly building the distributed compute fabric that AIOS runs on — and it's designed so the same architecture that coordinates a few machines today coordinates a fleet tomorrow.

Today the fabric is three heterogeneous machines: a primary Linux training node with the most cores and cooling, and two Apple-silicon laptops that fill in around it. They are different operating systems, different chip architectures, different core counts — and the system treats them as one pool. As of this writing they've collectively run over 790,000 simulated transport episodes across nearly 3,000 training runs. The live counter on the front page is that number, read straight from the fabric.

The design is deliberately simple, because simple things scale:

Every node is autonomous. Each runs its own dispatcher daemon and its own local registry of what has run and what is queued. There is no single point that has to be up for work to happen. A node that goes offline doesn't stall the others.
Work is pinned, not guessed. An experiment batch declares which machine it belongs on, and it's submitted to that node. Each machine profiles itself — cores, memory, thermal headroom — and sets its own safe concurrency limit, so we never oversubscribe a laptop the way we'd load the Linux box.
Code is kept in lockstep. A scheduled daily sync pulls every node to the same reviewed mainline, and a hard rule — no experiment launches from code that hasn't been merged and verified — means a result is always traceable to an exact, reviewed commit. Reproducibility isn't a cleanup step; it's a precondition to running at all.

Why build it this way for a company that's still pre-product? Because the shape of the problem doesn't change as it grows — only the scale does. Enrolling a new machine is the same operation whether it's a spare laptop or a rack of GPUs: install the daemon, let it profile itself, point batches at it. The heterogeneity we handle today — mixing an x86 Linux box with Apple-silicon laptops — is the same heterogeneity a growing team faces when it adds cloud instances, an on-prem cluster, and a dedicated training server. We're already running the multi-node case; scaling it is adding nodes, not rebuilding the system.

That's the enterprise story in one line: the research fabric is horizontal by construction. The same autonomous-node, self-profiling, pinned-work, reproducibility-first design that turns three personal machines into one research engine is the design that turns a hundred machines into one. And because AIOS is a thin shell over model-driven decisions, adding compute doesn't just mean more runs in parallel — it means the machine that plans, executes, and reads those runs gets more room to work. Compute and cognition scale together.

We built the fabric to remove our own bottleneck: the loop between an idea and the result that tells us whether it worked. Everything downstream — the science, the product — moves at the speed of that loop.

2026-07-01 · LOG-004 · GSP Teaching a swarm to predict — the GSP head comes alive

Our central hypothesis is that prediction is the mechanism behind good coordination. A robot that can anticipate where the shared payload — and the rest of the swarm — is about to be should coordinate better, and should generalize to objects it has never seen. We call this line of work GSP: Global State Prediction — the subject of the founder's doctoral dissertation. A small network head learns to predict something about the world's state, and that prediction is fed to the controller that chooses actions.

For a long time, the prediction head simply didn't work. Trained inside the full system, it produced predictions that correlated with reality at roughly 0.01 — statistically indistinguishable from noise. The easy conclusion was that the task was too hard, or the idea was wrong. We didn't accept that, because the theory was sound; the question was whether we had the engineering right.

So we built a ceiling ladder to find out where the signal was being lost. We measured, in order: what the online head actually achieved (~0.01), what an offline model could extract from the same inputs (~0.13), what was achievable from the truly observable quantity (~0.89), and what a perfect predictor would reach (~0.998). The gap between 0.01 and 0.89 told us the problem was not the task — the signal was there — but our plumbing.

Chasing that gap down, we found three concrete bugs, each upstream of the last:

the head wasn't actually receiving the observation it needed;
a key quantity was being computed in the wrong reference frame, scrambling it;
and a bearing was measured relative to the wrong point entirely.

We also found the head was being fed exploration noise it should never have seen, and that the feature it needed — a wrap-safe, world-frame change in bearing to the object — had to be handed to it explicitly rather than left to be discovered. Neural networks are not magic; if a quantity involves a discontinuity like the ±180° wrap-around, you encode it, you don't hope.

With the reference frame fixed, the noise removed, and the right feature provided, the head came alive: within-run prediction correlation of about 0.85, reproduced across six independent runs. The first working GSP prediction head in the campaign.

That is a real milestone, and it is not the finish line. A head that predicts accurately is necessary but not sufficient — the question that actually matters for the company is whether a predicting swarm coordinates better than a non-predicting one on task success. That experiment is running now: the predicting head against an identical baseline, paired seed for seed, judged on how often the swarm actually completes the transport.

**Task success over training, three random seeds.** Both the baseline (IC, grey) and the predicting head (GSP-N, teal) reach near-perfect success early; the open challenge is late-training instability. GSP-N tends to hold a steadier, higher success floor — clearest on seeds s99 and s200, near-even on s13. Because the predicting head trains roughly twice as slowly, its curves end earlier, so the matched-length head-to-head is still resolving.

The early read is honest and unfinished. Both learners climb to near-perfect success, then both wrestle with the late-training instability that is the real open problem in this task. The predicting head tends to hold a steadier floor — but it has not yet cleared the baseline by a margin larger than the seed-to-seed noise, and it is still training. We'll report the matured result here whichever way it lands. And if predicting the object's current state isn't enough, the next move is already clear: predict its future state. The mechanism stays; only the target changes.

2026-06-28 · LOG-003 · AIOS AIOS — building the machine that runs the research

A research program is only as fast as its slowest loop: propose an idea, run the experiment, read the result, decide what's next. For a small team chasing a hard problem, that loop is the real bottleneck. So alongside the science, we've been building the machine that runs it — an agentic operating system we call AIOS.

The design principle is a thin shell: write the minimum code needed to manage lifecycle, plumbing, and safety, and let the model do the actual work. Every decision that requires judgment — what to investigate, how to interpret a result, what to try next — goes through a model call, not a hand-written rule. The code's only job is to create the conditions for the model to work well and safely.

Concretely, AIOS runs a small hierarchy. A Principal owns the plan and every irreversible decision — what work to do, whether to merge, how to respond to a surprise. Below it, cheaper junior agents execute narrowly-scoped tickets — write this function, run this analysis, triage these files — and report back. Sensors report, the orchestrator decides, agents execute. No agent below the orchestrator makes autonomous calls about direction.

Around that sits the safety envelope that makes autonomy tolerable:

Golden gates — a ticket only counts as done when it passes explicit, pre-defined checks, so a confident-but-wrong agent can't merge broken work.
Hard spend caps — every run has a budget it cannot exceed.
A provider seam — juniors can be driven by different model backends, so we can route the cheapest capable model to each task and swap providers without touching the logic.
Anti-stale-data guards — the harness is being hardened specifically against a failure class we hit ourselves: acting on a cached or mislabeled result instead of the real one. When a near-miss happens, we turn it into a permanent check rather than a lesson we have to relearn.

Why does a space-logistics company build an agentic OS? Because the two problems rhyme. Coordinating a swarm of robots to move a payload and coordinating a swarm of agents to move a research program are both about getting many limited actors to combine into something none could do alone — under real constraints, with safety that can't be optional. AIOS is how we compound: every result makes the machine that produces the next result a little sharper. This log itself is published by that machine.

2026-06-20 · LOG-002 · METHOD Stabilizing the learner — from divergence to a baseline we trust

Before you can ask whether a clever idea helps, you need a learner that behaves the same way twice. For a while, ours didn't. On the hardest version of our transport task — where robots must first pass through a "gate" before coordinating on the payload — training would sometimes climb to a high success rate and then collapse, falling apart late in a run for no obvious reason. A result that good one moment and gone the next is worse than a bad result: it's untrustworthy.

Rather than paper over it, we traced the collapse to its root. The signature was a classic failure mode of value-based reinforcement learning: Q-value divergence. The network's estimate of how good a state is was inflating without bound, driven by a combination of an unclipped loss, a very long-horizon discount, and the usual overestimation bias of the algorithm. Once those estimates blow up, the policy built on top of them falls apart.

The fix was a small set of well-understood stabilizers applied deliberately:

a Huber loss instead of raw squared error, so large surprises don't produce enormous gradients;
gradient clipping, to bound how far any single update can move the network;
clipping the bootstrapped target values themselves, so a runaway estimate can't feed back into the next update and compound;
scaling the reward into a smaller range, which keeps the value magnitudes bounded from the start.

Notably, we did not have to shorten the horizon — the discount factor stayed exactly the same in the stabilized runs. The fix wasn't to make the problem easier; it was to keep the value estimates from exploding while leaving the task untouched.

With those in place, the collapse stopped and runs became repeatable. The picture below is the whole point of this entry — the same task at the same discount, with and without that stabilizer set:

**Q-value divergence, before and after.** Rolling 10-episode task success. Without the stabilizers (coral: default squared-error loss, no clipping, no reward scaling), training climbs to 100% and then collapses to zero as the value estimates diverge. With the stabilizer set (teal: Huber loss, gradient and target clipping, reward scaling), the same task on the same discount holds a stable success rate — lower and noisier than the fleeting peak, but a floor you can actually build on. The collapse doesn't just get smaller; it disappears.

That mattered for a subtle reason beyond stability: the collapse had been silently invalidating comparisons. When we'd previously ranked ideas by their single best checkpoint, we were often ranking noise from a diverging run. A stable learner is the precondition for honest science.

Out of this came the methodology we now hold ourselves to:

Baseline-and-improve. The current pipeline is "version zero" — the number to beat. Nothing gets called an improvement until it clears a hard stability gate and beats the baseline across multiple random seeds, not just one lucky one.
A ceiling ladder. Before trusting that a component learned something, we measure what the best possible version of that signal would look like, and compare. It keeps us from celebrating a result that's actually far below what the task allows.
Diagnostics always on. We log a component's inputs, outputs, gradients, and correlations to its target from the first run — because you cannot retrofit that history onto an experiment after the fact.

None of this is glamorous. It's the unglamorous part that makes the next entries in this log — where we start adding prediction to the swarm — mean something.

2026-04-07 · LOG-001 · FOUNDATIONS Why we're building the swarm before the satellite

Stelaris starts from a research bet made years before the company: that the hard part of moving things in space is not the thruster or the gripper — it's the coordination. Getting many simple robots to jointly grab, carry, and place an object that none of them could move alone is a control problem, not a hardware problem. Solve the control problem and the hardware becomes an engineering exercise. Fail to solve it and no amount of hardware helps.

That bet came out of doctoral work in multi-agent reinforcement learning for collective transport — swarms of robots learning to push and carry payloads together. The setting is deceptively simple: a handful of agents, a shared object, a goal location. The difficulty is that every robot's best action depends on what every other robot is about to do, and none of them can see the whole picture. Coordination has to emerge from local information.

We study this in high-fidelity simulation before touching space hardware, for the same reason aircraft are designed in wind tunnels first. Our simulator models the physics of multiple robots contacting and transporting a rigid body, and we train controllers against it with deep reinforcement learning. Every idea earns its place by moving a number — task success rate — not by sounding plausible.

Two principles have shaped everything since:

The algorithm is the moat. Launch costs are collapsing and hardware is commoditizing. What does not commoditize is a swarm that can coordinate to move arbitrary objects it was never trained on. That capability is the asset that makes the rest of the company credible.
Science first, honestly. We hold ourselves to a hard rule: a claim that some component contributes to a behavior requires an ablation that disables it and shows the behavior degrade. "It works" is never evidence that any particular piece is doing the work. This discipline is slower, and it is the reason we trust our own results.

The manifesto frames the company as the "FedEx of space" — a shared fleet that repositions payloads instead of every payload flying its own propulsion. This log is where we'll record how the coordination engine behind that vision actually gets built: the wins, the dead ends, and the honest state of the science. This first entry is the foundation. Everything after it is a brick in that wall.