中文

Zero training, second-level reactions.~400ms

Meadow Mind completes reaction tasks at a 0.4-second decision speed, with no RL training and no reward.

engine MeadowCoder-7B (8-bit)
hardware Apple M1 Max, on-device
envs official Gymnasium, untouched physics
training 0 samples, 0 reward
0 samples
no environment data collected
0 training
no RL, no reward, no gradients
generalizes
>5 games solved with zero training, growing
<1s
decision latency (~400ms)

Four-layer architecture

You only prepare the perception and the rule; the decision core is packaged. import meadow_mind and go.

① Perceiver
Raw observations become one sentence. e.g. Status: DROPPING. Iron rule: always include a velocity or trend term.
② Rule
The policy is one sentence of natural language. Change behavior by editing words. Zero training.
③ Mind
A 7B model reads the rule and the situation and picks an action in real time. Fixed ~0.4s latency, independent of answer length.
④ Actuator
Action letters map to env actions.
e.g. C → fire main engine

How to use

Five steps to wire a task into Meadow Mind and run it with zero training.

1Understand the task, explore input-output
Observe the variables, actions, win and lose conditions (the reaction deadline must be looser than 0.4s); list every action and watch what result it produces, e.g. "push right → the pole gets caught".
2Build perception words
Describe the current situation in one sentence so the Mind knows what is happening. e.g. turn angle 0.13, spin 0.9 into "the pole tilts right, spinning fast". Buckets are enough: small/big, fast/slow.
3Imprint the rule
Invert outputs and inputs into a rule: on situation X do action B. e.g. "tilting right → push right", "DROPPING → fire the main engine". One sentence is the policy.
4Decide on memory
Ask one question: "is revisiting the same state a failure signal?" Yes (maze, exploration, dead ends) → Task(memory=True); the task is about maintaining a state (balance, landing, tracking) → keep it off, repetition is the job. The runner also hints when it detects looping.
5Take the exam
Before playing, give the Mind a written exam: each item = a situation + the expected answer ("tilting right + fast spin" expects "push right"). mind.check(task) asks item by item; CartPole passed 7 of 8 (one miss allowed), landing 5/5, maze 7/7, MountainCar 3/3. A failed exam needs no training: the perception sentence is usually incomplete; rephrase and re-check.
Or hand all five steps to an AI
Humans do not have to do this by hand: paste the prompt below plus your game description into any code agent (Meadow CLI, Claude, Cursor) and it wires the task for you. You only review the final exam score.
You are a Meadow Mind task integration engineer. Given a game's observation and
action description, produce:
1) perceive(obs): translate numeric observations into one English situation
   sentence. Bucket values (small/big, fast/slow), always include a velocity or
   trend term, arbitrate multi-objective states into ONE uppercase Status keyword.
2) rule: one English sentence, a one-layer mapping from status keywords to
   option letters (no nesting).
3) options: multiple choice (A=..., B=...) mapped to env actions. No free-form.
4) Decide on memory: "is revisiting the same state a failure signal?"
   Yes (maze/exploration/dead ends) → Task(memory=True), write perceive(obs, task),
   use task.seen(key) to annotate (safe, already visited), and add
   "prefer unvisited directions" to the rule. Regulation tasks (balance/landing):
   keep it OFF — annotations measurably hurt. Unsure → off; the runner hints on loops.
5) sanity: enumerate every situation with its expected letter (the exam;
   include annotated situations if memory is on).
Then run mind.check(task), one miss allowed; on failure only rephrase, never
touch the model.

The four games, fully decomposed

Every frame in every video corresponds to one real model decision. No scripted policy, no edited speed-ups.

Balance

CartPole-v1
400/400 perfect0.35s / stepsanity 7/8
obs
[cart pos, cart vel, pole angle θ, angular vel θ̇], 4-dim
actions
2: push left, push right
win/lose
|θ| over 12° loses; solve bar 195 steps
effects
push right moves the pivot under the mass, catching the pole
RULE (THIS SENTENCE IS THE POLICY) Spin fast: push toward the spin. Spin slow: push toward the tilt.
The language version of the classic θ+θ̇ policy. Watching tilt without spin oscillates to death; the velocity term is an iron rule.
def perceive(obs):
    th, thv = obs[2], obs[3]
    tilt  = "right" if th  > 0 else "left"
    spin  = "right" if thv > 0 else "left"
    speed = "fast" if abs(thv) > abs(th) else "slow"
    return f"The pole tilts {tilt}. The spin is {spin}, {speed} spin."
400 steps, perfect. Turn-based: each frame advances only after one real decision.

Landing

LunarLander-v3
+251 safe landing0.45s / stepsanity 5/5
Main engine brakes, side engines trim attitude, line up the pad, cushion the touchdown. 178 real decisions.
obs
[x, y, vx, vy, angle, angular vel, leg1, leg2], 8-dim
actions
4: coast, left engine, main engine, right engine
win/lose
crash −100; solve bar +200
effects
main engine slows the fall; side engines turn; the perceiver arbitrates multiple goals
RULE DROPPING: fire the main engine. TURN-LEFT / TURN-RIGHT: side engines. STABLE / LANDED: do nothing.
if leg1 or leg2:
    if vy < -0.1:
        return "Status: DROPPING."  # touched but sinking: keep cushioning
    return "Status: LANDED."
Outcome feedback in action: the first flight crashed at +27.5; the trace showed control stopped at first leg contact. One cushioning line in the perceiver later, the second flight landed at +251. One line, ten seconds, no reward.

Maze

FrozenLake-v1 8×8
GOAL in 14 steps, shortest path0.36s / stepsanity 7/7
Top-left to bottom-right, around all 10 holes, in exactly the theoretical shortest path.
obs
cell id 0-63; S start, F ice, H hole, G goal
actions
left, down, right, up (deterministic)
win/lose
step on H dies, reach G wins
RULE Take primary if safe; hole or wall → secondary; both bad → the escape direction; never enter a hole.
The perceiver provides global awareness: the bearing toward the goal becomes primary and secondary candidate directions, each annotated safe / hole / blocked.
"Primary: down (safe). Secondary: right (hole)."

Momentum

MountainCar-v0
flag in 103 steps0.37s / stepsanity 3/3
obs
[position, velocity], 2-dim
actions
3: push left, coast, push right
win/lose
reach the flag within 200 steps; the engine is weaker than gravity, driving straight up fails
effects
pushing with the motion pumps energy into the system (swing principle)
RULE (A COUNTERINTUITIVE STRATEGY, ONE SENTENCE) Push in the direction of motion, pumping energy like a swing; when still, push left to start.
MountainCar is RL's classic sparse-reward problem: you must move away from the goal to reach it. RL discovers this by exploration; here it is written straight into the rule.
rule = ("Rule: push in the same direction the car is moving, "
        "to pump energy like a swing. If not moving, push left.")
Two swings back and forth to build energy, up the hill in 103 steps (limit 200).

Meadow Mind's memory at work

Without memory, a dead end means pacing at its mouth forever.
Add a memory cue to the perception sentence, and it backs out and routes around.

Dead-end maze

custom funnel trap · FrozenLake 8×8
left: no memory, stuck ✗right: with memory, GOAL in 22 steps ✓the only difference is 5 words
A funnel forces both runs into the same pocket dead end (holes below, left and right). The left side paces forever; the right side struggles twice, backs out, and detours to the goal.
Built-in memory switch Task(memory=True). Visited states accumulate automatically; annotate the perception sentence with task.seen(). No model changes, no training. Off by default; the runner hints when it detects looping.
st = "safe"
if cell in visited:
    st = "safe, already visited"
RULE (ONE EDITED SENTENCE) Prefer unvisited directions; if primary was visited take secondary; otherwise the alternative.

Why no reward

RL (PPO / SAC)Meadow Mind
policy sourcereward engineering + training (hours to days)one sentence (seconds)
samples10⁵ to 10⁷ env steps0
changing behaviorretrainedit words
interpretabilityblack-box weightsrule and every decision are readable
decision latency0.1 to 1 ms~0.4 s (honest weakness)
continuous precision, high-rate controlstrongdiscrete multiple choice (honest weakness)

RL needs reward because the policy hides inside weights and can only be carved by a scalar signal. Meadow Mind's policy is a readable sentence: env scores are just report cards and never enter the decision loop. Reward is replaced by outcome feedback: the episode trace points at the wrong sentence, and you edit it. LunarLander went from crash to landing with one ten-second cushioning line.

Honest limits: the reaction floor is ~0.4s (~2Hz); tighter deadlines (a 1-meter pole, Pong trajectory prediction) are out of reach today; the perceiver is human-designed, a teaching division of labour. Next version: layered perception that acts as soon as confidence crosses a threshold, targeting ~0.15s.