Meadow Mind v0.1.0 — zero training, second-level reactions

Four-layer architecture

You only prepare the perception and the rule; the decision core is packaged. import meadow_mind and go.

① Perceiver

Raw observations become one sentence. e.g. Status: DROPPING. Iron rule: always include a velocity or trend term.

② Rule

The policy is one sentence of natural language. Change behavior by editing words. Zero training.

③ Mind

A 7B model reads the rule and the situation and picks an action in real time. Fixed ~0.4s latency, independent of answer length.

④ Actuator

Action letters map to env actions.
e.g. C → fire main engine

How to use

Five steps to wire a task into Meadow Mind and run it with zero training.

1Understand the task, explore input-output

Observe the variables, actions, win and lose conditions (the reaction deadline must be looser than 0.4s); list every action and watch what result it produces, e.g. "push right → the pole gets caught".

2Build perception words

Describe the current situation in one sentence so the Mind knows what is happening. e.g. turn angle 0.13, spin 0.9 into "the pole tilts right, spinning fast". Buckets are enough: small/big, fast/slow.

3Imprint the rule

Invert outputs and inputs into a rule: on situation X do action B. e.g. "tilting right → push right", "DROPPING → fire the main engine". One sentence is the policy.

4Decide on memory

Ask one question: "is revisiting the same state a failure signal?" Yes (maze, exploration, dead ends) → Task(memory=True); the task is about maintaining a state (balance, landing, tracking) → keep it off, repetition is the job. The runner also hints when it detects looping.

5Take the exam

Before playing, give the Mind a written exam: each item = a situation + the expected answer ("tilting right + fast spin" expects "push right"). mind.check(task) asks item by item; CartPole passed 7 of 8 (one miss allowed), landing 5/5, maze 7/7, MountainCar 3/3. A failed exam needs no training: the perception sentence is usually incomplete; rephrase and re-check.

★Or hand all five steps to an AI

Humans do not have to do this by hand: paste the prompt below plus your game description into any code agent (Meadow CLI, Claude, Cursor) and it wires the task for you. You only review the final exam score.

You are a Meadow Mind task integration engineer. Given a game's observation and
action description, produce:
1) perceive(obs): translate numeric observations into one English situation
   sentence. Bucket values (small/big, fast/slow), always include a velocity or
   trend term, arbitrate multi-objective states into ONE uppercase Status keyword.
2) rule: one English sentence, a one-layer mapping from status keywords to
   option letters (no nesting).
3) options: multiple choice (A=..., B=...) mapped to env actions. No free-form.
4) Decide on memory: "is revisiting the same state a failure signal?"
   Yes (maze/exploration/dead ends) → Task(memory=True), write perceive(obs, task),
   use task.seen(key) to annotate (safe, already visited), and add
   "prefer unvisited directions" to the rule. Regulation tasks (balance/landing):
   keep it OFF — annotations measurably hurt. Unsure → off; the runner hints on loops.
5) sanity: enumerate every situation with its expected letter (the exam;
   include annotated situations if memory is on).
Then run mind.check(task), one miss allowed; on failure only rephrase, never
touch the model.

The four games, fully decomposed

Every frame in every video corresponds to one real model decision. No scripted policy, no edited speed-ups.

Balance

CartPole-v1

400/400 perfect0.35s / stepsanity 7/8

obs: [cart pos, cart vel, pole angle θ, angular vel θ̇], 4-dim
actions: 2: push left, push right
win/lose: |θ| over 12° loses; solve bar 195 steps
effects: push right moves the pivot under the mass, catching the pole

RULE (THIS SENTENCE IS THE POLICY) Spin fast: push toward the spin. Spin slow: push toward the tilt.

The language version of the classic θ+θ̇ policy. Watching tilt without spin oscillates to death; the velocity term is an iron rule.

def perceive(obs):
    th, thv = obs[2], obs[3]
    tilt  = "right" if th  > 0 else "left"
    spin  = "right" if thv > 0 else "left"
    speed = "fast" if abs(thv) > abs(th) else "slow"
    return f"The pole tilts {tilt}. The spin is {spin}, {speed} spin."

400 steps, perfect. Turn-based: each frame advances only after one real decision.

Landing

LunarLander-v3

+251 safe landing0.45s / stepsanity 5/5

Main engine brakes, side engines trim attitude, line up the pad, cushion the touchdown. 178 real decisions.

obs: [x, y, vx, vy, angle, angular vel, leg1, leg2], 8-dim
actions: 4: coast, left engine, main engine, right engine
win/lose: crash −100; solve bar +200
effects: main engine slows the fall; side engines turn; the perceiver arbitrates multiple goals

RULE DROPPING: fire the main engine. TURN-LEFT / TURN-RIGHT: side engines. STABLE / LANDED: do nothing.

if leg1 or leg2:
    if vy < -0.1:
        return "Status: DROPPING."  # touched but sinking: keep cushioning
    return "Status: LANDED."

Outcome feedback in action: the first flight crashed at +27.5; the trace showed control stopped at first leg contact. One cushioning line in the perceiver later, the second flight landed at +251. One line, ten seconds, no reward.

Maze

FrozenLake-v1 8×8

GOAL in 14 steps, shortest path0.36s / stepsanity 7/7

Top-left to bottom-right, around all 10 holes, in exactly the theoretical shortest path.

obs: cell id 0-63; S start, F ice, H hole, G goal
actions: left, down, right, up (deterministic)
win/lose: step on H dies, reach G wins

RULE Take primary if safe; hole or wall → secondary; both bad → the escape direction; never enter a hole.

The perceiver provides global awareness: the bearing toward the goal becomes primary and secondary candidate directions, each annotated safe / hole / blocked.

"Primary: down (safe). Secondary: right (hole)."

Momentum

MountainCar-v0

flag in 103 steps0.37s / stepsanity 3/3

obs: [position, velocity], 2-dim
actions: 3: push left, coast, push right
win/lose: reach the flag within 200 steps; the engine is weaker than gravity, driving straight up fails
effects: pushing with the motion pumps energy into the system (swing principle)

RULE (A COUNTERINTUITIVE STRATEGY, ONE SENTENCE) Push in the direction of motion, pumping energy like a swing; when still, push left to start.

MountainCar is RL's classic sparse-reward problem: you must move away from the goal to reach it. RL discovers this by exploration; here it is written straight into the rule.

rule = ("Rule: push in the same direction the car is moving, "
        "to pump energy like a swing. If not moving, push left.")

Two swings back and forth to build energy, up the hill in 103 steps (limit 200).

Meadow Mind's memory at work

Without memory, a dead end means pacing at its mouth forever.
Add a memory cue to the perception sentence, and it backs out and routes around.

Dead-end maze

custom funnel trap · FrozenLake 8×8

left: no memory, stuck ✗right: with memory, GOAL in 22 steps ✓the only difference is 5 words

A funnel forces both runs into the same pocket dead end (holes below, left and right). The left side paces forever; the right side struggles twice, backs out, and detours to the goal.

Built-in memory switch Task(memory=True). Visited states accumulate automatically; annotate the perception sentence with task.seen(). No model changes, no training. Off by default; the runner hints when it detects looping.

st = "safe"
if cell in visited:
    st = "safe, already visited"

RULE (ONE EDITED SENTENCE) Prefer unvisited directions; if primary was visited take secondary; otherwise the alternative.

Why no reward

	RL (PPO / SAC)	Meadow Mind
policy source	reward engineering + training (hours to days)	one sentence (seconds)
samples	10⁵ to 10⁷ env steps	0
changing behavior	retrain	edit words
interpretability	black-box weights	rule and every decision are readable
decision latency	0.1 to 1 ms	~0.4 s (honest weakness)
continuous precision, high-rate control	strong	discrete multiple choice (honest weakness)

RL needs reward because the policy hides inside weights and can only be carved by a scalar signal. Meadow Mind's policy is a readable sentence: env scores are just report cards and never enter the decision loop. Reward is replaced by outcome feedback: the episode trace points at the wrong sentence, and you edit it. LunarLander went from crash to landing with one ten-second cushioning line.

Honest limits: the reaction floor is ~0.4s (~2Hz); tighter deadlines (a 1-meter pole, Pong trajectory prediction) are out of reach today; the perceiver is human-designed, a teaching division of labour. Next version: layered perception that acts as soon as confidence crosses a threshold, targeting ~0.15s.