Deployment & Operations

rocky-bot runs on EC2 as a systemd-user service, driven by 30 accounts minted in one shot via mint-30.sh. This page covers the complete "zero-to-running + continuous monitoring + incident postmortems" lifecycle.

If you don't know the strategy itself, read Strategy Loops and Risk Controls first.

1. Initial provisioning: mint 30 accounts

Source: rocky-bot/scripts/mint-30.sh

cd /Users/ubuntu/Desktop/Rocky/rocky-bot
bash scripts/mint-30.sh > .keys.json

What the script does (~1.5 min, all SSH into EC2):

Generates 30 role manifests (inline python3 producing JSON)
- 12 BUY ladders (5/6/7/8/9/10 bps × L1 + 15/20/25/30 bps × L2 + 50/100 bps × L3)
- 12 SELL ladders (symmetric)
- 1 anchor
- 5 takers
Calls EC2's mint_api_key Rust binary for each
- Runs cargo run -p api-gateway --bin mint_api_key -- --new-user --label '<id>' in the rocky-backend repo
- Backend creates an auth.api_keys row + a new user_id
- Returns user_id / api key / secret
Seeds $100 USDC to each new user immediately
- Calls backend /v1/deposits/seed
Aggregates the final .keys.json to stdout

.keys.json structure example:

{
  "rocky_fapi_url": "https://demo.rocky.exchange",
  "accounts": [
    {
      "id": "mm-l1-buy-05bps",
      "role": "ladder",
      "side": "BUY",
      "offset_bps": 5,
      "user_id": "5cfb031b-5936-4467-9533-cd2df576dbb8",
      "key": "...",
      "secret": "..."
    },
    ...30 total
  ]
}

Security note: .keys.json holds 30 (key, secret) pairs — highly sensitive.

.gitignored, never goes into git
Only scp'd to EC2 during deploy
Keep a local copy for re-deploy purposes

1.1 SSH multiplexing

mint-30.sh uses ControlMaster to reuse a single SSH connection, avoiding hitting the server's MaxStartups rate limit when opening 30 connections:

SSH_OPTS=(-i "$SSH_KEY" -o ControlMaster=auto -o ControlPath="$CTRL_PATH" -o ControlPersist=60)

Historically without this, the 8th connection was rejected by sshd ("connection reset by peer").

2. Deploy: ship / upgrade bot code

Source: rocky-bot/deploy.sh

cd /Users/ubuntu/Desktop/Rocky/rocky-bot
./deploy.sh

What it does:

rsync source from rocky_bot/ to EC2 ~/rocky-bot/
scp .env (env config) + scp .keys.json (account credentials)
SSH uv venv --python 3.12 --allow-existing && uv pip install -e .
SSH systemctl --user restart rocky-bot

rsync is configured --exclude .env --exclude .keys.json so sensitive files go through the dedicated scp channel.

2.1 systemd unit

# ~/.config/systemd/user/rocky-bot.service
[Unit]
Description=rocky-bot — volume generator for demo.rocky.exchange
After=network.target

[Service]
Type=exec
WorkingDirectory=%h/rocky-bot
ExecStart=%h/rocky-bot/.venv/bin/python -m rocky_bot.main
Restart=on-failure
RestartSec=5

[Install]
WantedBy=default.target

Control commands:

systemctl --user start rocky-bot
systemctl --user stop rocky-bot
systemctl --user restart rocky-bot
systemctl --user is-active rocky-bot
journalctl --user -u rocky-bot --since "5 min ago" --no-pager

3. Reset: clear state + restart

Source: rocky-bot/scripts/reset.sh

bash scripts/reset.sh

5 steps:

systemctl --user stop rocky-bot

SQL reset for 30 funnel accounts:

UPDATE ledger.positions SET qty=0, locked_margin=0 WHERE user_id IN (funnel users);
UPDATE ledger.accounts SET available = available + locked, locked = 0
  WHERE asset='USDC' AND user_id IN (funnel users);
DELETE FROM ledger.orders_open WHERE user_id IN (funnel users);

pkill -f 'target/release/matching-engine' + nohup restart ME (so its in-memory book syncs with the just-cleared orders_open)
sleep 3
systemctl --user restart rocky-bot

3.1 Why ME needs restarting

Historical lesson: early resets only cleared the DB, not ME. ME's memory still referenced rows that DELETE had wiped from orders_open. When the new bot started, ME matched against these "ghost orders" → backend apply.rs couldn't find the corresponding order row → fell through the leverage=1 fallback → wrong margin calculation.

After this step, ME reloads from (now empty) orders_open at restart — guaranteed clean.

4. Monitoring (runtime)

Four key metrics:

4.1 max(locked) — any account near blow-up?

ssh ... 'docker exec rocky-backend-stack-postgres-1 psql -U rocky -d rocky -c "
  SELECT round(max(locked)::numeric, 2) AS max_l,
         round(avg(locked)::numeric, 2) AS avg_l,
         count(*) FILTER (WHERE locked > 50) AS over_50,
         count(*) FILTER (WHERE locked > 80) AS over_80
    FROM ledger.accounts a JOIN auth.api_keys k ON k.user_id = a.user_id
   WHERE a.asset = '\''USDC'\''
     AND (k.label LIKE '\''mm-%'\'' OR k.label LIKE '\''taker-%'\'')
"'

Healthy (based on post-leverage-fix measurements):

max_l < 50 USDC
over_80 == 0
avg_l ~ 20–30 USDC

Bad: max_l near 100 means the cap stopped working — major incident.

4.2 -2010 error count — how often the backend rejects orders

ssh ... 'journalctl --user -u rocky-bot --since "30 min ago" --no-pager 2>&1 | grep -c "\-2010"'

-2010 = "insufficient balance", occurs when the backend can't lock the order's margin. Healthy:

< 30 over 30 min (under 1/min)
Perfect: 0

Bad: > 100 means accounts are full — typically alongside max_l near 100.

4.3 recent trades — fills still happening?

ssh ... 'docker exec rocky-backend-stack-postgres-1 psql -U rocky -d rocky -c "
  SELECT symbol, side, price, qty, ts FROM ledger.trades ORDER BY ts DESC LIMIT 3
"'

Healthy: most recent ts within 1 minute. Bad: ts more than 5 minutes ago → bot isn't working (might be dead or fully capped).

4.4 invariant logger — backend computation correctness

ssh ... 'grep -c "invariant violated" /tmp/rocky-services/internal-ledger.log'

Healthy: 0 or a stable low number (just startup noise). Bad: growing → backend matching layer has a bug.

5. Five margin-leak fix rounds

Full incident timeline (chronological), useful for understanding why the current architecture looks the way it does:

Round 1: bot-side position cap (commits 077887d / 526cac5 / 97d23a6 / 1ae6f1a)

Symptom: within 30 minutes max(locked) climbed to $99. Fix: add position-cap gates to LadderMakerLoop / AnchorMakerLoop (Strategy Loops § 2.4). Result: slowed but didn't cure.

Round 2: phantom-trade refusal (commits e67b63f / fd6b9e2)

Symptom: with the cap gate live, still a leak. Diagnosed: apply_trade_matched fell through to leverage=1 when it couldn't find the orders_open row. Fix: apply.rs early-return + publish OrderCancelled NATS so ME drops the ghost order. Result: reduced but still leaking.

Round 3: invariant instrumentation (commit e32ae17)

Symptom: still leaking, but couldn't pin down which trade caused it. Fix: at the end of apply_trade_matched, add tracing::error! checking accounts.locked == sum(positions.locked_margin) + sum(orders_open.margin_locked). Result: 30 minutes yielded 956 violations. Data showed pos_sum values containing 142857 recurring (= 1/7) → leverage was being computed as 7.

Round 4: reset.sh ops fix

Symptom: reset flow often half-failed (ME wasn't restarted / SQL heredoc swallowed by ssh quoting). Fix: full rewrite of reset.sh including ME pkill + restart + stdin-pipe SQL. Result: cleanup became reproducible, but the leak itself wasn't fixed.

Round 5: LEVERAGE_V1 constant (commit dd653e6)

Root cause: apply.rs derived leverage from notional / order_margin, which rounds to 7/9/11 after partial fills + price drift. Fix: replace derivation with const LEVERAGE_V1: u32 = 10;, aligning with the hardcoded leverage: 10 already used everywhere above api-gateway. Result: 100% fixed. 63 minutes runtime after deploy with stable max_locked $27, 0 invariant violations, 0 -2010 errors.

6. Troubleshooting checklist

Bot seems idle

systemctl --user is-active rocky-bot — is the service alive?
journalctl --user -u rocky-bot --since "1 min ago" — recent logs
Check if max(locked) is near $98 — if so, accounts are full
Check if recent trades are fresh — if not, BinanceFeed may have disconnected

CB tripping repeatedly

Check journal for "CircuitBreaker opened: reason=..." — API errors or max loss
If API errors: is the backend healthy? curl https://demo.rocky.exchange/api/perp/markets
If max loss: manually reset CB (restart bot) OR raise RiskCaps.max_loss_usdc

Invariant violations reappear

Find specific cases: grep "invariant violated" /tmp/rocky-services/internal-ledger.log | head -3
Check the diff and inferred leverage in each
If leverage-related, retrace round 5
If new pattern, possibly a new backend leak → open a spec and run the diagnostic flow

Overview — top-level architecture
Strategy Loops — three strategy internals
Risk Controls — RiskCaps / CircuitBreaker / position cap / LEVERAGE_V1
Repo spec docs (full incident postmortems):
- rocky.interface/docs/superpowers/specs/2026-05-25-rocky-bot-position-cap-fix-design.md
- rocky.interface/docs/superpowers/specs/2026-05-25-phantom-trade-fix-design.md
- rocky.interface/docs/superpowers/specs/2026-05-25-margin-leak-instrumentation-design.md
- rocky.interface/docs/superpowers/specs/2026-05-25-leverage-derivation-fix-design.md

1. Initial provisioning: mint 30 accounts​

1.1 SSH multiplexing​

2. Deploy: ship / upgrade bot code​

2.1 systemd unit​

3. Reset: clear state + restart​

3.1 Why ME needs restarting​

4. Monitoring (runtime)​

4.1 max(locked) — any account near blow-up?​

4.2 -2010 error count — how often the backend rejects orders​

4.3 recent trades — fills still happening?​

4.4 invariant logger — backend computation correctness​

5. Five margin-leak fix rounds​

Round 1: bot-side position cap (commits 077887d / 526cac5 / 97d23a6 / 1ae6f1a)​

Round 2: phantom-trade refusal (commits e67b63f / fd6b9e2)​

Round 3: invariant instrumentation (commit e32ae17)​

Round 4: reset.sh ops fix​

Round 5: LEVERAGE_V1 constant (commit dd653e6)​

6. Troubleshooting checklist​

Bot seems idle​

CB tripping repeatedly​

Invariant violations reappear​

7. Related​