Deployment & Operations
rocky-bot runs on EC2 as a systemd-user service, driven by 30 accounts minted in one shot via mint-30.sh. This page covers the complete "zero-to-running + continuous monitoring + incident postmortems" lifecycle.
If you don't know the strategy itself, read Strategy Loops and Risk Controls first.
1. Initial provisioning: mint 30 accounts
Source: rocky-bot/scripts/mint-30.sh
cd /Users/ubuntu/Desktop/Rocky/rocky-bot
bash scripts/mint-30.sh > .keys.json
What the script does (~1.5 min, all SSH into EC2):
- Generates 30 role manifests (inline python3 producing JSON)
- 12 BUY ladders (5/6/7/8/9/10 bps × L1 + 15/20/25/30 bps × L2 + 50/100 bps × L3)
- 12 SELL ladders (symmetric)
- 1 anchor
- 5 takers
- Calls EC2's
mint_api_keyRust binary for each- Runs
cargo run -p api-gateway --bin mint_api_key -- --new-user --label '<id>'in the rocky-backend repo - Backend creates an
auth.api_keysrow + a new user_id - Returns
user_id / api key / secret
- Runs
- Seeds $100 USDC to each new user immediately
- Calls backend
/v1/deposits/seed
- Calls backend
- Aggregates the final
.keys.jsonto stdout
.keys.json structure example:
{
"rocky_fapi_url": "https://demo.rocky.exchange",
"accounts": [
{
"id": "mm-l1-buy-05bps",
"role": "ladder",
"side": "BUY",
"offset_bps": 5,
"user_id": "5cfb031b-5936-4467-9533-cd2df576dbb8",
"key": "...",
"secret": "..."
},
...30 total
]
}
Security note: .keys.json holds 30 (key, secret) pairs — highly sensitive.
.gitignored, never goes into git- Only scp'd to EC2 during deploy
- Keep a local copy for re-deploy purposes
1.1 SSH multiplexing
mint-30.sh uses ControlMaster to reuse a single SSH connection, avoiding hitting the server's MaxStartups rate limit when opening 30 connections:
SSH_OPTS=(-i "$SSH_KEY" -o ControlMaster=auto -o ControlPath="$CTRL_PATH" -o ControlPersist=60)
Historically without this, the 8th connection was rejected by sshd ("connection reset by peer").
2. Deploy: ship / upgrade bot code
Source: rocky-bot/deploy.sh
cd /Users/ubuntu/Desktop/Rocky/rocky-bot
./deploy.sh
What it does:
rsyncsource fromrocky_bot/to EC2~/rocky-bot/scp .env(env config) +scp .keys.json(account credentials)- SSH
uv venv --python 3.12 --allow-existing && uv pip install -e . - SSH
systemctl --user restart rocky-bot
rsync is configured --exclude .env --exclude .keys.json so sensitive files go through the dedicated scp channel.
2.1 systemd unit
# ~/.config/systemd/user/rocky-bot.service
[Unit]
Description=rocky-bot — volume generator for demo.rocky.exchange
After=network.target
[Service]
Type=exec
WorkingDirectory=%h/rocky-bot
ExecStart=%h/rocky-bot/.venv/bin/python -m rocky_bot.main
Restart=on-failure
RestartSec=5
[Install]
WantedBy=default.target
Control commands:
systemctl --user start rocky-bot
systemctl --user stop rocky-bot
systemctl --user restart rocky-bot
systemctl --user is-active rocky-bot
journalctl --user -u rocky-bot --since "5 min ago" --no-pager
3. Reset: clear state + restart
Source: rocky-bot/scripts/reset.sh
bash scripts/reset.sh
5 steps:
systemctl --user stop rocky-bot- SQL reset for 30 funnel accounts:
UPDATE ledger.positions SET qty=0, locked_margin=0 WHERE user_id IN (funnel users);UPDATE ledger.accounts SET available = available + locked, locked = 0WHERE asset='USDC' AND user_id IN (funnel users);DELETE FROM ledger.orders_open WHERE user_id IN (funnel users);
pkill -f 'target/release/matching-engine'+ nohup restart ME (so its in-memory book syncs with the just-clearedorders_open)sleep 3systemctl --user restart rocky-bot
3.1 Why ME needs restarting
Historical lesson: early resets only cleared the DB, not ME. ME's memory still referenced rows that DELETE had wiped from orders_open. When the new bot started, ME matched against these "ghost orders" → backend apply.rs couldn't find the corresponding order row → fell through the leverage=1 fallback → wrong margin calculation.
After this step, ME reloads from (now empty) orders_open at restart — guaranteed clean.
4. Monitoring (runtime)
Four key metrics:
4.1 max(locked) — any account near blow-up?
ssh ... 'docker exec rocky-backend-stack-postgres-1 psql -U rocky -d rocky -c "
SELECT round(max(locked)::numeric, 2) AS max_l,
round(avg(locked)::numeric, 2) AS avg_l,
count(*) FILTER (WHERE locked > 50) AS over_50,
count(*) FILTER (WHERE locked > 80) AS over_80
FROM ledger.accounts a JOIN auth.api_keys k ON k.user_id = a.user_id
WHERE a.asset = '\''USDC'\''
AND (k.label LIKE '\''mm-%'\'' OR k.label LIKE '\''taker-%'\'')
"'
Healthy (based on post-leverage-fix measurements):
max_l < 50USDCover_80 == 0avg_l ~ 20–30USDC
Bad: max_l near 100 means the cap stopped working — major incident.
4.2 -2010 error count — how often the backend rejects orders
ssh ... 'journalctl --user -u rocky-bot --since "30 min ago" --no-pager 2>&1 | grep -c "\-2010"'
-2010 = "insufficient balance", occurs when the backend can't lock the order's margin. Healthy:
- < 30 over 30 min (under 1/min)
- Perfect:
0
Bad: > 100 means accounts are full — typically alongside max_l near 100.
4.3 recent trades — fills still happening?
ssh ... 'docker exec rocky-backend-stack-postgres-1 psql -U rocky -d rocky -c "
SELECT symbol, side, price, qty, ts FROM ledger.trades ORDER BY ts DESC LIMIT 3
"'
Healthy: most recent ts within 1 minute.
Bad: ts more than 5 minutes ago → bot isn't working (might be dead or fully capped).
4.4 invariant logger — backend computation correctness
ssh ... 'grep -c "invariant violated" /tmp/rocky-services/internal-ledger.log'
Healthy: 0 or a stable low number (just startup noise).
Bad: growing → backend matching layer has a bug.
5. Five margin-leak fix rounds
Full incident timeline (chronological), useful for understanding why the current architecture looks the way it does:
Round 1: bot-side position cap (commits 077887d / 526cac5 / 97d23a6 / 1ae6f1a)
Symptom: within 30 minutes max(locked) climbed to $99. Fix: add position-cap gates to LadderMakerLoop / AnchorMakerLoop (Strategy Loops § 2.4). Result: slowed but didn't cure.
Round 2: phantom-trade refusal (commits e67b63f / fd6b9e2)
Symptom: with the cap gate live, still a leak. Diagnosed: apply_trade_matched fell through to leverage=1 when it couldn't find the orders_open row.
Fix: apply.rs early-return + publish OrderCancelled NATS so ME drops the ghost order.
Result: reduced but still leaking.
Round 3: invariant instrumentation (commit e32ae17)
Symptom: still leaking, but couldn't pin down which trade caused it.
Fix: at the end of apply_trade_matched, add tracing::error! checking accounts.locked == sum(positions.locked_margin) + sum(orders_open.margin_locked).
Result: 30 minutes yielded 956 violations. Data showed pos_sum values containing 142857 recurring (= 1/7) → leverage was being computed as 7.
Round 4: reset.sh ops fix
Symptom: reset flow often half-failed (ME wasn't restarted / SQL heredoc swallowed by ssh quoting). Fix: full rewrite of reset.sh including ME pkill + restart + stdin-pipe SQL. Result: cleanup became reproducible, but the leak itself wasn't fixed.
Round 5: LEVERAGE_V1 constant (commit dd653e6)
Root cause: apply.rs derived leverage from notional / order_margin, which rounds to 7/9/11 after partial fills + price drift.
Fix: replace derivation with const LEVERAGE_V1: u32 = 10;, aligning with the hardcoded leverage: 10 already used everywhere above api-gateway.
Result: 100% fixed. 63 minutes runtime after deploy with stable max_locked $27, 0 invariant violations, 0 -2010 errors.
6. Troubleshooting checklist
Bot seems idle
systemctl --user is-active rocky-bot— is the service alive?journalctl --user -u rocky-bot --since "1 min ago"— recent logs- Check if max(locked) is near $98 — if so, accounts are full
- Check if recent trades are fresh — if not, BinanceFeed may have disconnected
CB tripping repeatedly
- Check journal for "CircuitBreaker opened: reason=..." — API errors or max loss
- If API errors: is the backend healthy?
curl https://demo.rocky.exchange/api/perp/markets - If max loss: manually reset CB (restart bot) OR raise RiskCaps.max_loss_usdc
Invariant violations reappear
- Find specific cases:
grep "invariant violated" /tmp/rocky-services/internal-ledger.log | head -3 - Check the
diffand inferred leverage in each - If leverage-related, retrace round 5
- If new pattern, possibly a new backend leak → open a spec and run the diagnostic flow
7. Related
- Overview — top-level architecture
- Strategy Loops — three strategy internals
- Risk Controls — RiskCaps / CircuitBreaker / position cap / LEVERAGE_V1
- Repo spec docs (full incident postmortems):
rocky.interface/docs/superpowers/specs/2026-05-25-rocky-bot-position-cap-fix-design.mdrocky.interface/docs/superpowers/specs/2026-05-25-phantom-trade-fix-design.mdrocky.interface/docs/superpowers/specs/2026-05-25-margin-leak-instrumentation-design.mdrocky.interface/docs/superpowers/specs/2026-05-25-leverage-derivation-fix-design.md