Skip to main content

Deployment & Operations

rocky-bot runs on EC2 as a systemd-user service, driven by 30 accounts minted in one shot via mint-30.sh. This page covers the complete "zero-to-running + continuous monitoring + incident postmortems" lifecycle.

If you don't know the strategy itself, read Strategy Loops and Risk Controls first.


1. Initial provisioning: mint 30 accounts

Source: rocky-bot/scripts/mint-30.sh

cd /Users/ubuntu/Desktop/Rocky/rocky-bot
bash scripts/mint-30.sh > .keys.json

What the script does (~1.5 min, all SSH into EC2):

  1. Generates 30 role manifests (inline python3 producing JSON)
    • 12 BUY ladders (5/6/7/8/9/10 bps × L1 + 15/20/25/30 bps × L2 + 50/100 bps × L3)
    • 12 SELL ladders (symmetric)
    • 1 anchor
    • 5 takers
  2. Calls EC2's mint_api_key Rust binary for each
    • Runs cargo run -p api-gateway --bin mint_api_key -- --new-user --label '<id>' in the rocky-backend repo
    • Backend creates an auth.api_keys row + a new user_id
    • Returns user_id / api key / secret
  3. Seeds $100 USDC to each new user immediately
    • Calls backend /v1/deposits/seed
  4. Aggregates the final .keys.json to stdout

.keys.json structure example:

{
"rocky_fapi_url": "https://demo.rocky.exchange",
"accounts": [
{
"id": "mm-l1-buy-05bps",
"role": "ladder",
"side": "BUY",
"offset_bps": 5,
"user_id": "5cfb031b-5936-4467-9533-cd2df576dbb8",
"key": "...",
"secret": "..."
},
...30 total
]
}

Security note: .keys.json holds 30 (key, secret) pairs — highly sensitive.

  • .gitignored, never goes into git
  • Only scp'd to EC2 during deploy
  • Keep a local copy for re-deploy purposes

1.1 SSH multiplexing

mint-30.sh uses ControlMaster to reuse a single SSH connection, avoiding hitting the server's MaxStartups rate limit when opening 30 connections:

SSH_OPTS=(-i "$SSH_KEY" -o ControlMaster=auto -o ControlPath="$CTRL_PATH" -o ControlPersist=60)

Historically without this, the 8th connection was rejected by sshd ("connection reset by peer").


2. Deploy: ship / upgrade bot code

Source: rocky-bot/deploy.sh

cd /Users/ubuntu/Desktop/Rocky/rocky-bot
./deploy.sh

What it does:

  1. rsync source from rocky_bot/ to EC2 ~/rocky-bot/
  2. scp .env (env config) + scp .keys.json (account credentials)
  3. SSH uv venv --python 3.12 --allow-existing && uv pip install -e .
  4. SSH systemctl --user restart rocky-bot

rsync is configured --exclude .env --exclude .keys.json so sensitive files go through the dedicated scp channel.

2.1 systemd unit

# ~/.config/systemd/user/rocky-bot.service
[Unit]
Description=rocky-bot — volume generator for demo.rocky.exchange
After=network.target

[Service]
Type=exec
WorkingDirectory=%h/rocky-bot
ExecStart=%h/rocky-bot/.venv/bin/python -m rocky_bot.main
Restart=on-failure
RestartSec=5

[Install]
WantedBy=default.target

Control commands:

systemctl --user start rocky-bot
systemctl --user stop rocky-bot
systemctl --user restart rocky-bot
systemctl --user is-active rocky-bot
journalctl --user -u rocky-bot --since "5 min ago" --no-pager

3. Reset: clear state + restart

Source: rocky-bot/scripts/reset.sh

bash scripts/reset.sh

5 steps:

  1. systemctl --user stop rocky-bot
  2. SQL reset for 30 funnel accounts:
    UPDATE ledger.positions SET qty=0, locked_margin=0 WHERE user_id IN (funnel users);
    UPDATE ledger.accounts SET available = available + locked, locked = 0
    WHERE asset='USDC' AND user_id IN (funnel users);
    DELETE FROM ledger.orders_open WHERE user_id IN (funnel users);
  3. pkill -f 'target/release/matching-engine' + nohup restart ME (so its in-memory book syncs with the just-cleared orders_open)
  4. sleep 3
  5. systemctl --user restart rocky-bot

3.1 Why ME needs restarting

Historical lesson: early resets only cleared the DB, not ME. ME's memory still referenced rows that DELETE had wiped from orders_open. When the new bot started, ME matched against these "ghost orders" → backend apply.rs couldn't find the corresponding order row → fell through the leverage=1 fallback → wrong margin calculation.

After this step, ME reloads from (now empty) orders_open at restart — guaranteed clean.


4. Monitoring (runtime)

Four key metrics:

4.1 max(locked) — any account near blow-up?

ssh ... 'docker exec rocky-backend-stack-postgres-1 psql -U rocky -d rocky -c "
SELECT round(max(locked)::numeric, 2) AS max_l,
round(avg(locked)::numeric, 2) AS avg_l,
count(*) FILTER (WHERE locked > 50) AS over_50,
count(*) FILTER (WHERE locked > 80) AS over_80
FROM ledger.accounts a JOIN auth.api_keys k ON k.user_id = a.user_id
WHERE a.asset = '\''USDC'\''
AND (k.label LIKE '\''mm-%'\'' OR k.label LIKE '\''taker-%'\'')
"'

Healthy (based on post-leverage-fix measurements):

  • max_l < 50 USDC
  • over_80 == 0
  • avg_l ~ 20–30 USDC

Bad: max_l near 100 means the cap stopped working — major incident.

4.2 -2010 error count — how often the backend rejects orders

ssh ... 'journalctl --user -u rocky-bot --since "30 min ago" --no-pager 2>&1 | grep -c "\-2010"'

-2010 = "insufficient balance", occurs when the backend can't lock the order's margin. Healthy:

  • < 30 over 30 min (under 1/min)
  • Perfect: 0

Bad: > 100 means accounts are full — typically alongside max_l near 100.

4.3 recent trades — fills still happening?

ssh ... 'docker exec rocky-backend-stack-postgres-1 psql -U rocky -d rocky -c "
SELECT symbol, side, price, qty, ts FROM ledger.trades ORDER BY ts DESC LIMIT 3
"'

Healthy: most recent ts within 1 minute. Bad: ts more than 5 minutes ago → bot isn't working (might be dead or fully capped).

4.4 invariant logger — backend computation correctness

ssh ... 'grep -c "invariant violated" /tmp/rocky-services/internal-ledger.log'

Healthy: 0 or a stable low number (just startup noise). Bad: growing → backend matching layer has a bug.


5. Five margin-leak fix rounds

Full incident timeline (chronological), useful for understanding why the current architecture looks the way it does:

Round 1: bot-side position cap (commits 077887d / 526cac5 / 97d23a6 / 1ae6f1a)

Symptom: within 30 minutes max(locked) climbed to $99. Fix: add position-cap gates to LadderMakerLoop / AnchorMakerLoop (Strategy Loops § 2.4). Result: slowed but didn't cure.

Round 2: phantom-trade refusal (commits e67b63f / fd6b9e2)

Symptom: with the cap gate live, still a leak. Diagnosed: apply_trade_matched fell through to leverage=1 when it couldn't find the orders_open row. Fix: apply.rs early-return + publish OrderCancelled NATS so ME drops the ghost order. Result: reduced but still leaking.

Round 3: invariant instrumentation (commit e32ae17)

Symptom: still leaking, but couldn't pin down which trade caused it. Fix: at the end of apply_trade_matched, add tracing::error! checking accounts.locked == sum(positions.locked_margin) + sum(orders_open.margin_locked). Result: 30 minutes yielded 956 violations. Data showed pos_sum values containing 142857 recurring (= 1/7) → leverage was being computed as 7.

Round 4: reset.sh ops fix

Symptom: reset flow often half-failed (ME wasn't restarted / SQL heredoc swallowed by ssh quoting). Fix: full rewrite of reset.sh including ME pkill + restart + stdin-pipe SQL. Result: cleanup became reproducible, but the leak itself wasn't fixed.

Round 5: LEVERAGE_V1 constant (commit dd653e6)

Root cause: apply.rs derived leverage from notional / order_margin, which rounds to 7/9/11 after partial fills + price drift. Fix: replace derivation with const LEVERAGE_V1: u32 = 10;, aligning with the hardcoded leverage: 10 already used everywhere above api-gateway. Result: 100% fixed. 63 minutes runtime after deploy with stable max_locked $27, 0 invariant violations, 0 -2010 errors.


6. Troubleshooting checklist

Bot seems idle

  1. systemctl --user is-active rocky-bot — is the service alive?
  2. journalctl --user -u rocky-bot --since "1 min ago" — recent logs
  3. Check if max(locked) is near $98 — if so, accounts are full
  4. Check if recent trades are fresh — if not, BinanceFeed may have disconnected

CB tripping repeatedly

  1. Check journal for "CircuitBreaker opened: reason=..." — API errors or max loss
  2. If API errors: is the backend healthy? curl https://demo.rocky.exchange/api/perp/markets
  3. If max loss: manually reset CB (restart bot) OR raise RiskCaps.max_loss_usdc

Invariant violations reappear

  1. Find specific cases: grep "invariant violated" /tmp/rocky-services/internal-ledger.log | head -3
  2. Check the diff and inferred leverage in each
  3. If leverage-related, retrace round 5
  4. If new pattern, possibly a new backend leak → open a spec and run the diagnostic flow

  • Overview — top-level architecture
  • Strategy Loops — three strategy internals
  • Risk Controls — RiskCaps / CircuitBreaker / position cap / LEVERAGE_V1
  • Repo spec docs (full incident postmortems):
    • rocky.interface/docs/superpowers/specs/2026-05-25-rocky-bot-position-cap-fix-design.md
    • rocky.interface/docs/superpowers/specs/2026-05-25-phantom-trade-fix-design.md
    • rocky.interface/docs/superpowers/specs/2026-05-25-margin-leak-instrumentation-design.md
    • rocky.interface/docs/superpowers/specs/2026-05-25-leverage-derivation-fix-design.md