feat: containerized control plane — Postgres persistence, dark-agent init, tenant onboarding, fleet UI #3
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/containerize"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Turns the pre-scaffold repo into a working, containerized multi-tenant control plane. Verified live end-to-end against real Yandex Cloud, a live dark-agent 0.19.0 guest, HashiCorp Vault, and Postgres.
What landed
Dockerfile+docker-compose.yml(orchestrator + Postgres + Vault + UI);DARK_BINDconfigurable listen address.FleetRegistryport + Postgres adapter (durable fleet + lifecycle status), durable Postgres vault,dbmodule + migrations;reconcile_onceloop (orphan #767 / stuck / interrupted-teardown / Ready-drift), timer inmain.DarkAgentport to the real guest API +HttpDarkAgent(reqwest/rustls); the init→claude→export dance, bundle round-tripped through the vault;guest_report(live /health + /heal).TenantSecretSource/Sinkports + Vault KV v2 adapter;PUT /tenants/:tenant/secretsonboarding endpoint — a secret-write surface kept off the fleet-command API (#737).GET /instances,GET /instances/:tenant/:iddetail; a TypeScript backend-for-frontend dashboard (fleet table, provision/teardown/onboard, per-VM page with opencode/dark-agent/heal entry points + live guest).Live-hardening fixes (from the real run)
DARK_VM_IMAGE), not a request field.dark-vm-{uuid}(no YC 409 on re-provision; non-ASCII tenants safe).nks_configured).RECONCILE_INTERVAL_SECS=0disables reconcile (safety valve for manual real-cloud runs); empty env vars treated as unset.Known follow-ups (not in this PR)
Sandbox project — CI runs fmt + clippy + test (41 tests; PG/Vault/YC integration tests are
#[ignore]).Split vault.rs into vault/{mod,mock}.rs (dir-module form) and add a VaultError::Backend variant distinct from Missing. The new variant forces the load-bearing match in control::initialize to decide explicitly: a backend read failure now bails (return Err) instead of falling through to the init/export path that would rotate a live tenant's durable bundle (NKS #628). Regression test proves no dark-agent ops run on a vault backend error.First real provision→init against a live dark-agent 0.19.0 failed at the init step with 'error decoding response body': the guest returns {ssh_pubkey, forges} but the InitOutcome DTO required nks_configured. Make InitOutcome/ClaudeOutcome/ImportOutcome #[serde(default)] (Default-derived) so the adapter is liberal in what it accepts across dark-agent versions. Regression test pins the real 0.19.0 init response. Verified end to end after the fix: VM provisions with a public IP, /health is reached, init decodes, and the claude step runs and correctly surfaces the guest's validation (the test tenant's stored token wasn't a real sk-ant-oat01 Claude Max token).The tenant-derived name meant YC's per-folder name uniqueness rejected a second provision for the same tenant with 409 (surfaced as a 500), and a leftover VM blocked re-provision. Name each VM dark-vm-{uuid} instead: unique per provision, so a tenant may run more than one VM and re-provision never collides. The tenant is NOT in the name — the registry maps provider-id -> tenant. Keeps FLEET_NAME_PREFIX for reconcile's orphan scan. Side benefit: non-ASCII tenants can no longer produce an invalid YC resource name (the earlier Cyrillic 'invalid resource name').