Feat: Add production-grade multi-stage container image for API

Two-stage build (uv builder + python:3.12-slim runtime) with non-root user (UID 1001), no dev deps, layer-cache-optimised dep install, and graceful SIGTERM shutdown. Verified by api/tests/build/verify_production_image.sh covering build, health endpoint, non-root, stdout logging, secret-free layers, missing-env-var exit, and dep-layer cache hit. All 102 integration tests still pass; shellcheck clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Feat: Rate-limit login endpoint to block brute-force attacks
2026-05-07 19:59:29 +00:00 · 2026-05-06 21:01:37 +00:00
31 changed files with 2385 additions and 8 deletions
--- a/.env.example
+++ b/.env.example
@@ -19,3 +19,11 @@ JWT_SECRET_KEY=change-me-to-a-long-random-string
 JWT_EXPIRY_SECONDS=86400
 OWNER_USERNAME=owner
 OWNER_PASSWORD=change-me
+
+# Login brute-force protection
+LOGIN_MAX_FAILURES=5
+LOGIN_WINDOW_SECONDS=300
+LOGIN_COOLDOWN_SECONDS=900
+# Comma-separated IPs/CIDRs of trusted upstream proxies (e.g. nginx ingress pod CIDR).
+# Leave empty when not behind a reverse proxy.
+LOGIN_TRUSTED_PROXY_IPS=
--- a/.env.test.example
+++ b/.env.test.example
@@ -27,3 +27,10 @@ OWNER_PASSWORD=testpassword
 # API
 API_BASE_URL=http://localhost:8000
 MAX_UPLOAD_BYTES=52428800
+
+# Login brute-force protection
+LOGIN_MAX_FAILURES=5
+LOGIN_WINDOW_SECONDS=300
+LOGIN_COOLDOWN_SECONDS=900
+# Comma-separated IPs/CIDRs of trusted upstream proxies; leave empty for direct connections.
+LOGIN_TRUSTED_PROXY_IPS=
--- a/.gitignore
+++ b/.gitignore
@@ -16,6 +16,7 @@ venv/
 *.egg-info/
 dist/
 build/
+!api/tests/build/
 .pytest_cache/
 .ruff_cache/
 .coverage
--- a/.specify/feature.json
+++ b/.specify/feature.json
@@ -1 +1,3 @@
-{"feature_directory":"specs/008-postgres-integration-tests"}
+{
+  "feature_directory": "specs/010-api-prod-dockerfile"
+}
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,5 +1,5 @@
 <!-- SPECKIT START -->
 For additional context about technologies to be used, project structure,
 shell commands, and other important information, read the current plan at
-`specs/008-postgres-integration-tests/plan.md`.
+`specs/010-api-prod-dockerfile/plan.md`.
 <!-- SPECKIT END -->
--- a/8
+++ b/8
@@ -1,7 +1,13 @@
-.PHONY: test-unit test-integration
+.PHONY: test-unit test-integration build-prod verify-prod

 test-unit:
 	cd api && python -m pytest tests/unit/ -v

 test-integration:
 	docker compose -f docker-compose.test.yml run --rm api-test
+
+build-prod:
+	docker build -f api/Dockerfile.prod api/ -t reactbin-api-prod:latest
+
+verify-prod:
+	bash api/tests/build/verify_production_image.sh
--- a/api/.dockerignore
+++ b/api/.dockerignore
@@ -12,3 +12,6 @@ dist/
 .env
 .env.*
 !.env.example
+tests/
+alembic/
+alembic.ini
--- a/api/Dockerfile.prod
+++ b/api/Dockerfile.prod
@@ -0,0 +1,51 @@
+# syntax=docker/dockerfile:1
+
+# ════════════════════════════════════════════════
+# Build stage: install production deps via uv
+# ════════════════════════════════════════════════
+FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim AS builder
+
+WORKDIR /app
+
+ENV UV_COMPILE_BYTECODE=1 \
+    UV_LINK_MODE=copy \
+    UV_PYTHON_DOWNLOADS=never
+
+# Layer cache split: deps only (changes rarely)
+COPY pyproject.toml uv.lock ./
+RUN --mount=type=cache,target=/root/.cache/uv \
+    uv sync --frozen --no-dev --no-install-project
+
+# Layer cache split: source (changes often)
+COPY app/ ./app/
+
+# ════════════════════════════════════════════════
+# Runtime stage: lean image with venv + source
+# ════════════════════════════════════════════════
+FROM python:3.12-slim
+
+WORKDIR /app
+
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends curl \
+    && rm -rf /var/lib/apt/lists/*
+
+RUN groupadd --system --gid 1001 appgroup \
+    && useradd --system --uid 1001 --gid 1001 --no-create-home appuser
+
+COPY --from=builder --chown=appuser:appgroup /app/.venv /app/.venv
+COPY --chown=appuser:appgroup app/ ./app/
+
+USER appuser
+
+ENV PATH="/app/.venv/bin:$PATH"
+
+EXPOSE 8000
+
+HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+    CMD curl -f http://localhost:8000/api/v1/health || exit 1
+
+CMD ["uvicorn", "app.main:app", \
+     "--host", "0.0.0.0", \
+     "--port", "8000", \
+     "--timeout-graceful-shutdown", "30"]
--- a/api/app/auth/rate_limiter.py
+++ b/api/app/auth/rate_limiter.py
@@ -0,0 +1,91 @@
+import ipaddress
+import logging
+import time
+from dataclasses import dataclass, field
+from ipaddress import IPv4Network, IPv6Network
+from threading import Lock
+
+from starlette.requests import Request
+
+logger = logging.getLogger(__name__)
+
+
+def get_client_ip(
+    request: Request,
+    trusted_networks: list[IPv4Network | IPv6Network],
+) -> str:
+    """Return the resolved client IP, honouring X-Forwarded-For when the
+    TCP peer is a trusted upstream proxy. Falls back to the TCP peer address
+    when no trusted networks are configured or the peer is not in the list."""
+    peer = request.client.host if request.client else "unknown"
+    if trusted_networks and peer != "unknown":
+        try:
+            peer_addr = ipaddress.ip_address(peer)
+            if any(peer_addr in net for net in trusted_networks):
+                xff = request.headers.get("X-Forwarded-For", "").split(",")[0].strip()
+                if xff:
+                    return xff
+                real_ip = request.headers.get("X-Real-IP", "").strip()
+                if real_ip:
+                    return real_ip
+        except ValueError:
+            pass
+    return peer
+
+
+@dataclass
+class _Record:
+    failures: int = 0
+    window_start: float = field(default_factory=time.time)
+    blocked_until: float = 0.0
+
+
+class LoginRateLimiter:
+    def __init__(
+        self,
+        max_failures: int = 5,
+        window_seconds: int = 300,
+        cooldown_seconds: int = 900,
+    ) -> None:
+        self._max = max_failures
+        self._window = window_seconds
+        self._cooldown = cooldown_seconds
+        self._store: dict[str, _Record] = {}
+        self._lock = Lock()
+
+    @property
+    def cooldown_seconds(self) -> int:
+        return self._cooldown
+
+    def is_blocked(self, ip: str) -> bool:
+        now = time.time()
+        with self._lock:
+            rec = self._store.get(ip)
+            if rec is None:
+                return False
+            if rec.blocked_until > now:
+                return True
+            if rec.blocked_until > 0:
+                del self._store[ip]
+            return False
+
+    def record_failure(self, ip: str) -> None:
+        now = time.time()
+        with self._lock:
+            rec = self._store.get(ip)
+            if rec is None:
+                rec = _Record(window_start=now)
+                self._store[ip] = rec
+            if now - rec.window_start > self._window:
+                rec.failures = 0
+                rec.window_start = now
+            rec.failures += 1
+            if rec.failures >= self._max:
+                rec.blocked_until = now + self._cooldown
+                logger.warning(
+                    "Login blocked for %s after %d failures", ip, rec.failures
+                )
+
+    def record_success(self, ip: str) -> None:
+        with self._lock:
+            self._store.pop(ip, None)
--- a/api/app/config.py
+++ b/api/app/config.py
@@ -18,6 +18,10 @@ class Settings(BaseSettings):
    jwt_expiry_seconds: int = 86400
    owner_username: str
    owner_password: str
+    login_max_failures: int = 5
+    login_window_seconds: int = 300
+    login_cooldown_seconds: int = 900
+    login_trusted_proxy_ips: str = ""


@lru_cache
--- a/api/app/main.py
+++ b/api/app/main.py
@@ -1,17 +1,30 @@
-from contextlib import asynccontextmanager
+import ipaddress
+from contextlib import asynccontextmanager, suppress

 from fastapi import FastAPI, Request
 from fastapi.exceptions import HTTPException
 from fastapi.responses import JSONResponse

+from app.auth.rate_limiter import LoginRateLimiter
 from app.config import get_settings
 from app.database import Base, get_engine


@asynccontextmanager
 async def lifespan(application: FastAPI):
-    get_settings()
-    # Verify DB connection and run migrations on startup
+    settings = get_settings()
+    application.state.login_rate_limiter = LoginRateLimiter(
+        max_failures=settings.login_max_failures,
+        window_seconds=settings.login_window_seconds,
+        cooldown_seconds=settings.login_cooldown_seconds,
+    )
+    trusted_networks = []
+    for part in settings.login_trusted_proxy_ips.split(","):
+        part = part.strip()
+        if part:
+            with suppress(ValueError):
+                trusted_networks.append(ipaddress.ip_network(part, strict=False))
+    application.state.login_trusted_networks = trusted_networks
    engine = get_engine()
    async with engine.begin() as conn:
        # In production, Alembic handles migrations; this is a dev convenience
@@ -22,6 +35,10 @@ async def lifespan(application: FastAPI):

 app = FastAPI(title="Reactbin API", version="1.0.0", lifespan=lifespan)

+# Defaults so app.state is populated even when lifespan doesn't run (e.g. tests)
+app.state.login_rate_limiter = LoginRateLimiter()
+app.state.login_trusted_networks = []
+

@app.exception_handler(HTTPException)
 async def http_exception_handler(request: Request, exc: HTTPException):
--- a/api/app/routers/auth.py
+++ b/api/app/routers/auth.py
@@ -1,7 +1,9 @@
-from fastapi import APIRouter, Depends, HTTPException
+from fastapi import APIRouter, Depends, HTTPException, Request
+from fastapi.responses import JSONResponse
 from pydantic import BaseModel

 from app.auth.jwt_provider import JWTAuthProvider
+from app.auth.rate_limiter import LoginRateLimiter, get_client_ip
 from app.dependencies import get_jwt_auth

 router = APIRouter(tags=["auth"])
@@ -19,12 +21,32 @@ class TokenResponse(BaseModel):


@router.post("/auth/token", response_model=TokenResponse)
-async def login(body: LoginRequest, auth: JWTAuthProvider = Depends(get_jwt_auth)):
+async def login(
+    request: Request,
+    body: LoginRequest,
+    auth: JWTAuthProvider = Depends(get_jwt_auth),
+):
+    limiter: LoginRateLimiter = request.app.state.login_rate_limiter
+    ip: str = get_client_ip(request, request.app.state.login_trusted_networks)
+
+    if limiter.is_blocked(ip):
+        return JSONResponse(
+            status_code=429,
+            content={
+                "detail": "Too many failed login attempts. Please try again later.",
+                "code": "login_rate_limited",
+            },
+            headers={"Retry-After": str(limiter.cooldown_seconds)},
+        )
+
    if not auth.verify_credentials(body.username, body.password):
+        limiter.record_failure(ip)
        raise HTTPException(
            status_code=401,
            detail={"detail": "Invalid credentials", "code": "invalid_credentials"},
        )
+
+    limiter.record_success(ip)
    token = auth.create_token()
    return TokenResponse(
        access_token=token,
--- a/api/tests/build/.gitkeep
+++ b/api/tests/build/.gitkeep
--- a/api/tests/build/verify_production_image.sh
+++ b/api/tests/build/verify_production_image.sh
@@ -0,0 +1,119 @@
+#!/usr/bin/env bash
+# TDD verification script for api/Dockerfile.prod
+# Fails (red) if Dockerfile.prod does not exist or any check fails.
+set -euo pipefail
+
+IMAGE="reactbin-api-prod:verify-$$"
+IMAGE2="reactbin-api-prod:verify-cache-$$"
+PG_CONTAINER=""
+APP_CONTAINER=""
+
+cleanup() {
+    [ -n "$APP_CONTAINER" ] && docker rm -f "$APP_CONTAINER" 2>/dev/null || true
+    [ -n "$PG_CONTAINER" ] && docker rm -f "$PG_CONTAINER" 2>/dev/null || true
+    docker rmi "$IMAGE" 2>/dev/null || true
+    docker rmi "$IMAGE2" 2>/dev/null || true
+}
+trap cleanup EXIT
+
+# ── US1 check 1: build ────────────────────────────────────────────────────────
+echo "[verify] Building $IMAGE..."
+docker build -f api/Dockerfile.prod api/ -t "$IMAGE"
+echo "[verify] Build OK"
+
+# ── US1 check 2: start with a throwaway postgres ──────────────────────────────
+echo "[verify] Starting postgres..."
+PG_CONTAINER=$(docker run -d \
+    -e POSTGRES_DB=reactbin_verify \
+    -e POSTGRES_USER=verify \
+    -e POSTGRES_PASSWORD=verify \
+    postgres:16-alpine)
+
+for i in $(seq 1 30); do
+    if docker exec "$PG_CONTAINER" pg_isready -U verify -q 2>/dev/null; then break; fi
+    sleep 1
+    if [[ $i -eq 30 ]]; then echo "FAIL: postgres did not become ready"; exit 1; fi
+done
+
+PG_IP=$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' "$PG_CONTAINER")
+
+echo "[verify] Starting production container..."
+APP_CONTAINER=$(docker run -d \
+    -p 18000:8000 \
+    -e JWT_SECRET_KEY=verify-key \
+    -e OWNER_USERNAME=testowner \
+    -e OWNER_PASSWORD=testpassword \
+    -e DATABASE_URL="postgresql+asyncpg://verify:verify@${PG_IP}:5432/reactbin_verify" \
+    -e S3_ENDPOINT_URL=http://noop:9000 \
+    -e S3_BUCKET_NAME=noop \
+    -e S3_ACCESS_KEY_ID=noop \
+    -e S3_SECRET_ACCESS_KEY=noop \
+    -e S3_REGION=us-east-1 \
+    "$IMAGE")
+
+# ── US1 check 3: health endpoint ──────────────────────────────────────────────
+echo "[verify] Polling health endpoint..."
+for i in $(seq 1 30); do
+    if curl -sf http://localhost:18000/api/v1/health > /dev/null; then break; fi
+    sleep 1
+    if [[ $i -eq 30 ]]; then echo "FAIL: health check timed out after 30s"; exit 1; fi
+done
+echo "[verify] Health check passed"
+
+# ── US2 check 1: non-root user ────────────────────────────────────────────────
+UID_IN_CONTAINER=$(docker exec "$APP_CONTAINER" id -u)
+if [[ "$UID_IN_CONTAINER" -eq 0 ]]; then
+    echo "FAIL: process running as root (UID 0)"; exit 1
+fi
+echo "[verify] Non-root user OK (UID $UID_IN_CONTAINER)"
+
+# ── C1: stdout/stderr log capture ────────────────────────────────────────────
+LOGS=$(docker logs "$APP_CONTAINER" 2>&1)
+if [[ -z "$LOGS" ]]; then
+    echo "FAIL: no output on stdout/stderr"; exit 1
+fi
+if ! echo "$LOGS" | grep -qiE "(started server|application startup complete|uvicorn)"; then
+    echo "FAIL: no startup logs found on stdout/stderr"; exit 1
+fi
+echo "[verify] Stdout logging OK"
+
+# ── US1 check 4: SIGTERM → exit 0 ────────────────────────────────────────────
+docker stop "$APP_CONTAINER" > /dev/null
+EXIT_CODE=$(docker wait "$APP_CONTAINER")
+if [[ "$EXIT_CODE" -ne 0 ]]; then
+    echo "FAIL: non-zero exit code $EXIT_CODE after SIGTERM"; exit 1
+fi
+echo "[verify] Graceful shutdown OK (exit $EXIT_CODE)"
+
+# ── US2 check 2: dev deps absent ─────────────────────────────────────────────
+if docker run --rm "$IMAGE" /app/.venv/bin/python -c "import pytest" 2>/dev/null; then
+    echo "FAIL: pytest importable in production image (dev deps present)"; exit 1
+fi
+echo "[verify] Dev deps absent OK"
+
+# ── C2: no hardcoded secrets in image layers ─────────────────────────────────
+if docker history --no-trunc "$IMAGE" 2>&1 | grep -qiE "(password|secret_key|api_key|token)"; then
+    echo "FAIL: potential secret found in image history"; exit 1
+fi
+echo "[verify] No secrets in image layers OK"
+
+# ── C3: missing env var → non-zero exit ──────────────────────────────────────
+set +e
+docker run --rm -e JWT_SECRET_KEY=verify-key "$IMAGE" 2>/dev/null
+MISSING_ENV_EXIT=$?
+set -e
+if [[ "$MISSING_ENV_EXIT" -eq 0 ]]; then
+    echo "FAIL: container exited 0 despite missing OWNER_USERNAME"; exit 1
+fi
+echo "[verify] Missing-env-var exit check OK (exit $MISSING_ENV_EXIT)"
+
+# ── US3: dep layer cached on source-only rebuild ──────────────────────────────
+echo "[verify] Testing cache hit on source-only rebuild..."
+touch api/app/main.py
+BUILD2_OUTPUT=$(docker build --progress=plain -f api/Dockerfile.prod api/ -t "$IMAGE2" 2>&1)
+if ! echo "$BUILD2_OUTPUT" | grep -q "CACHED"; then
+    echo "FAIL: dependency layer not reused on source-only rebuild"; exit 1
+fi
+echo "[verify] Dep layer cache hit confirmed (US3 OK)"
+
+echo "[verify] All checks passed (US1 + US2 + US3)."
--- a/api/tests/integration/test_login_rate_limit.py
+++ b/api/tests/integration/test_login_rate_limit.py
@@ -0,0 +1,121 @@
+import os
+
+import pytest
+from httpx import AsyncClient
+
+from app.auth.rate_limiter import LoginRateLimiter
+from app.main import app
+
+BAD_CREDS = {"username": "attacker", "password": "wrong"}
+VALID_CREDS = {
+    "username": os.environ.get("OWNER_USERNAME", "testowner"),
+    "password": os.environ.get("OWNER_PASSWORD", "testpassword"),
+}
+
+
+def _fresh_limiter():
+    return LoginRateLimiter(max_failures=3, window_seconds=60, cooldown_seconds=30)
+
+
+@pytest.mark.asyncio
+async def test_repeated_failures_trigger_429(client: AsyncClient):
+    original_limiter = app.state.login_rate_limiter
+    original_networks = app.state.login_trusted_networks
+    app.state.login_rate_limiter = _fresh_limiter()
+    app.state.login_trusted_networks = []
+    try:
+        for _ in range(3):
+            await client.post("/api/v1/auth/token", json=BAD_CREDS)
+        resp = await client.post("/api/v1/auth/token", json=BAD_CREDS)
+        assert resp.status_code == 429
+        assert resp.json()["code"] == "login_rate_limited"
+    finally:
+        app.state.login_rate_limiter = original_limiter
+        app.state.login_trusted_networks = original_networks
+
+
+@pytest.mark.asyncio
+async def test_success_resets_counter(client: AsyncClient):
+    original_limiter = app.state.login_rate_limiter
+    original_networks = app.state.login_trusted_networks
+    app.state.login_rate_limiter = _fresh_limiter()
+    app.state.login_trusted_networks = []
+    try:
+        for _ in range(2):
+            await client.post("/api/v1/auth/token", json=BAD_CREDS)
+        await client.post("/api/v1/auth/token", json=VALID_CREDS)
+        for _ in range(3):
+            resp = await client.post("/api/v1/auth/token", json=BAD_CREDS)
+            assert resp.status_code == 401, "counter should have reset after success"
+    finally:
+        app.state.login_rate_limiter = original_limiter
+        app.state.login_trusted_networks = original_networks
+
+
+@pytest.mark.asyncio
+async def test_429_has_retry_after_header(client: AsyncClient):
+    original_limiter = app.state.login_rate_limiter
+    original_networks = app.state.login_trusted_networks
+    app.state.login_rate_limiter = _fresh_limiter()
+    app.state.login_trusted_networks = []
+    try:
+        for _ in range(3):
+            await client.post("/api/v1/auth/token", json=BAD_CREDS)
+        resp = await client.post("/api/v1/auth/token", json=BAD_CREDS)
+        assert resp.status_code == 429
+        assert "Retry-After" in resp.headers
+        assert int(resp.headers["Retry-After"]) > 0
+    finally:
+        app.state.login_rate_limiter = original_limiter
+        app.state.login_trusted_networks = original_networks
+
+
+@pytest.mark.asyncio
+async def test_429_body_shape(client: AsyncClient):
+    original_limiter = app.state.login_rate_limiter
+    original_networks = app.state.login_trusted_networks
+    app.state.login_rate_limiter = _fresh_limiter()
+    app.state.login_trusted_networks = []
+    try:
+        for _ in range(3):
+            await client.post("/api/v1/auth/token", json=BAD_CREDS)
+        resp = await client.post("/api/v1/auth/token", json=BAD_CREDS)
+        assert resp.status_code == 429
+        assert resp.json() == {
+            "detail": "Too many failed login attempts. Please try again later.",
+            "code": "login_rate_limited",
+        }
+    finally:
+        app.state.login_rate_limiter = original_limiter
+        app.state.login_trusted_networks = original_networks
+
+
+@pytest.mark.asyncio
+async def test_xff_header_ignored_when_no_trusted_networks(client: AsyncClient):
+    original_limiter = app.state.login_rate_limiter
+    original_networks = app.state.login_trusted_networks
+    app.state.login_rate_limiter = _fresh_limiter()
+    app.state.login_trusted_networks = []
+    try:
+        # Send 3 failures all claiming to be "1.2.3.4" via XFF
+        for _ in range(3):
+            await client.post(
+                "/api/v1/auth/token",
+                json=BAD_CREDS,
+                headers={"X-Forwarded-For": "1.2.3.4"},
+            )
+        # 4th request with a *different* XFF — if XFF were trusted, this
+        # would appear to be a fresh IP and get 401. Since XFF is ignored,
+        # the real peer ("testclient") is blocked and we get 429.
+        resp = await client.post(
+            "/api/v1/auth/token",
+            json=BAD_CREDS,
+            headers={"X-Forwarded-For": "9.9.9.9"},
+        )
+        assert resp.status_code == 429, (
+            "XFF should be ignored when no trusted networks are configured; "
+            "expected real peer to be blocked"
+        )
+    finally:
+        app.state.login_rate_limiter = original_limiter
+        app.state.login_trusted_networks = original_networks
--- a/api/tests/unit/test_rate_limiter.py
+++ b/api/tests/unit/test_rate_limiter.py
@@ -0,0 +1,98 @@
+import ipaddress
+from unittest.mock import MagicMock
+
+from starlette.requests import Request
+
+from app.auth.rate_limiter import LoginRateLimiter, get_client_ip
+
+# ---------------------------------------------------------------------------
+# LoginRateLimiter tests
+# ---------------------------------------------------------------------------
+
+
+def make_limiter():
+    return LoginRateLimiter(max_failures=3, window_seconds=60, cooldown_seconds=300)
+
+
+def test_not_blocked_initially():
+    assert make_limiter().is_blocked("1.2.3.4") is False
+
+
+def test_blocked_after_threshold():
+    limiter = make_limiter()
+    for _ in range(3):
+        limiter.record_failure("1.2.3.4")
+    assert limiter.is_blocked("1.2.3.4") is True
+
+
+def test_success_clears_failures():
+    limiter = make_limiter()
+    limiter.record_failure("1.2.3.4")
+    limiter.record_failure("1.2.3.4")
+    limiter.record_success("1.2.3.4")
+    assert limiter.is_blocked("1.2.3.4") is False
+
+
+def test_ips_are_isolated():
+    limiter = make_limiter()
+    for _ in range(3):
+        limiter.record_failure("1.1.1.1")
+    assert limiter.is_blocked("2.2.2.2") is False
+
+
+def test_window_resets_after_expiry():
+    import time
+
+    limiter = LoginRateLimiter(max_failures=3, window_seconds=0, cooldown_seconds=300)
+    limiter.record_failure("1.2.3.4")
+    limiter.record_failure("1.2.3.4")
+    time.sleep(0.01)
+    limiter.record_failure("1.2.3.4")
+    # window expired — counter reset on third call, so failures = 1, not 3
+    assert limiter.is_blocked("1.2.3.4") is False
+
+
+def test_log_warning_on_lockout(caplog):
+    import logging
+
+    limiter = make_limiter()
+    with caplog.at_level(logging.WARNING, logger="app.auth.rate_limiter"):
+        for _ in range(3):
+            limiter.record_failure("5.6.7.8")
+    assert "Login blocked" in caplog.text
+    assert "5.6.7.8" in caplog.text
+
+
+# ---------------------------------------------------------------------------
+# get_client_ip tests
+# ---------------------------------------------------------------------------
+
+
+def make_request(peer: str, headers: dict) -> MagicMock:
+    req = MagicMock(spec=Request)
+    req.client.host = peer
+    req.headers = headers
+    return req
+
+
+def test_get_client_ip_no_trusted_networks_returns_peer():
+    req = make_request("203.0.113.1", {"X-Forwarded-For": "10.0.0.1"})
+    assert get_client_ip(req, []) == "203.0.113.1"
+
+
+def test_get_client_ip_trusted_peer_uses_xff():
+    req = make_request("10.0.0.1", {"X-Forwarded-For": "203.0.113.5"})
+    nets = [ipaddress.ip_network("10.0.0.0/8")]
+    assert get_client_ip(req, nets) == "203.0.113.5"
+
+
+def test_get_client_ip_untrusted_peer_ignores_xff():
+    req = make_request("8.8.8.8", {"X-Forwarded-For": "203.0.113.5"})
+    nets = [ipaddress.ip_network("10.0.0.0/8")]
+    assert get_client_ip(req, nets) == "8.8.8.8"
+
+
+def test_get_client_ip_trusted_peer_falls_back_to_real_ip():
+    req = make_request("10.0.0.1", {"X-Real-IP": "203.0.113.9"})
+    nets = [ipaddress.ip_network("10.0.0.0/8")]
+    assert get_client_ip(req, nets) == "203.0.113.9"
--- a/specs/009-login-rate-limiting/checklists/requirements.md
+++ b/specs/009-login-rate-limiting/checklists/requirements.md
@@ -0,0 +1,34 @@
+# Specification Quality Checklist: Login Brute-Force Protection
+
+**Purpose**: Validate specification completeness and quality before proceeding to planning
+**Created**: 2026-05-06
+**Feature**: [spec.md](../spec.md)
+
+## Content Quality
+
+- [X] No implementation details (languages, frameworks, APIs)
+- [X] Focused on user value and business needs
+- [X] Written for non-technical stakeholders
+- [X] All mandatory sections completed
+
+## Requirement Completeness
+
+- [X] No [NEEDS CLARIFICATION] markers remain
+- [X] Requirements are testable and unambiguous
+- [X] Success criteria are measurable
+- [X] Success criteria are technology-agnostic (no implementation details)
+- [X] All acceptance scenarios are defined
+- [X] Edge cases are identified
+- [X] Scope is clearly bounded
+- [X] Dependencies and assumptions identified
+
+## Feature Readiness
+
+- [X] All functional requirements have clear acceptance criteria
+- [X] User scenarios cover primary flows
+- [X] Feature meets measurable outcomes defined in Success Criteria
+- [X] No implementation details leak into specification
+
+## Notes
+
+- All items pass. Spec is ready for `/speckit-plan`.
--- a/specs/009-login-rate-limiting/contracts/auth.md
+++ b/specs/009-login-rate-limiting/contracts/auth.md
@@ -0,0 +1,85 @@
+# API Contract: Authentication
+
+## POST /api/v1/auth/token
+
+Authenticates the owner and returns a JWT access token.
+
+**This endpoint is modified by feature 009** to enforce brute-force protection.
+All previous behaviour is preserved. One new response code (429) is added.
+
+### Request
+
+```
+POST /api/v1/auth/token
+Content-Type: application/json
+```
+
+```json
+{
+  "username": "string",
+  "password": "string"
+}
+```
+
+### Responses
+
+#### 200 OK — Credentials accepted
+
+```json
+{
+  "access_token": "<jwt>",
+  "token_type": "bearer",
+  "expires_in": 86400
+}
+```
+
+Side effect: resets the failure counter for the caller's IP address.
+
+---
+
+#### 401 Unauthorized — Credentials rejected
+
+```json
+{
+  "detail": "Invalid credentials",
+  "code": "invalid_credentials"
+}
+```
+
+Side effect: increments the failure counter for the caller's IP address. If the
+counter reaches `LOGIN_MAX_FAILURES`, subsequent requests from this IP will receive
+429 until the cooldown expires.
+
+---
+
+#### 429 Too Many Requests — Source blocked after repeated failures
+
+**This response is new in feature 009.**
+
+```
+HTTP/1.1 429 Too Many Requests
+Retry-After: 900
+Content-Type: application/json
+```
+
+```json
+{
+  "detail": "Too many failed login attempts. Please try again later.",
+  "code": "login_rate_limited"
+}
+```
+
+The `Retry-After` header value is the configured cooldown duration in seconds (default: 900).
+It reflects the maximum possible wait, not the exact remaining lockout time.
+
+No credentials are verified when this response is returned — the request is
+rejected before authentication is attempted.
+
+---
+
+### Notes
+
+- The failure counter is per source IP address (TCP peer, not forwarded headers).
+- Threshold values (`LOGIN_MAX_FAILURES`, `LOGIN_WINDOW_SECONDS`, `LOGIN_COOLDOWN_SECONDS`)
+  are not disclosed in any response.
+- Counters are in-memory and reset on process restart.
--- a/specs/009-login-rate-limiting/data-model.md
+++ b/specs/009-login-rate-limiting/data-model.md
@@ -0,0 +1,53 @@
+# Data Model: Login Brute-Force Protection
+
+## Overview
+
+This feature introduces no new database tables. The only data entity is a transient,
+in-memory rate-limit record that does not survive process restarts. This is intentional
+(see research.md Decision 3).
+
+---
+
+## Entity: Rate-Limit Record (in-memory only)
+
+| Field          | Type    | Description                                                                 |
+|----------------|---------|-----------------------------------------------------------------------------|
+| `failures`     | int     | Count of consecutive failed login attempts in the current window            |
+| `window_start` | float   | Unix timestamp marking when the current counting window began               |
+| `blocked_until`| float   | Unix timestamp after which the source is no longer blocked; 0.0 if not blocked |
+
+**Keyed by**: resolved client IP address string (e.g., `"192.168.1.1"`); see `get_client_ip()` in `rate_limiter.py` for resolution logic
+
+**Lifecycle**:
+1. Record is created on the first failed login from a source.
+2. `failures` increments on each subsequent failure within the window.
+3. When `failures >= LOGIN_MAX_FAILURES`, `blocked_until` is set to `now + LOGIN_COOLDOWN_SECONDS`.
+4. When `blocked_until` has passed, the record is deleted on the next request from that source.
+5. A successful login deletes the record immediately (failure counter reset).
+6. If `now - window_start > LOGIN_WINDOW_SECONDS` without triggering lockout, the counter resets within the existing record.
+
+**State machine**:
+
+```
+[no record]
+     │ first failure
+     ▼
+[tracking] ──── failure N ≥ max ────► [blocked]
+     │                                     │
+     │ success / window expires             │ cooldown expires
+     ▼                                     ▼
+[no record] ◄─────────────────────── [no record]
+```
+
+---
+
+## Configuration Entity: Rate-Limit Settings
+
+Stored as environment variables; loaded via `app.config.Settings`:
+
+| Env Var                    | Default | Description                                              |
+|----------------------------|---------|----------------------------------------------------------|
+| `LOGIN_MAX_FAILURES`       | `5`     | Failures within window before lockout                    |
+| `LOGIN_WINDOW_SECONDS`     | `300`   | Rolling window duration in seconds (5 minutes)           |
+| `LOGIN_COOLDOWN_SECONDS`   | `900`   | Lockout duration in seconds after threshold exceeded (15 minutes) |
+| `LOGIN_TRUSTED_PROXY_IPS`  | `""`    | Comma-separated IPs/CIDRs of trusted upstream proxies (e.g., `10.0.0.0/8`); empty = disabled |
--- a/specs/009-login-rate-limiting/plan.md
+++ b/specs/009-login-rate-limiting/plan.md
@@ -0,0 +1,388 @@
+# Implementation Plan: Login Brute-Force Protection
+
+**Branch**: `009-login-rate-limiting` | **Date**: 2026-05-06 | **Spec**: [spec.md](spec.md)  
+**Input**: Feature specification from `specs/009-login-rate-limiting/spec.md`
+
+## Summary
+
+Add failure-counting brute-force protection to the login endpoint (`POST /api/v1/auth/token`).
+After a configurable number of consecutive failed attempts from the same resolved client IP,
+the endpoint returns HTTP 429 with a `Retry-After` header for a configurable cooldown period.
+A successful login resets the counter. All thresholds are configurable via environment variables.
+When deployed behind a reverse proxy (nginx, Kubernetes ingress), a `LOGIN_TRUSTED_PROXY_IPS`
+setting enables extraction of the real client IP from `X-Forwarded-For`. No new infrastructure
+(no Redis, no new DB table) — counters live in process memory.
+
+---
+
+## Technical Context
+
+**Language/Version**: Python 3.12+  
+**Primary Dependencies**: FastAPI, pydantic-settings (already in use); no new dependencies added  
+**Storage**: In-memory `dict` (no persistence across restarts — intentional)  
+**Testing**: pytest + pytest-asyncio (existing test infrastructure)  
+**Target Platform**: Linux server (Docker)  
+**Project Type**: Web service (API only — this feature has no UI surface)  
+**Performance Goals**: Rate limiter adds negligible overhead (dict lookup + lock acquisition; sub-millisecond)  
+**Constraints**: Must not add new runtime service dependencies; must not change any auth behaviour for non-blocked sources  
+**Scale/Scope**: Single process, single user; in-memory store is sufficient
+
+---
+
+## Constitution Check
+
+| Principle | Status | Notes |
+|-----------|--------|-------|
+| §2.4 Auth abstraction (AuthProvider interface) | ✅ Pass | Rate limiter is a guard *before* `JWTAuthProvider.verify_credentials()`, not a bypass of the interface |
+| §2.5 DB abstraction (repository layer) | ✅ Pass | No database access; in-memory only |
+| §2.6 No speculative abstraction | ✅ Pass | Concrete `LoginRateLimiter` class, no interface; only one implementation planned |
+| §3.3 Error envelope (`detail` + `code`) | ✅ Pass | 429 response uses `{"detail": "...", "code": "login_rate_limited"}` |
+| §5.1 TDD | ✅ Required | Tasks follow red → green order |
+| §5.2 Integration tests against PostgreSQL | ✅ Pass | Integration test for the login endpoint will run against the Docker PostgreSQL stack |
+| §7.2 Environment configuration | ✅ Pass | `LOGIN_MAX_FAILURES`, `LOGIN_WINDOW_SECONDS`, `LOGIN_COOLDOWN_SECONDS`, `LOGIN_TRUSTED_PROXY_IPS` from env vars |
+| §7.3 Linting (ruff) | ✅ Required | All new files must pass `ruff check` |
+
+**Gate result**: No violations. Cleared to proceed.
+
+---
+
+## Project Structure
+
+### Documentation (this feature)
+
+```text
+specs/009-login-rate-limiting/
+├── plan.md              ← this file
+├── research.md          ← decisions on approach
+├── data-model.md        ← rate-limit record entity
+├── quickstart.md        ← curl runbook
+├── contracts/
+│   └── auth.md          ← updated POST /api/v1/auth/token with 429
+└── tasks.md             ← generated by /speckit-tasks
+```
+
+### Source Code Changes
+
+```text
+api/
+├── app/
+│   ├── auth/
+│   │   ├── rate_limiter.py          ← NEW: LoginRateLimiter class
+│   │   ├── jwt_provider.py          (unchanged)
+│   │   ├── noop.py                  (unchanged)
+│   │   └── provider.py              (unchanged)
+│   ├── config.py                    ← add login_max_failures, login_window_seconds, login_cooldown_seconds, login_trusted_proxy_ips
+│   ├── main.py                      ← init LoginRateLimiter in lifespan, attach to app.state
+│   └── routers/
+│       └── auth.py                  ← check rate limit before auth, record outcome
+└── tests/
+    ├── unit/
+    │   └── test_rate_limiter.py     ← NEW: unit tests for LoginRateLimiter logic
+    └── integration/
+        └── test_login_rate_limit.py ← NEW: integration tests for 429 behaviour via HTTP
+```
+
+---
+
+## Implementation Detail
+
+### `api/app/auth/rate_limiter.py`
+
+```python
+import ipaddress
+import logging
+import time
+from dataclasses import dataclass, field
+from ipaddress import IPv4Network, IPv6Network
+from threading import Lock
+
+from starlette.requests import Request
+
+logger = logging.getLogger(__name__)
+
+
+def get_client_ip(
+    request: Request,
+    trusted_networks: list[IPv4Network | IPv6Network],
+) -> str:
+    """Return the resolved client IP, honouring X-Forwarded-For when the
+    TCP peer is a trusted upstream proxy. Falls back to the TCP peer address
+    when no trusted networks are configured or the peer is not in the list."""
+    peer = request.client.host if request.client else "unknown"
+    if trusted_networks and peer != "unknown":
+        try:
+            peer_addr = ipaddress.ip_address(peer)
+            if any(peer_addr in net for net in trusted_networks):
+                xff = request.headers.get("X-Forwarded-For", "").split(",")[0].strip()
+                if xff:
+                    return xff
+                real_ip = request.headers.get("X-Real-IP", "").strip()
+                if real_ip:
+                    return real_ip
+        except ValueError:
+            pass
+    return peer
+
+
+@dataclass
+class _Record:
+    failures: int = 0
+    window_start: float = field(default_factory=time.time)
+    blocked_until: float = 0.0
+
+
+class LoginRateLimiter:
+    def __init__(
+        self,
+        max_failures: int = 5,
+        window_seconds: int = 300,
+        cooldown_seconds: int = 900,
+    ) -> None:
+        self._max = max_failures
+        self._window = window_seconds
+        self._cooldown = cooldown_seconds
+        self._store: dict[str, _Record] = {}
+        self._lock = Lock()
+
+    @property
+    def cooldown_seconds(self) -> int:
+        return self._cooldown
+
+    def is_blocked(self, ip: str) -> bool:
+        now = time.time()
+        with self._lock:
+            rec = self._store.get(ip)
+            if rec is None:
+                return False
+            if rec.blocked_until > now:
+                return True
+            if rec.blocked_until > 0:
+                del self._store[ip]
+            return False
+
+    def record_failure(self, ip: str) -> None:
+        now = time.time()
+        with self._lock:
+            rec = self._store.get(ip)
+            if rec is None:
+                rec = _Record(window_start=now)
+                self._store[ip] = rec
+            if now - rec.window_start > self._window:
+                rec.failures = 0
+                rec.window_start = now
+            rec.failures += 1
+            if rec.failures >= self._max:
+                rec.blocked_until = now + self._cooldown
+                logger.warning(
+                    "Login blocked for %s after %d failures", ip, rec.failures
+                )
+
+    def record_success(self, ip: str) -> None:
+        with self._lock:
+            self._store.pop(ip, None)
+```
+
+### `api/app/config.py` additions
+
+```python
+login_max_failures: int = 5
+login_window_seconds: int = 300
+login_cooldown_seconds: int = 900
+login_trusted_proxy_ips: str = ""  # comma-separated IPs/CIDRs; empty = disabled
+```
+
+### `api/app/main.py` lifespan update
+
+```python
+import ipaddress
+
+from app.auth.rate_limiter import LoginRateLimiter
+
+@asynccontextmanager
+async def lifespan(application: FastAPI):
+    settings = get_settings()
+    application.state.login_rate_limiter = LoginRateLimiter(
+        max_failures=settings.login_max_failures,
+        window_seconds=settings.login_window_seconds,
+        cooldown_seconds=settings.login_cooldown_seconds,
+    )
+    trusted_networks = []
+    for part in settings.login_trusted_proxy_ips.split(","):
+        part = part.strip()
+        if part:
+            try:
+                trusted_networks.append(ipaddress.ip_network(part, strict=False))
+            except ValueError:
+                pass  # invalid entry — skip silently
+    application.state.login_trusted_networks = trusted_networks
+    # ... existing DB setup unchanged
+    engine = get_engine()
+    async with engine.begin() as conn:
+        await conn.run_sync(Base.metadata.create_all)
+    yield
+    await engine.dispose()
+```
+
+### `api/app/routers/auth.py` update
+
+```python
+from fastapi import APIRouter, Depends, HTTPException, Request
+from fastapi.responses import JSONResponse
+from pydantic import BaseModel
+
+from app.auth.jwt_provider import JWTAuthProvider
+from app.auth.rate_limiter import LoginRateLimiter, get_client_ip
+from app.dependencies import get_jwt_auth
+
+router = APIRouter(tags=["auth"])
+
+
+class LoginRequest(BaseModel):
+    username: str
+    password: str
+
+
+class TokenResponse(BaseModel):
+    access_token: str
+    token_type: str = "bearer"
+    expires_in: int
+
+
+@router.post("/auth/token", response_model=TokenResponse)
+async def login(
+    request: Request,
+    body: LoginRequest,
+    auth: JWTAuthProvider = Depends(get_jwt_auth),
+):
+    limiter: LoginRateLimiter = request.app.state.login_rate_limiter
+    ip: str = get_client_ip(request, request.app.state.login_trusted_networks)
+
+    if limiter.is_blocked(ip):
+        return JSONResponse(
+            status_code=429,
+            content={
+                "detail": "Too many failed login attempts. Please try again later.",
+                "code": "login_rate_limited",
+            },
+            headers={"Retry-After": str(limiter.cooldown_seconds)},
+        )
+
+    if not auth.verify_credentials(body.username, body.password):
+        limiter.record_failure(ip)
+        raise HTTPException(
+            status_code=401,
+            detail={"detail": "Invalid credentials", "code": "invalid_credentials"},
+        )
+
+    limiter.record_success(ip)
+    token = auth.create_token()
+    return TokenResponse(
+        access_token=token,
+        token_type="bearer",
+        expires_in=auth._expiry_seconds,
+    )
+```
+
+### `api/tests/unit/test_rate_limiter.py` (representative cases)
+
+```python
+import time
+import pytest
+from app.auth.rate_limiter import LoginRateLimiter
+
+
+def test_not_blocked_initially():
+    limiter = LoginRateLimiter(max_failures=3, window_seconds=60, cooldown_seconds=300)
+    assert limiter.is_blocked("1.2.3.4") is False
+
+
+def test_blocked_after_threshold():
+    limiter = LoginRateLimiter(max_failures=3, window_seconds=60, cooldown_seconds=300)
+    for _ in range(3):
+        limiter.record_failure("1.2.3.4")
+    assert limiter.is_blocked("1.2.3.4") is True
+
+
+def test_success_clears_failures():
+    limiter = LoginRateLimiter(max_failures=3, window_seconds=60, cooldown_seconds=300)
+    limiter.record_failure("1.2.3.4")
+    limiter.record_failure("1.2.3.4")
+    limiter.record_success("1.2.3.4")
+    assert limiter.is_blocked("1.2.3.4") is False
+
+
+def test_ips_are_isolated():
+    limiter = LoginRateLimiter(max_failures=2, window_seconds=60, cooldown_seconds=300)
+    limiter.record_failure("1.1.1.1")
+    limiter.record_failure("1.1.1.1")
+    assert limiter.is_blocked("2.2.2.2") is False
+```
+
+### `api/tests/integration/test_login_rate_limit.py` (representative cases)
+
+```python
+import pytest
+from httpx import AsyncClient
+
+# Uses the 'client' fixture (NoOpAuthProvider) from conftest — sufficient for this
+# endpoint since we're testing the rate-limit layer, not auth correctness.
+# The login endpoint instantiates its own limiter via app.state, so we need
+# the full ASGI app.
+
+BAD_CREDS = {"username": "attacker", "password": "wrong"}
+
+
+@pytest.mark.asyncio
+async def test_repeated_failures_trigger_429(client: AsyncClient):
+    # Use a custom limiter with low threshold to avoid slow tests
+    # (the app.state.login_rate_limiter is set in lifespan; override for test)
+    from app.auth.rate_limiter import LoginRateLimiter
+    from app.main import app
+    original = app.state.login_rate_limiter
+    app.state.login_rate_limiter = LoginRateLimiter(
+        max_failures=3, window_seconds=60, cooldown_seconds=30
+    )
+    try:
+        for _ in range(3):
+            await client.post("/api/v1/auth/token", json=BAD_CREDS)
+        resp = await client.post("/api/v1/auth/token", json=BAD_CREDS)
+        assert resp.status_code == 429
+        assert resp.json()["code"] == "login_rate_limited"
+        assert "Retry-After" in resp.headers
+    finally:
+        app.state.login_rate_limiter = original
+```
+
+---
+
+## Implementation Phases
+
+### Phase 1 (MVP — P1): Blocking after repeated failures
+
+1. Add `login_max_failures`, `login_window_seconds`, `login_cooldown_seconds`, `login_trusted_proxy_ips` to `api/app/config.py`
+2. Create `api/app/auth/rate_limiter.py` with `LoginRateLimiter` and `get_client_ip()`
+3. Initialize rate limiter and parse trusted networks in `api/app/main.py` lifespan; attach both to `app.state`
+4. Update `api/app/routers/auth.py` to resolve client IP via `get_client_ip()`, then check + record outcomes
+5. Unit tests: `api/tests/unit/test_rate_limiter.py`
+6. Integration tests: `api/tests/integration/test_login_rate_limit.py`
+
+### Phase 2 (US2 — observability): Logging and response hints
+
+Delivered as part of Phase 1 (the `logger.warning(...)` call and `Retry-After` header
+are embedded in the same implementation). No separate phase needed.
+
+---
+
+## Environment Variables to Add to `.env.example`
+
+```dotenv
+# Login brute-force protection
+LOGIN_MAX_FAILURES=5
+LOGIN_WINDOW_SECONDS=300
+LOGIN_COOLDOWN_SECONDS=900
+# Comma-separated IPs/CIDRs of trusted upstream proxies (e.g. nginx ingress pod CIDR).
+# Leave empty when not behind a reverse proxy.
+LOGIN_TRUSTED_PROXY_IPS=
+```
+
+These are optional (have defaults) so existing `.env` files without them continue working.
--- a/specs/009-login-rate-limiting/quickstart.md
+++ b/specs/009-login-rate-limiting/quickstart.md
@@ -0,0 +1,112 @@
+# Quickstart: Login Brute-Force Protection
+
+## Prerequisites
+
+- API running (via `docker compose up` or locally with `.env` set)
+- `curl` available
+
+---
+
+## Scenario 1: Trigger the rate limiter
+
+Send 6 consecutive failed login attempts (default threshold is 5):
+
+```bash
+for i in $(seq 1 6); do
+  echo "Attempt $i:"
+  curl -s -o /dev/null -w "%{http_code}\n" \
+    -X POST http://localhost:8000/api/v1/auth/token \
+    -H "Content-Type: application/json" \
+    -d '{"username": "wrong", "password": "wrong"}'
+done
+```
+
+Expected output:
+```
+Attempt 1: 401
+Attempt 2: 401
+Attempt 3: 401
+Attempt 4: 401
+Attempt 5: 401
+Attempt 6: 429
+```
+
+The 6th attempt returns 429. Inspect the headers:
+
+```bash
+curl -i -X POST http://localhost:8000/api/v1/auth/token \
+  -H "Content-Type: application/json" \
+  -d '{"username": "wrong", "password": "wrong"}'
+```
+
+Expected headers include:
+```
+HTTP/1.1 429 Too Many Requests
+Retry-After: 900
+```
+
+Expected body:
+```json
+{"detail": "Too many failed login attempts. Please try again later.", "code": "login_rate_limited"}
+```
+
+---
+
+## Scenario 2: Successful login resets the counter
+
+Make some failed attempts, then log in with valid credentials:
+
+```bash
+# Fail twice
+for i in 1 2; do
+  curl -s -o /dev/null -w "fail $i: %{http_code}\n" \
+    -X POST http://localhost:8000/api/v1/auth/token \
+    -H "Content-Type: application/json" \
+    -d '{"username": "wrong", "password": "wrong"}'
+done
+
+# Succeed — resets counter
+curl -s -o /dev/null -w "success: %{http_code}\n" \
+  -X POST http://localhost:8000/api/v1/auth/token \
+  -H "Content-Type: application/json" \
+  -d '{"username": "'"$OWNER_USERNAME"'", "password": "'"$OWNER_PASSWORD"'"}'
+
+# Now fail 5 more times — counter was reset, so no 429 yet
+for i in $(seq 1 5); do
+  curl -s -o /dev/null -w "fail after reset $i: %{http_code}\n" \
+    -X POST http://localhost:8000/api/v1/auth/token \
+    -H "Content-Type: application/json" \
+    -d '{"username": "wrong", "password": "wrong"}'
+done
+```
+
+Expected: all "fail after reset" lines return 401 (not 429), confirming the counter was reset.
+
+---
+
+## Scenario 3: Observe log output
+
+While triggering the rate limiter (Scenario 1), watch API logs:
+
+```bash
+docker compose logs -f api
+```
+
+After the threshold is crossed you should see a line like:
+
+```
+WARNING  app.auth.rate_limiter:rate_limiter.py:NN Login blocked for 172.18.0.1 after 5 failures
+```
+
+---
+
+## Environment variable overrides
+
+To test with a lower threshold without code changes:
+
+```bash
+LOGIN_MAX_FAILURES=2 LOGIN_WINDOW_SECONDS=60 LOGIN_COOLDOWN_SECONDS=30 \
+  uvicorn app.main:app --reload
+```
+
+Then only 2 failures trigger the lockout, and it clears after 30 seconds.
--- a/specs/009-login-rate-limiting/research.md
+++ b/specs/009-login-rate-limiting/research.md
@@ -0,0 +1,67 @@
+# Research: Login Brute-Force Protection
+
+## Decision 1: Library vs. custom implementation
+
+**Decision**: Custom in-memory failure tracker (no new library dependency)
+
+**Rationale**: The requirement is to count *failed* login attempts specifically and reset on success — not to rate-limit all requests regardless of outcome. Popular libraries like `slowapi` count all requests to a route, which would break FR-004 (reset on success) without significant workarounds. A purpose-built 60-line class is simpler, more auditable, and has no dependency footprint.
+
+**Alternatives considered**:
+- `slowapi` (built on `limits`): Counts all requests, not failures. Requires patching the exception handler to decrement on success — fragile and non-obvious.
+- `slowapi` with a custom key function: Could be done, but the library's storage model doesn't expose a "reset this key" API in a clean way.
+- Redis-backed counter: Overkill for a single-user personal app with one instance. No new infrastructure justified.
+
+---
+
+## Decision 2: Fixed window vs. sliding window
+
+**Decision**: Fixed window with per-source reset on successful login
+
+**Rationale**: Fixed window is simpler to implement correctly and sufficient for this use case. The main attack — rapid sequential guessing — is fully addressed. The known "burst at window boundary" weakness is irrelevant here because: (a) the cooldown period is separate from the counting window, and (b) a successful login resets the counter entirely.
+
+**Alternatives considered**:
+- Sliding window: More accurate, but adds complexity (requires storing timestamps of each request). The marginal security benefit doesn't justify the implementation cost for a personal single-user app.
+
+---
+
+## Decision 3: In-memory backing store
+
+**Decision**: Python `dict` keyed by source IP, protected by a threading `Lock`
+
+**Rationale**: The application runs as a single process. In-memory storage means counters reset on restart — this is acceptable and matches the "fail open" assumption in the spec. No new infrastructure (Redis, database table) is required.
+
+**Alternatives considered**:
+- Database-backed counters: Persistent across restarts, but adds a DB round-trip to every login request (including successful ones). Not justified.
+- Redis: Distributed-safe and persistent, but requires a new service dependency. Out of scope for a personal single-instance app.
+
+---
+
+## Decision 4: Source identifier
+
+**Decision**: `request.client.host` (the TCP peer address)
+
+**Rationale**: The spec explicitly states not to trust `X-Forwarded-For` headers unless the app is known to be behind a trusted proxy. `request.client.host` in Starlette/FastAPI is the actual TCP peer IP — it cannot be spoofed by an attacker sending arbitrary headers.
+
+**Alternatives considered**:
+- `X-Forwarded-For` first value: Spoofable if the app is not behind a trusted proxy (attacker can set arbitrary header values).
+- `X-Real-IP`: Same spoofing concern.
+
+---
+
+## Decision 5: 429 response and Retry-After header
+
+**Decision**: Return HTTP 429 with `{"detail": "...", "code": "login_rate_limited"}` and a `Retry-After` header set to the configured cooldown duration in seconds
+
+**Rationale**: HTTP 429 is the standard "Too Many Requests" status. The `Retry-After` header is explicitly mentioned in the spec (US2 acceptance scenario) and is required by RFC 6585 for rate-limit responses. Setting it to the *configured* cooldown (not the exact remaining time) satisfies FR-005: it doesn't reveal precise expiry, just the maximum wait. The response body follows §3.3 of the constitution (error envelope with `detail` and `code`).
+
+---
+
+## Decision 6: Default threshold values
+
+**Decision**: `LOGIN_MAX_FAILURES=5`, `LOGIN_WINDOW_SECONDS=300` (5 min), `LOGIN_COOLDOWN_SECONDS=900` (15 min)
+
+**Rationale**: Industry standard for web apps. 5 attempts is enough for legitimate typos but makes brute-force infeasible at human scale. A 5-minute counting window matches typical "I fat-fingered my password" retry patterns. A 15-minute cooldown is a meaningful deterrent without locking out a legitimate owner indefinitely.
+
+**Alternatives considered**:
+- 3 failures / 60 s window / 300 s cooldown: More aggressive, but too likely to lock out the legitimate owner on a bad day.
+- 10 failures: Too permissive for a brute-force defense.
--- a/specs/009-login-rate-limiting/spec.md
+++ b/specs/009-login-rate-limiting/spec.md
@@ -0,0 +1,84 @@
+# Feature Specification: Login Brute-Force Protection
+
+**Feature Branch**: `009-login-rate-limiting`  
+**Created**: 2026-05-06  
+**Status**: Draft  
+**Input**: User description: "Login API endpoints should be rate limited or otherwise protected against brute force attacks"
+
+## User Scenarios & Testing *(mandatory)*
+
+### User Story 1 - Repeated failed logins are blocked (Priority: P1)
+
+An attacker (or misconfigured client) sending many rapid login attempts with the wrong password is slowed or blocked before they can exhaustively guess credentials. After a threshold number of consecutive failures from the same source, the system refuses further attempts for a cooldown period and returns a clear, non-leaking error.
+
+**Why this priority**: Directly prevents credential-stuffing and brute-force attacks against the sole privileged account. Without this, the owner account is exposed to automated password guessing with no friction.
+
+**Independent Test**: Send more than the allowed number of failed login requests in quick succession and confirm that subsequent attempts are rejected with a rate-limit or lockout response — without knowing or changing the real password.
+
+**Acceptance Scenarios**:
+
+1. **Given** an attacker sends N+1 failed login attempts within the configured window, **When** the (N+1)th request arrives, **Then** the system returns an error response indicating the request is blocked (not the normal "invalid credentials" error) and does not process the login attempt.
+2. **Given** a legitimate user has been temporarily blocked after too many failures, **When** the cooldown period elapses and they retry with the correct password, **Then** they are logged in successfully.
+3. **Given** a legitimate user makes a few failed attempts and then waits beyond the cooldown window, **When** they retry within the next window, **Then** their failure counter resets and they are not blocked.
+
+---
+
+### User Story 2 - Operators can observe and reason about blocking activity (Priority: P2)
+
+When the protection triggers, the system produces enough observable signal (log entries, response metadata) that an operator can confirm the feature is working, diagnose false positives, and tune thresholds — without exposing sensitive details to the client.
+
+**Why this priority**: Invisible security controls are unmanageable. Operators need to know the system is doing what it claims, and blocked legitimate users need a clear (but not exploitable) explanation.
+
+**Independent Test**: Trigger the rate limiter and confirm that: (a) the response body or headers communicate that the request was blocked and when the client may retry; (b) the server logs an entry identifying the blocked source and the reason.
+
+**Acceptance Scenarios**:
+
+1. **Given** a source is blocked, **When** they receive the rejection response, **Then** the response indicates they should wait before retrying (e.g., a `Retry-After` hint) without disclosing the exact threshold values.
+2. **Given** the rate limiter fires, **When** an operator inspects server logs, **Then** there is a log entry at WARNING level or above recording the blocked source and timestamp.
+
+---
+
+### Edge Cases
+
+- What happens when a distributed attacker rotates IPs to avoid per-IP limits?
+- How does the system behave if the backing store for rate-limit counters is temporarily unavailable — does it fail open (allow all) or fail closed (block all)?
+- Are IPv6 addresses and IPv4-mapped-IPv6 addresses treated consistently?
+- Does a successful login reset the failure counter for that source?
+- What happens if many legitimate users share a NAT/proxy IP (e.g., corporate network)?
+- What if `TRUSTED_PROXY_IPS` is configured to include an IP that an external attacker controls? (An attacker could then spoof `X-Forwarded-For` and rotate fake source IPs to bypass the rate limiter — operators must only list genuinely trusted upstream infrastructure.)
+
+## Requirements *(mandatory)*
+
+### Functional Requirements
+
+- **FR-001**: The system MUST enforce a maximum number of failed login attempts per source identifier (the resolved client IP address) within a rolling time window before blocking further attempts.
+- **FR-002**: Once a source exceeds the failure threshold, the system MUST reject subsequent login requests for a configurable cooldown period, returning a distinct response (not the normal invalid-credentials response).
+- **FR-003**: After the cooldown period expires, the system MUST permit the source to attempt login again, resetting its failure count.
+- **FR-004**: A successful login MUST reset the failure counter for that source, preventing accumulation of old failures from blocking future legitimate access.
+- **FR-005**: The rejection response MUST NOT reveal the specific threshold values or remaining lockout duration in a way that aids an attacker in timing their attempts, but MUST provide enough information (e.g., "try again later") for a legitimate user to understand the situation.
+- **FR-006**: The system MUST log a structured warning event whenever a source is blocked, including the source identifier and timestamp.
+- **FR-007**: Rate-limit thresholds (maximum attempts, window duration, cooldown duration) MUST be configurable without code changes.
+- **FR-008**: The system MUST support a configurable list of trusted upstream proxy IP addresses and CIDR ranges. When the TCP peer address matches a trusted proxy, the resolved client IP MUST be extracted from the `X-Forwarded-For` request header (first entry) or, if absent, `X-Real-IP`. When no trusted proxies are configured, the TCP peer address MUST be used directly and forwarded-IP headers MUST be ignored.
+
+### Key Entities
+
+- **Rate-limit record**: Tracks the number of consecutive failures and the window start time for a given source identifier; expires automatically after the cooldown period.
+- **Source identifier**: The resolved client IP address used to key rate-limit records. When `LOGIN_TRUSTED_PROXY_IPS` is empty (default), this is the TCP peer address. When one or more proxy IPs/CIDRs are configured and the TCP peer matches, the first `X-Forwarded-For` entry (or `X-Real-IP`) is used instead.
+
+## Success Criteria *(mandatory)*
+
+### Measurable Outcomes
+
+- **SC-001**: An automated script sending 100 consecutive failed login requests completes with at least 90 of those requests rejected after the threshold is crossed — verified in a controlled test environment.
+- **SC-002**: A legitimate user who has been temporarily blocked can successfully log in within 5 minutes of the cooldown period expiring without any manual intervention.
+- **SC-003**: Zero information about threshold values or exact lockout expiry is present in blocked response bodies or headers.
+- **SC-004**: Every blocking event produces a corresponding log entry; 100% of triggered blocking events are observable in logs during testing.
+
+## Assumptions
+
+- The application has a single login endpoint used by all clients (the owner login introduced in feature 004).
+- Source identification uses the resolved client IP address. By default (when `LOGIN_TRUSTED_PROXY_IPS` is empty) this is the TCP peer address. When one or more proxy IPs/CIDRs are configured, the first entry of `X-Forwarded-For` (or `X-Real-IP`) is used instead — but only when the TCP peer is in the trusted list, preventing header spoofing by external clients.
+- If the rate-limit backing store is unavailable, the system fails open (allows the attempt through) rather than blocking all logins — this preserves the owner's access, which is critical for a single-user admin application.
+- No CAPTCHA or multi-factor step is in scope; protection is purely count/time-based.
+- The feature targets the login endpoint only; other endpoints are out of scope.
+- The single-user nature of the app means IP-based identification is sufficient — there is no need for per-username lockout, and using IP (rather than username) avoids contributing to username enumeration risk.
--- a/specs/009-login-rate-limiting/tasks.md
+++ b/specs/009-login-rate-limiting/tasks.md
@@ -0,0 +1,120 @@
+# Tasks: Login Brute-Force Protection
+
+**Input**: Design documents from `specs/009-login-rate-limiting/`
+**Prerequisites**: plan.md ✅, spec.md ✅, research.md ✅, data-model.md ✅, contracts/auth.md ✅, quickstart.md ✅
+
+**Tests**: TDD is non-negotiable (§5.1). Every test task appears before the implementation task it covers. For each red step, run the test and confirm it fails before proceeding to the implementation.
+
+**Organization**: Phase 1 adds env vars; Phase 2 adds config fields (shared by both stories); Phase 3 implements the core blocking behaviour (US1 MVP); Phase 4 adds observability-specific test coverage (US2); Phase 5 is polish.
+
+## Format: `[ID] [P?] [Story] Description`
+
+- **[P]**: Can run in parallel with other [P] tasks in the same phase
+- **[Story]**: Which user story this task belongs to
+- Exact file paths included in every task description
+
+---
+
+## Phase 1: Setup
+
+- [X] T001 Add a `# Login brute-force protection` comment block with `LOGIN_MAX_FAILURES=5`, `LOGIN_WINDOW_SECONDS=300`, `LOGIN_COOLDOWN_SECONDS=900`, and `LOGIN_TRUSTED_PROXY_IPS=` (empty by default, with an inline comment explaining it accepts comma-separated IPs/CIDRs) to both `.env.example` and `.env.test.example` at the repo root
+
+---
+
+## Phase 2: Foundational
+
+**Purpose**: Add the three new settings fields — required before any story implementation.
+
+- [X] T002 Add `login_max_failures: int = 5`, `login_window_seconds: int = 300`, `login_cooldown_seconds: int = 900`, `login_trusted_proxy_ips: str = ""` to the `Settings` class in `api/app/config.py` (append after `owner_password`)
+
+**Checkpoint**: `api/app/config.py` accepts all three new env vars with defaults.
+
+---
+
+## Phase 3: User Story 1 — Repeated failed logins are blocked (Priority: P1) 🎯 MVP
+
+**Goal**: After `LOGIN_MAX_FAILURES` consecutive failed login attempts from the same source IP within `LOGIN_WINDOW_SECONDS`, `POST /api/v1/auth/token` returns HTTP 429 for `LOGIN_COOLDOWN_SECONDS`. A successful login resets the counter.
+
+**Independent Test**: `cd api && python -m pytest tests/unit/test_rate_limiter.py tests/integration/test_login_rate_limit.py::test_repeated_failures_trigger_429 tests/integration/test_login_rate_limit.py::test_success_resets_counter tests/integration/test_login_rate_limit.py::test_429_has_retry_after_header tests/integration/test_login_rate_limit.py::test_xff_header_ignored_when_no_trusted_networks -v` — all pass.
+
+### Tests for User Story 1 (TDD red — write first, confirm failure before T005)
+
+- [X] T003 [P] [US1] Create `api/tests/unit/test_rate_limiter.py` with ten failing unit tests — import `LoginRateLimiter` and `get_client_ip` from `app.auth.rate_limiter`; for `LoginRateLimiter` (instantiate with `max_failures=3, window_seconds=60, cooldown_seconds=300`): `test_not_blocked_initially`, `test_blocked_after_threshold`, `test_success_clears_failures`, `test_ips_are_isolated`, `test_window_resets_after_expiry`, `test_log_warning_on_lockout` (caplog at WARNING level: call `record_failure()` until threshold, assert `"Login blocked" in caplog.text` and IP in log output); for `get_client_ip` (construct a mock using `from unittest.mock import MagicMock` and `from starlette.requests import Request`: `req = MagicMock(spec=Request); req.client.host = "10.0.0.1"; req.headers = {"X-Forwarded-For": "203.0.113.5"}`): `test_get_client_ip_no_trusted_networks_returns_peer` (empty `trusted_networks=[]` → returns `req.client.host`), `test_get_client_ip_trusted_peer_uses_xff` (peer `"10.0.0.1"` in trusted CIDR `"10.0.0.0/8"` → returns `"203.0.113.5"`), `test_get_client_ip_untrusted_peer_ignores_xff` (peer `"8.8.8.8"` not in trusted CIDR `"10.0.0.0/8"` → returns `"8.8.8.8"` despite XFF), `test_get_client_ip_trusted_peer_falls_back_to_real_ip` (peer trusted, no XFF header, `X-Real-IP: "203.0.113.9"` → returns `"203.0.113.9"`); run `python -m pytest tests/unit/test_rate_limiter.py -v` and confirm `ImportError` or `ModuleNotFoundError` (red)
+- [X] T004 [P] [US1] Create `api/tests/integration/test_login_rate_limit.py` with four failing integration tests; each must override both `app.state.login_rate_limiter` (fresh `LoginRateLimiter(max_failures=3, window_seconds=60, cooldown_seconds=30)`) and `app.state.login_trusted_networks` (set to `[]` for all four tests — the `ASGITransport` peer is `"testclient"`, not a valid IP, so trusted-network matching can't be exercised here; proxy extraction is fully covered by T003 unit tests) via try/finally: (1) `test_repeated_failures_trigger_429` — POST three bad-credential requests then assert fourth returns 429 with `resp.json()["code"] == "login_rate_limited"`; (2) `test_success_resets_counter` — two failures → one valid login using `{"username": os.environ["OWNER_USERNAME"], "password": os.environ["OWNER_PASSWORD"]}` (matching conftest.py defaults: `testowner`/`testpassword`) → three more failures → assert all three return 401, not 429; (3) `test_429_has_retry_after_header` — trigger lockout (three failures), then assert `"Retry-After" in resp.headers` and `int(resp.headers["Retry-After"]) > 0`; (4) `test_xff_header_ignored_when_no_trusted_networks` — send three bad-cred requests with `headers={"X-Forwarded-For": "1.2.3.4"}` then a fourth with `headers={"X-Forwarded-For": "9.9.9.9"}` — assert the fourth returns 429 (not 401), proving the limiter tracked the real peer `"testclient"` for all requests and XFF was ignored; run `python -m pytest tests/integration/test_login_rate_limit.py -v` and confirm failure (red)
+
+### Implementation for User Story 1
+
+- [X] T005 [US1] Create `api/app/auth/rate_limiter.py` with two exports: (1) `get_client_ip(request: Request, trusted_networks: list[IPv4Network | IPv6Network]) -> str` — imports `ipaddress`, `from ipaddress import IPv4Network, IPv6Network`, `from starlette.requests import Request`; extracts `peer = request.client.host if request.client else "unknown"`; if `trusted_networks` is non-empty and peer is parseable as an IP address and falls within any trusted network, returns first `X-Forwarded-For` entry (strip whitespace) or `X-Real-IP` value, otherwise returns `peer`; wraps `ipaddress.ip_address(peer)` in `try/except ValueError` and falls back to `peer` on error; (2) `LoginRateLimiter` class: `__init__(self, max_failures: int = 5, window_seconds: int = 300, cooldown_seconds: int = 900)` storing params as `_max`, `_window`, `_cooldown`; `_store: dict[str, _Record]` and `_lock: threading.Lock`; `@dataclass _Record` with `failures: int = 0`, `window_start: float = field(default_factory=time.time)`, `blocked_until: float = 0.0`; `is_blocked(ip: str) -> bool`, `record_failure(ip: str) -> None` (logs WARNING on lockout), `record_success(ip: str) -> None`, `cooldown_seconds` property; stdlib imports: `import ipaddress, logging, time`, `from dataclasses import dataclass, field`, `from threading import Lock`
+- [X] T006 [US1] Update `api/app/main.py` lifespan: add `import ipaddress` at top; import `LoginRateLimiter` from `app.auth.rate_limiter`; inside `lifespan` before `engine = get_engine()`, consolidate to `settings = get_settings()` (remove the existing bare `get_settings()` call), then set `application.state.login_rate_limiter = LoginRateLimiter(max_failures=settings.login_max_failures, window_seconds=settings.login_window_seconds, cooldown_seconds=settings.login_cooldown_seconds)`; then parse `settings.login_trusted_proxy_ips` — split on `","`, strip each part, skip empty strings, call `ipaddress.ip_network(part, strict=False)` inside a `try/except ValueError` (skip invalid entries silently), collect results into `trusted_networks: list`; set `application.state.login_trusted_networks = trusted_networks`
+- [X] T007 [US1] Update `api/app/routers/auth.py` login endpoint: add `Request` to FastAPI imports and add `from fastapi.responses import JSONResponse`; add `from app.auth.rate_limiter import LoginRateLimiter, get_client_ip`; add `request: Request` as first parameter to `login()`; extract `limiter: LoginRateLimiter = request.app.state.login_rate_limiter` and `ip: str = get_client_ip(request, request.app.state.login_trusted_networks)`; add guard block — if `limiter.is_blocked(ip)`: return `JSONResponse(status_code=429, content={"detail": "Too many failed login attempts. Please try again later.", "code": "login_rate_limited"}, headers={"Retry-After": str(limiter.cooldown_seconds)})`; after `verify_credentials` returns False: call `limiter.record_failure(ip)` before the existing `HTTPException`; after `auth.create_token()`: call `limiter.record_success(ip)` before returning `TokenResponse`
+- [X] T008 [US1] Verify TDD green: run `cd api && python -m pytest tests/unit/test_rate_limiter.py -v` — all 10 pass; run `make test-integration` — all tests pass including `test_repeated_failures_trigger_429`, `test_success_resets_counter`, `test_429_has_retry_after_header`, and `test_xff_header_ignored_when_no_trusted_networks`
+
+**Checkpoint**: Brute-force blocking is live. Automated repeated failures are stopped after threshold; the owner can still log in after cooldown; unit and integration tests pass.
+
+---
+
+## Phase 4: User Story 2 — Operators can observe blocking activity (Priority: P2)
+
+**Goal**: The 429 response includes a `Retry-After` header with a positive integer; the response body `code` is `"login_rate_limited"` and contains no threshold numeric values; server logs a WARNING when blocking triggers.
+
+**Independent Test**: Trigger the rate limiter (already works from Phase 3) and assert `Retry-After` header is present in the response and `code` field is `"login_rate_limited"`.
+
+### Tests for User Story 2 (TDD red — extend existing file)
+
+- [X] T009 [US2] Add one test to `api/tests/integration/test_login_rate_limit.py` targeting observability properties not yet covered: `test_429_body_shape` — override `app.state.login_rate_limiter` with a fresh `LoginRateLimiter(max_failures=3, window_seconds=60, cooldown_seconds=30)` via try/finally (same isolation pattern as T004), trigger lockout (three failures), then assert `resp.json() == {"detail": "Too many failed login attempts. Please try again later.", "code": "login_rate_limited"}` (exact match — confirms no threshold values leak and shape is correct); confirm this test is green immediately against the US1 implementation (T007 already returns this exact body)
+
+**Checkpoint**: US2 observability properties are explicitly exercised by integration tests; a future regression in the Retry-After header or code field will be caught.
+
+---
+
+## Phase 5: Polish & Cross-Cutting Concerns
+
+- [X] T010 Run `cd api && ruff check app/auth/rate_limiter.py app/routers/auth.py app/config.py app/main.py tests/unit/test_rate_limiter.py tests/integration/test_login_rate_limit.py` — fix any violations
+
+---
+
+## Dependencies & Execution Order
+
+### Phase Dependencies
+
+- **Phase 1 (Setup)**: No external dependencies — can start immediately
+- **Phase 2 (Foundational)**: No external dependencies — can start immediately (parallel with Phase 1)
+- **Phase 3 (US1)**: Depends on Phase 2 (T002 must exist before T006 can use `settings.login_max_failures`)
+- **Phase 4 (US2)**: Depends on Phase 3 (tests verify behaviour implemented in T007)
+- **Phase 5 (Polish)**: Depends on all prior phases
+
+### Within Phase 3
+
+- T003 ∥ T004 (different files, no dependency — write tests in parallel)
+- T005 after T003, T004 (implement after tests confirm they fail)
+- T006 ∥ T007 after T005 (both import from `rate_limiter.py`; write to different files — `main.py` and `auth.py`; T006 sets `app.state.login_trusted_networks` which T007's router reads)
+- T008 after T005, T006, T007 (verify all pass)
+
+### Execution Order Summary
+
+```
+Step 1: T001 ∥ T002 (setup + foundational — parallel, different files)
+Step 2: T003 ∥ T004 (write failing tests — parallel)
+Step 3: T005 (implement LoginRateLimiter — after red tests confirmed)
+Step 4: T006 ∥ T007 (wire limiter into app — parallel, different files)
+Step 5: T008 (verify green)
+Step 6: T009 (US2 observability tests — verify green immediately)
+Step 7: T010 (ruff clean)
+```
+
+---
+
+## Implementation Strategy
+
+### MVP (US1 — the blocker)
+
+1. Complete T001–T002 (config setup)
+2. Complete T003–T008 (core blocking)
+3. **Validate**: Run `make test-integration` — all 88 existing tests still pass; 2 new rate-limit tests pass
+4. US2 adds verification coverage for already-implemented observability features
+
+### Incremental Delivery
+
+- After Phase 3: Brute-force attacks on the login endpoint are blocked — core security net is in place
+- After Phase 4: Observability properties are explicitly tested — regressions in headers/logs will be caught
+- After Phase 5: Lint clean, ready for merge
--- a/specs/010-api-prod-dockerfile/checklists/requirements.md
+++ b/specs/010-api-prod-dockerfile/checklists/requirements.md
@@ -0,0 +1,34 @@
+# Specification Quality Checklist: Production-Grade API Container Image
+
+**Purpose**: Validate specification completeness and quality before proceeding to planning
+**Created**: 2026-05-07
+**Feature**: [spec.md](../spec.md)
+
+## Content Quality
+
+- [X] No implementation details (languages, frameworks, APIs)
+- [X] Focused on user value and business needs
+- [X] Written for non-technical stakeholders
+- [X] All mandatory sections completed
+
+## Requirement Completeness
+
+- [X] No [NEEDS CLARIFICATION] markers remain
+- [X] Requirements are testable and unambiguous
+- [X] Success criteria are measurable
+- [X] Success criteria are technology-agnostic (no implementation details)
+- [X] All acceptance scenarios are defined
+- [X] Edge cases are identified
+- [X] Scope is clearly bounded
+- [X] Dependencies and assumptions identified
+
+## Feature Readiness
+
+- [X] All functional requirements have clear acceptance criteria
+- [X] User scenarios cover primary flows
+- [X] Feature meets measurable outcomes defined in Success Criteria
+- [X] No implementation details leak into specification
+
+## Notes
+
+- All items pass. Ready for `/speckit-plan`.
--- a/specs/010-api-prod-dockerfile/contracts/container.md
+++ b/specs/010-api-prod-dockerfile/contracts/container.md
@@ -0,0 +1,122 @@
+# Contract: Production API Container Image
+
+This document defines the observable interface of the `reactbin-api-prod` container image. Any orchestration layer (Kubernetes manifests, Docker Compose, CI pipeline) MUST be written against this contract.
+
+---
+
+## Network Interface
+
+| Property | Value |
+|----------|-------|
+| Protocol | HTTP/1.1 |
+| Port | 8000 (TCP) |
+| Bind address | `0.0.0.0` (all interfaces inside the container) |
+
+---
+
+## Health Check
+
+The container exposes a health check at the existing API health endpoint:
+
+```
+GET /api/v1/health
+```
+
+**Success response** (`200 OK`):
+```json
+{ "status": "ok" }
+```
+
+The container declares a built-in `HEALTHCHECK` with the following defaults:
+
+| Parameter | Value |
+|-----------|-------|
+| Interval | 30s |
+| Timeout | 5s |
+| Start period | 10s |
+| Retries | 3 |
+
+Orchestrators that define their own probes (e.g. Kubernetes `livenessProbe` / `readinessProbe`) SHOULD use this same endpoint.
+
+---
+
+## Required Environment Variables
+
+All configuration is supplied at runtime via environment variables. The image contains no defaults for secret or environment-specific values.
+
+| Variable | Description | Example |
+|----------|-------------|---------|
+| `JWT_SECRET_KEY` | HS256 signing key for bearer tokens | `change-me-long-random-string` |
+| `OWNER_USERNAME` | Username of the single owner account | `owner` |
+| `OWNER_PASSWORD` | Password of the single owner account | `change-me` |
+| `DATABASE_URL` | PostgreSQL connection URL (asyncpg scheme) | `postgresql+asyncpg://user:pass@host:5432/db` |
+| `S3_ENDPOINT_URL` | S3-compatible object storage endpoint | `https://s3.amazonaws.com` |
+| `S3_BUCKET_NAME` | Storage bucket name | `reactbin-prod` |
+| `S3_ACCESS_KEY_ID` | Storage access key | `AKIAIOSFODNN7EXAMPLE` |
+| `S3_SECRET_ACCESS_KEY` | Storage secret key | `wJalrXUtnFEMI/K7MDENG` |
+| `S3_REGION` | Storage region | `us-east-1` |
+
+**Optional environment variables** (safe defaults apply):
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `JWT_EXPIRY_SECONDS` | `86400` | Token lifetime in seconds |
+| `MAX_UPLOAD_BYTES` | `52428800` | Maximum upload file size (50 MB) |
+| `LOGIN_MAX_FAILURES` | `5` | Brute-force lock threshold |
+| `LOGIN_WINDOW_SECONDS` | `300` | Failure counting window |
+| `LOGIN_COOLDOWN_SECONDS` | `900` | Lock duration after threshold |
+| `LOGIN_TRUSTED_PROXY_IPS` | `` | Comma-separated trusted proxy CIDRs |
+| `API_BASE_URL` | _(not required at runtime)_ | Used only by client tooling |
+
+**Startup failure behaviour**: If a required variable is absent, the application exits with a non-zero code before accepting any requests. The error is logged to stderr identifying the missing variable.
+
+---
+
+## Signal Handling
+
+| Signal | Behaviour |
+|--------|-----------|
+| `SIGTERM` | Stop accepting new connections; drain in-flight requests; exit 0 within 30s |
+| `SIGKILL` | Immediate termination (OS-level; no graceful drain possible) |
+
+Kubernetes should configure `terminationGracePeriodSeconds ≥ 30` to allow the full drain window.
+
+---
+
+## Process Identity
+
+| Property | Value |
+|----------|-------|
+| User | `appuser` |
+| UID | `1001` |
+| GID | `1001` |
+| Root privileges | None |
+
+The container MUST NOT be run with `--privileged` or as UID 0.
+
+---
+
+## Filesystem
+
+- **Working directory**: `/app`
+- **Application source**: `/app/app/`
+- **Virtual environment**: `/app/.venv/`
+- **No writable state**: The container requires no persistent local storage. All state is in PostgreSQL and S3.
+- **Read-only root**: The container is compatible with `--read-only` (no writes to the filesystem at runtime).
+
+---
+
+## Logging
+
+All log output is written to **stdout** (info/debug) and **stderr** (warnings/errors). No log files are written inside the container. The container runtime log driver captures all output without additional configuration.
+
+---
+
+## Image Tags
+
+| Tag pattern | Meaning |
+|-------------|---------|
+| `reactbin-api-prod:latest` | Latest build from `master` |
+| `reactbin-api-prod:<git-sha>` | Immutable build for a specific commit |
+
+Deployments SHOULD pin to a specific git SHA tag, not `latest`.
--- a/specs/010-api-prod-dockerfile/plan.md
+++ b/specs/010-api-prod-dockerfile/plan.md
@@ -0,0 +1,242 @@
+# Implementation Plan: Production-Grade API Container Image
+
+**Branch**: `010-api-prod-dockerfile` | **Date**: 2026-05-07 | **Spec**: [spec.md](spec.md)
+**Input**: Feature specification from `specs/010-api-prod-dockerfile/spec.md`
+
+## Summary
+
+Produce a production-ready `api/Dockerfile.prod` using a two-stage build: a uv builder stage that installs lockfile-pinned, production-only dependencies into a virtual environment, and a lean `python:3.12-slim` runtime stage that contains only the venv, application source, and `curl` for health checks. The runtime process runs as a non-root user (UID 1001), handles SIGTERM gracefully via uvicorn's built-in drain, and logs exclusively to stdout/stderr. Behavioral verification is automated via a shell script (`api/tests/build/verify_production_image.sh`) written before the Dockerfile (§5.1 TDD).
+
+---
+
+## Technical Context
+
+**Language/Version**: Python 3.12 (existing API), Docker multi-stage build  
+**Build tool**: uv (lockfile: `api/uv.lock`, already committed)  
+**Base images**: `ghcr.io/astral-sh/uv:python3.12-bookworm-slim` (builder), `python:3.12-slim` (runtime)  
+**Testing**: Shell verification script (`verify_production_image.sh`) + `make verify-prod` target  
+**Target Platform**: linux/amd64 container (Kubernetes or Docker host)  
+**Performance Goals**: Container starts and passes health check within 30s; rebuild from warm cache in under 60s  
+**Constraints**: No root process, no hardcoded secrets, no dev deps in final image, compatible with `--read-only` filesystem  
+**Scale/Scope**: Single-file addition (`Dockerfile.prod`) + shell test + two Makefile targets; zero changes to existing source code
+
+---
+
+## Constitution Check
+
+*GATE: Must pass before Phase 0 research. Re-checked post-design below.*
+
+| Principle | Status | Notes |
+|-----------|--------|-------|
+| §5.1 TDD non-negotiable | **COMPLIANT** | `verify_production_image.sh` written before `Dockerfile.prod`; script fails (red) because the build file is absent, then passes (green) after |
+| §5.2 Test pyramid | **COMPLIANT** | Shell verification script is the integration-level test for this build artefact; no unit tests applicable (no Python business logic added) |
+| §5.4 CI must pass | **COMPLIANT** | `make verify-prod` target is runnable in host CI (requires Docker on the runner, which the existing `make test-integration` already requires) |
+| §6 Tech Stack — Docker | **COMPLIANT** | Docker + Docker Compose are mandated; this adds a production Docker file within that constraint |
+| §7.1 One-command local start | **COMPLIANT** | `api/Dockerfile` (dev stack) is unchanged; `docker compose up` is unaffected |
+| §7.2 Environment configuration | **COMPLIANT** | `Dockerfile.prod` contains zero hardcoded env values; all config is injected at runtime |
+| §7.3 Ruff/lint | **COMPLIANT** | No new Python files; shell script linted with `shellcheck` |
+| §2.6 No speculative abstraction | **COMPLIANT** | Single Dockerfile, no plugin system or generics |
+| §8 Scope boundaries | **COMPLIANT** | Purely infrastructure; no new API routes, data model, or UI changes |
+
+**Post-design re-check**: All gates remain green. No violations.
+
+---
+
+## Project Structure
+
+### Documentation (this feature)
+
+```text
+specs/010-api-prod-dockerfile/
+├── plan.md              # This file
+├── research.md          # Phase 0 decisions
+├── contracts/
+│   └── container.md     # Container interface contract (port, env vars, signals, user)
+├── quickstart.md        # Build and verification scenarios
+└── tasks.md             # Generated by /speckit-tasks
+```
+
+### Source Code Changes
+
+```text
+api/
+├── Dockerfile           # Existing dev/test image — UNCHANGED
+├── Dockerfile.prod      # NEW: production multi-stage image
+├── .dockerignore        # Existing — verify test files are excluded from build context
+└── tests/
+    └── build/
+        └── verify_production_image.sh   # NEW: TDD verification script (written first)
+
+Makefile                 # Root Makefile — add build-prod and verify-prod targets
+```
+
+---
+
+## Dockerfile.prod — Annotated Reference
+
+```dockerfile
+# syntax=docker/dockerfile:1
+
+# ════════════════════════════════════════════════
+# Build stage: install production deps via uv
+# ════════════════════════════════════════════════
+FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim AS builder
+
+WORKDIR /app
+
+# Pre-compile bytecode; use copy mode for cross-layer compatibility
+ENV UV_COMPILE_BYTECODE=1 \
+    UV_LINK_MODE=copy \
+    UV_PYTHON_DOWNLOADS=never
+
+# ── Layer cache split: deps only (changes rarely) ──
+COPY pyproject.toml uv.lock ./
+RUN --mount=type=cache,target=/root/.cache/uv \
+    uv sync --frozen --no-dev --no-install-project
+
+# ── Layer cache split: source (changes often) ──
+COPY app/ ./app/
+
+# ════════════════════════════════════════════════
+# Runtime stage: lean image with venv + source
+# ════════════════════════════════════════════════
+FROM python:3.12-slim
+
+WORKDIR /app
+
+# curl for HEALTHCHECK — only tool added beyond base Python
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends curl \
+    && rm -rf /var/lib/apt/lists/*
+
+# Non-root system user (UID/GID 1001)
+RUN groupadd --system --gid 1001 appgroup \
+    && useradd --system --uid 1001 --gid 1001 --no-create-home appuser
+
+# Copy venv from builder; copy source directly from build context
+COPY --from=builder --chown=appuser:appgroup /app/.venv /app/.venv
+COPY --chown=appuser:appgroup app/ ./app/
+
+USER appuser
+
+# Activate the venv by prepending its bin to PATH
+ENV PATH="/app/.venv/bin:$PATH"
+
+EXPOSE 8000
+
+HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+    CMD curl -f http://localhost:8000/api/v1/health || exit 1
+
+# uvicorn handles SIGTERM; --timeout-graceful-shutdown gives 30s to drain requests
+CMD ["uvicorn", "app.main:app", \
+     "--host", "0.0.0.0", \
+     "--port", "8000", \
+     "--timeout-graceful-shutdown", "30"]
+```
+
+> **Note on COPY paths**: Build context is `api/` (as set by the Makefile target). `COPY app/ ./app/` in both stages refers to `api/app/`. The runtime stage copies source directly from the build context, not from the builder stage — this is simpler and avoids an extra intermediate layer.
+
+---
+
+## verify_production_image.sh — Structure
+
+```sh
+#!/usr/bin/env bash
+# TDD verification script for api/Dockerfile.prod
+# Fails (red) if Dockerfile.prod does not exist or any check fails.
+set -euo pipefail
+
+IMAGE="reactbin-api-prod:verify-$$"
+
+cleanup() { docker rm -f "$CONTAINER" 2>/dev/null || true; docker rmi "$IMAGE" 2>/dev/null || true; }
+trap cleanup EXIT
+
+# Step 1: Build — fails red if Dockerfile.prod is absent
+docker build -f api/Dockerfile.prod api/ -t "$IMAGE"
+
+# Step 2: Start container with minimal env vars
+CONTAINER=$(docker run -d -p 18000:8000 \
+  -e JWT_SECRET_KEY=verify-test-key \
+  -e OWNER_USERNAME=testowner \
+  -e OWNER_PASSWORD=testpassword \
+  -e DATABASE_URL=postgresql+asyncpg://noop:noop@noop/noop \
+  -e S3_ENDPOINT_URL=http://noop:9000 \
+  -e S3_BUCKET_NAME=noop \
+  -e S3_ACCESS_KEY_ID=noop \
+  -e S3_SECRET_ACCESS_KEY=noop \
+  -e S3_REGION=us-east-1 \
+  "$IMAGE")
+
+# Step 3: Poll health endpoint (app will fail to connect to DB, but /health is pre-DB)
+for i in $(seq 1 30); do
+  if curl -sf http://localhost:18000/api/v1/health > /dev/null; then break; fi
+  sleep 1
+  [[ $i -eq 30 ]] && { echo "FAIL: health check timed out"; exit 1; }
+done
+
+# Step 4: Assert non-root user
+UID_IN_CONTAINER=$(docker exec "$CONTAINER" id -u)
+[[ "$UID_IN_CONTAINER" -ne 0 ]] || { echo "FAIL: process running as root"; exit 1; }
+
+# Step 5: Graceful shutdown
+docker stop "$CONTAINER"          # sends SIGTERM
+EXIT_CODE=$(docker wait "$CONTAINER")
+[[ "$EXIT_CODE" -eq 0 ]] || { echo "FAIL: non-zero exit code $EXIT_CODE"; exit 1; }
+
+# Step 6: Dev deps absent
+if docker run --rm "$IMAGE" /app/.venv/bin/python -c "import pytest" 2>/dev/null; then
+  echo "FAIL: pytest importable in production image (dev deps present)"; exit 1
+fi
+
+echo "All production image checks passed."
+```
+
+> **Note on health check feasibility**: `/api/v1/health` is a simple JSON response that does not require a database connection (confirmed in `api/app/main.py`). The verification script can therefore pass even without a real PostgreSQL instance.
+
+---
+
+## Makefile Targets
+
+Add to root `Makefile`:
+
+```makefile
+.PHONY: build-prod verify-prod
+
+build-prod:
+	docker build -f api/Dockerfile.prod api/ -t reactbin-api-prod:latest
+
+verify-prod:
+	bash api/tests/build/verify_production_image.sh
+```
+
+---
+
+## `.dockerignore` Review
+
+The existing `api/.dockerignore` already excludes `.venv/`, `__pycache__/`, `.env`, etc. Two additions improve the production build context:
+
+```
+tests/
+*.egg-info/
+alembic/
+alembic.ini
+```
+
+`tests/` and `alembic/` are not needed in the production image (we `COPY app/ ./app/` explicitly). Excluding them from the build context reduces the data sent to the Docker daemon.
+
+> `*.egg-info/` is already present in the existing `.dockerignore`.
+
+---
+
+## Implementation Order
+
+Tasks are generated by `/speckit-tasks`, but the logical dependency order is:
+
+1. **Write `verify_production_image.sh`** (TDD red — build fails because `Dockerfile.prod` absent)
+2. **Add `Makefile` targets** (`build-prod`, `verify-prod`) — references the script
+3. **Write `api/Dockerfile.prod`** (implement to make TDD pass)
+4. **Update `api/.dockerignore`** (exclude `tests/`, `alembic/` from build context)
+5. **Run `make verify-prod`** (TDD green — all 6 checks pass)
+6. **Run `shellcheck`** on `verify_production_image.sh`
+
+No existing tests are modified. `make test-integration` continues to use `api/Dockerfile` unchanged.
--- a/specs/010-api-prod-dockerfile/quickstart.md
+++ b/specs/010-api-prod-dockerfile/quickstart.md
@@ -0,0 +1,138 @@
+# Quickstart: Production API Container Image
+
+## Prerequisites
+
+- Docker 24+ installed and running on the host
+- `make` available
+- A copy of `.env` (or the env vars from `.env.example`) for smoke-testing
+
+---
+
+## Build the Production Image
+
+```sh
+make build-prod
+# Equivalent: docker build -f api/Dockerfile.prod api/ -t reactbin-api-prod:latest
+```
+
+On a warm cache (deps unchanged), the build should complete in under 60 seconds because the dependency layer is reused.
+
+---
+
+## Verify the Production Image (TDD Smoke Test)
+
+```sh
+make verify-prod
+```
+
+This runs `api/tests/build/verify_production_image.sh`, which:
+1. Builds the image (fails fast if `Dockerfile.prod` is missing — the **red** TDD state)
+2. Starts the container with test env vars
+3. Polls `/api/v1/health` until it returns 200 (or times out after 30s)
+4. Asserts the API process is running as a non-root user (UID ≠ 0)
+5. Sends SIGTERM and asserts the container exits with code 0 within 30s
+6. Asserts `pytest` is NOT importable inside the container (dev deps excluded)
+
+**Expected output (green)**:
+```
+[verify] Building reactbin-api-prod:test ...
+[verify] Build OK
+[verify] Starting container ...
+[verify] Health check passed (GET /api/v1/health → 200)
+[verify] Process user: 1001 (non-root ✓)
+[verify] Sending SIGTERM ...
+[verify] Container exited with code 0 (graceful shutdown ✓)
+[verify] Dev deps absent ✓
+[verify] All checks passed.
+```
+
+---
+
+## User Story Integration Scenarios
+
+### US1 — API Runs Reliably in Production
+
+```sh
+# Start container with real (or test) env vars
+docker run --rm -d \
+  --name reactbin-test \
+  -p 8000:8000 \
+  -e JWT_SECRET_KEY=my-secret \
+  -e OWNER_USERNAME=owner \
+  -e OWNER_PASSWORD=changeme \
+  -e DATABASE_URL=postgresql+asyncpg://user:pass@host:5432/db \
+  -e S3_ENDPOINT_URL=http://minio:9000 \
+  -e S3_BUCKET_NAME=reactbin \
+  -e S3_ACCESS_KEY_ID=minioadmin \
+  -e S3_SECRET_ACCESS_KEY=minioadmin \
+  -e S3_REGION=us-east-1 \
+  reactbin-api-prod:latest
+
+# Check health
+curl http://localhost:8000/api/v1/health
+# → {"status":"ok"}
+
+# Graceful shutdown
+docker stop reactbin-test     # sends SIGTERM
+docker wait reactbin-test     # → exit code 0
+```
+
+### US2 — Minimal, Secure Container
+
+```sh
+# Verify non-root user
+docker inspect --format='{{.Config.User}}' reactbin-api-prod:latest
+# → appuser (or 1001)
+
+# Verify no dev packages (pytest should not be importable)
+docker run --rm reactbin-api-prod:latest \
+  /app/.venv/bin/python -c "import pytest" 2>&1
+# → ModuleNotFoundError: No module named 'pytest'
+
+# Verify no source control or test files in image
+docker run --rm reactbin-api-prod:latest ls /app
+# → app  .venv   (no tests/, no alembic/, no .git/)
+```
+
+### US3 — Fast, Reproducible Builds
+
+```sh
+# First build (cold): installs all deps
+time docker build --no-cache -f api/Dockerfile.prod api/ -t reactbin-api-prod:cold
+
+# Touch a source file only (no dep change)
+touch api/app/main.py
+
+# Second build: dependency layer served from cache
+time docker build -f api/Dockerfile.prod api/ -t reactbin-api-prod:warm
+# Expect: warm build < 30s; cold build varies (network-dependent)
+
+# Confirm same health response from both
+docker run --rm ... reactbin-api-prod:cold
+docker run --rm ... reactbin-api-prod:warm
+```
+
+---
+
+## Missing Env Var Behaviour
+
+```sh
+docker run --rm \
+  -e JWT_SECRET_KEY=my-secret \
+  # OWNER_USERNAME intentionally omitted
+  reactbin-api-prod:latest
+# → Container exits non-zero, stderr logs: "field required: owner_username"
+```
+
+---
+
+## Read-Only Filesystem Compatibility
+
+```sh
+docker run --rm --read-only \
+  -e JWT_SECRET_KEY=... [other env vars] \
+  reactbin-api-prod:latest &
+
+curl http://localhost:8000/api/v1/health
+# → {"status":"ok"}
+```
--- a/specs/010-api-prod-dockerfile/research.md
+++ b/specs/010-api-prod-dockerfile/research.md
@@ -0,0 +1,94 @@
+# Research: Production API Container Image
+
+## Decision 1 — Use a Separate `Dockerfile.prod`
+
+**Decision**: Add `api/Dockerfile.prod` alongside the existing `api/Dockerfile`.
+
+**Rationale**: The existing `api/Dockerfile` installs dev dependencies (`.[dev]`), mounts source with `--reload`, and is used by the Docker Compose integration test stack. Modifying it would break `make test-integration`. A separate file keeps the two images independent with zero coupling.
+
+**Alternatives considered**:
+- Build-arg flag in a single Dockerfile: adds conditional complexity and makes both files harder to read.
+- Rename existing to `Dockerfile.dev` and make `Dockerfile` the production image: would require updating `docker-compose.test.yml` with an explicit file reference — a wider change than needed for this feature.
+
+---
+
+## Decision 2 — Multi-Stage Build: uv Builder + python:3.12-slim Runtime
+
+**Decision**: Two-stage build. Stage 1 (`builder`) uses `ghcr.io/astral-sh/uv:python3.12-bookworm-slim` to install production dependencies into a virtual environment. Stage 2 (`runtime`) uses `python:3.12-slim` and copies only the `.venv` and application source from the builder. uv is not present in the final image.
+
+**Rationale**: 
+- uv's official Docker image is the fastest, most correct way to produce a pinned, bytecode-compiled venv from `uv.lock`.
+- Keeping uv out of the runtime image reduces attack surface and image size.
+- `python:3.12-slim` is a well-maintained, widely scanned base; using it for the runtime stage aligns with existing project images.
+
+**Layer caching strategy**:
+```
+COPY pyproject.toml uv.lock ./
+RUN uv sync --frozen --no-dev --no-install-project   ← cache hits when only source changes
+COPY app/ ./app/                                       ← only reaches here on source changes
+```
+`--no-install-project` installs all listed dependencies without the project package itself. The project source is then copied separately. This means a source-only change reuses the dependency layer from cache.
+
+**Environment variables for optimal builds**:
+- `UV_COMPILE_BYTECODE=1` — pre-compile `.pyc` files; slightly larger venv but faster cold starts.
+- `UV_LINK_MODE=copy` — avoids hard-link issues when copying between image layers.
+- `UV_PYTHON_DOWNLOADS=never` — ensures the builder stage uses the bundled Python, not a downloaded one.
+
+**Alternatives considered**:
+- Installing deps into the system Python (`--system`): rejected because it pollutes the base image and makes it harder to copy deps cleanly into the runtime stage.
+- Using a single `FROM python:3.12-slim` with pip: slower builds, no lockfile pinning, no bytecode compilation step.
+
+---
+
+## Decision 3 — Non-Root User (UID 1001, System User)
+
+**Decision**: Create a system user `appuser` with GID/UID 1001 in the runtime stage. All owned files are `chown`-ed at `COPY` time using `--chown=appuser:appgroup`.
+
+**Rationale**: Running as root inside a container is a container breakout risk. A numeric UID (rather than a named user that might not exist on the host) is required by some Kubernetes pod security admission policies. UID 1001 avoids collision with UID 1000 (the typical first interactive user on a Linux host) while remaining a predictable, inspectable value.
+
+**Alternatives considered**:
+- UID 1000: small risk of collision with host user when bind mounts are involved.
+- `USER nobody`: `nobody` (UID 65534) works but its name and UID are not consistent across distros.
+
+---
+
+## Decision 4 — SIGTERM Graceful Shutdown via uvicorn `--timeout-graceful-shutdown`
+
+**Decision**: Use `uvicorn`'s built-in `--timeout-graceful-shutdown 30` flag. No process supervisor (tini, s6) is required.
+
+**Rationale**: uvicorn handles SIGTERM natively when run as PID 1 in single-worker mode (the production Dockerfile runs one worker). On SIGTERM it stops accepting new connections, waits up to `--timeout-graceful-shutdown` seconds for in-flight requests to complete, then exits with code 0. No additional init system is needed.
+
+**Alternatives considered**:
+- tini: adds a small init shim that reaps zombies and forwards signals. Not necessary with a single uvicorn worker (no child processes to reap).
+- Gunicorn + uvicorn workers: more complex; appropriate for multi-worker setups but the deployment platform (Kubernetes) scales horizontally via pod replicas rather than in-process workers.
+
+---
+
+## Decision 5 — `curl` for HEALTHCHECK
+
+**Decision**: Install `curl` (via `apt-get --no-install-recommends`) in the runtime stage and use it in the `HEALTHCHECK` directive.
+
+**Rationale**: The existing dev Dockerfile already installs `curl` for the same reason. `curl -f` exits non-zero on HTTP errors, making it a reliable single-command health probe. A Python one-liner adds interpreter startup overhead (~100ms) per check; `curl` is ~5ms.
+
+**Alternatives considered**:
+- `wget -q --spider`: available on Alpine but not on Debian-slim by default; requires separate install.
+- Python `urllib.request`: no extra install, but slower and adds noise to the process table during health checks.
+
+---
+
+## Decision 6 — TDD Verification via Shell Script
+
+**Decision**: Write `api/tests/build/verify_production_image.sh` before `Dockerfile.prod`. The script builds the image and runs behavioral checks (health endpoint, non-root user, clean SIGTERM exit). It is the "failing test" per §5.1.
+
+**Rationale**: The production image is a build artifact, not Python business logic. pytest cannot test a Docker image without Docker-in-Docker, which the current CI stack does not support. A shell script run on the host (via `make verify-prod`) is the appropriate TDD vehicle for this artefact type.
+
+**Verification steps the script covers**:
+1. `docker build -f api/Dockerfile.prod api/` → fails (red) until Dockerfile.prod exists.
+2. Run container with required env vars; wait for health endpoint → `GET /api/v1/health` returns 200.
+3. Inspect running process user → UID ≠ 0 (non-root).
+4. Send SIGTERM to container; assert exit code 0 within 30s (graceful shutdown).
+5. Assert dev packages are absent: `pip show pytest` inside container must return non-zero.
+
+**Alternatives considered**:
+- pytest with docker SDK: requires `docker` Python package and DinD in CI; rejected as over-engineered for a single-file build artifact.
+- Manual verification only: rejected because §5.1 mandates automated failing tests before production code.
--- a/specs/010-api-prod-dockerfile/spec.md
+++ b/specs/010-api-prod-dockerfile/spec.md
@@ -0,0 +1,96 @@
+# Feature Specification: Production-Grade API Container Image
+
+**Feature Branch**: `010-api-prod-dockerfile`
+**Created**: 2026-05-07
+**Status**: Draft
+**Input**: User description: "We need a production-grade Dockerfile for the API to start preparing for a production deployment."
+
+## User Scenarios & Testing *(mandatory)*
+
+### User Story 1 — API Runs Reliably in Production (Priority: P1)
+
+An operator builds and runs the API container in a production environment. The container starts successfully, serves requests, and can be health-checked by an orchestrator (e.g., Kubernetes). When the orchestrator signals shutdown, the container drains in-flight requests before exiting cleanly, avoiding dropped connections.
+
+**Why this priority**: Without a correctly functioning container, no production deployment is possible. This is the baseline that all other stories depend on.
+
+**Independent Test**: Build the image from source, run the container with required env vars, call the health endpoint, send SIGTERM, and verify the process exits cleanly with code 0. No other stories are required.
+
+**Acceptance Scenarios**:
+
+1. **Given** a built container image and all required env vars, **When** the container starts, **Then** it begins serving requests within 30 seconds and the health endpoint returns a success response.
+2. **Given** a running container, **When** a SIGTERM is received, **Then** the process finishes any in-flight requests and exits with code 0 within 30 seconds.
+3. **Given** a running container, **When** a required env var is absent, **Then** the process exits immediately with a non-zero code and logs a clear error message identifying the missing variable.
+
+---
+
+### User Story 2 — Minimal, Secure Container (Priority: P2)
+
+A security-conscious operator audits the container image before promotion to production. They verify the API process does not run as root, the image contains no development tooling or test artefacts, and no credentials are baked into the image layers.
+
+**Why this priority**: Running as root or including unnecessary tools increases the blast radius of any container breakout. This is a production-readiness requirement, not optional hardening.
+
+**Independent Test**: Inspect the built image to confirm the runtime user is non-root, confirm no dev/test files are present in the image layers, and scan the image with a standard vulnerability scanner. Passes independently of any deployment environment.
+
+**Acceptance Scenarios**:
+
+1. **Given** a built container image, **When** the running process user is inspected, **Then** the API process runs as a non-root user with a numeric UID.
+2. **Given** a built container image, **When** the image layers are inspected, **Then** no development dependencies, test files, or local configuration are present.
+3. **Given** a built container image, **When** the image layers are scanned for hardcoded secrets, **Then** no credentials, API keys, or secret values are found embedded in any layer.
+
+---
+
+### User Story 3 — Fast, Reproducible Builds (Priority: P3)
+
+A developer rebuilds the container image after a code change. The build completes quickly because unchanged layers (dependencies) are cached. Given identical source inputs, the resulting image is functionally equivalent across builds, enabling confident CI/CD promotion.
+
+**Why this priority**: Slow or non-deterministic builds reduce developer confidence and slow deployment pipelines. Important for velocity, but the container already works (P1, P2) before this is optimised.
+
+**Independent Test**: Build the image twice from the same source; confirm the second build reuses dependency layers from cache and completes significantly faster than the first.
+
+**Acceptance Scenarios**:
+
+1. **Given** an image built once, **When** only application source files change and the image is rebuilt, **Then** the dependency installation step is served from cache and the rebuild completes faster than a clean build.
+2. **Given** two builds from the same source commit, **When** the images are run, **Then** both produce identical API behaviour.
+
+---
+
+### Edge Cases
+
+- What happens when the database is unavailable at container startup?
+- What happens when the container is sent SIGKILL instead of SIGTERM (hard kill by orchestrator)?
+- What happens if the container runs out of memory mid-request?
+- How does the image behave when run read-only filesystem (`--read-only`)?
+
+## Requirements *(mandatory)*
+
+### Functional Requirements
+
+- **FR-001**: The container image MUST start the API service and begin accepting requests without manual intervention after supplying required env vars.
+- **FR-002**: The container image MUST expose a health check that an orchestrator can poll to determine service readiness.
+- **FR-003**: The container image MUST handle the SIGTERM signal by completing in-flight requests then exiting cleanly within 30 seconds.
+- **FR-004**: The container image MUST run the API process as a non-root, non-privileged user.
+- **FR-005**: The container image MUST NOT contain development dependencies, test files, source control metadata, or local configuration files.
+- **FR-006**: The container image MUST NOT contain any hardcoded credentials, secrets, or environment-specific values — all configuration MUST be supplied via environment variables at runtime.
+- **FR-007**: The container image MUST log to standard output and standard error so logs are captured by the container runtime without additional configuration.
+- **FR-008**: The container image MUST be buildable reproducibly from the same source inputs — a rebuild from the same commit MUST produce a functionally equivalent image.
+- **FR-009**: Rebuilding the image after a source-only change (no dependency changes) MUST reuse the cached dependency installation layer.
+
+## Success Criteria *(mandatory)*
+
+### Measurable Outcomes
+
+- **SC-001**: The container starts and serves its first successful health-check response within 30 seconds of launch with all required env vars present.
+- **SC-002**: The container exits cleanly (code 0) within 30 seconds of receiving a SIGTERM, with no in-flight requests dropped.
+- **SC-003**: The API process inside the container runs as a non-root user (inspectable via container runtime tooling).
+- **SC-004**: A rebuild after a source-only change completes in under 60 seconds on a warm cache (dependency layer reused).
+- **SC-005**: The image contains zero hardcoded secrets (verifiable by static layer inspection).
+- **SC-006**: All API logs appear on stdout/stderr and are captured by the container runtime log driver without additional sidecar or configuration.
+
+## Assumptions
+
+- The existing test Dockerfile (used by the integration test stack) is not suitable for production and will remain separate; this feature produces a distinct production image.
+- All required runtime configuration (database URL, S3 credentials, JWT secret, etc.) will be injected as environment variables by the deployment platform — the image itself carries no environment-specific values.
+- The deployment target supports OCI-compatible container images (Kubernetes, Docker, etc.).
+- No persistent local storage is needed by the API container; all state lives in the database and object storage.
+- The production image does not need to run database migrations; migrations are applied by a separate step in the deployment pipeline.
+- A single-architecture image (linux/amd64) is sufficient for the initial production target.
--- a/specs/010-api-prod-dockerfile/tasks.md
+++ b/specs/010-api-prod-dockerfile/tasks.md
@@ -0,0 +1,158 @@
+# Tasks: Production-Grade API Container Image
+
+**Input**: Design documents from `specs/010-api-prod-dockerfile/`
+**Prerequisites**: plan.md ✅, spec.md ✅, research.md ✅, contracts/container.md ✅, quickstart.md ✅
+
+**Tests**: TDD is non-negotiable (§5.1). The "test" for a Docker build artefact is `api/tests/build/verify_production_image.sh`, written before `api/Dockerfile.prod` exists. Running the script immediately fails (red) because the build step cannot find the file; writing `Dockerfile.prod` turns it green.
+
+**Organization**: Phase 1 sets up Makefile targets and `.dockerignore`; Phase 3 (US1) writes the verification script and the Dockerfile; Phase 4 (US2) extends the script with security checks; Phase 5 (US3) extends it with a cache-hit check; Phase 6 polishes.
+
+## Format: `[ID] [P?] [Story] Description`
+
+- **[P]**: Can run in parallel with other [P] tasks in the same phase
+- **[Story]**: Which user story this task belongs to
+- Exact file paths included in every task description
+
+---
+
+## Phase 1: Setup
+
+- [X] T001 Add `build-prod` and `verify-prod` targets (and their `.PHONY` entries) to the root `Makefile` at `/workspace/Makefile`: `build-prod` runs `docker build -f api/Dockerfile.prod api/ -t reactbin-api-prod:latest`; `verify-prod` runs `bash api/tests/build/verify_production_image.sh`
+
+- [X] T002 Update `api/.dockerignore` at `/workspace/api/.dockerignore`: append three lines — `tests/`, `alembic/`, and `alembic.ini` — so these are excluded from the production build context (the Dockerfile.prod copies only `app/` explicitly, but excluding them from the context keeps the transfer to the Docker daemon fast)
+
+---
+
+## Phase 2: Foundational
+
+- [X] T003 Create directory `api/tests/build/` at `/workspace/api/tests/build/` with `mkdir -p` and add a `.gitkeep` so the directory is tracked
+
+**Checkpoint**: Directory structure is ready; Makefile and .dockerignore are updated.
+
+---
+
+## Phase 3: User Story 1 — API Runs Reliably in Production (Priority: P1) 🎯 MVP
+
+**Goal**: The container builds, starts, serves the health endpoint, and exits cleanly on SIGTERM.
+
+**Independent Test**: `make verify-prod` — passes when `Dockerfile.prod` exists and all US1 checks pass.
+
+### Test for User Story 1 (TDD red — write first, confirm failure before T005)
+
+- [X] T004 [US1] Create `api/tests/build/verify_production_image.sh` as an executable bash script (`chmod +x`) with `#!/usr/bin/env bash` and `set -euo pipefail`; the script MUST:
+  1. Set `IMAGE="reactbin-api-prod:verify-$$"` and `PG_CONTAINER=""` and `APP_CONTAINER=""`;
+  2. Define a `cleanup()` function that runs `docker rm -f "$APP_CONTAINER" "$PG_CONTAINER" 2>/dev/null || true` and `docker rmi "$IMAGE" 2>/dev/null || true`, and register it with `trap cleanup EXIT`;
+  3. **[US1 check 1 — build]** Run `docker build -f api/Dockerfile.prod api/ -t "$IMAGE"` — this is the line that fails **red** because `api/Dockerfile.prod` does not yet exist; print `[verify] Building $IMAGE...` before and `[verify] Build OK` after;
+  4. **[US1 check 2 — start with real DB]** Launch a throwaway postgres: `PG_CONTAINER=$(docker run -d -e POSTGRES_DB=reactbin_verify -e POSTGRES_USER=verify -e POSTGRES_PASSWORD=verify postgres:16-alpine)`; poll `docker exec "$PG_CONTAINER" pg_isready -U verify` up to 30 × 1s, fail if timeout; capture `PG_IP=$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' "$PG_CONTAINER")`;
+  5. Start the production container: `APP_CONTAINER=$(docker run -d -p 18000:8000 -e JWT_SECRET_KEY=verify-key -e OWNER_USERNAME=testowner -e OWNER_PASSWORD=testpassword -e DATABASE_URL="postgresql+asyncpg://verify:verify@${PG_IP}:5432/reactbin_verify" -e S3_ENDPOINT_URL=http://noop:9000 -e S3_BUCKET_NAME=noop -e S3_ACCESS_KEY_ID=noop -e S3_SECRET_ACCESS_KEY=noop -e S3_REGION=us-east-1 "$IMAGE")`; note — S3 credentials are placeholders; the health endpoint does not require S3;
+  6. **[US1 check 3 — health endpoint]** Poll `curl -sf http://localhost:18000/api/v1/health` up to 30 × 1s, fail with a message if timeout; print `[verify] Health check passed` on success;
+  7. **[US1 check 4 — SIGTERM → exit 0]** Run `docker stop "$APP_CONTAINER"` (sends SIGTERM); capture `EXIT_CODE=$(docker wait "$APP_CONTAINER")`; assert `"$EXIT_CODE" -eq 0`, fail with `FAIL: non-zero exit $EXIT_CODE` otherwise; print `[verify] Graceful shutdown OK (exit $EXIT_CODE)`;
+  8. Print `[verify] US1 checks passed.`
+  9. **[C3 — missing env var → non-zero exit]** Run `docker run --rm -e JWT_SECRET_KEY=verify-key "$IMAGE" 2>&1`; assert the exit code is **non-zero** (OWNER_USERNAME is absent so Pydantic settings validation must fail at startup); print `[verify] Missing-env-var exit check OK`;
+  After writing the script, run `make verify-prod` and confirm it **fails** with a Docker build error (red state — `Dockerfile.prod` does not exist).
+
+### Implementation for User Story 1
+
+- [X] T005 [US1] Create `api/Dockerfile.prod` at `/workspace/api/Dockerfile.prod` — a two-stage multi-stage build:
+  **Stage 1 (builder)**: `FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim AS builder`; `WORKDIR /app`; set `ENV UV_COMPILE_BYTECODE=1 UV_LINK_MODE=copy UV_PYTHON_DOWNLOADS=never`; `COPY pyproject.toml uv.lock ./`; `RUN --mount=type=cache,target=/root/.cache/uv uv sync --frozen --no-dev --no-install-project`; `COPY app/ ./app/`
+  **Stage 2 (runtime)**: `FROM python:3.12-slim`; `WORKDIR /app`; `RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*`; `RUN groupadd --system --gid 1001 appgroup && useradd --system --uid 1001 --gid 1001 --no-create-home appuser`; `COPY --from=builder --chown=appuser:appgroup /app/.venv /app/.venv`; `COPY --chown=appuser:appgroup app/ ./app/`; `USER appuser`; `ENV PATH="/app/.venv/bin:$PATH"`; `EXPOSE 8000`; `HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 CMD curl -f http://localhost:8000/api/v1/health || exit 1`; `CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--timeout-graceful-shutdown", "30"]`
+
+- [X] T006 [US1] Verify TDD green for US1: run `make verify-prod` and confirm all four US1 checks pass — build OK, health endpoint returns 200, SIGTERM produces exit code 0, and `[verify] US1 checks passed.` is printed.
+
+**Checkpoint**: US1 is complete. Production container builds, starts, serves traffic, and shuts down gracefully.
+
+---
+
+## Phase 4: User Story 2 — Minimal, Secure Container (Priority: P2)
+
+**Goal**: The production image runs as non-root and contains no dev dependencies or embedded secrets.
+
+**Independent Test**: US2 checks in `make verify-prod` — the same script extended with non-root and dev-deps-absent assertions.
+
+### Tests for User Story 2 (TDD extension — add checks, confirm they pass against existing Dockerfile.prod)
+
+- [X] T007 [US2] Extend `api/tests/build/verify_production_image.sh` with two US2 checks inserted after the SIGTERM check (before the final `US1 checks passed` line):
+  **[US2 check 1 — non-root]** After the container is running (before `docker stop`), run `UID_IN_CONTAINER=$(docker exec "$APP_CONTAINER" id -u)`; assert `"$UID_IN_CONTAINER" -ne 0`, fail with `FAIL: process running as root (UID 0)` if violated; print `[verify] Non-root user OK (UID $UID_IN_CONTAINER)`;
+  **[US2 check 2 — dev deps absent]** After cleanup of APP_CONTAINER but still holding the image, run `docker run --rm "$IMAGE" /app/.venv/bin/python -c "import pytest" 2>/dev/null`; assert the command returns **non-zero** (i.e., pytest is NOT importable); if it returns 0, fail with `FAIL: pytest importable in production image (dev deps present)`; print `[verify] Dev deps absent OK`;
+  **[C1 — stdout log capture]** Run `docker logs "$APP_CONTAINER" 2>&1`; assert the output is non-empty and contains `Started server` or `Application startup complete` (uvicorn startup lines); fail with `FAIL: no startup logs found on stdout/stderr` if absent; print `[verify] Stdout logging OK`; note — insert this check while APP_CONTAINER is still running, before the `docker stop` call;
+  **[C2 — no hardcoded secrets in layers]** Run `docker history --no-trunc "$IMAGE" 2>&1`; pipe through `grep -iE "(password|secret_key|api_key|token)" `; assert zero matching lines; if any match, fail with `FAIL: potential secret found in image history`; print `[verify] No secrets in image layers OK`;
+  Update the final success line to `[verify] All checks passed (US1 + US2).`; confirm `make verify-prod` passes.
+
+**Checkpoint**: US2 is verified. Image runs as UID 1001 and contains no test tooling.
+
+---
+
+## Phase 5: User Story 3 — Fast, Reproducible Builds (Priority: P3)
+
+**Goal**: Rebuilding after a source-only change reuses the dependency layer from cache.
+
+**Independent Test**: US3 check in `make verify-prod` — a timed second build after touching a source file asserts the dep layer was cached.
+
+### Tests for User Story 3 (TDD extension)
+
+- [X] T008 [US3] Extend `api/tests/build/verify_production_image.sh` with a US3 cache check appended after all other checks (before final success line):
+  **[US3 check — dep layer cached on source-only rebuild]** Set `IMAGE2="reactbin-api-prod:verify-cache-$$"`; `touch api/app/main.py`; capture the output of `docker build --progress=plain -f api/Dockerfile.prod api/ -t "$IMAGE2" 2>&1` (the `--progress=plain` flag ensures consistent `CACHED` output regardless of Docker version or TTY settings); assert the output contains the string `CACHED`; if `CACHED` is absent, fail with `FAIL: dependency layer not reused on source-only rebuild`; add `docker rmi "$IMAGE2" 2>/dev/null || true` to the `cleanup()` function; print `[verify] Dep layer cache hit confirmed (US3 OK)`;
+  Update the final success line to `[verify] All checks passed (US1 + US2 + US3).`
+
+- [X] T009 [US3] Verify TDD green for US3: run `make verify-prod` and confirm the full script passes including the cache check — the build output for the second image must contain `CACHED`, and `[verify] All checks passed (US1 + US2 + US3).` must print.
+
+**Checkpoint**: All three user stories are verified end-to-end by `make verify-prod`.
+
+---
+
+## Phase 6: Polish & Cross-Cutting Concerns
+
+- [X] T010 Run `make test-integration` from `/workspace` and confirm all 102 existing tests still pass — verifies that the `.dockerignore` additions (T002) do not break the existing test Dockerfile build or any integration test (§5.4 regression gate)
+
+- [X] T011 Run `shellcheck api/tests/build/verify_production_image.sh` and fix any violations (common: unquoted variables, `[ ]` vs `[[ ]]`, missing `--` before arguments)
+
+---
+
+## Dependencies & Execution Order
+
+### Phase Dependencies
+
+- **Phase 1 (Setup)**: No external dependencies — start immediately
+- **Phase 2 (Foundational)**: No dependencies — start immediately (parallel with Phase 1)
+- **Phase 3 (US1)**: Depends on Phase 1 (Makefile + .dockerignore must exist before `make verify-prod` can run) and Phase 2 (test directory must exist)
+- **Phase 4 (US2)**: Depends on Phase 3 (US1 script and Dockerfile must exist to extend)
+- **Phase 5 (US3)**: Depends on Phase 4 (full US2 script must exist to extend)
+- **Phase 6 (Polish)**: Depends on all prior phases; T010 (regression test) must precede T011 (shellcheck)
+
+### Within Phase 3
+
+- T004 before T005 (write test script before writing the Dockerfile)
+- T005 after T004 (implement Dockerfile after confirming red state)
+- T006 after T005 (verify green after implementation)
+
+### Execution Order Summary
+
+```
+Step 1: T001 ∥ T002 ∥ T003  (setup — parallel, different files)
+Step 2: T004                 (write verification script — TDD red)
+Step 3: T005                 (write Dockerfile.prod — implementation)
+Step 4: T006                 (verify US1 green)
+Step 5: T007                 (extend script with US2 checks, verify pass)
+Step 6: T008                 (extend script with US3 check)
+Step 7: T009                 (verify US3 green)
+Step 8: T010                 (make test-integration — regression gate)
+Step 9: T011                 (shellcheck polish)
+```
+
+---
+
+## Implementation Strategy
+
+### MVP (US1 — reliable production run)
+
+1. Complete T001–T003 (setup)
+2. Complete T004–T006 (core blocking: write script → write Dockerfile → verify green)
+3. **Validate**: `make verify-prod` passes; `make test-integration` still passes (no regressions)
+4. US2 and US3 add explicit verification coverage for properties already implemented
+
+### Incremental Delivery
+
+- After Phase 3: Production image builds, starts, and shuts down gracefully — safe to deploy
+- After Phase 4: Security properties (non-root, no dev deps) are explicitly verified
+- After Phase 5: Build efficiency (layer caching) is confirmed by automated check
+- After Phase 6: Script is lint-clean, ready for CI integration