Files
agatha 12176471e1 Feat: Add production-grade multi-stage container image for API
Two-stage build (uv builder + python:3.12-slim runtime) with non-root
user (UID 1001), no dev deps, layer-cache-optimised dep install, and
graceful SIGTERM shutdown. Verified by api/tests/build/verify_production_image.sh
covering build, health endpoint, non-root, stdout logging, secret-free
layers, missing-env-var exit, and dep-layer cache hit. All 102 integration
tests still pass; shellcheck clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 19:59:29 +00:00

7.3 KiB

Feature Specification: Production-Grade API Container Image

Feature Branch: 010-api-prod-dockerfile Created: 2026-05-07 Status: Draft Input: User description: "We need a production-grade Dockerfile for the API to start preparing for a production deployment."

User Scenarios & Testing (mandatory)

User Story 1 — API Runs Reliably in Production (Priority: P1)

An operator builds and runs the API container in a production environment. The container starts successfully, serves requests, and can be health-checked by an orchestrator (e.g., Kubernetes). When the orchestrator signals shutdown, the container drains in-flight requests before exiting cleanly, avoiding dropped connections.

Why this priority: Without a correctly functioning container, no production deployment is possible. This is the baseline that all other stories depend on.

Independent Test: Build the image from source, run the container with required env vars, call the health endpoint, send SIGTERM, and verify the process exits cleanly with code 0. No other stories are required.

Acceptance Scenarios:

  1. Given a built container image and all required env vars, When the container starts, Then it begins serving requests within 30 seconds and the health endpoint returns a success response.
  2. Given a running container, When a SIGTERM is received, Then the process finishes any in-flight requests and exits with code 0 within 30 seconds.
  3. Given a running container, When a required env var is absent, Then the process exits immediately with a non-zero code and logs a clear error message identifying the missing variable.

User Story 2 — Minimal, Secure Container (Priority: P2)

A security-conscious operator audits the container image before promotion to production. They verify the API process does not run as root, the image contains no development tooling or test artefacts, and no credentials are baked into the image layers.

Why this priority: Running as root or including unnecessary tools increases the blast radius of any container breakout. This is a production-readiness requirement, not optional hardening.

Independent Test: Inspect the built image to confirm the runtime user is non-root, confirm no dev/test files are present in the image layers, and scan the image with a standard vulnerability scanner. Passes independently of any deployment environment.

Acceptance Scenarios:

  1. Given a built container image, When the running process user is inspected, Then the API process runs as a non-root user with a numeric UID.
  2. Given a built container image, When the image layers are inspected, Then no development dependencies, test files, or local configuration are present.
  3. Given a built container image, When the image layers are scanned for hardcoded secrets, Then no credentials, API keys, or secret values are found embedded in any layer.

User Story 3 — Fast, Reproducible Builds (Priority: P3)

A developer rebuilds the container image after a code change. The build completes quickly because unchanged layers (dependencies) are cached. Given identical source inputs, the resulting image is functionally equivalent across builds, enabling confident CI/CD promotion.

Why this priority: Slow or non-deterministic builds reduce developer confidence and slow deployment pipelines. Important for velocity, but the container already works (P1, P2) before this is optimised.

Independent Test: Build the image twice from the same source; confirm the second build reuses dependency layers from cache and completes significantly faster than the first.

Acceptance Scenarios:

  1. Given an image built once, When only application source files change and the image is rebuilt, Then the dependency installation step is served from cache and the rebuild completes faster than a clean build.
  2. Given two builds from the same source commit, When the images are run, Then both produce identical API behaviour.

Edge Cases

  • What happens when the database is unavailable at container startup?
  • What happens when the container is sent SIGKILL instead of SIGTERM (hard kill by orchestrator)?
  • What happens if the container runs out of memory mid-request?
  • How does the image behave when run read-only filesystem (--read-only)?

Requirements (mandatory)

Functional Requirements

  • FR-001: The container image MUST start the API service and begin accepting requests without manual intervention after supplying required env vars.
  • FR-002: The container image MUST expose a health check that an orchestrator can poll to determine service readiness.
  • FR-003: The container image MUST handle the SIGTERM signal by completing in-flight requests then exiting cleanly within 30 seconds.
  • FR-004: The container image MUST run the API process as a non-root, non-privileged user.
  • FR-005: The container image MUST NOT contain development dependencies, test files, source control metadata, or local configuration files.
  • FR-006: The container image MUST NOT contain any hardcoded credentials, secrets, or environment-specific values — all configuration MUST be supplied via environment variables at runtime.
  • FR-007: The container image MUST log to standard output and standard error so logs are captured by the container runtime without additional configuration.
  • FR-008: The container image MUST be buildable reproducibly from the same source inputs — a rebuild from the same commit MUST produce a functionally equivalent image.
  • FR-009: Rebuilding the image after a source-only change (no dependency changes) MUST reuse the cached dependency installation layer.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: The container starts and serves its first successful health-check response within 30 seconds of launch with all required env vars present.
  • SC-002: The container exits cleanly (code 0) within 30 seconds of receiving a SIGTERM, with no in-flight requests dropped.
  • SC-003: The API process inside the container runs as a non-root user (inspectable via container runtime tooling).
  • SC-004: A rebuild after a source-only change completes in under 60 seconds on a warm cache (dependency layer reused).
  • SC-005: The image contains zero hardcoded secrets (verifiable by static layer inspection).
  • SC-006: All API logs appear on stdout/stderr and are captured by the container runtime log driver without additional sidecar or configuration.

Assumptions

  • The existing test Dockerfile (used by the integration test stack) is not suitable for production and will remain separate; this feature produces a distinct production image.
  • All required runtime configuration (database URL, S3 credentials, JWT secret, etc.) will be injected as environment variables by the deployment platform — the image itself carries no environment-specific values.
  • The deployment target supports OCI-compatible container images (Kubernetes, Docker, etc.).
  • No persistent local storage is needed by the API container; all state lives in the database and object storage.
  • The production image does not need to run database migrations; migrations are applied by a separate step in the deployment pipeline.
  • A single-architecture image (linux/amd64) is sufficient for the initial production target.