Two-stage build (uv builder + python:3.12-slim runtime) with non-root user (UID 1001), no dev deps, layer-cache-optimised dep install, and graceful SIGTERM shutdown. Verified by api/tests/build/verify_production_image.sh covering build, health endpoint, non-root, stdout logging, secret-free layers, missing-env-var exit, and dep-layer cache hit. All 102 integration tests still pass; shellcheck clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
95 lines
6.3 KiB
Markdown
95 lines
6.3 KiB
Markdown
# Research: Production API Container Image
|
|
|
|
## Decision 1 — Use a Separate `Dockerfile.prod`
|
|
|
|
**Decision**: Add `api/Dockerfile.prod` alongside the existing `api/Dockerfile`.
|
|
|
|
**Rationale**: The existing `api/Dockerfile` installs dev dependencies (`.[dev]`), mounts source with `--reload`, and is used by the Docker Compose integration test stack. Modifying it would break `make test-integration`. A separate file keeps the two images independent with zero coupling.
|
|
|
|
**Alternatives considered**:
|
|
- Build-arg flag in a single Dockerfile: adds conditional complexity and makes both files harder to read.
|
|
- Rename existing to `Dockerfile.dev` and make `Dockerfile` the production image: would require updating `docker-compose.test.yml` with an explicit file reference — a wider change than needed for this feature.
|
|
|
|
---
|
|
|
|
## Decision 2 — Multi-Stage Build: uv Builder + python:3.12-slim Runtime
|
|
|
|
**Decision**: Two-stage build. Stage 1 (`builder`) uses `ghcr.io/astral-sh/uv:python3.12-bookworm-slim` to install production dependencies into a virtual environment. Stage 2 (`runtime`) uses `python:3.12-slim` and copies only the `.venv` and application source from the builder. uv is not present in the final image.
|
|
|
|
**Rationale**:
|
|
- uv's official Docker image is the fastest, most correct way to produce a pinned, bytecode-compiled venv from `uv.lock`.
|
|
- Keeping uv out of the runtime image reduces attack surface and image size.
|
|
- `python:3.12-slim` is a well-maintained, widely scanned base; using it for the runtime stage aligns with existing project images.
|
|
|
|
**Layer caching strategy**:
|
|
```
|
|
COPY pyproject.toml uv.lock ./
|
|
RUN uv sync --frozen --no-dev --no-install-project ← cache hits when only source changes
|
|
COPY app/ ./app/ ← only reaches here on source changes
|
|
```
|
|
`--no-install-project` installs all listed dependencies without the project package itself. The project source is then copied separately. This means a source-only change reuses the dependency layer from cache.
|
|
|
|
**Environment variables for optimal builds**:
|
|
- `UV_COMPILE_BYTECODE=1` — pre-compile `.pyc` files; slightly larger venv but faster cold starts.
|
|
- `UV_LINK_MODE=copy` — avoids hard-link issues when copying between image layers.
|
|
- `UV_PYTHON_DOWNLOADS=never` — ensures the builder stage uses the bundled Python, not a downloaded one.
|
|
|
|
**Alternatives considered**:
|
|
- Installing deps into the system Python (`--system`): rejected because it pollutes the base image and makes it harder to copy deps cleanly into the runtime stage.
|
|
- Using a single `FROM python:3.12-slim` with pip: slower builds, no lockfile pinning, no bytecode compilation step.
|
|
|
|
---
|
|
|
|
## Decision 3 — Non-Root User (UID 1001, System User)
|
|
|
|
**Decision**: Create a system user `appuser` with GID/UID 1001 in the runtime stage. All owned files are `chown`-ed at `COPY` time using `--chown=appuser:appgroup`.
|
|
|
|
**Rationale**: Running as root inside a container is a container breakout risk. A numeric UID (rather than a named user that might not exist on the host) is required by some Kubernetes pod security admission policies. UID 1001 avoids collision with UID 1000 (the typical first interactive user on a Linux host) while remaining a predictable, inspectable value.
|
|
|
|
**Alternatives considered**:
|
|
- UID 1000: small risk of collision with host user when bind mounts are involved.
|
|
- `USER nobody`: `nobody` (UID 65534) works but its name and UID are not consistent across distros.
|
|
|
|
---
|
|
|
|
## Decision 4 — SIGTERM Graceful Shutdown via uvicorn `--timeout-graceful-shutdown`
|
|
|
|
**Decision**: Use `uvicorn`'s built-in `--timeout-graceful-shutdown 30` flag. No process supervisor (tini, s6) is required.
|
|
|
|
**Rationale**: uvicorn handles SIGTERM natively when run as PID 1 in single-worker mode (the production Dockerfile runs one worker). On SIGTERM it stops accepting new connections, waits up to `--timeout-graceful-shutdown` seconds for in-flight requests to complete, then exits with code 0. No additional init system is needed.
|
|
|
|
**Alternatives considered**:
|
|
- tini: adds a small init shim that reaps zombies and forwards signals. Not necessary with a single uvicorn worker (no child processes to reap).
|
|
- Gunicorn + uvicorn workers: more complex; appropriate for multi-worker setups but the deployment platform (Kubernetes) scales horizontally via pod replicas rather than in-process workers.
|
|
|
|
---
|
|
|
|
## Decision 5 — `curl` for HEALTHCHECK
|
|
|
|
**Decision**: Install `curl` (via `apt-get --no-install-recommends`) in the runtime stage and use it in the `HEALTHCHECK` directive.
|
|
|
|
**Rationale**: The existing dev Dockerfile already installs `curl` for the same reason. `curl -f` exits non-zero on HTTP errors, making it a reliable single-command health probe. A Python one-liner adds interpreter startup overhead (~100ms) per check; `curl` is ~5ms.
|
|
|
|
**Alternatives considered**:
|
|
- `wget -q --spider`: available on Alpine but not on Debian-slim by default; requires separate install.
|
|
- Python `urllib.request`: no extra install, but slower and adds noise to the process table during health checks.
|
|
|
|
---
|
|
|
|
## Decision 6 — TDD Verification via Shell Script
|
|
|
|
**Decision**: Write `api/tests/build/verify_production_image.sh` before `Dockerfile.prod`. The script builds the image and runs behavioral checks (health endpoint, non-root user, clean SIGTERM exit). It is the "failing test" per §5.1.
|
|
|
|
**Rationale**: The production image is a build artifact, not Python business logic. pytest cannot test a Docker image without Docker-in-Docker, which the current CI stack does not support. A shell script run on the host (via `make verify-prod`) is the appropriate TDD vehicle for this artefact type.
|
|
|
|
**Verification steps the script covers**:
|
|
1. `docker build -f api/Dockerfile.prod api/` → fails (red) until Dockerfile.prod exists.
|
|
2. Run container with required env vars; wait for health endpoint → `GET /api/v1/health` returns 200.
|
|
3. Inspect running process user → UID ≠ 0 (non-root).
|
|
4. Send SIGTERM to container; assert exit code 0 within 30s (graceful shutdown).
|
|
5. Assert dev packages are absent: `pip show pytest` inside container must return non-zero.
|
|
|
|
**Alternatives considered**:
|
|
- pytest with docker SDK: requires `docker` Python package and DinD in CI; rejected as over-engineered for a single-file build artifact.
|
|
- Manual verification only: rejected because §5.1 mandates automated failing tests before production code.
|