proxy-pool/docs/03-database-schema.md

205 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Database schema reference
## Overview
All tables use UUID primary keys (generated client-side via `uuid4()`), `timestamptz` for datetime columns, and follow a consistent naming convention: `snake_case` table names, singular for join/config tables, plural for entity tables.
The schema is managed by Alembic. Never modify tables directly — always create a migration.
## Proxy domain tables
### proxy_sources
Configurable scrape targets. Each record defines a URL to fetch, a parser to use, and a schedule.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `url` | `varchar(2048)` | UNIQUE, NOT NULL | The URL to scrape |
| `parser_name` | `varchar(64)` | NOT NULL | Maps to a registered `SourceParser.name` |
| `cron_schedule` | `varchar(64)` | nullable | Cron expression for scrape frequency. Falls back to the parser's `default_schedule()` if NULL |
| `default_protocol` | `enum(proxy_protocol)` | NOT NULL, default `http` | Protocol to assign when the parser can't determine it from the source |
| `is_active` | `boolean` | NOT NULL, default `true` | Inactive sources are skipped by the scrape task |
| `last_scraped_at` | `timestamptz` | nullable | Timestamp of the last successful scrape |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
**Rationale**: Storing the parser name rather than auto-detecting every time allows explicit control. A source might look like a plain text file but actually need a custom parser.
### proxies
The core proxy table. Each record represents a unique `(ip, port, protocol)` combination.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `ip` | `inet` | NOT NULL | IPv4 or IPv6 address |
| `port` | `integer` | NOT NULL | Port number (165535) |
| `protocol` | `enum(proxy_protocol)` | NOT NULL | `http`, `https`, `socks4`, `socks5` |
| `source_id` | `uuid` | FK → proxy_sources.id, NOT NULL | Which source discovered this proxy |
| `status` | `enum(proxy_status)` | NOT NULL, default `unchecked` | `unchecked`, `active`, `dead` |
| `anonymity` | `enum(anonymity_level)` | nullable | `transparent`, `anonymous`, `elite` |
| `exit_ip` | `inet` | nullable | The IP address seen by the target when using this proxy |
| `country` | `varchar(2)` | nullable | ISO 3166-1 alpha-2 country code of the exit IP |
| `score` | `float` | NOT NULL, default `0.0` | Composite quality score (0.01.0) |
| `avg_latency_ms` | `float` | nullable | Rolling average latency across recent checks |
| `uptime_pct` | `float` | nullable | Percentage of checks that passed (0.0100.0) |
| `first_seen_at` | `timestamptz` | NOT NULL, server default `now()` | When this proxy was first discovered |
| `last_checked_at` | `timestamptz` | nullable | When the last validation check completed |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
**Indexes**:
| Name | Columns | Type | Purpose |
|------|---------|------|---------|
| `ix_proxies_ip_port_proto` | `(ip, port, protocol)` | UNIQUE | Deduplication on upsert |
| `ix_proxies_status_score` | `(status, score)` | B-tree | Fast filtering for "active proxies sorted by score" |
**Design note**: The same `ip:port` can appear multiple times if it supports different protocols (e.g., HTTP on port 8080 and SOCKS5 on port 1080). The composite unique index enforces this correctly.
**Computed columns**: `score`, `avg_latency_ms`, and `uptime_pct` are denormalized from `proxy_checks`. They are recomputed by the validation pipeline after each check run and by a periodic rollup task. This avoids expensive aggregation queries on every proxy list request.
### proxy_checks
Append-only log of every validation check attempt. This is the raw data behind the computed fields on `proxies`.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `proxy_id` | `uuid` | FK → proxies.id ON DELETE CASCADE, NOT NULL | |
| `checker_name` | `varchar(64)` | NOT NULL | The `ProxyChecker.name` that ran this check |
| `stage` | `integer` | NOT NULL | Pipeline stage number |
| `passed` | `boolean` | NOT NULL | Whether the check succeeded |
| `latency_ms` | `float` | nullable | Time taken for this specific check |
| `detail` | `text` | nullable | Human-readable result description or error message |
| `exit_ip` | `inet` | nullable | Exit IP discovered during this check (if applicable) |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
**Indexes**:
| Name | Columns | Purpose |
|------|---------|---------|
| `ix_checks_proxy_created` | `(proxy_id, created_at)` | Efficient history queries per proxy |
**Retention**: This table grows fast. A periodic cleanup task (`tasks_cleanup.prune_checks`) deletes rows older than a configurable retention period (default: 7 days), keeping only the most recent N checks per proxy.
### proxy_tags
Flexible key-value labels for proxies. Useful for user-defined categorization (e.g., `datacenter: true`, `provider: aws`, `tested_site: google.com`).
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `proxy_id` | `uuid` | FK → proxies.id ON DELETE CASCADE, NOT NULL | |
| `key` | `varchar(64)` | NOT NULL | Tag name |
| `value` | `varchar(256)` | NOT NULL | Tag value |
**Indexes**:
| Name | Columns | Type | Purpose |
|------|---------|------|---------|
| `ix_tags_proxy_key` | `(proxy_id, key)` | UNIQUE | One value per key per proxy |
## Accounts domain tables
### users
User accounts. Minimal by design — the primary purpose is to own API keys and credits.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `email` | `varchar(320)` | UNIQUE, NOT NULL | Used for notifications and account recovery |
| `display_name` | `varchar(128)` | nullable | |
| `is_active` | `boolean` | NOT NULL, default `true` | Inactive users cannot authenticate |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
### api_keys
API keys for authentication. The raw key is shown once at creation; only the hash is stored.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `user_id` | `uuid` | FK → users.id ON DELETE CASCADE, NOT NULL | |
| `key_hash` | `varchar(128)` | NOT NULL | SHA-256 hash of the raw API key |
| `prefix` | `varchar(8)` | NOT NULL | First 8 characters of the raw key, for quick lookup |
| `label` | `varchar(128)` | nullable | User-assigned label (e.g., "production", "testing") |
| `is_active` | `boolean` | NOT NULL, default `true` | Revoked keys have `is_active = false` |
| `last_used_at` | `timestamptz` | nullable | Updated on each authenticated request |
| `expires_at` | `timestamptz` | nullable | NULL means no expiration |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
**Indexes**:
| Name | Columns | Type | Purpose |
|------|---------|------|---------|
| `ix_api_keys_hash` | `(key_hash)` | UNIQUE | Uniqueness constraint on key hashes |
| `ix_api_keys_prefix` | `(prefix)` | B-tree | Fast prefix-based lookup before full hash comparison |
**Auth flow**: On each request, the middleware extracts the API key from the `Authorization: Bearer <key>` header, computes `prefix = key[:8]`, queries `api_keys WHERE prefix = ? AND is_active = true AND (expires_at IS NULL OR expires_at > now())`, then verifies `sha256(key) == key_hash`. This two-step approach avoids computing a hash against every key in the database.
### credit_ledger
Append-only ledger of all credit transactions. Current balance is `SELECT SUM(amount) FROM credit_ledger WHERE user_id = ?`.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `user_id` | `uuid` | FK → users.id ON DELETE CASCADE, NOT NULL | |
| `amount` | `integer` | NOT NULL | Positive = credit in, negative = debit |
| `tx_type` | `enum(credit_tx_type)` | NOT NULL | `purchase`, `acquire`, `refund`, `admin_adjust` |
| `description` | `text` | nullable | Human-readable note |
| `reference_id` | `uuid` | nullable | Links to the related entity (e.g., lease ID for `acquire` transactions) |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
**Indexes**:
| Name | Columns | Purpose |
|------|---------|---------|
| `ix_ledger_user_created` | `(user_id, created_at)` | Balance computation and history queries |
**Caching**: The computed balance is cached in Redis under `credits:{user_id}`. The cache is invalidated (DEL) whenever a new ledger entry is created. Cache miss triggers a `SUM(amount)` query.
**Concurrency**: Because balance is derived from a SUM, concurrent inserts don't cause race conditions on the balance itself. The acquire endpoint uses `SELECT ... FOR UPDATE` on the user row to serialize credit checks, preventing double-spending under high concurrency.
### proxy_leases
Tracks which proxies are currently checked out by which users. Both Redis (for fast lookup) and PostgreSQL (for audit trail) maintain lease state.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `user_id` | `uuid` | FK → users.id, NOT NULL | |
| `proxy_id` | `uuid` | FK → proxies.id, NOT NULL | |
| `acquired_at` | `timestamptz` | NOT NULL, server default `now()` | |
| `expires_at` | `timestamptz` | NOT NULL | When the lease automatically releases |
| `is_released` | `boolean` | NOT NULL, default `false` | Set to true on explicit release or expiration cleanup |
**Indexes**:
| Name | Columns | Purpose |
|------|---------|---------|
| `ix_leases_user` | `(user_id)` | List a user's active leases |
| `ix_leases_proxy_active` | `(proxy_id, is_released)` | Check if a proxy is currently leased |
**Dual state**: Redis holds the lease as `lease:{proxy_id}` with a TTL matching `expires_at`. The proxy selection query excludes proxies with an active Redis lease key. The PostgreSQL record exists for audit, billing reconciliation, and cleanup if Redis state is lost.
## Enum types
All enums are PostgreSQL native enums created via `CREATE TYPE`:
| Enum name | Values |
|-----------|--------|
| `proxy_protocol` | `http`, `https`, `socks4`, `socks5` |
| `proxy_status` | `unchecked`, `active`, `dead` |
| `anonymity_level` | `transparent`, `anonymous`, `elite` |
| `credit_tx_type` | `purchase`, `acquire`, `refund`, `admin_adjust` |
## Migration conventions
- One migration per logical change. Don't bundle unrelated schema changes.
- Migration filenames: `NNN_descriptive_name.py` (e.g., `001_initial_schema.py`).
- Always include both `upgrade()` and `downgrade()` functions.
- Test migrations against a fresh database AND against a database with existing data.
- Use `alembic revision --autogenerate -m "description"` for model-driven changes, but always review the generated SQL before applying.