205 lines
11 KiB
Markdown
205 lines
11 KiB
Markdown
# Database schema reference
|
||
|
||
## Overview
|
||
|
||
All tables use UUID primary keys (generated client-side via `uuid4()`), `timestamptz` for datetime columns, and follow a consistent naming convention: `snake_case` table names, singular for join/config tables, plural for entity tables.
|
||
|
||
The schema is managed by Alembic. Never modify tables directly — always create a migration.
|
||
|
||
## Proxy domain tables
|
||
|
||
### proxy_sources
|
||
|
||
Configurable scrape targets. Each record defines a URL to fetch, a parser to use, and a schedule.
|
||
|
||
| Column | Type | Constraints | Description |
|
||
|--------|------|-------------|-------------|
|
||
| `id` | `uuid` | PK, default uuid4 | |
|
||
| `url` | `varchar(2048)` | UNIQUE, NOT NULL | The URL to scrape |
|
||
| `parser_name` | `varchar(64)` | NOT NULL | Maps to a registered `SourceParser.name` |
|
||
| `cron_schedule` | `varchar(64)` | nullable | Cron expression for scrape frequency. Falls back to the parser's `default_schedule()` if NULL |
|
||
| `default_protocol` | `enum(proxy_protocol)` | NOT NULL, default `http` | Protocol to assign when the parser can't determine it from the source |
|
||
| `is_active` | `boolean` | NOT NULL, default `true` | Inactive sources are skipped by the scrape task |
|
||
| `last_scraped_at` | `timestamptz` | nullable | Timestamp of the last successful scrape |
|
||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||
|
||
**Rationale**: Storing the parser name rather than auto-detecting every time allows explicit control. A source might look like a plain text file but actually need a custom parser.
|
||
|
||
### proxies
|
||
|
||
The core proxy table. Each record represents a unique `(ip, port, protocol)` combination.
|
||
|
||
| Column | Type | Constraints | Description |
|
||
|--------|------|-------------|-------------|
|
||
| `id` | `uuid` | PK, default uuid4 | |
|
||
| `ip` | `inet` | NOT NULL | IPv4 or IPv6 address |
|
||
| `port` | `integer` | NOT NULL | Port number (1–65535) |
|
||
| `protocol` | `enum(proxy_protocol)` | NOT NULL | `http`, `https`, `socks4`, `socks5` |
|
||
| `source_id` | `uuid` | FK → proxy_sources.id, NOT NULL | Which source discovered this proxy |
|
||
| `status` | `enum(proxy_status)` | NOT NULL, default `unchecked` | `unchecked`, `active`, `dead` |
|
||
| `anonymity` | `enum(anonymity_level)` | nullable | `transparent`, `anonymous`, `elite` |
|
||
| `exit_ip` | `inet` | nullable | The IP address seen by the target when using this proxy |
|
||
| `country` | `varchar(2)` | nullable | ISO 3166-1 alpha-2 country code of the exit IP |
|
||
| `score` | `float` | NOT NULL, default `0.0` | Composite quality score (0.0–1.0) |
|
||
| `avg_latency_ms` | `float` | nullable | Rolling average latency across recent checks |
|
||
| `uptime_pct` | `float` | nullable | Percentage of checks that passed (0.0–100.0) |
|
||
| `first_seen_at` | `timestamptz` | NOT NULL, server default `now()` | When this proxy was first discovered |
|
||
| `last_checked_at` | `timestamptz` | nullable | When the last validation check completed |
|
||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||
|
||
**Indexes**:
|
||
|
||
| Name | Columns | Type | Purpose |
|
||
|------|---------|------|---------|
|
||
| `ix_proxies_ip_port_proto` | `(ip, port, protocol)` | UNIQUE | Deduplication on upsert |
|
||
| `ix_proxies_status_score` | `(status, score)` | B-tree | Fast filtering for "active proxies sorted by score" |
|
||
|
||
**Design note**: The same `ip:port` can appear multiple times if it supports different protocols (e.g., HTTP on port 8080 and SOCKS5 on port 1080). The composite unique index enforces this correctly.
|
||
|
||
**Computed columns**: `score`, `avg_latency_ms`, and `uptime_pct` are denormalized from `proxy_checks`. They are recomputed by the validation pipeline after each check run and by a periodic rollup task. This avoids expensive aggregation queries on every proxy list request.
|
||
|
||
### proxy_checks
|
||
|
||
Append-only log of every validation check attempt. This is the raw data behind the computed fields on `proxies`.
|
||
|
||
| Column | Type | Constraints | Description |
|
||
|--------|------|-------------|-------------|
|
||
| `id` | `uuid` | PK, default uuid4 | |
|
||
| `proxy_id` | `uuid` | FK → proxies.id ON DELETE CASCADE, NOT NULL | |
|
||
| `checker_name` | `varchar(64)` | NOT NULL | The `ProxyChecker.name` that ran this check |
|
||
| `stage` | `integer` | NOT NULL | Pipeline stage number |
|
||
| `passed` | `boolean` | NOT NULL | Whether the check succeeded |
|
||
| `latency_ms` | `float` | nullable | Time taken for this specific check |
|
||
| `detail` | `text` | nullable | Human-readable result description or error message |
|
||
| `exit_ip` | `inet` | nullable | Exit IP discovered during this check (if applicable) |
|
||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||
|
||
**Indexes**:
|
||
|
||
| Name | Columns | Purpose |
|
||
|------|---------|---------|
|
||
| `ix_checks_proxy_created` | `(proxy_id, created_at)` | Efficient history queries per proxy |
|
||
|
||
**Retention**: This table grows fast. A periodic cleanup task (`tasks_cleanup.prune_checks`) deletes rows older than a configurable retention period (default: 7 days), keeping only the most recent N checks per proxy.
|
||
|
||
### proxy_tags
|
||
|
||
Flexible key-value labels for proxies. Useful for user-defined categorization (e.g., `datacenter: true`, `provider: aws`, `tested_site: google.com`).
|
||
|
||
| Column | Type | Constraints | Description |
|
||
|--------|------|-------------|-------------|
|
||
| `id` | `uuid` | PK, default uuid4 | |
|
||
| `proxy_id` | `uuid` | FK → proxies.id ON DELETE CASCADE, NOT NULL | |
|
||
| `key` | `varchar(64)` | NOT NULL | Tag name |
|
||
| `value` | `varchar(256)` | NOT NULL | Tag value |
|
||
|
||
**Indexes**:
|
||
|
||
| Name | Columns | Type | Purpose |
|
||
|------|---------|------|---------|
|
||
| `ix_tags_proxy_key` | `(proxy_id, key)` | UNIQUE | One value per key per proxy |
|
||
|
||
## Accounts domain tables
|
||
|
||
### users
|
||
|
||
User accounts. Minimal by design — the primary purpose is to own API keys and credits.
|
||
|
||
| Column | Type | Constraints | Description |
|
||
|--------|------|-------------|-------------|
|
||
| `id` | `uuid` | PK, default uuid4 | |
|
||
| `email` | `varchar(320)` | UNIQUE, NOT NULL | Used for notifications and account recovery |
|
||
| `display_name` | `varchar(128)` | nullable | |
|
||
| `is_active` | `boolean` | NOT NULL, default `true` | Inactive users cannot authenticate |
|
||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||
|
||
### api_keys
|
||
|
||
API keys for authentication. The raw key is shown once at creation; only the hash is stored.
|
||
|
||
| Column | Type | Constraints | Description |
|
||
|--------|------|-------------|-------------|
|
||
| `id` | `uuid` | PK, default uuid4 | |
|
||
| `user_id` | `uuid` | FK → users.id ON DELETE CASCADE, NOT NULL | |
|
||
| `key_hash` | `varchar(128)` | NOT NULL | SHA-256 hash of the raw API key |
|
||
| `prefix` | `varchar(8)` | NOT NULL | First 8 characters of the raw key, for quick lookup |
|
||
| `label` | `varchar(128)` | nullable | User-assigned label (e.g., "production", "testing") |
|
||
| `is_active` | `boolean` | NOT NULL, default `true` | Revoked keys have `is_active = false` |
|
||
| `last_used_at` | `timestamptz` | nullable | Updated on each authenticated request |
|
||
| `expires_at` | `timestamptz` | nullable | NULL means no expiration |
|
||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||
|
||
**Indexes**:
|
||
|
||
| Name | Columns | Type | Purpose |
|
||
|------|---------|------|---------|
|
||
| `ix_api_keys_hash` | `(key_hash)` | UNIQUE | Uniqueness constraint on key hashes |
|
||
| `ix_api_keys_prefix` | `(prefix)` | B-tree | Fast prefix-based lookup before full hash comparison |
|
||
|
||
**Auth flow**: On each request, the middleware extracts the API key from the `Authorization: Bearer <key>` header, computes `prefix = key[:8]`, queries `api_keys WHERE prefix = ? AND is_active = true AND (expires_at IS NULL OR expires_at > now())`, then verifies `sha256(key) == key_hash`. This two-step approach avoids computing a hash against every key in the database.
|
||
|
||
### credit_ledger
|
||
|
||
Append-only ledger of all credit transactions. Current balance is `SELECT SUM(amount) FROM credit_ledger WHERE user_id = ?`.
|
||
|
||
| Column | Type | Constraints | Description |
|
||
|--------|------|-------------|-------------|
|
||
| `id` | `uuid` | PK, default uuid4 | |
|
||
| `user_id` | `uuid` | FK → users.id ON DELETE CASCADE, NOT NULL | |
|
||
| `amount` | `integer` | NOT NULL | Positive = credit in, negative = debit |
|
||
| `tx_type` | `enum(credit_tx_type)` | NOT NULL | `purchase`, `acquire`, `refund`, `admin_adjust` |
|
||
| `description` | `text` | nullable | Human-readable note |
|
||
| `reference_id` | `uuid` | nullable | Links to the related entity (e.g., lease ID for `acquire` transactions) |
|
||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||
|
||
**Indexes**:
|
||
|
||
| Name | Columns | Purpose |
|
||
|------|---------|---------|
|
||
| `ix_ledger_user_created` | `(user_id, created_at)` | Balance computation and history queries |
|
||
|
||
**Caching**: The computed balance is cached in Redis under `credits:{user_id}`. The cache is invalidated (DEL) whenever a new ledger entry is created. Cache miss triggers a `SUM(amount)` query.
|
||
|
||
**Concurrency**: Because balance is derived from a SUM, concurrent inserts don't cause race conditions on the balance itself. The acquire endpoint uses `SELECT ... FOR UPDATE` on the user row to serialize credit checks, preventing double-spending under high concurrency.
|
||
|
||
### proxy_leases
|
||
|
||
Tracks which proxies are currently checked out by which users. Both Redis (for fast lookup) and PostgreSQL (for audit trail) maintain lease state.
|
||
|
||
| Column | Type | Constraints | Description |
|
||
|--------|------|-------------|-------------|
|
||
| `id` | `uuid` | PK, default uuid4 | |
|
||
| `user_id` | `uuid` | FK → users.id, NOT NULL | |
|
||
| `proxy_id` | `uuid` | FK → proxies.id, NOT NULL | |
|
||
| `acquired_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||
| `expires_at` | `timestamptz` | NOT NULL | When the lease automatically releases |
|
||
| `is_released` | `boolean` | NOT NULL, default `false` | Set to true on explicit release or expiration cleanup |
|
||
|
||
**Indexes**:
|
||
|
||
| Name | Columns | Purpose |
|
||
|------|---------|---------|
|
||
| `ix_leases_user` | `(user_id)` | List a user's active leases |
|
||
| `ix_leases_proxy_active` | `(proxy_id, is_released)` | Check if a proxy is currently leased |
|
||
|
||
**Dual state**: Redis holds the lease as `lease:{proxy_id}` with a TTL matching `expires_at`. The proxy selection query excludes proxies with an active Redis lease key. The PostgreSQL record exists for audit, billing reconciliation, and cleanup if Redis state is lost.
|
||
|
||
## Enum types
|
||
|
||
All enums are PostgreSQL native enums created via `CREATE TYPE`:
|
||
|
||
| Enum name | Values |
|
||
|-----------|--------|
|
||
| `proxy_protocol` | `http`, `https`, `socks4`, `socks5` |
|
||
| `proxy_status` | `unchecked`, `active`, `dead` |
|
||
| `anonymity_level` | `transparent`, `anonymous`, `elite` |
|
||
| `credit_tx_type` | `purchase`, `acquire`, `refund`, `admin_adjust` |
|
||
|
||
## Migration conventions
|
||
|
||
- One migration per logical change. Don't bundle unrelated schema changes.
|
||
- Migration filenames: `NNN_descriptive_name.py` (e.g., `001_initial_schema.py`).
|
||
- Always include both `upgrade()` and `downgrade()` functions.
|
||
- Test migrations against a fresh database AND against a database with existing data.
|
||
- Use `alembic revision --autogenerate -m "description"` for model-driven changes, but always review the generated SQL before applying.
|