proxy-pool/docs/03-database-schema.md

11 KiB
Raw Blame History

Database schema reference

Overview

All tables use UUID primary keys (generated client-side via uuid4()), timestamptz for datetime columns, and follow a consistent naming convention: snake_case table names, singular for join/config tables, plural for entity tables.

The schema is managed by Alembic. Never modify tables directly — always create a migration.

Proxy domain tables

proxy_sources

Configurable scrape targets. Each record defines a URL to fetch, a parser to use, and a schedule.

Column Type Constraints Description
id uuid PK, default uuid4
url varchar(2048) UNIQUE, NOT NULL The URL to scrape
parser_name varchar(64) NOT NULL Maps to a registered SourceParser.name
cron_schedule varchar(64) nullable Cron expression for scrape frequency. Falls back to the parser's default_schedule() if NULL
default_protocol enum(proxy_protocol) NOT NULL, default http Protocol to assign when the parser can't determine it from the source
is_active boolean NOT NULL, default true Inactive sources are skipped by the scrape task
last_scraped_at timestamptz nullable Timestamp of the last successful scrape
created_at timestamptz NOT NULL, server default now()

Rationale: Storing the parser name rather than auto-detecting every time allows explicit control. A source might look like a plain text file but actually need a custom parser.

proxies

The core proxy table. Each record represents a unique (ip, port, protocol) combination.

Column Type Constraints Description
id uuid PK, default uuid4
ip inet NOT NULL IPv4 or IPv6 address
port integer NOT NULL Port number (165535)
protocol enum(proxy_protocol) NOT NULL http, https, socks4, socks5
source_id uuid FK → proxy_sources.id, NOT NULL Which source discovered this proxy
status enum(proxy_status) NOT NULL, default unchecked unchecked, active, dead
anonymity enum(anonymity_level) nullable transparent, anonymous, elite
exit_ip inet nullable The IP address seen by the target when using this proxy
country varchar(2) nullable ISO 3166-1 alpha-2 country code of the exit IP
score float NOT NULL, default 0.0 Composite quality score (0.01.0)
avg_latency_ms float nullable Rolling average latency across recent checks
uptime_pct float nullable Percentage of checks that passed (0.0100.0)
first_seen_at timestamptz NOT NULL, server default now() When this proxy was first discovered
last_checked_at timestamptz nullable When the last validation check completed
created_at timestamptz NOT NULL, server default now()

Indexes:

Name Columns Type Purpose
ix_proxies_ip_port_proto (ip, port, protocol) UNIQUE Deduplication on upsert
ix_proxies_status_score (status, score) B-tree Fast filtering for "active proxies sorted by score"

Design note: The same ip:port can appear multiple times if it supports different protocols (e.g., HTTP on port 8080 and SOCKS5 on port 1080). The composite unique index enforces this correctly.

Computed columns: score, avg_latency_ms, and uptime_pct are denormalized from proxy_checks. They are recomputed by the validation pipeline after each check run and by a periodic rollup task. This avoids expensive aggregation queries on every proxy list request.

proxy_checks

Append-only log of every validation check attempt. This is the raw data behind the computed fields on proxies.

Column Type Constraints Description
id uuid PK, default uuid4
proxy_id uuid FK → proxies.id ON DELETE CASCADE, NOT NULL
checker_name varchar(64) NOT NULL The ProxyChecker.name that ran this check
stage integer NOT NULL Pipeline stage number
passed boolean NOT NULL Whether the check succeeded
latency_ms float nullable Time taken for this specific check
detail text nullable Human-readable result description or error message
exit_ip inet nullable Exit IP discovered during this check (if applicable)
created_at timestamptz NOT NULL, server default now()

Indexes:

Name Columns Purpose
ix_checks_proxy_created (proxy_id, created_at) Efficient history queries per proxy

Retention: This table grows fast. A periodic cleanup task (tasks_cleanup.prune_checks) deletes rows older than a configurable retention period (default: 7 days), keeping only the most recent N checks per proxy.

proxy_tags

Flexible key-value labels for proxies. Useful for user-defined categorization (e.g., datacenter: true, provider: aws, tested_site: google.com).

Column Type Constraints Description
id uuid PK, default uuid4
proxy_id uuid FK → proxies.id ON DELETE CASCADE, NOT NULL
key varchar(64) NOT NULL Tag name
value varchar(256) NOT NULL Tag value

Indexes:

Name Columns Type Purpose
ix_tags_proxy_key (proxy_id, key) UNIQUE One value per key per proxy

Accounts domain tables

users

User accounts. Minimal by design — the primary purpose is to own API keys and credits.

Column Type Constraints Description
id uuid PK, default uuid4
email varchar(320) UNIQUE, NOT NULL Used for notifications and account recovery
display_name varchar(128) nullable
is_active boolean NOT NULL, default true Inactive users cannot authenticate
created_at timestamptz NOT NULL, server default now()

api_keys

API keys for authentication. The raw key is shown once at creation; only the hash is stored.

Column Type Constraints Description
id uuid PK, default uuid4
user_id uuid FK → users.id ON DELETE CASCADE, NOT NULL
key_hash varchar(128) NOT NULL SHA-256 hash of the raw API key
prefix varchar(8) NOT NULL First 8 characters of the raw key, for quick lookup
label varchar(128) nullable User-assigned label (e.g., "production", "testing")
is_active boolean NOT NULL, default true Revoked keys have is_active = false
last_used_at timestamptz nullable Updated on each authenticated request
expires_at timestamptz nullable NULL means no expiration
created_at timestamptz NOT NULL, server default now()

Indexes:

Name Columns Type Purpose
ix_api_keys_hash (key_hash) UNIQUE Uniqueness constraint on key hashes
ix_api_keys_prefix (prefix) B-tree Fast prefix-based lookup before full hash comparison

Auth flow: On each request, the middleware extracts the API key from the Authorization: Bearer <key> header, computes prefix = key[:8], queries api_keys WHERE prefix = ? AND is_active = true AND (expires_at IS NULL OR expires_at > now()), then verifies sha256(key) == key_hash. This two-step approach avoids computing a hash against every key in the database.

credit_ledger

Append-only ledger of all credit transactions. Current balance is SELECT SUM(amount) FROM credit_ledger WHERE user_id = ?.

Column Type Constraints Description
id uuid PK, default uuid4
user_id uuid FK → users.id ON DELETE CASCADE, NOT NULL
amount integer NOT NULL Positive = credit in, negative = debit
tx_type enum(credit_tx_type) NOT NULL purchase, acquire, refund, admin_adjust
description text nullable Human-readable note
reference_id uuid nullable Links to the related entity (e.g., lease ID for acquire transactions)
created_at timestamptz NOT NULL, server default now()

Indexes:

Name Columns Purpose
ix_ledger_user_created (user_id, created_at) Balance computation and history queries

Caching: The computed balance is cached in Redis under credits:{user_id}. The cache is invalidated (DEL) whenever a new ledger entry is created. Cache miss triggers a SUM(amount) query.

Concurrency: Because balance is derived from a SUM, concurrent inserts don't cause race conditions on the balance itself. The acquire endpoint uses SELECT ... FOR UPDATE on the user row to serialize credit checks, preventing double-spending under high concurrency.

proxy_leases

Tracks which proxies are currently checked out by which users. Both Redis (for fast lookup) and PostgreSQL (for audit trail) maintain lease state.

Column Type Constraints Description
id uuid PK, default uuid4
user_id uuid FK → users.id, NOT NULL
proxy_id uuid FK → proxies.id, NOT NULL
acquired_at timestamptz NOT NULL, server default now()
expires_at timestamptz NOT NULL When the lease automatically releases
is_released boolean NOT NULL, default false Set to true on explicit release or expiration cleanup

Indexes:

Name Columns Purpose
ix_leases_user (user_id) List a user's active leases
ix_leases_proxy_active (proxy_id, is_released) Check if a proxy is currently leased

Dual state: Redis holds the lease as lease:{proxy_id} with a TTL matching expires_at. The proxy selection query excludes proxies with an active Redis lease key. The PostgreSQL record exists for audit, billing reconciliation, and cleanup if Redis state is lost.

Enum types

All enums are PostgreSQL native enums created via CREATE TYPE:

Enum name Values
proxy_protocol http, https, socks4, socks5
proxy_status unchecked, active, dead
anonymity_level transparent, anonymous, elite
credit_tx_type purchase, acquire, refund, admin_adjust

Migration conventions

  • One migration per logical change. Don't bundle unrelated schema changes.
  • Migration filenames: NNN_descriptive_name.py (e.g., 001_initial_schema.py).
  • Always include both upgrade() and downgrade() functions.
  • Test migrations against a fresh database AND against a database with existing data.
  • Use alembic revision --autogenerate -m "description" for model-driven changes, but always review the generated SQL before applying.