docs: add developer documentation

This commit is contained in:
agatha 2026-03-14 12:22:08 -04:00
parent 11919e516b
commit c041e83a19
8 changed files with 1804 additions and 0 deletions

View File

@ -2,3 +2,5 @@
Proxy Pool is a FastAPI backend that discovers, validates, and serves free proxy servers. It scrapes configurable proxy Proxy Pool is a FastAPI backend that discovers, validates, and serves free proxy servers. It scrapes configurable proxy
list sources, runs a multi-stage validation pipeline to determine which proxies are alive and usable, and exposes a list sources, runs a multi-stage validation pipeline to determine which proxies are alive and usable, and exposes a
query API for consumers to find and acquire working proxies. query API for consumers to find and acquire working proxies.
See `docs/` for full documentation.

122
docs/01-architecture.md Normal file
View File

@ -0,0 +1,122 @@
# Architecture overview
## Purpose
Proxy Pool is a FastAPI backend that discovers, validates, and serves free proxy servers. It scrapes configurable proxy list sources, runs a multi-stage validation pipeline to determine which proxies are alive and usable, and exposes a query API for consumers to find and acquire working proxies.
## System components
The application is a single FastAPI process with a companion ARQ worker process. Both share the same codebase and connect to the same PostgreSQL database and Redis instance.
### FastAPI application
The HTTP API layer. Handles all inbound requests, dependency injection, middleware (CORS, rate limiting, request logging), and the app lifespan (startup/shutdown). Built with the app factory pattern via `create_app()` so tests can instantiate independent app instances.
### ARQ worker
A separate process running ARQ (async Redis queue). Executes background tasks: periodic source scraping, proxy validation sweeps, stale proxy cleanup, and lease expiration. Tasks are defined in `proxy_pool.worker` and import from domain service layers — they contain orchestration, not business logic.
### PostgreSQL (asyncpg)
Primary data store. All proxy data, user accounts, credit ledger, check history, and source configuration. Accessed through SQLAlchemy 2.0 async ORM with asyncpg as the dialect driver. Schema managed by Alembic.
### Redis
Serves three roles:
- **Task queue**: ARQ uses Redis as its message broker for enqueuing and scheduling background tasks.
- **Cache**: Hot proxy lists (the "top N good proxies" result), user credit balances, and recently-validated proxy sets are cached in Redis with TTL-based expiration.
- **Lease manager**: The proxy acquire endpoint uses Redis `SET key value EX ttl NX` for atomic lease creation. This prevents two consumers from acquiring the same proxy simultaneously.
### Plugin registry
A runtime plugin system that manages three plugin types: source parsers, proxy checkers, and notifiers. Plugins are discovered at startup by scanning the `plugins/builtin/` and `plugins/contrib/` directories. The registry validates plugins against Python Protocol classes and stores them for use by the pipeline and event bus.
## Domain modules
The codebase is split into two domain modules with a shared infrastructure layer:
### Proxy domain (`proxy_pool.proxy`)
Owns all proxy-related data and logic: sources, proxies, check history, tags, scoring, and the validation pipeline. Exposes routes for CRUD on sources, querying/filtering proxies, running on-demand site reachability tests, and viewing pool health stats.
### Accounts domain (`proxy_pool.accounts`)
Owns user accounts, API key authentication, and the credit ledger. Exposes routes for key management, account info, and credit balance/history. Provides a FastAPI dependency (`get_current_user`) that resolves an API key header into an authenticated `User`.
### Integration boundary
The two domains never import from each other directly. The single integration point is the `POST /proxies/acquire` endpoint, which:
1. Resolves the API key via the accounts auth dependency
2. Checks the user's credit balance (Redis cache, DB fallback)
3. Selects the best matching proxy via the proxy service layer
4. Creates an atomic lease in Redis
5. Debits the credit ledger and logs the lease in a single DB transaction
6. Invalidates the cached credit balance
This is the only place where both domains participate in the same request.
## Data flow
### Discovery
A periodic ARQ cron task (`scrape_all`) iterates over active `ProxySource` records. For each source, it fetches the URL, selects the appropriate `SourceParser` plugin (by `parser_name` or auto-detection via `supports()`), parses the raw content into a list of `DiscoveredProxy` objects, and upserts them into the `proxies` table using `ON CONFLICT (ip, port, protocol) DO UPDATE`.
### Validation
Newly discovered proxies (status `UNCHECKED`) and proxies due for revalidation are fed into the checker pipeline. The pipeline runs all registered `ProxyChecker` plugins in stage order:
- **Stage 1** — Quick liveness checks (TCP connect, SOCKS handshake). Run concurrently within the stage.
- **Stage 2** — HTTP-level checks (exit IP detection, anonymity classification, GeoIP lookup). Run concurrently within the stage.
- **Stage 3** — Optional site-specific reachability checks.
If any checker in a stage fails, all subsequent stages are skipped. Every check result is logged to `proxy_checks` for historical analysis. After the pipeline completes, a composite score is computed from latency, uptime percentage, and proxy age, then persisted to the `proxies` table.
### Serving
The query API (`GET /proxies`) filters by protocol, country, anonymity level, minimum score, latency range, and "last verified within N minutes." Results are served from a Redis cache when available, with cache invalidation triggered by validation sweeps.
The acquire endpoint (`POST /proxies/acquire`) selects a proxy matching the requested filters, creates a time-limited lease, debits one credit, and returns the proxy details. The lease is tracked in both Redis (for fast exclusion from future queries) and PostgreSQL (for audit trail and cleanup).
### Notification
The event bus dispatches events (`proxy.pool_low`, `source.failed`, `credits.low_balance`, etc.) to registered `Notifier` plugins. Notifications are fire-and-forget via `asyncio.create_task` — they never block the main request or pipeline path.
## Technology decisions
| Component | Choice | Rationale |
|-----------|--------|-----------|
| Web framework | FastAPI | Async-native, dependency injection, automatic OpenAPI docs |
| ORM | SQLAlchemy 2.0 async | Mapped classes, native async session, Alembic integration |
| DB driver | asyncpg | Fastest async PostgreSQL driver for Python |
| Migrations | Alembic | Industry standard, autogeneration from SA models |
| Task queue | ARQ + Redis | Lightweight async task queue, cron scheduling built in |
| HTTP client | httpx + httpx-socks | Async, SOCKS proxy support, connection pooling |
| Settings | pydantic-settings | Env-driven config with type validation |
| Package manager | uv | Fast, lockfile-based, handles editable installs |
| Containerization | Docker + Compose | Multi-stage builds, service orchestration |
## Deployment topology
```
┌─────────────────┐
│ Load balancer │
└────────┬────────┘
┌──────────────┼──────────────┐
│ │ │
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
│ API (N) │ │ API (N) │ │ API (N) │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└──────────────┼──────────────┘
┌──────────────┼──────────────┐
│ │ │
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
│ Worker (M) │ │PostgreSQL │ │ Redis │
└───────────┘ └───────────┘ └───────────┘
```
API processes are stateless and can be scaled horizontally. Workers should typically run as 1-2 instances to avoid duplicate scrape/validation work (ARQ handles job deduplication via Redis). PostgreSQL and Redis are single-instance in the default setup but can be replicated per standard practices.

317
docs/02-plugin-system.md Normal file
View File

@ -0,0 +1,317 @@
# Plugin system design
## Overview
The plugin system allows extending Proxy Pool's functionality without modifying core code. Plugins can add new proxy list parsers, new validation methods, and new notification channels. The system uses Python's `typing.Protocol` for structural typing — plugins implement the right interface without inheriting from any base class.
## Plugin types
### SourceParser
Responsible for extracting proxy entries from raw scraped content. Each parser handles a specific format (plain text lists, HTML tables, JSON APIs, etc.).
```python
from typing import Protocol, runtime_checkable
@runtime_checkable
class SourceParser(Protocol):
name: str
def supports(self, url: str) -> bool:
"""Return True if this parser can handle the given source URL.
Used as a fallback when no parser_name is explicitly set on a ProxySource.
The registry calls supports() on each registered parser and uses the first match.
"""
...
async def parse(self, raw: bytes, source: ProxySource) -> list[DiscoveredProxy]:
"""Extract proxy entries from raw scraped content.
Arguments:
raw: The raw bytes fetched from the source URL. The parser is responsible
for decoding (the encoding may vary by source).
source: The ProxySource record, providing context like default_protocol.
Returns:
A list of DiscoveredProxy objects. Duplicates within a single parse call
are acceptable — deduplication happens at the upsert layer.
"""
...
def default_schedule(self) -> str | None:
"""Optional cron expression for scrape frequency.
If None, the schedule configured on the ProxySource record is used.
This allows parsers to suggest a sensible default (e.g. "*/30 * * * *"
for sources that update frequently).
"""
...
```
**Registration key**: `parser_name` on the `ProxySource` record maps to `SourceParser.name`.
**Built-in parsers**: `plaintext` (one `ip:port` per line), `html_table` (HTML table with IP/port columns), `json_api` (JSON array or nested structure).
### ProxyChecker
Runs a single validation check against a proxy. Checkers are organized into stages — all checkers in stage N run before any in stage N+1. Within a stage, checkers run concurrently.
```python
@runtime_checkable
class ProxyChecker(Protocol):
name: str
stage: int # Pipeline ordering. Lower stages run first.
priority: int # Ordering within a stage. Lower priority runs first.
timeout: float # Per-check timeout in seconds.
async def check(self, proxy: Proxy, context: CheckContext) -> CheckResult:
"""Run this check against the proxy.
Arguments:
proxy: The proxy being validated.
context: Shared mutable state across the pipeline. Checkers in earlier
stages populate fields (exit_ip, tcp_latency_ms) that later
stages can read. Also provides a pre-configured httpx.AsyncClient.
Returns:
CheckResult with passed=True/False and a detail string.
"""
...
def should_skip(self, proxy: Proxy) -> bool:
"""Return True to skip this check for the given proxy.
Example: A SOCKS5-specific checker returns True for HTTP-only proxies.
"""
...
```
**Pipeline execution**: The orchestrator in `proxy_pool.proxy.pipeline` groups checkers by stage, runs each group concurrently via `asyncio.gather`, and aborts the pipeline on the first stage with any failure. Every individual check result is logged to the `proxy_checks` table.
**`CheckContext`**: A mutable dataclass that travels through the pipeline:
```python
@dataclass
class CheckContext:
started_at: datetime
http_client: httpx.AsyncClient
# Populated by checkers as they run:
exit_ip: str | None = None
tcp_latency_ms: float | None = None
http_latency_ms: float | None = None
anonymity_level: AnonymityLevel | None = None
country: str | None = None
headers_forwarded: list[str] = field(default_factory=list)
def elapsed_ms(self) -> float:
return (utcnow() - self.started_at).total_seconds() * 1000
```
**`CheckResult`**: The return type from every checker:
```python
@dataclass
class CheckResult:
passed: bool
detail: str
latency_ms: float | None = None
metadata: dict[str, Any] = field(default_factory=dict)
```
**Built-in checkers**:
| Name | Stage | What it does |
|------|-------|-------------|
| `tcp_connect` | 1 | Opens a TCP connection to verify the proxy is reachable |
| `socks_handshake` | 1 | Performs a SOCKS4/5 handshake (skipped for HTTP proxies) |
| `http_anonymity` | 2 | Sends an HTTP request through the proxy to a judge URL, determines exit IP and which headers are forwarded |
| `geoip_lookup` | 2 | Resolves the exit IP to a country code using MaxMind GeoLite2 |
| `site_reach` | 3 | Optional: tests whether the proxy can reach specific target URLs |
### Notifier
Reacts to system events. Notifiers are called asynchronously (fire-and-forget) and must never block the main application path.
```python
@runtime_checkable
class Notifier(Protocol):
name: str
subscribes_to: list[str] # Glob patterns: "proxy.*", "credits.low_balance"
async def notify(self, event: Event) -> None:
"""Handle an event.
Called via asyncio.create_task — exceptions are caught and logged
but do not propagate. Implementations should handle their own
retries if needed.
"""
...
async def health_check(self) -> bool:
"""Verify the notification backend is reachable.
Called periodically and surfaced in the admin stats endpoint.
Return False if the backend is unreachable.
"""
...
```
**Event types**:
| Event | Payload | When emitted |
|-------|---------|-------------|
| `proxy.pool_low` | `{active_count, threshold}` | Active proxy count drops below configured threshold |
| `proxy.new_batch` | `{source_id, count}` | A scrape discovers new proxies |
| `source.failed` | `{source_id, error}` | A scrape attempt fails |
| `source.stale` | `{source_id, hours_since_success}` | A source hasn't produced results in N hours |
| `credits.low_balance` | `{user_id, balance, threshold}` | User balance drops below threshold |
| `credits.exhausted` | `{user_id}` | User balance reaches zero |
**Glob matching**: `proxy.*` matches all events starting with `proxy.`. Exact matches like `credits.low_balance` match only that event. A notifier can subscribe to multiple patterns.
**Built-in notifiers**: `smtp` (email alerts), `webhook` (HTTP POST to a configured URL).
## Plugin registry
The `PluginRegistry` class is the central coordinator. It stores registered plugins, validates them against Protocol contracts at registration time, and provides lookup methods used by the pipeline and event bus.
```python
class PluginRegistry:
def __init__(self) -> None:
self._parsers: dict[str, SourceParser] = {}
self._checkers: list[ProxyChecker] = [] # Sorted by (stage, priority)
self._notifiers: list[Notifier] = []
self._event_subs: dict[str, list[Notifier]] = {}
def register_parser(self, plugin: SourceParser) -> None: ...
def register_checker(self, plugin: ProxyChecker) -> None: ...
def register_notifier(self, plugin: Notifier) -> None: ...
def get_parser(self, name: str) -> SourceParser: ...
def get_parser_for_url(self, url: str) -> SourceParser | None: ...
def get_checker_pipeline(self) -> list[ProxyChecker]: ...
async def emit(self, event: Event) -> None: ...
```
**Validation**: At registration time, `_validate_protocol()` uses `isinstance()` (enabled by `@runtime_checkable` on each Protocol) as a structural check, then inspects for missing attributes/methods and raises `PluginValidationError` with a descriptive message.
**Conflict detection**: Two parsers with the same `name` raise `PluginConflictError`. Checkers and notifiers are additive (duplicates are allowed, though unusual).
## Plugin discovery
Plugins are discovered at application startup by scanning two directories:
- `proxy_pool/plugins/builtin/` — Ships with the application. Tested in CI.
- `proxy_pool/plugins/contrib/` — User-provided plugins. Can be mounted as a Docker volume.
### Convention
Each plugin is a Python module (single `.py` file or package directory) that defines a `create_plugin(settings: Settings)` function. This factory function:
- Receives the application settings (for reading config like SMTP credentials)
- Returns a plugin instance, or `None` if the plugin should not activate (e.g. SMTP not configured)
```python
# Example: proxy_pool/plugins/builtin/parsers/plaintext.py
class PlaintextParser:
name = "plaintext"
def supports(self, url: str) -> bool:
return url.endswith(".txt")
async def parse(self, raw: bytes, source: ProxySource) -> list[DiscoveredProxy]:
# ... parsing logic ...
return results
def default_schedule(self) -> str | None:
return "*/30 * * * *"
def create_plugin(settings: Settings) -> PlaintextParser:
return PlaintextParser()
```
### Discovery algorithm
```python
async def discover_plugins(plugins_dir: Path, registry: PluginRegistry, settings: Settings):
for path in sorted(plugins_dir.rglob("*.py")):
if path.name.startswith("_"):
continue
module = importlib.import_module(derive_module_path(path))
if not hasattr(module, "create_plugin"):
continue
plugin = module.create_plugin(settings)
if plugin is None:
continue # Plugin opted out (unconfigured)
# Type-based routing:
match plugin:
case SourceParser(): registry.register_parser(plugin)
case ProxyChecker(): registry.register_checker(plugin)
case Notifier(): registry.register_notifier(plugin)
```
The `match` statement uses structural pattern matching against `@runtime_checkable` Protocol types. The order matters — if a plugin somehow satisfies multiple Protocols, the first match wins.
## Wiring into FastAPI
The plugin registry is created during the FastAPI lifespan and stored on `app.state`:
```python
@asynccontextmanager
async def lifespan(app: FastAPI):
settings = get_settings()
registry = PluginRegistry()
# Discover from both builtin and contrib directories
await discover_plugins(Path("proxy_pool/plugins/builtin"), registry, settings)
await discover_plugins(Path("/app/plugins-contrib"), registry, settings)
# Health-check notifiers
for notifier in registry.notifiers:
healthy = await notifier.health_check()
if not healthy:
logger.warning(f"Notifier '{notifier.name}' failed health check at startup")
app.state.registry = registry
# ... other setup (db, redis) ...
yield
# ... cleanup ...
```
Route handlers access the registry via a FastAPI dependency:
```python
async def get_registry(request: Request) -> PluginRegistry:
return request.app.state.registry
```
## Writing a third-party plugin
1. Create a `.py` file in `plugins/contrib/` (or mount a directory as a Docker volume at `/app/plugins-contrib/`).
2. Implement the relevant Protocol methods and attributes.
3. Define `create_plugin(settings: Settings) -> YourPlugin | None`.
4. Restart the application. The plugin will be discovered and registered automatically.
No inheritance required. No registration decorators. Just implement the shape and provide the factory.
### Testing a plugin
```python
from proxy_pool.config import Settings
from your_plugin import create_plugin
def test_plugin_creates_successfully():
settings = Settings(smtp_host="localhost", ...)
plugin = create_plugin(settings)
assert plugin is not None
assert plugin.name == "my_custom_parser"
async def test_parser_extracts_proxies():
plugin = create_plugin(Settings(...))
raw = b"192.168.1.1:8080\n10.0.0.1:3128\n"
source = make_test_source()
results = await plugin.parse(raw, source)
assert len(results) == 2
assert results[0].ip == "192.168.1.1"
```

204
docs/03-database-schema.md Normal file
View File

@ -0,0 +1,204 @@
# Database schema reference
## Overview
All tables use UUID primary keys (generated client-side via `uuid4()`), `timestamptz` for datetime columns, and follow a consistent naming convention: `snake_case` table names, singular for join/config tables, plural for entity tables.
The schema is managed by Alembic. Never modify tables directly — always create a migration.
## Proxy domain tables
### proxy_sources
Configurable scrape targets. Each record defines a URL to fetch, a parser to use, and a schedule.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `url` | `varchar(2048)` | UNIQUE, NOT NULL | The URL to scrape |
| `parser_name` | `varchar(64)` | NOT NULL | Maps to a registered `SourceParser.name` |
| `cron_schedule` | `varchar(64)` | nullable | Cron expression for scrape frequency. Falls back to the parser's `default_schedule()` if NULL |
| `default_protocol` | `enum(proxy_protocol)` | NOT NULL, default `http` | Protocol to assign when the parser can't determine it from the source |
| `is_active` | `boolean` | NOT NULL, default `true` | Inactive sources are skipped by the scrape task |
| `last_scraped_at` | `timestamptz` | nullable | Timestamp of the last successful scrape |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
**Rationale**: Storing the parser name rather than auto-detecting every time allows explicit control. A source might look like a plain text file but actually need a custom parser.
### proxies
The core proxy table. Each record represents a unique `(ip, port, protocol)` combination.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `ip` | `inet` | NOT NULL | IPv4 or IPv6 address |
| `port` | `integer` | NOT NULL | Port number (165535) |
| `protocol` | `enum(proxy_protocol)` | NOT NULL | `http`, `https`, `socks4`, `socks5` |
| `source_id` | `uuid` | FK → proxy_sources.id, NOT NULL | Which source discovered this proxy |
| `status` | `enum(proxy_status)` | NOT NULL, default `unchecked` | `unchecked`, `active`, `dead` |
| `anonymity` | `enum(anonymity_level)` | nullable | `transparent`, `anonymous`, `elite` |
| `exit_ip` | `inet` | nullable | The IP address seen by the target when using this proxy |
| `country` | `varchar(2)` | nullable | ISO 3166-1 alpha-2 country code of the exit IP |
| `score` | `float` | NOT NULL, default `0.0` | Composite quality score (0.01.0) |
| `avg_latency_ms` | `float` | nullable | Rolling average latency across recent checks |
| `uptime_pct` | `float` | nullable | Percentage of checks that passed (0.0100.0) |
| `first_seen_at` | `timestamptz` | NOT NULL, server default `now()` | When this proxy was first discovered |
| `last_checked_at` | `timestamptz` | nullable | When the last validation check completed |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
**Indexes**:
| Name | Columns | Type | Purpose |
|------|---------|------|---------|
| `ix_proxies_ip_port_proto` | `(ip, port, protocol)` | UNIQUE | Deduplication on upsert |
| `ix_proxies_status_score` | `(status, score)` | B-tree | Fast filtering for "active proxies sorted by score" |
**Design note**: The same `ip:port` can appear multiple times if it supports different protocols (e.g., HTTP on port 8080 and SOCKS5 on port 1080). The composite unique index enforces this correctly.
**Computed columns**: `score`, `avg_latency_ms`, and `uptime_pct` are denormalized from `proxy_checks`. They are recomputed by the validation pipeline after each check run and by a periodic rollup task. This avoids expensive aggregation queries on every proxy list request.
### proxy_checks
Append-only log of every validation check attempt. This is the raw data behind the computed fields on `proxies`.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `proxy_id` | `uuid` | FK → proxies.id ON DELETE CASCADE, NOT NULL | |
| `checker_name` | `varchar(64)` | NOT NULL | The `ProxyChecker.name` that ran this check |
| `stage` | `integer` | NOT NULL | Pipeline stage number |
| `passed` | `boolean` | NOT NULL | Whether the check succeeded |
| `latency_ms` | `float` | nullable | Time taken for this specific check |
| `detail` | `text` | nullable | Human-readable result description or error message |
| `exit_ip` | `inet` | nullable | Exit IP discovered during this check (if applicable) |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
**Indexes**:
| Name | Columns | Purpose |
|------|---------|---------|
| `ix_checks_proxy_created` | `(proxy_id, created_at)` | Efficient history queries per proxy |
**Retention**: This table grows fast. A periodic cleanup task (`tasks_cleanup.prune_checks`) deletes rows older than a configurable retention period (default: 7 days), keeping only the most recent N checks per proxy.
### proxy_tags
Flexible key-value labels for proxies. Useful for user-defined categorization (e.g., `datacenter: true`, `provider: aws`, `tested_site: google.com`).
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `proxy_id` | `uuid` | FK → proxies.id ON DELETE CASCADE, NOT NULL | |
| `key` | `varchar(64)` | NOT NULL | Tag name |
| `value` | `varchar(256)` | NOT NULL | Tag value |
**Indexes**:
| Name | Columns | Type | Purpose |
|------|---------|------|---------|
| `ix_tags_proxy_key` | `(proxy_id, key)` | UNIQUE | One value per key per proxy |
## Accounts domain tables
### users
User accounts. Minimal by design — the primary purpose is to own API keys and credits.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `email` | `varchar(320)` | UNIQUE, NOT NULL | Used for notifications and account recovery |
| `display_name` | `varchar(128)` | nullable | |
| `is_active` | `boolean` | NOT NULL, default `true` | Inactive users cannot authenticate |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
### api_keys
API keys for authentication. The raw key is shown once at creation; only the hash is stored.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `user_id` | `uuid` | FK → users.id ON DELETE CASCADE, NOT NULL | |
| `key_hash` | `varchar(128)` | NOT NULL | SHA-256 hash of the raw API key |
| `prefix` | `varchar(8)` | NOT NULL | First 8 characters of the raw key, for quick lookup |
| `label` | `varchar(128)` | nullable | User-assigned label (e.g., "production", "testing") |
| `is_active` | `boolean` | NOT NULL, default `true` | Revoked keys have `is_active = false` |
| `last_used_at` | `timestamptz` | nullable | Updated on each authenticated request |
| `expires_at` | `timestamptz` | nullable | NULL means no expiration |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
**Indexes**:
| Name | Columns | Type | Purpose |
|------|---------|------|---------|
| `ix_api_keys_hash` | `(key_hash)` | UNIQUE | Uniqueness constraint on key hashes |
| `ix_api_keys_prefix` | `(prefix)` | B-tree | Fast prefix-based lookup before full hash comparison |
**Auth flow**: On each request, the middleware extracts the API key from the `Authorization: Bearer <key>` header, computes `prefix = key[:8]`, queries `api_keys WHERE prefix = ? AND is_active = true AND (expires_at IS NULL OR expires_at > now())`, then verifies `sha256(key) == key_hash`. This two-step approach avoids computing a hash against every key in the database.
### credit_ledger
Append-only ledger of all credit transactions. Current balance is `SELECT SUM(amount) FROM credit_ledger WHERE user_id = ?`.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `user_id` | `uuid` | FK → users.id ON DELETE CASCADE, NOT NULL | |
| `amount` | `integer` | NOT NULL | Positive = credit in, negative = debit |
| `tx_type` | `enum(credit_tx_type)` | NOT NULL | `purchase`, `acquire`, `refund`, `admin_adjust` |
| `description` | `text` | nullable | Human-readable note |
| `reference_id` | `uuid` | nullable | Links to the related entity (e.g., lease ID for `acquire` transactions) |
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
**Indexes**:
| Name | Columns | Purpose |
|------|---------|---------|
| `ix_ledger_user_created` | `(user_id, created_at)` | Balance computation and history queries |
**Caching**: The computed balance is cached in Redis under `credits:{user_id}`. The cache is invalidated (DEL) whenever a new ledger entry is created. Cache miss triggers a `SUM(amount)` query.
**Concurrency**: Because balance is derived from a SUM, concurrent inserts don't cause race conditions on the balance itself. The acquire endpoint uses `SELECT ... FOR UPDATE` on the user row to serialize credit checks, preventing double-spending under high concurrency.
### proxy_leases
Tracks which proxies are currently checked out by which users. Both Redis (for fast lookup) and PostgreSQL (for audit trail) maintain lease state.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | `uuid` | PK, default uuid4 | |
| `user_id` | `uuid` | FK → users.id, NOT NULL | |
| `proxy_id` | `uuid` | FK → proxies.id, NOT NULL | |
| `acquired_at` | `timestamptz` | NOT NULL, server default `now()` | |
| `expires_at` | `timestamptz` | NOT NULL | When the lease automatically releases |
| `is_released` | `boolean` | NOT NULL, default `false` | Set to true on explicit release or expiration cleanup |
**Indexes**:
| Name | Columns | Purpose |
|------|---------|---------|
| `ix_leases_user` | `(user_id)` | List a user's active leases |
| `ix_leases_proxy_active` | `(proxy_id, is_released)` | Check if a proxy is currently leased |
**Dual state**: Redis holds the lease as `lease:{proxy_id}` with a TTL matching `expires_at`. The proxy selection query excludes proxies with an active Redis lease key. The PostgreSQL record exists for audit, billing reconciliation, and cleanup if Redis state is lost.
## Enum types
All enums are PostgreSQL native enums created via `CREATE TYPE`:
| Enum name | Values |
|-----------|--------|
| `proxy_protocol` | `http`, `https`, `socks4`, `socks5` |
| `proxy_status` | `unchecked`, `active`, `dead` |
| `anonymity_level` | `transparent`, `anonymous`, `elite` |
| `credit_tx_type` | `purchase`, `acquire`, `refund`, `admin_adjust` |
## Migration conventions
- One migration per logical change. Don't bundle unrelated schema changes.
- Migration filenames: `NNN_descriptive_name.py` (e.g., `001_initial_schema.py`).
- Always include both `upgrade()` and `downgrade()` functions.
- Test migrations against a fresh database AND against a database with existing data.
- Use `alembic revision --autogenerate -m "description"` for model-driven changes, but always review the generated SQL before applying.

405
docs/04-api-reference.md Normal file
View File

@ -0,0 +1,405 @@
# API reference
## Authentication
All endpoints except `POST /auth/register` and `GET /health` require an API key in the `Authorization` header:
```
Authorization: Bearer pp_a1b2c3d4e5f6g7h8i9j0...
```
API keys use a `pp_` prefix (proxy pool) followed by 48 random characters. The prefix aids visual identification and log filtering.
Unauthenticated or invalid requests return `401 Unauthorized`. Requests with valid keys but insufficient credits return `402 Payment Required`.
## Common response patterns
### Pagination
List endpoints support cursor-based pagination:
```
GET /proxies?limit=50&cursor=eyJpZCI6Ii4uLiJ9
```
Response includes a `next_cursor` field when more results exist:
```json
{
"items": [...],
"next_cursor": "eyJpZCI6Ii4uLiJ9",
"total_count": 1234
}
```
### Error responses
All errors follow a consistent shape:
```json
{
"error": {
"code": "INSUFFICIENT_CREDITS",
"message": "You need at least 1 credit to acquire a proxy. Current balance: 0.",
"details": {}
}
}
```
## Proxy domain endpoints
### Sources
#### `GET /sources`
List all configured proxy sources.
**Query parameters**: `is_active` (bool, optional), `limit` (int, default 50), `cursor` (string, optional).
**Response** `200`:
```json
{
"items": [
{
"id": "uuid",
"url": "https://example.com/proxies.txt",
"parser_name": "plaintext",
"cron_schedule": "*/30 * * * *",
"default_protocol": "http",
"is_active": true,
"last_scraped_at": "2025-01-15T10:30:00Z",
"created_at": "2025-01-01T00:00:00Z"
}
],
"next_cursor": null,
"total_count": 5
}
```
#### `POST /sources`
Add a new proxy source.
**Request body**:
```json
{
"url": "https://example.com/proxies.txt",
"parser_name": "plaintext",
"cron_schedule": "*/30 * * * *",
"default_protocol": "http"
}
```
**Validation**: The `parser_name` must match a registered plugin. If omitted, the registry attempts auto-detection via `supports()`. Returns `422` if no matching parser is found.
**Response** `201`: The created source object.
#### `PATCH /sources/{source_id}`
Update a source. All fields are optional — only provided fields are changed.
#### `DELETE /sources/{source_id}`
Delete a source. Associated proxies are NOT deleted (they may have been discovered by multiple sources).
#### `POST /sources/{source_id}/scrape`
Trigger an immediate scrape of the source, bypassing the cron schedule. Returns the scrape result.
**Response** `200`:
```json
{
"source_id": "uuid",
"proxies_discovered": 142,
"proxies_new": 23,
"proxies_updated": 119,
"duration_ms": 1540
}
```
### Proxies
#### `GET /proxies`
Query the proxy pool with filtering and sorting.
**Query parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `status` | string | Filter by status: `active`, `dead`, `unchecked` |
| `protocol` | string | Filter by protocol: `http`, `https`, `socks4`, `socks5` |
| `anonymity` | string | Filter by anonymity: `transparent`, `anonymous`, `elite` |
| `country` | string | ISO 3166-1 alpha-2 country code |
| `min_score` | float | Minimum composite score (0.01.0) |
| `max_latency_ms` | float | Maximum average latency |
| `min_uptime_pct` | float | Minimum uptime percentage |
| `verified_within_minutes` | int | Only proxies checked within the last N minutes |
| `sort` | string | Sort field: `score`, `latency`, `uptime`, `last_checked` |
| `order` | string | Sort order: `asc`, `desc` (default: `desc` for score) |
| `limit` | int | Results per page (default: 50, max: 200) |
| `cursor` | string | Pagination cursor |
**Response** `200`:
```json
{
"items": [
{
"id": "uuid",
"ip": "203.0.113.42",
"port": 8080,
"protocol": "http",
"status": "active",
"anonymity": "elite",
"exit_ip": "203.0.113.42",
"country": "US",
"score": 0.87,
"avg_latency_ms": 245.3,
"uptime_pct": 94.2,
"last_checked_at": "2025-01-15T10:25:00Z",
"first_seen_at": "2025-01-10T08:00:00Z",
"tags": {"provider": "datacenter"}
}
],
"next_cursor": "...",
"total_count": 892
}
```
#### `GET /proxies/{proxy_id}`
Get detailed info for a single proxy, including recent check history.
**Response** `200`:
```json
{
"proxy": { ... },
"recent_checks": [
{
"checker_name": "tcp_connect",
"stage": 1,
"passed": true,
"latency_ms": 120.5,
"detail": "TCP connect OK",
"created_at": "2025-01-15T10:25:00Z"
}
]
}
```
#### `POST /proxies/acquire`
Acquire a proxy with an exclusive lease. Costs 1 credit.
**Request body**:
```json
{
"protocol": "http",
"country": "US",
"anonymity": "elite",
"min_score": 0.7,
"lease_duration_seconds": 300
}
```
All filter fields are optional. `lease_duration_seconds` defaults to 300 (5 minutes), max 3600 (1 hour).
**Response** `200`:
```json
{
"lease_id": "uuid",
"proxy": {
"ip": "203.0.113.42",
"port": 8080,
"protocol": "http",
"country": "US",
"anonymity": "elite"
},
"expires_at": "2025-01-15T10:30:00Z",
"credits_remaining": 42
}
```
**Error responses**: `402` if insufficient credits, `404` if no proxy matches the filters, `409` if all matching proxies are currently leased.
#### `POST /proxies/acquire/{lease_id}/release`
Release a lease early. The proxy becomes available immediately. The credit is NOT refunded (credits are consumed on acquisition).
#### `POST /proxies/test`
Test whether good proxies can reach a specific URL.
**Request body**:
```json
{
"url": "https://example.com",
"count": 5,
"timeout_seconds": 10,
"protocol": "http",
"country": "US"
}
```
**Response** `200`:
```json
{
"url": "https://example.com",
"results": [
{
"proxy_id": "uuid",
"ip": "203.0.113.42",
"port": 8080,
"reachable": true,
"status_code": 200,
"latency_ms": 340,
"error": null
},
{
"proxy_id": "uuid",
"ip": "198.51.100.10",
"port": 3128,
"reachable": false,
"status_code": null,
"latency_ms": null,
"error": "Connection refused by target"
}
],
"success_rate": 0.8
}
```
### Stats
#### `GET /stats/pool`
Pool health overview.
**Response** `200`:
```json
{
"total_proxies": 15420,
"by_status": {"active": 3200, "dead": 11800, "unchecked": 420},
"by_protocol": {"http": 8000, "https": 4000, "socks4": 1200, "socks5": 2220},
"by_anonymity": {"transparent": 1500, "anonymous": 1000, "elite": 700},
"avg_score": 0.62,
"avg_latency_ms": 380.5,
"sources_active": 12,
"sources_total": 15,
"last_scrape_at": "2025-01-15T10:30:00Z",
"last_validation_at": "2025-01-15T10:25:00Z"
}
```
#### `GET /stats/plugins`
Plugin registry status.
**Response** `200`:
```json
{
"parsers": [{"name": "plaintext", "type": "SourceParser"}],
"checkers": [
{"name": "tcp_connect", "stage": 1, "priority": 0},
{"name": "http_anonymity", "stage": 2, "priority": 0}
],
"notifiers": [
{"name": "smtp", "healthy": true, "subscribes_to": ["proxy.pool_low", "credits.*"]},
{"name": "webhook", "healthy": false, "subscribes_to": ["*"]}
]
}
```
## Accounts domain endpoints
### Auth
#### `POST /auth/register`
Create a new user account and initial API key. No authentication required.
**Request body**:
```json
{
"email": "user@example.com",
"display_name": "Alice"
}
```
**Response** `201`:
```json
{
"user": {"id": "uuid", "email": "user@example.com", "display_name": "Alice"},
"api_key": {
"id": "uuid",
"key": "pp_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0u1v2w3x4",
"prefix": "pp_a1b2c",
"label": "default"
}
}
```
**Important**: The `key` field is the raw API key. It is returned ONLY in this response. Store it securely — it cannot be retrieved again.
#### `POST /auth/keys`
Create an additional API key for the authenticated user.
#### `GET /auth/keys`
List all API keys for the authenticated user (returns prefix and metadata, never the full key).
#### `DELETE /auth/keys/{key_id}`
Revoke an API key.
### Account
#### `GET /account`
Get the authenticated user's account info.
#### `GET /account/credits`
Get current credit balance and recent transaction history.
**Response** `200`:
```json
{
"balance": 42,
"recent_transactions": [
{
"amount": -1,
"tx_type": "acquire",
"description": "Proxy acquired: 203.0.113.42:8080",
"created_at": "2025-01-15T10:25:00Z"
},
{
"amount": 100,
"tx_type": "purchase",
"description": "Credit purchase",
"created_at": "2025-01-14T00:00:00Z"
}
]
}
```
#### `GET /account/leases`
List the authenticated user's active proxy leases.
## System endpoints
#### `GET /health`
Basic health check. No authentication required.
**Response** `200`:
```json
{
"status": "healthy",
"postgres": "connected",
"redis": "connected",
"version": "0.1.0"
}
```

185
docs/05-worker-tasks.md Normal file
View File

@ -0,0 +1,185 @@
# Worker and task reference
## Overview
Background tasks run in a separate ARQ worker process. The worker connects to the same PostgreSQL and Redis instances as the API. Tasks are defined in `proxy_pool.worker.tasks_*` modules and registered in `proxy_pool.worker.settings`.
## Running the worker
```bash
# Development
uv run arq proxy_pool.worker.settings.WorkerSettings
# Docker
docker compose up worker
```
The worker process is independent of the API process. You can run multiple worker instances, though for most deployments one is sufficient (ARQ handles job deduplication via Redis).
## Worker settings
```python
# proxy_pool/worker/settings.py
class WorkerSettings:
functions = [
scrape_source,
scrape_all,
validate_proxy,
revalidate_sweep,
prune_dead_proxies,
prune_old_checks,
expire_leases,
]
cron_jobs = [
cron(scrape_all, minute={0, 30}), # Every 30 minutes
cron(revalidate_sweep, minute={10, 25, 40, 55}), # Every 15 minutes
cron(prune_dead_proxies, hour={3}, minute={0}), # Daily at 3:00 AM
cron(prune_old_checks, hour={4}, minute={0}), # Daily at 4:00 AM
cron(expire_leases, minute=set(range(60))), # Every minute
]
redis_settings = RedisSettings.from_dsn(settings.redis_url)
max_jobs = 50
job_timeout = 300 # 5 minutes
keep_result = 3600 # Keep results for 1 hour
```
## Task definitions
### Scrape tasks
#### `scrape_all(ctx)`
Periodic task that iterates over all active `ProxySource` records and enqueues a `scrape_source` job for each one. Sources whose `cron_schedule` or parser's `default_schedule()` indicates they aren't due yet are skipped.
**Schedule**: Every 30 minutes (configurable).
**Behavior**: Enqueues individual `scrape_source` jobs rather than scraping inline. This allows the worker pool to parallelize across sources and provides per-source error isolation.
#### `scrape_source(ctx, source_id: str)`
Fetches the URL for a single `ProxySource`, selects the appropriate `SourceParser` plugin, parses the content, and upserts discovered proxies.
**Steps**:
1. Load the `ProxySource` by ID.
2. Fetch the URL via `httpx.AsyncClient` with a configurable timeout (default: 30s).
3. Look up the parser by `source.parser_name` in the plugin registry.
4. Call `parser.parse(raw_bytes, source)` to get a list of `DiscoveredProxy`.
5. Upsert each proxy using `INSERT ... ON CONFLICT (ip, port, protocol) DO UPDATE SET source_id = ?, last_seen_at = now()`.
6. Update `source.last_scraped_at`.
7. Emit `proxy.new_batch` event if new proxies were discovered.
8. On failure, emit `source.failed` event and log the error.
**Error handling**: HTTP errors, parse errors, and database errors are caught and logged. The source is not deactivated on failure — transient errors are expected. A separate `source.stale` event is emitted if a source hasn't produced results in a configurable number of hours.
**Timeout**: 60 seconds (includes fetch + parse + upsert).
### Validation tasks
#### `revalidate_sweep(ctx)`
Periodic task that selects proxies due for revalidation and enqueues `validate_proxy` jobs.
**Selection criteria** (in priority order):
1. Proxies with `status = unchecked` (never validated, highest priority).
2. Proxies with `status = active` and `last_checked_at < now() - interval` (stale active proxies).
3. Proxies with `status = dead` and `last_checked_at < now() - longer_interval` (periodic dead re-check, lower frequency).
**Configurable intervals**:
- Active proxy recheck: every 10 minutes (default).
- Dead proxy recheck: every 6 hours (default).
- Batch size per sweep: 200 proxies (default).
**Schedule**: Every 15 minutes.
#### `validate_proxy(ctx, proxy_id: str)`
Runs the full checker pipeline for a single proxy.
**Steps**:
1. Load the `Proxy` by ID.
2. Create a `CheckContext` with a fresh `httpx.AsyncClient`.
3. Call `run_checker_pipeline(proxy, registry, http_client, db_session)`.
4. The pipeline runs all registered checkers in stage order (see plugin system docs).
5. Compute composite score from results.
6. Update the proxy record with new status, score, latency, uptime, exit IP, country, anonymity.
7. If the proxy transitioned from `active` to `dead` or vice versa, check pool health thresholds and emit `proxy.pool_low` if needed.
**Timeout**: 120 seconds (individual checker timeouts are enforced within the pipeline).
**Concurrency**: Multiple `validate_proxy` jobs can run simultaneously. Each job operates on a different proxy, so there are no conflicts. ARQ's `job_id` parameter is set to `validate:{proxy_id}` to prevent duplicate validation of the same proxy.
### Cleanup tasks
#### `prune_dead_proxies(ctx)`
Removes proxies that have been dead for an extended period.
**Criteria**: `status = dead` AND `last_checked_at < now() - retention_days` (default: 30 days).
**Behavior**: Hard deletes the proxy row. CASCADE deletes remove associated `proxy_checks`, `proxy_tags`, and `proxy_leases`.
**Schedule**: Daily at 3:00 AM.
#### `prune_old_checks(ctx)`
Trims the `proxy_checks` table to control storage growth.
**Strategy**: For each proxy, keep the most recent N check records (default: 100) and delete anything older than a retention period (default: 7 days). Both conditions must be met — recent proxies keep all their checks even if they have more than 100, while old checks are always pruned.
**Schedule**: Daily at 4:00 AM.
#### `expire_leases(ctx)`
Cleans up expired proxy leases.
**Steps**:
1. Query `proxy_leases WHERE is_released = false AND expires_at < now()`.
2. For each expired lease, set `is_released = true`.
3. Delete the corresponding Redis lease key (if it still exists — it should have expired via TTL, but this is a safety net).
**Schedule**: Every minute.
**Note**: Redis TTL is the primary expiration mechanism. This task is a consistency backstop that ensures the PostgreSQL records are accurate even if Redis keys expire silently.
## Task retry behavior
ARQ retries are configured per-task:
| Task | Max retries | Retry delay |
|------|-------------|-------------|
| `scrape_source` | 2 | 60s exponential |
| `validate_proxy` | 1 | 30s |
| `prune_dead_proxies` | 0 | — |
| `prune_old_checks` | 0 | — |
| `expire_leases` | 1 | 10s |
Retry delays use exponential backoff. Failed tasks after max retries are logged and the job result is stored in Redis for inspection.
## Monitoring
### Job results
ARQ stores job results in Redis for `keep_result` seconds (default: 3600). Query results via:
```python
from arq.connections import ArqRedis
redis = ArqRedis(...)
result = await redis.get_result("job_id")
```
### Health indicators
The `GET /stats/pool` endpoint includes `last_scrape_at` and `last_validation_at` timestamps. If these fall behind schedule, the worker may be down or stuck.
### Logging
Tasks log at structured INFO level on start/completion and WARN/ERROR on failures:
```
INFO scrape_source source_id=abc count_new=23 count_updated=119 duration_ms=1540
WARN scrape_source source_id=def error="HTTP 503" retrying=true attempt=2
ERROR validate_proxy proxy_id=ghi error="Pipeline timeout after 120s"
```

View File

@ -0,0 +1,333 @@
# Development guide
## Prerequisites
- Python 3.12+
- [uv](https://docs.astral.sh/uv/) (package manager)
- Docker and Docker Compose (for dependencies and testing)
- Git
## Initial setup
### 1. Clone and install
```bash
git clone <repo-url> proxy-pool
cd proxy-pool
# Install all dependencies (including dev) in a virtual env
uv sync
# Verify installation
uv run python -c "import proxy_pool; print('OK')"
```
`uv sync` creates a `.venv/` in the project root, installs all dependencies from `uv.lock`, and installs the `proxy_pool` package in editable mode (thanks to the `src/` layout and `pyproject.toml` build config).
### 2. Start infrastructure
```bash
# Start PostgreSQL and Redis
docker compose up -d postgres redis
# Verify they're running
docker compose ps
```
### 3. Configure environment
```bash
cp .env.example .env
# Edit .env with your local settings
```
Key variables:
```env
DATABASE_URL=postgresql+asyncpg://proxypool:proxypool@localhost:5432/proxypool
REDIS_URL=redis://localhost:6379/0
SECRET_KEY=your-random-secret-for-dev
LOG_LEVEL=DEBUG
# Optional: SMTP for notifier plugin testing
SMTP_HOST=
SMTP_PORT=587
SMTP_USER=
SMTP_PASSWORD=
ALERT_EMAIL=
```
### 4. Run migrations
```bash
uv run alembic upgrade head
```
### 5. Start the application
```bash
# API server (with hot reload)
uv run uvicorn proxy_pool.app:create_app --factory --reload --port 8000
# In a separate terminal: ARQ worker
uv run arq proxy_pool.worker.settings.WorkerSettings
```
The API is now available at `http://localhost:8000`. OpenAPI docs are at `http://localhost:8000/docs`.
## Project layout
```
proxy-pool/
├── src/proxy_pool/ # Application source code
│ ├── app.py # App factory + lifespan
│ ├── config.py # Settings (env-driven)
│ ├── common/ # Shared utilities
│ ├── db/ # Database infrastructure
│ ├── proxy/ # Proxy domain module
│ ├── accounts/ # Accounts domain module
│ ├── plugins/ # Plugin system + built-in plugins
│ └── worker/ # ARQ task definitions
├── tests/ # Test suite
├── alembic/ # Migration files
├── docs/ # This documentation
└── pyproject.toml # Project config (uv, ruff, mypy, pytest)
```
See `01-architecture.md` for detailed structure and rationale.
## Working with the database
### Creating a migration
```bash
# Auto-generate from model changes
uv run alembic revision --autogenerate -m "add proxy_tags table"
# Review the generated migration!
cat alembic/versions/NNN_add_proxy_tags_table.py
# Apply it
uv run alembic upgrade head
```
Always review autogenerated migrations. Alembic can miss custom indexes, enum type changes, and data migrations. Common things to verify:
- Enum types are created/altered correctly.
- Index names match the naming convention.
- `downgrade()` reverses the change completely.
- No data is dropped unintentionally.
### Useful Alembic commands
```bash
# Show current revision
uv run alembic current
# Show migration history
uv run alembic history --verbose
# Downgrade one step
uv run alembic downgrade -1
# Downgrade to a specific revision
uv run alembic downgrade abc123
# Generate a blank migration (for data migrations)
uv run alembic revision -m "backfill proxy scores"
```
### Database shell
```bash
# Via Docker
docker compose exec postgres psql -U proxypool
# Or directly
psql postgresql://proxypool:proxypool@localhost:5432/proxypool
```
## Running tests
### Quick: unit tests only (no Docker needed)
```bash
uv run pytest tests/unit/ -x -v
```
### Full: integration tests with Docker dependencies
```bash
# Start test infrastructure
docker compose -f docker-compose.yml -f docker-compose.test.yml up -d postgres redis
# Run all tests
uv run pytest tests/ -x -v --timeout=30
# Or run via Docker (how CI does it)
docker compose -f docker-compose.yml -f docker-compose.test.yml run --rm test
```
### Test organization
- `tests/unit/` — No I/O. All external dependencies are mocked. Fast.
- `tests/integration/` — Uses real PostgreSQL and Redis via Docker. Tests full request flows, database queries, and cache behavior.
- `tests/plugins/` — Plugin-specific tests. Most are unit tests, but some (like SMTP notifier) may use integration fixtures.
### Key fixtures (in `conftest.py`)
```python
@pytest.fixture
async def db_session():
"""Provides an async SQLAlchemy session rolled back after each test."""
@pytest.fixture
async def redis():
"""Provides a Redis connection flushed after each test."""
@pytest.fixture
async def client(db_session, redis):
"""Provides an httpx.AsyncClient wired to a test app instance."""
@pytest.fixture
def registry():
"""Provides a PluginRegistry with built-in plugins loaded."""
```
### Writing a test
```python
# tests/unit/test_scoring.py
from proxy_pool.proxy.service import compute_proxy_score
def test_score_weights_latency():
checks = [make_check(passed=True, latency_ms=100)]
score = compute_proxy_score(make_proxy(), checks, make_context())
assert 0.8 < score <= 1.0
def test_dead_proxy_gets_zero_score():
checks = [make_check(passed=False)]
score = compute_proxy_score(make_proxy(), checks, make_context())
assert score == 0.0
```
```python
# tests/integration/test_acquire_flow.py
async def test_acquire_deducts_credit(client, db_session):
user = await create_user_with_credits(db_session, credits=10)
await create_active_proxy(db_session)
response = await client.post(
"/proxies/acquire",
headers={"Authorization": f"Bearer {user.api_key}"},
json={"protocol": "http"},
)
assert response.status_code == 200
assert response.json()["credits_remaining"] == 9
```
## Code quality
### Linting and formatting
```bash
# Check
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/
# Fix
uv run ruff check --fix src/ tests/
uv run ruff format src/ tests/
```
### Type checking
```bash
uv run mypy src/
```
`mypy` is configured with `strict = true` in `pyproject.toml`. The `pydantic.mypy` plugin is enabled for correct Pydantic model inference.
### Pre-commit (optional)
If you want automated checks on every commit:
```bash
uv tool install pre-commit
pre-commit install
```
## Docker workflow
### Build the image
```bash
docker compose build
```
### Run the full stack
```bash
# Run migrations + start API + worker
docker compose --profile migrate up -d migrate
docker compose up -d api worker
```
### View logs
```bash
docker compose logs -f api worker
```
### Rebuild after code changes
```bash
docker compose build api
docker compose up -d api worker
```
### Shell into a running container
```bash
docker compose exec api bash
docker compose exec postgres psql -U proxypool
docker compose exec redis redis-cli
```
## Adding a new plugin
1. Create a file in `src/proxy_pool/plugins/builtin/<type>/your_plugin.py`.
2. Implement the relevant Protocol (see `02-plugin-system.md`).
3. Define `create_plugin(settings: Settings) -> YourPlugin | None`.
4. Add tests in `tests/plugins/test_your_plugin.py`.
5. Restart the app — the plugin is auto-discovered.
For third-party plugins, place files in `plugins/contrib/` (or mount a directory at `/app/plugins-contrib` in Docker).
## Common development tasks
### Add a new API endpoint
1. Define Pydantic schemas in `<domain>/schemas.py`.
2. Add business logic in `<domain>/service.py`.
3. Create the route in `<domain>/router.py`.
4. Register the router in `app.py` if it's a new router.
5. Add tests.
### Add a new database table
1. Define the SQLAlchemy model in `<domain>/models.py`.
2. Import the model in `db/base.py` (so Alembic sees it).
3. Generate a migration: `uv run alembic revision --autogenerate -m "description"`.
4. Review and apply: `uv run alembic upgrade head`.
5. Add tests.
### Add a new background task
1. Define the task function in `worker/tasks_<category>.py`.
2. Register it in `worker/settings.py` (add to `functions` list, and `cron_jobs` if periodic).
3. Restart the worker.
4. Add tests.

236
docs/07-operations-guide.md Normal file
View File

@ -0,0 +1,236 @@
# Operations guide
## Deployment
### Docker Compose (single-server)
The simplest deployment for small-to-medium workloads. All services run on a single machine.
```bash
# Clone and configure
git clone <repo-url> proxy-pool && cd proxy-pool
cp .env.example .env
# Edit .env with production values
# Build and start
docker compose build
docker compose --profile migrate up -d migrate # Run migrations
docker compose up -d api worker # Start services
```
### Production considerations
**API scaling**: Run multiple API instances behind a load balancer. The API is stateless — any instance can handle any request. In Docker Compose, use `docker compose up -d --scale api=3`.
**Worker scaling**: Typically 1-2 worker instances are sufficient. ARQ deduplicates jobs via Redis, so multiple workers don't cause duplicate work. Scale workers if validation throughput is a bottleneck.
**Database**: Use a managed PostgreSQL service (AWS RDS, GCP Cloud SQL, etc.) for production. Enable connection pooling (PgBouncer) if running more than ~10 API instances.
**Redis**: A single Redis instance is sufficient for most workloads. Enable persistence (AOF or RDB snapshots) if you want lease state to survive Redis restarts. For high availability, use Redis Sentinel or a managed Redis service.
## Configuration reference
All configuration is via environment variables, parsed by `pydantic-settings`.
### Required
| Variable | Description | Example |
|----------|-------------|---------|
| `DATABASE_URL` | PostgreSQL connection string | `postgresql+asyncpg://user:pass@host:5432/db` |
| `REDIS_URL` | Redis connection string | `redis://host:6379/0` |
| `SECRET_KEY` | Used for internal signing (API key generation) | Random 64+ character string |
### Application
| Variable | Default | Description |
|----------|---------|-------------|
| `APP_NAME` | `proxy-pool` | Application name (appears in logs, OpenAPI docs) |
| `LOG_LEVEL` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR` |
| `CORS_ORIGINS` | `[]` | Comma-separated list of allowed CORS origins |
| `API_KEY_PREFIX` | `pp_` | Prefix for generated API keys |
### Proxy pipeline
| Variable | Default | Description |
|----------|---------|-------------|
| `SCRAPE_TIMEOUT_SECONDS` | `30` | HTTP timeout when fetching proxy sources |
| `SCRAPE_USER_AGENT` | `ProxyPool/0.1` | User-Agent header for scrape requests |
| `CHECK_TCP_TIMEOUT` | `5.0` | Timeout for TCP connect checks |
| `CHECK_HTTP_TIMEOUT` | `10.0` | Timeout for HTTP-level checks |
| `CHECK_PIPELINE_TIMEOUT` | `120` | Overall pipeline timeout per proxy |
| `JUDGE_URL` | `http://httpbin.org/ip` | URL used by the HTTP anonymity checker to determine exit IP |
| `REVALIDATE_ACTIVE_INTERVAL_MINUTES` | `10` | How often active proxies are re-checked |
| `REVALIDATE_DEAD_INTERVAL_HOURS` | `6` | How often dead proxies are re-checked |
| `REVALIDATE_BATCH_SIZE` | `200` | Max proxies per revalidation sweep |
| `POOL_LOW_THRESHOLD` | `100` | Emit `proxy.pool_low` event when active count drops below this |
### Accounts
| Variable | Default | Description |
|----------|---------|-------------|
| `DEFAULT_CREDITS` | `100` | Credits granted to new accounts |
| `MAX_LEASE_DURATION_SECONDS` | `3600` | Maximum allowed lease duration |
| `DEFAULT_LEASE_DURATION_SECONDS` | `300` | Default lease duration if not specified |
| `CREDIT_LOW_THRESHOLD` | `10` | Emit `credits.low_balance` when balance drops below this |
### Cleanup
| Variable | Default | Description |
|----------|---------|-------------|
| `PRUNE_DEAD_AFTER_DAYS` | `30` | Delete dead proxies older than this |
| `PRUNE_CHECKS_AFTER_DAYS` | `7` | Delete check history older than this |
| `PRUNE_CHECKS_KEEP_LAST` | `100` | Always keep at least this many checks per proxy |
### Notifications
| Variable | Default | Description |
|----------|---------|-------------|
| `SMTP_HOST` | (empty) | SMTP server. If empty, SMTP notifier is disabled. |
| `SMTP_PORT` | `587` | SMTP port |
| `SMTP_USER` | (empty) | SMTP username |
| `SMTP_PASSWORD` | (empty) | SMTP password |
| `ALERT_EMAIL` | (empty) | Recipient for alert emails |
| `WEBHOOK_URL` | (empty) | Webhook URL. If empty, webhook notifier is disabled. |
### Redis cache
| Variable | Default | Description |
|----------|---------|-------------|
| `CACHE_PROXY_LIST_TTL` | `60` | TTL in seconds for cached proxy query results |
| `CACHE_CREDIT_BALANCE_TTL` | `300` | TTL in seconds for cached credit balances |
## Monitoring
### Health check
```bash
curl http://localhost:8000/health
```
Returns `200` with connection status for PostgreSQL and Redis. Use this as a Docker/Kubernetes health check and load balancer target.
### Key metrics to watch
**Pool health** (`GET /stats/pool`):
- `by_status.active` — The number of working proxies. If this drops suddenly, investigate source failures or upstream blocks.
- `last_scrape_at` — If this is stale, the worker may be down or the scrape task is failing.
- `last_validation_at` — If this is stale, validation is backed up or the worker is stuck.
**Plugin health** (`GET /stats/plugins`):
- Check `notifiers[].healthy` — if a notifier is unhealthy, alerts won't be delivered.
**Worker job queue**: Monitor Redis keys `arq:queue:default` (pending jobs) and `arq:result:*` (completed/failed jobs). A growing queue indicates the worker can't keep up.
### Log format
Logs are structured JSON in production (`LOG_LEVEL=INFO`):
```json
{
"timestamp": "2025-01-15T10:30:00Z",
"level": "INFO",
"message": "scrape_source completed",
"source_id": "abc-123",
"proxies_new": 23,
"duration_ms": 1540
}
```
### Alerting
The built-in notification system handles operational alerts:
- `proxy.pool_low` — Active proxy count below threshold. Action: add more sources or investigate why proxies are dying.
- `source.failed` — A scrape failed. Usually transient (upstream 503). Investigate if persistent.
- `source.stale` — A source hasn't produced results in N hours. The source may be dead or blocking your scraper.
- `credits.low_balance` / `credits.exhausted` — User account alerts. No operational action needed unless it's your own account.
## Troubleshooting
### Proxies are all dying
**Symptoms**: `by_status.active` dropping, `by_status.dead` increasing.
**Possible causes**:
- The judge URL (`JUDGE_URL`) is down or rate-limiting you. Check if `httpbin.org/ip` is accessible from your server.
- Your server's IP is blocked by proxy providers. Try from a different IP or use a self-hosted judge endpoint.
- Proxy sources are returning stale lists. Check `last_scraped_at` on sources.
**Fix**: Self-host a simple judge endpoint (a Flask/FastAPI app that returns `{"ip": request.remote_addr}`) to eliminate dependency on httpbin.
### Worker is not processing jobs
**Symptoms**: `last_scrape_at` and `last_validation_at` are stale. Redis queue is growing.
**Check**:
```bash
docker compose logs worker --tail=50
docker compose exec redis redis-cli LLEN arq:queue:default
```
**Possible causes**:
- Worker process crashed. Restart it: `docker compose restart worker`.
- Redis connection lost. Check Redis health: `docker compose exec redis redis-cli ping`.
- A task is stuck (infinite loop or hung network call). Check `CHECK_PIPELINE_TIMEOUT`.
### Database connections exhausted
**Symptoms**: `asyncpg.exceptions.TooManyConnectionsError` or slow queries.
**Fix**: Reduce the connection pool size in `DATABASE_URL` parameters, or deploy PgBouncer. The default asyncpg pool size is 10 connections per process — with 3 API instances and 1 worker, that's 40 connections. PostgreSQL's default limit is 100.
```env
# In DATABASE_URL or via SQLAlchemy pool config
DATABASE_POOL_SIZE=5
DATABASE_MAX_OVERFLOW=10
```
### Redis memory growing
**Symptoms**: Redis memory usage increasing over time.
**Possible causes**:
- ARQ job results not expiring. Check `keep_result` setting.
- Proxy cache not being invalidated. Verify `CACHE_PROXY_LIST_TTL` is set.
- Lease keys not expiring (should auto-expire via TTL).
**Fix**: Set a Redis `maxmemory` policy:
```
maxmemory 256mb
maxmemory-policy allkeys-lru
```
### Migration failed
**Symptoms**: `alembic upgrade head` errors.
**Steps**:
1. Check the current state: `uv run alembic current`.
2. Look at the error — usually a constraint violation or type mismatch.
3. If the migration is partially applied, you may need to manually fix the state: `uv run alembic stamp <revision>`.
4. For production, always test migrations against a copy of the production database first.
## Backup and recovery
### Database backup
```bash
# Dump
docker compose exec postgres pg_dump -U proxypool proxypool > backup.sql
# Restore
docker compose exec -T postgres psql -U proxypool proxypool < backup.sql
```
### Redis
For proxy pool, Redis data is ephemeral (cache + queue). Losing Redis state means:
- Cached proxy lists are rebuilt on next query (minor latency spike).
- Active leases are lost (the `expire_leases` task will clean up PostgreSQL state).
- Pending ARQ jobs are lost (the next cron cycle will re-enqueue them).
If lease integrity is critical, enable Redis persistence (AOF recommended):
```
appendonly yes
appendfsync everysec
```