docs: add developer documentation
This commit is contained in:
parent
11919e516b
commit
c041e83a19
@ -2,3 +2,5 @@
|
||||
Proxy Pool is a FastAPI backend that discovers, validates, and serves free proxy servers. It scrapes configurable proxy
|
||||
list sources, runs a multi-stage validation pipeline to determine which proxies are alive and usable, and exposes a
|
||||
query API for consumers to find and acquire working proxies.
|
||||
|
||||
See `docs/` for full documentation.
|
||||
|
||||
122
docs/01-architecture.md
Normal file
122
docs/01-architecture.md
Normal file
@ -0,0 +1,122 @@
|
||||
# Architecture overview
|
||||
|
||||
## Purpose
|
||||
|
||||
Proxy Pool is a FastAPI backend that discovers, validates, and serves free proxy servers. It scrapes configurable proxy list sources, runs a multi-stage validation pipeline to determine which proxies are alive and usable, and exposes a query API for consumers to find and acquire working proxies.
|
||||
|
||||
## System components
|
||||
|
||||
The application is a single FastAPI process with a companion ARQ worker process. Both share the same codebase and connect to the same PostgreSQL database and Redis instance.
|
||||
|
||||
### FastAPI application
|
||||
|
||||
The HTTP API layer. Handles all inbound requests, dependency injection, middleware (CORS, rate limiting, request logging), and the app lifespan (startup/shutdown). Built with the app factory pattern via `create_app()` so tests can instantiate independent app instances.
|
||||
|
||||
### ARQ worker
|
||||
|
||||
A separate process running ARQ (async Redis queue). Executes background tasks: periodic source scraping, proxy validation sweeps, stale proxy cleanup, and lease expiration. Tasks are defined in `proxy_pool.worker` and import from domain service layers — they contain orchestration, not business logic.
|
||||
|
||||
### PostgreSQL (asyncpg)
|
||||
|
||||
Primary data store. All proxy data, user accounts, credit ledger, check history, and source configuration. Accessed through SQLAlchemy 2.0 async ORM with asyncpg as the dialect driver. Schema managed by Alembic.
|
||||
|
||||
### Redis
|
||||
|
||||
Serves three roles:
|
||||
|
||||
- **Task queue**: ARQ uses Redis as its message broker for enqueuing and scheduling background tasks.
|
||||
- **Cache**: Hot proxy lists (the "top N good proxies" result), user credit balances, and recently-validated proxy sets are cached in Redis with TTL-based expiration.
|
||||
- **Lease manager**: The proxy acquire endpoint uses Redis `SET key value EX ttl NX` for atomic lease creation. This prevents two consumers from acquiring the same proxy simultaneously.
|
||||
|
||||
### Plugin registry
|
||||
|
||||
A runtime plugin system that manages three plugin types: source parsers, proxy checkers, and notifiers. Plugins are discovered at startup by scanning the `plugins/builtin/` and `plugins/contrib/` directories. The registry validates plugins against Python Protocol classes and stores them for use by the pipeline and event bus.
|
||||
|
||||
## Domain modules
|
||||
|
||||
The codebase is split into two domain modules with a shared infrastructure layer:
|
||||
|
||||
### Proxy domain (`proxy_pool.proxy`)
|
||||
|
||||
Owns all proxy-related data and logic: sources, proxies, check history, tags, scoring, and the validation pipeline. Exposes routes for CRUD on sources, querying/filtering proxies, running on-demand site reachability tests, and viewing pool health stats.
|
||||
|
||||
### Accounts domain (`proxy_pool.accounts`)
|
||||
|
||||
Owns user accounts, API key authentication, and the credit ledger. Exposes routes for key management, account info, and credit balance/history. Provides a FastAPI dependency (`get_current_user`) that resolves an API key header into an authenticated `User`.
|
||||
|
||||
### Integration boundary
|
||||
|
||||
The two domains never import from each other directly. The single integration point is the `POST /proxies/acquire` endpoint, which:
|
||||
|
||||
1. Resolves the API key via the accounts auth dependency
|
||||
2. Checks the user's credit balance (Redis cache, DB fallback)
|
||||
3. Selects the best matching proxy via the proxy service layer
|
||||
4. Creates an atomic lease in Redis
|
||||
5. Debits the credit ledger and logs the lease in a single DB transaction
|
||||
6. Invalidates the cached credit balance
|
||||
|
||||
This is the only place where both domains participate in the same request.
|
||||
|
||||
## Data flow
|
||||
|
||||
### Discovery
|
||||
|
||||
A periodic ARQ cron task (`scrape_all`) iterates over active `ProxySource` records. For each source, it fetches the URL, selects the appropriate `SourceParser` plugin (by `parser_name` or auto-detection via `supports()`), parses the raw content into a list of `DiscoveredProxy` objects, and upserts them into the `proxies` table using `ON CONFLICT (ip, port, protocol) DO UPDATE`.
|
||||
|
||||
### Validation
|
||||
|
||||
Newly discovered proxies (status `UNCHECKED`) and proxies due for revalidation are fed into the checker pipeline. The pipeline runs all registered `ProxyChecker` plugins in stage order:
|
||||
|
||||
- **Stage 1** — Quick liveness checks (TCP connect, SOCKS handshake). Run concurrently within the stage.
|
||||
- **Stage 2** — HTTP-level checks (exit IP detection, anonymity classification, GeoIP lookup). Run concurrently within the stage.
|
||||
- **Stage 3** — Optional site-specific reachability checks.
|
||||
|
||||
If any checker in a stage fails, all subsequent stages are skipped. Every check result is logged to `proxy_checks` for historical analysis. After the pipeline completes, a composite score is computed from latency, uptime percentage, and proxy age, then persisted to the `proxies` table.
|
||||
|
||||
### Serving
|
||||
|
||||
The query API (`GET /proxies`) filters by protocol, country, anonymity level, minimum score, latency range, and "last verified within N minutes." Results are served from a Redis cache when available, with cache invalidation triggered by validation sweeps.
|
||||
|
||||
The acquire endpoint (`POST /proxies/acquire`) selects a proxy matching the requested filters, creates a time-limited lease, debits one credit, and returns the proxy details. The lease is tracked in both Redis (for fast exclusion from future queries) and PostgreSQL (for audit trail and cleanup).
|
||||
|
||||
### Notification
|
||||
|
||||
The event bus dispatches events (`proxy.pool_low`, `source.failed`, `credits.low_balance`, etc.) to registered `Notifier` plugins. Notifications are fire-and-forget via `asyncio.create_task` — they never block the main request or pipeline path.
|
||||
|
||||
## Technology decisions
|
||||
|
||||
| Component | Choice | Rationale |
|
||||
|-----------|--------|-----------|
|
||||
| Web framework | FastAPI | Async-native, dependency injection, automatic OpenAPI docs |
|
||||
| ORM | SQLAlchemy 2.0 async | Mapped classes, native async session, Alembic integration |
|
||||
| DB driver | asyncpg | Fastest async PostgreSQL driver for Python |
|
||||
| Migrations | Alembic | Industry standard, autogeneration from SA models |
|
||||
| Task queue | ARQ + Redis | Lightweight async task queue, cron scheduling built in |
|
||||
| HTTP client | httpx + httpx-socks | Async, SOCKS proxy support, connection pooling |
|
||||
| Settings | pydantic-settings | Env-driven config with type validation |
|
||||
| Package manager | uv | Fast, lockfile-based, handles editable installs |
|
||||
| Containerization | Docker + Compose | Multi-stage builds, service orchestration |
|
||||
|
||||
## Deployment topology
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ Load balancer │
|
||||
└────────┬────────┘
|
||||
│
|
||||
┌──────────────┼──────────────┐
|
||||
│ │ │
|
||||
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
|
||||
│ API (N) │ │ API (N) │ │ API (N) │
|
||||
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
|
||||
│ │ │
|
||||
└──────────────┼──────────────┘
|
||||
│
|
||||
┌──────────────┼──────────────┐
|
||||
│ │ │
|
||||
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
|
||||
│ Worker (M) │ │PostgreSQL │ │ Redis │
|
||||
└───────────┘ └───────────┘ └───────────┘
|
||||
```
|
||||
|
||||
API processes are stateless and can be scaled horizontally. Workers should typically run as 1-2 instances to avoid duplicate scrape/validation work (ARQ handles job deduplication via Redis). PostgreSQL and Redis are single-instance in the default setup but can be replicated per standard practices.
|
||||
317
docs/02-plugin-system.md
Normal file
317
docs/02-plugin-system.md
Normal file
@ -0,0 +1,317 @@
|
||||
# Plugin system design
|
||||
|
||||
## Overview
|
||||
|
||||
The plugin system allows extending Proxy Pool's functionality without modifying core code. Plugins can add new proxy list parsers, new validation methods, and new notification channels. The system uses Python's `typing.Protocol` for structural typing — plugins implement the right interface without inheriting from any base class.
|
||||
|
||||
## Plugin types
|
||||
|
||||
### SourceParser
|
||||
|
||||
Responsible for extracting proxy entries from raw scraped content. Each parser handles a specific format (plain text lists, HTML tables, JSON APIs, etc.).
|
||||
|
||||
```python
|
||||
from typing import Protocol, runtime_checkable
|
||||
|
||||
@runtime_checkable
|
||||
class SourceParser(Protocol):
|
||||
name: str
|
||||
|
||||
def supports(self, url: str) -> bool:
|
||||
"""Return True if this parser can handle the given source URL.
|
||||
|
||||
Used as a fallback when no parser_name is explicitly set on a ProxySource.
|
||||
The registry calls supports() on each registered parser and uses the first match.
|
||||
"""
|
||||
...
|
||||
|
||||
async def parse(self, raw: bytes, source: ProxySource) -> list[DiscoveredProxy]:
|
||||
"""Extract proxy entries from raw scraped content.
|
||||
|
||||
Arguments:
|
||||
raw: The raw bytes fetched from the source URL. The parser is responsible
|
||||
for decoding (the encoding may vary by source).
|
||||
source: The ProxySource record, providing context like default_protocol.
|
||||
|
||||
Returns:
|
||||
A list of DiscoveredProxy objects. Duplicates within a single parse call
|
||||
are acceptable — deduplication happens at the upsert layer.
|
||||
"""
|
||||
...
|
||||
|
||||
def default_schedule(self) -> str | None:
|
||||
"""Optional cron expression for scrape frequency.
|
||||
|
||||
If None, the schedule configured on the ProxySource record is used.
|
||||
This allows parsers to suggest a sensible default (e.g. "*/30 * * * *"
|
||||
for sources that update frequently).
|
||||
"""
|
||||
...
|
||||
```
|
||||
|
||||
**Registration key**: `parser_name` on the `ProxySource` record maps to `SourceParser.name`.
|
||||
|
||||
**Built-in parsers**: `plaintext` (one `ip:port` per line), `html_table` (HTML table with IP/port columns), `json_api` (JSON array or nested structure).
|
||||
|
||||
### ProxyChecker
|
||||
|
||||
Runs a single validation check against a proxy. Checkers are organized into stages — all checkers in stage N run before any in stage N+1. Within a stage, checkers run concurrently.
|
||||
|
||||
```python
|
||||
@runtime_checkable
|
||||
class ProxyChecker(Protocol):
|
||||
name: str
|
||||
stage: int # Pipeline ordering. Lower stages run first.
|
||||
priority: int # Ordering within a stage. Lower priority runs first.
|
||||
timeout: float # Per-check timeout in seconds.
|
||||
|
||||
async def check(self, proxy: Proxy, context: CheckContext) -> CheckResult:
|
||||
"""Run this check against the proxy.
|
||||
|
||||
Arguments:
|
||||
proxy: The proxy being validated.
|
||||
context: Shared mutable state across the pipeline. Checkers in earlier
|
||||
stages populate fields (exit_ip, tcp_latency_ms) that later
|
||||
stages can read. Also provides a pre-configured httpx.AsyncClient.
|
||||
|
||||
Returns:
|
||||
CheckResult with passed=True/False and a detail string.
|
||||
"""
|
||||
...
|
||||
|
||||
def should_skip(self, proxy: Proxy) -> bool:
|
||||
"""Return True to skip this check for the given proxy.
|
||||
|
||||
Example: A SOCKS5-specific checker returns True for HTTP-only proxies.
|
||||
"""
|
||||
...
|
||||
```
|
||||
|
||||
**Pipeline execution**: The orchestrator in `proxy_pool.proxy.pipeline` groups checkers by stage, runs each group concurrently via `asyncio.gather`, and aborts the pipeline on the first stage with any failure. Every individual check result is logged to the `proxy_checks` table.
|
||||
|
||||
**`CheckContext`**: A mutable dataclass that travels through the pipeline:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class CheckContext:
|
||||
started_at: datetime
|
||||
http_client: httpx.AsyncClient
|
||||
# Populated by checkers as they run:
|
||||
exit_ip: str | None = None
|
||||
tcp_latency_ms: float | None = None
|
||||
http_latency_ms: float | None = None
|
||||
anonymity_level: AnonymityLevel | None = None
|
||||
country: str | None = None
|
||||
headers_forwarded: list[str] = field(default_factory=list)
|
||||
|
||||
def elapsed_ms(self) -> float:
|
||||
return (utcnow() - self.started_at).total_seconds() * 1000
|
||||
```
|
||||
|
||||
**`CheckResult`**: The return type from every checker:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class CheckResult:
|
||||
passed: bool
|
||||
detail: str
|
||||
latency_ms: float | None = None
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
```
|
||||
|
||||
**Built-in checkers**:
|
||||
|
||||
| Name | Stage | What it does |
|
||||
|------|-------|-------------|
|
||||
| `tcp_connect` | 1 | Opens a TCP connection to verify the proxy is reachable |
|
||||
| `socks_handshake` | 1 | Performs a SOCKS4/5 handshake (skipped for HTTP proxies) |
|
||||
| `http_anonymity` | 2 | Sends an HTTP request through the proxy to a judge URL, determines exit IP and which headers are forwarded |
|
||||
| `geoip_lookup` | 2 | Resolves the exit IP to a country code using MaxMind GeoLite2 |
|
||||
| `site_reach` | 3 | Optional: tests whether the proxy can reach specific target URLs |
|
||||
|
||||
### Notifier
|
||||
|
||||
Reacts to system events. Notifiers are called asynchronously (fire-and-forget) and must never block the main application path.
|
||||
|
||||
```python
|
||||
@runtime_checkable
|
||||
class Notifier(Protocol):
|
||||
name: str
|
||||
subscribes_to: list[str] # Glob patterns: "proxy.*", "credits.low_balance"
|
||||
|
||||
async def notify(self, event: Event) -> None:
|
||||
"""Handle an event.
|
||||
|
||||
Called via asyncio.create_task — exceptions are caught and logged
|
||||
but do not propagate. Implementations should handle their own
|
||||
retries if needed.
|
||||
"""
|
||||
...
|
||||
|
||||
async def health_check(self) -> bool:
|
||||
"""Verify the notification backend is reachable.
|
||||
|
||||
Called periodically and surfaced in the admin stats endpoint.
|
||||
Return False if the backend is unreachable.
|
||||
"""
|
||||
...
|
||||
```
|
||||
|
||||
**Event types**:
|
||||
|
||||
| Event | Payload | When emitted |
|
||||
|-------|---------|-------------|
|
||||
| `proxy.pool_low` | `{active_count, threshold}` | Active proxy count drops below configured threshold |
|
||||
| `proxy.new_batch` | `{source_id, count}` | A scrape discovers new proxies |
|
||||
| `source.failed` | `{source_id, error}` | A scrape attempt fails |
|
||||
| `source.stale` | `{source_id, hours_since_success}` | A source hasn't produced results in N hours |
|
||||
| `credits.low_balance` | `{user_id, balance, threshold}` | User balance drops below threshold |
|
||||
| `credits.exhausted` | `{user_id}` | User balance reaches zero |
|
||||
|
||||
**Glob matching**: `proxy.*` matches all events starting with `proxy.`. Exact matches like `credits.low_balance` match only that event. A notifier can subscribe to multiple patterns.
|
||||
|
||||
**Built-in notifiers**: `smtp` (email alerts), `webhook` (HTTP POST to a configured URL).
|
||||
|
||||
## Plugin registry
|
||||
|
||||
The `PluginRegistry` class is the central coordinator. It stores registered plugins, validates them against Protocol contracts at registration time, and provides lookup methods used by the pipeline and event bus.
|
||||
|
||||
```python
|
||||
class PluginRegistry:
|
||||
def __init__(self) -> None:
|
||||
self._parsers: dict[str, SourceParser] = {}
|
||||
self._checkers: list[ProxyChecker] = [] # Sorted by (stage, priority)
|
||||
self._notifiers: list[Notifier] = []
|
||||
self._event_subs: dict[str, list[Notifier]] = {}
|
||||
|
||||
def register_parser(self, plugin: SourceParser) -> None: ...
|
||||
def register_checker(self, plugin: ProxyChecker) -> None: ...
|
||||
def register_notifier(self, plugin: Notifier) -> None: ...
|
||||
def get_parser(self, name: str) -> SourceParser: ...
|
||||
def get_parser_for_url(self, url: str) -> SourceParser | None: ...
|
||||
def get_checker_pipeline(self) -> list[ProxyChecker]: ...
|
||||
async def emit(self, event: Event) -> None: ...
|
||||
```
|
||||
|
||||
**Validation**: At registration time, `_validate_protocol()` uses `isinstance()` (enabled by `@runtime_checkable` on each Protocol) as a structural check, then inspects for missing attributes/methods and raises `PluginValidationError` with a descriptive message.
|
||||
|
||||
**Conflict detection**: Two parsers with the same `name` raise `PluginConflictError`. Checkers and notifiers are additive (duplicates are allowed, though unusual).
|
||||
|
||||
## Plugin discovery
|
||||
|
||||
Plugins are discovered at application startup by scanning two directories:
|
||||
|
||||
- `proxy_pool/plugins/builtin/` — Ships with the application. Tested in CI.
|
||||
- `proxy_pool/plugins/contrib/` — User-provided plugins. Can be mounted as a Docker volume.
|
||||
|
||||
### Convention
|
||||
|
||||
Each plugin is a Python module (single `.py` file or package directory) that defines a `create_plugin(settings: Settings)` function. This factory function:
|
||||
|
||||
- Receives the application settings (for reading config like SMTP credentials)
|
||||
- Returns a plugin instance, or `None` if the plugin should not activate (e.g. SMTP not configured)
|
||||
|
||||
```python
|
||||
# Example: proxy_pool/plugins/builtin/parsers/plaintext.py
|
||||
|
||||
class PlaintextParser:
|
||||
name = "plaintext"
|
||||
|
||||
def supports(self, url: str) -> bool:
|
||||
return url.endswith(".txt")
|
||||
|
||||
async def parse(self, raw: bytes, source: ProxySource) -> list[DiscoveredProxy]:
|
||||
# ... parsing logic ...
|
||||
return results
|
||||
|
||||
def default_schedule(self) -> str | None:
|
||||
return "*/30 * * * *"
|
||||
|
||||
def create_plugin(settings: Settings) -> PlaintextParser:
|
||||
return PlaintextParser()
|
||||
```
|
||||
|
||||
### Discovery algorithm
|
||||
|
||||
```python
|
||||
async def discover_plugins(plugins_dir: Path, registry: PluginRegistry, settings: Settings):
|
||||
for path in sorted(plugins_dir.rglob("*.py")):
|
||||
if path.name.startswith("_"):
|
||||
continue
|
||||
module = importlib.import_module(derive_module_path(path))
|
||||
if not hasattr(module, "create_plugin"):
|
||||
continue
|
||||
plugin = module.create_plugin(settings)
|
||||
if plugin is None:
|
||||
continue # Plugin opted out (unconfigured)
|
||||
# Type-based routing:
|
||||
match plugin:
|
||||
case SourceParser(): registry.register_parser(plugin)
|
||||
case ProxyChecker(): registry.register_checker(plugin)
|
||||
case Notifier(): registry.register_notifier(plugin)
|
||||
```
|
||||
|
||||
The `match` statement uses structural pattern matching against `@runtime_checkable` Protocol types. The order matters — if a plugin somehow satisfies multiple Protocols, the first match wins.
|
||||
|
||||
## Wiring into FastAPI
|
||||
|
||||
The plugin registry is created during the FastAPI lifespan and stored on `app.state`:
|
||||
|
||||
```python
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
settings = get_settings()
|
||||
registry = PluginRegistry()
|
||||
|
||||
# Discover from both builtin and contrib directories
|
||||
await discover_plugins(Path("proxy_pool/plugins/builtin"), registry, settings)
|
||||
await discover_plugins(Path("/app/plugins-contrib"), registry, settings)
|
||||
|
||||
# Health-check notifiers
|
||||
for notifier in registry.notifiers:
|
||||
healthy = await notifier.health_check()
|
||||
if not healthy:
|
||||
logger.warning(f"Notifier '{notifier.name}' failed health check at startup")
|
||||
|
||||
app.state.registry = registry
|
||||
# ... other setup (db, redis) ...
|
||||
yield
|
||||
# ... cleanup ...
|
||||
```
|
||||
|
||||
Route handlers access the registry via a FastAPI dependency:
|
||||
|
||||
```python
|
||||
async def get_registry(request: Request) -> PluginRegistry:
|
||||
return request.app.state.registry
|
||||
```
|
||||
|
||||
## Writing a third-party plugin
|
||||
|
||||
1. Create a `.py` file in `plugins/contrib/` (or mount a directory as a Docker volume at `/app/plugins-contrib/`).
|
||||
2. Implement the relevant Protocol methods and attributes.
|
||||
3. Define `create_plugin(settings: Settings) -> YourPlugin | None`.
|
||||
4. Restart the application. The plugin will be discovered and registered automatically.
|
||||
|
||||
No inheritance required. No registration decorators. Just implement the shape and provide the factory.
|
||||
|
||||
### Testing a plugin
|
||||
|
||||
```python
|
||||
from proxy_pool.config import Settings
|
||||
from your_plugin import create_plugin
|
||||
|
||||
def test_plugin_creates_successfully():
|
||||
settings = Settings(smtp_host="localhost", ...)
|
||||
plugin = create_plugin(settings)
|
||||
assert plugin is not None
|
||||
assert plugin.name == "my_custom_parser"
|
||||
|
||||
async def test_parser_extracts_proxies():
|
||||
plugin = create_plugin(Settings(...))
|
||||
raw = b"192.168.1.1:8080\n10.0.0.1:3128\n"
|
||||
source = make_test_source()
|
||||
results = await plugin.parse(raw, source)
|
||||
assert len(results) == 2
|
||||
assert results[0].ip == "192.168.1.1"
|
||||
```
|
||||
204
docs/03-database-schema.md
Normal file
204
docs/03-database-schema.md
Normal file
@ -0,0 +1,204 @@
|
||||
# Database schema reference
|
||||
|
||||
## Overview
|
||||
|
||||
All tables use UUID primary keys (generated client-side via `uuid4()`), `timestamptz` for datetime columns, and follow a consistent naming convention: `snake_case` table names, singular for join/config tables, plural for entity tables.
|
||||
|
||||
The schema is managed by Alembic. Never modify tables directly — always create a migration.
|
||||
|
||||
## Proxy domain tables
|
||||
|
||||
### proxy_sources
|
||||
|
||||
Configurable scrape targets. Each record defines a URL to fetch, a parser to use, and a schedule.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | `uuid` | PK, default uuid4 | |
|
||||
| `url` | `varchar(2048)` | UNIQUE, NOT NULL | The URL to scrape |
|
||||
| `parser_name` | `varchar(64)` | NOT NULL | Maps to a registered `SourceParser.name` |
|
||||
| `cron_schedule` | `varchar(64)` | nullable | Cron expression for scrape frequency. Falls back to the parser's `default_schedule()` if NULL |
|
||||
| `default_protocol` | `enum(proxy_protocol)` | NOT NULL, default `http` | Protocol to assign when the parser can't determine it from the source |
|
||||
| `is_active` | `boolean` | NOT NULL, default `true` | Inactive sources are skipped by the scrape task |
|
||||
| `last_scraped_at` | `timestamptz` | nullable | Timestamp of the last successful scrape |
|
||||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||||
|
||||
**Rationale**: Storing the parser name rather than auto-detecting every time allows explicit control. A source might look like a plain text file but actually need a custom parser.
|
||||
|
||||
### proxies
|
||||
|
||||
The core proxy table. Each record represents a unique `(ip, port, protocol)` combination.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | `uuid` | PK, default uuid4 | |
|
||||
| `ip` | `inet` | NOT NULL | IPv4 or IPv6 address |
|
||||
| `port` | `integer` | NOT NULL | Port number (1–65535) |
|
||||
| `protocol` | `enum(proxy_protocol)` | NOT NULL | `http`, `https`, `socks4`, `socks5` |
|
||||
| `source_id` | `uuid` | FK → proxy_sources.id, NOT NULL | Which source discovered this proxy |
|
||||
| `status` | `enum(proxy_status)` | NOT NULL, default `unchecked` | `unchecked`, `active`, `dead` |
|
||||
| `anonymity` | `enum(anonymity_level)` | nullable | `transparent`, `anonymous`, `elite` |
|
||||
| `exit_ip` | `inet` | nullable | The IP address seen by the target when using this proxy |
|
||||
| `country` | `varchar(2)` | nullable | ISO 3166-1 alpha-2 country code of the exit IP |
|
||||
| `score` | `float` | NOT NULL, default `0.0` | Composite quality score (0.0–1.0) |
|
||||
| `avg_latency_ms` | `float` | nullable | Rolling average latency across recent checks |
|
||||
| `uptime_pct` | `float` | nullable | Percentage of checks that passed (0.0–100.0) |
|
||||
| `first_seen_at` | `timestamptz` | NOT NULL, server default `now()` | When this proxy was first discovered |
|
||||
| `last_checked_at` | `timestamptz` | nullable | When the last validation check completed |
|
||||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||||
|
||||
**Indexes**:
|
||||
|
||||
| Name | Columns | Type | Purpose |
|
||||
|------|---------|------|---------|
|
||||
| `ix_proxies_ip_port_proto` | `(ip, port, protocol)` | UNIQUE | Deduplication on upsert |
|
||||
| `ix_proxies_status_score` | `(status, score)` | B-tree | Fast filtering for "active proxies sorted by score" |
|
||||
|
||||
**Design note**: The same `ip:port` can appear multiple times if it supports different protocols (e.g., HTTP on port 8080 and SOCKS5 on port 1080). The composite unique index enforces this correctly.
|
||||
|
||||
**Computed columns**: `score`, `avg_latency_ms`, and `uptime_pct` are denormalized from `proxy_checks`. They are recomputed by the validation pipeline after each check run and by a periodic rollup task. This avoids expensive aggregation queries on every proxy list request.
|
||||
|
||||
### proxy_checks
|
||||
|
||||
Append-only log of every validation check attempt. This is the raw data behind the computed fields on `proxies`.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | `uuid` | PK, default uuid4 | |
|
||||
| `proxy_id` | `uuid` | FK → proxies.id ON DELETE CASCADE, NOT NULL | |
|
||||
| `checker_name` | `varchar(64)` | NOT NULL | The `ProxyChecker.name` that ran this check |
|
||||
| `stage` | `integer` | NOT NULL | Pipeline stage number |
|
||||
| `passed` | `boolean` | NOT NULL | Whether the check succeeded |
|
||||
| `latency_ms` | `float` | nullable | Time taken for this specific check |
|
||||
| `detail` | `text` | nullable | Human-readable result description or error message |
|
||||
| `exit_ip` | `inet` | nullable | Exit IP discovered during this check (if applicable) |
|
||||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||||
|
||||
**Indexes**:
|
||||
|
||||
| Name | Columns | Purpose |
|
||||
|------|---------|---------|
|
||||
| `ix_checks_proxy_created` | `(proxy_id, created_at)` | Efficient history queries per proxy |
|
||||
|
||||
**Retention**: This table grows fast. A periodic cleanup task (`tasks_cleanup.prune_checks`) deletes rows older than a configurable retention period (default: 7 days), keeping only the most recent N checks per proxy.
|
||||
|
||||
### proxy_tags
|
||||
|
||||
Flexible key-value labels for proxies. Useful for user-defined categorization (e.g., `datacenter: true`, `provider: aws`, `tested_site: google.com`).
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | `uuid` | PK, default uuid4 | |
|
||||
| `proxy_id` | `uuid` | FK → proxies.id ON DELETE CASCADE, NOT NULL | |
|
||||
| `key` | `varchar(64)` | NOT NULL | Tag name |
|
||||
| `value` | `varchar(256)` | NOT NULL | Tag value |
|
||||
|
||||
**Indexes**:
|
||||
|
||||
| Name | Columns | Type | Purpose |
|
||||
|------|---------|------|---------|
|
||||
| `ix_tags_proxy_key` | `(proxy_id, key)` | UNIQUE | One value per key per proxy |
|
||||
|
||||
## Accounts domain tables
|
||||
|
||||
### users
|
||||
|
||||
User accounts. Minimal by design — the primary purpose is to own API keys and credits.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | `uuid` | PK, default uuid4 | |
|
||||
| `email` | `varchar(320)` | UNIQUE, NOT NULL | Used for notifications and account recovery |
|
||||
| `display_name` | `varchar(128)` | nullable | |
|
||||
| `is_active` | `boolean` | NOT NULL, default `true` | Inactive users cannot authenticate |
|
||||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||||
|
||||
### api_keys
|
||||
|
||||
API keys for authentication. The raw key is shown once at creation; only the hash is stored.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | `uuid` | PK, default uuid4 | |
|
||||
| `user_id` | `uuid` | FK → users.id ON DELETE CASCADE, NOT NULL | |
|
||||
| `key_hash` | `varchar(128)` | NOT NULL | SHA-256 hash of the raw API key |
|
||||
| `prefix` | `varchar(8)` | NOT NULL | First 8 characters of the raw key, for quick lookup |
|
||||
| `label` | `varchar(128)` | nullable | User-assigned label (e.g., "production", "testing") |
|
||||
| `is_active` | `boolean` | NOT NULL, default `true` | Revoked keys have `is_active = false` |
|
||||
| `last_used_at` | `timestamptz` | nullable | Updated on each authenticated request |
|
||||
| `expires_at` | `timestamptz` | nullable | NULL means no expiration |
|
||||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||||
|
||||
**Indexes**:
|
||||
|
||||
| Name | Columns | Type | Purpose |
|
||||
|------|---------|------|---------|
|
||||
| `ix_api_keys_hash` | `(key_hash)` | UNIQUE | Uniqueness constraint on key hashes |
|
||||
| `ix_api_keys_prefix` | `(prefix)` | B-tree | Fast prefix-based lookup before full hash comparison |
|
||||
|
||||
**Auth flow**: On each request, the middleware extracts the API key from the `Authorization: Bearer <key>` header, computes `prefix = key[:8]`, queries `api_keys WHERE prefix = ? AND is_active = true AND (expires_at IS NULL OR expires_at > now())`, then verifies `sha256(key) == key_hash`. This two-step approach avoids computing a hash against every key in the database.
|
||||
|
||||
### credit_ledger
|
||||
|
||||
Append-only ledger of all credit transactions. Current balance is `SELECT SUM(amount) FROM credit_ledger WHERE user_id = ?`.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | `uuid` | PK, default uuid4 | |
|
||||
| `user_id` | `uuid` | FK → users.id ON DELETE CASCADE, NOT NULL | |
|
||||
| `amount` | `integer` | NOT NULL | Positive = credit in, negative = debit |
|
||||
| `tx_type` | `enum(credit_tx_type)` | NOT NULL | `purchase`, `acquire`, `refund`, `admin_adjust` |
|
||||
| `description` | `text` | nullable | Human-readable note |
|
||||
| `reference_id` | `uuid` | nullable | Links to the related entity (e.g., lease ID for `acquire` transactions) |
|
||||
| `created_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||||
|
||||
**Indexes**:
|
||||
|
||||
| Name | Columns | Purpose |
|
||||
|------|---------|---------|
|
||||
| `ix_ledger_user_created` | `(user_id, created_at)` | Balance computation and history queries |
|
||||
|
||||
**Caching**: The computed balance is cached in Redis under `credits:{user_id}`. The cache is invalidated (DEL) whenever a new ledger entry is created. Cache miss triggers a `SUM(amount)` query.
|
||||
|
||||
**Concurrency**: Because balance is derived from a SUM, concurrent inserts don't cause race conditions on the balance itself. The acquire endpoint uses `SELECT ... FOR UPDATE` on the user row to serialize credit checks, preventing double-spending under high concurrency.
|
||||
|
||||
### proxy_leases
|
||||
|
||||
Tracks which proxies are currently checked out by which users. Both Redis (for fast lookup) and PostgreSQL (for audit trail) maintain lease state.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | `uuid` | PK, default uuid4 | |
|
||||
| `user_id` | `uuid` | FK → users.id, NOT NULL | |
|
||||
| `proxy_id` | `uuid` | FK → proxies.id, NOT NULL | |
|
||||
| `acquired_at` | `timestamptz` | NOT NULL, server default `now()` | |
|
||||
| `expires_at` | `timestamptz` | NOT NULL | When the lease automatically releases |
|
||||
| `is_released` | `boolean` | NOT NULL, default `false` | Set to true on explicit release or expiration cleanup |
|
||||
|
||||
**Indexes**:
|
||||
|
||||
| Name | Columns | Purpose |
|
||||
|------|---------|---------|
|
||||
| `ix_leases_user` | `(user_id)` | List a user's active leases |
|
||||
| `ix_leases_proxy_active` | `(proxy_id, is_released)` | Check if a proxy is currently leased |
|
||||
|
||||
**Dual state**: Redis holds the lease as `lease:{proxy_id}` with a TTL matching `expires_at`. The proxy selection query excludes proxies with an active Redis lease key. The PostgreSQL record exists for audit, billing reconciliation, and cleanup if Redis state is lost.
|
||||
|
||||
## Enum types
|
||||
|
||||
All enums are PostgreSQL native enums created via `CREATE TYPE`:
|
||||
|
||||
| Enum name | Values |
|
||||
|-----------|--------|
|
||||
| `proxy_protocol` | `http`, `https`, `socks4`, `socks5` |
|
||||
| `proxy_status` | `unchecked`, `active`, `dead` |
|
||||
| `anonymity_level` | `transparent`, `anonymous`, `elite` |
|
||||
| `credit_tx_type` | `purchase`, `acquire`, `refund`, `admin_adjust` |
|
||||
|
||||
## Migration conventions
|
||||
|
||||
- One migration per logical change. Don't bundle unrelated schema changes.
|
||||
- Migration filenames: `NNN_descriptive_name.py` (e.g., `001_initial_schema.py`).
|
||||
- Always include both `upgrade()` and `downgrade()` functions.
|
||||
- Test migrations against a fresh database AND against a database with existing data.
|
||||
- Use `alembic revision --autogenerate -m "description"` for model-driven changes, but always review the generated SQL before applying.
|
||||
405
docs/04-api-reference.md
Normal file
405
docs/04-api-reference.md
Normal file
@ -0,0 +1,405 @@
|
||||
# API reference
|
||||
|
||||
## Authentication
|
||||
|
||||
All endpoints except `POST /auth/register` and `GET /health` require an API key in the `Authorization` header:
|
||||
|
||||
```
|
||||
Authorization: Bearer pp_a1b2c3d4e5f6g7h8i9j0...
|
||||
```
|
||||
|
||||
API keys use a `pp_` prefix (proxy pool) followed by 48 random characters. The prefix aids visual identification and log filtering.
|
||||
|
||||
Unauthenticated or invalid requests return `401 Unauthorized`. Requests with valid keys but insufficient credits return `402 Payment Required`.
|
||||
|
||||
## Common response patterns
|
||||
|
||||
### Pagination
|
||||
|
||||
List endpoints support cursor-based pagination:
|
||||
|
||||
```
|
||||
GET /proxies?limit=50&cursor=eyJpZCI6Ii4uLiJ9
|
||||
```
|
||||
|
||||
Response includes a `next_cursor` field when more results exist:
|
||||
|
||||
```json
|
||||
{
|
||||
"items": [...],
|
||||
"next_cursor": "eyJpZCI6Ii4uLiJ9",
|
||||
"total_count": 1234
|
||||
}
|
||||
```
|
||||
|
||||
### Error responses
|
||||
|
||||
All errors follow a consistent shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"error": {
|
||||
"code": "INSUFFICIENT_CREDITS",
|
||||
"message": "You need at least 1 credit to acquire a proxy. Current balance: 0.",
|
||||
"details": {}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Proxy domain endpoints
|
||||
|
||||
### Sources
|
||||
|
||||
#### `GET /sources`
|
||||
|
||||
List all configured proxy sources.
|
||||
|
||||
**Query parameters**: `is_active` (bool, optional), `limit` (int, default 50), `cursor` (string, optional).
|
||||
|
||||
**Response** `200`:
|
||||
```json
|
||||
{
|
||||
"items": [
|
||||
{
|
||||
"id": "uuid",
|
||||
"url": "https://example.com/proxies.txt",
|
||||
"parser_name": "plaintext",
|
||||
"cron_schedule": "*/30 * * * *",
|
||||
"default_protocol": "http",
|
||||
"is_active": true,
|
||||
"last_scraped_at": "2025-01-15T10:30:00Z",
|
||||
"created_at": "2025-01-01T00:00:00Z"
|
||||
}
|
||||
],
|
||||
"next_cursor": null,
|
||||
"total_count": 5
|
||||
}
|
||||
```
|
||||
|
||||
#### `POST /sources`
|
||||
|
||||
Add a new proxy source.
|
||||
|
||||
**Request body**:
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com/proxies.txt",
|
||||
"parser_name": "plaintext",
|
||||
"cron_schedule": "*/30 * * * *",
|
||||
"default_protocol": "http"
|
||||
}
|
||||
```
|
||||
|
||||
**Validation**: The `parser_name` must match a registered plugin. If omitted, the registry attempts auto-detection via `supports()`. Returns `422` if no matching parser is found.
|
||||
|
||||
**Response** `201`: The created source object.
|
||||
|
||||
#### `PATCH /sources/{source_id}`
|
||||
|
||||
Update a source. All fields are optional — only provided fields are changed.
|
||||
|
||||
#### `DELETE /sources/{source_id}`
|
||||
|
||||
Delete a source. Associated proxies are NOT deleted (they may have been discovered by multiple sources).
|
||||
|
||||
#### `POST /sources/{source_id}/scrape`
|
||||
|
||||
Trigger an immediate scrape of the source, bypassing the cron schedule. Returns the scrape result.
|
||||
|
||||
**Response** `200`:
|
||||
```json
|
||||
{
|
||||
"source_id": "uuid",
|
||||
"proxies_discovered": 142,
|
||||
"proxies_new": 23,
|
||||
"proxies_updated": 119,
|
||||
"duration_ms": 1540
|
||||
}
|
||||
```
|
||||
|
||||
### Proxies
|
||||
|
||||
#### `GET /proxies`
|
||||
|
||||
Query the proxy pool with filtering and sorting.
|
||||
|
||||
**Query parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `status` | string | Filter by status: `active`, `dead`, `unchecked` |
|
||||
| `protocol` | string | Filter by protocol: `http`, `https`, `socks4`, `socks5` |
|
||||
| `anonymity` | string | Filter by anonymity: `transparent`, `anonymous`, `elite` |
|
||||
| `country` | string | ISO 3166-1 alpha-2 country code |
|
||||
| `min_score` | float | Minimum composite score (0.0–1.0) |
|
||||
| `max_latency_ms` | float | Maximum average latency |
|
||||
| `min_uptime_pct` | float | Minimum uptime percentage |
|
||||
| `verified_within_minutes` | int | Only proxies checked within the last N minutes |
|
||||
| `sort` | string | Sort field: `score`, `latency`, `uptime`, `last_checked` |
|
||||
| `order` | string | Sort order: `asc`, `desc` (default: `desc` for score) |
|
||||
| `limit` | int | Results per page (default: 50, max: 200) |
|
||||
| `cursor` | string | Pagination cursor |
|
||||
|
||||
**Response** `200`:
|
||||
```json
|
||||
{
|
||||
"items": [
|
||||
{
|
||||
"id": "uuid",
|
||||
"ip": "203.0.113.42",
|
||||
"port": 8080,
|
||||
"protocol": "http",
|
||||
"status": "active",
|
||||
"anonymity": "elite",
|
||||
"exit_ip": "203.0.113.42",
|
||||
"country": "US",
|
||||
"score": 0.87,
|
||||
"avg_latency_ms": 245.3,
|
||||
"uptime_pct": 94.2,
|
||||
"last_checked_at": "2025-01-15T10:25:00Z",
|
||||
"first_seen_at": "2025-01-10T08:00:00Z",
|
||||
"tags": {"provider": "datacenter"}
|
||||
}
|
||||
],
|
||||
"next_cursor": "...",
|
||||
"total_count": 892
|
||||
}
|
||||
```
|
||||
|
||||
#### `GET /proxies/{proxy_id}`
|
||||
|
||||
Get detailed info for a single proxy, including recent check history.
|
||||
|
||||
**Response** `200`:
|
||||
```json
|
||||
{
|
||||
"proxy": { ... },
|
||||
"recent_checks": [
|
||||
{
|
||||
"checker_name": "tcp_connect",
|
||||
"stage": 1,
|
||||
"passed": true,
|
||||
"latency_ms": 120.5,
|
||||
"detail": "TCP connect OK",
|
||||
"created_at": "2025-01-15T10:25:00Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### `POST /proxies/acquire`
|
||||
|
||||
Acquire a proxy with an exclusive lease. Costs 1 credit.
|
||||
|
||||
**Request body**:
|
||||
```json
|
||||
{
|
||||
"protocol": "http",
|
||||
"country": "US",
|
||||
"anonymity": "elite",
|
||||
"min_score": 0.7,
|
||||
"lease_duration_seconds": 300
|
||||
}
|
||||
```
|
||||
|
||||
All filter fields are optional. `lease_duration_seconds` defaults to 300 (5 minutes), max 3600 (1 hour).
|
||||
|
||||
**Response** `200`:
|
||||
```json
|
||||
{
|
||||
"lease_id": "uuid",
|
||||
"proxy": {
|
||||
"ip": "203.0.113.42",
|
||||
"port": 8080,
|
||||
"protocol": "http",
|
||||
"country": "US",
|
||||
"anonymity": "elite"
|
||||
},
|
||||
"expires_at": "2025-01-15T10:30:00Z",
|
||||
"credits_remaining": 42
|
||||
}
|
||||
```
|
||||
|
||||
**Error responses**: `402` if insufficient credits, `404` if no proxy matches the filters, `409` if all matching proxies are currently leased.
|
||||
|
||||
#### `POST /proxies/acquire/{lease_id}/release`
|
||||
|
||||
Release a lease early. The proxy becomes available immediately. The credit is NOT refunded (credits are consumed on acquisition).
|
||||
|
||||
#### `POST /proxies/test`
|
||||
|
||||
Test whether good proxies can reach a specific URL.
|
||||
|
||||
**Request body**:
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"count": 5,
|
||||
"timeout_seconds": 10,
|
||||
"protocol": "http",
|
||||
"country": "US"
|
||||
}
|
||||
```
|
||||
|
||||
**Response** `200`:
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"results": [
|
||||
{
|
||||
"proxy_id": "uuid",
|
||||
"ip": "203.0.113.42",
|
||||
"port": 8080,
|
||||
"reachable": true,
|
||||
"status_code": 200,
|
||||
"latency_ms": 340,
|
||||
"error": null
|
||||
},
|
||||
{
|
||||
"proxy_id": "uuid",
|
||||
"ip": "198.51.100.10",
|
||||
"port": 3128,
|
||||
"reachable": false,
|
||||
"status_code": null,
|
||||
"latency_ms": null,
|
||||
"error": "Connection refused by target"
|
||||
}
|
||||
],
|
||||
"success_rate": 0.8
|
||||
}
|
||||
```
|
||||
|
||||
### Stats
|
||||
|
||||
#### `GET /stats/pool`
|
||||
|
||||
Pool health overview.
|
||||
|
||||
**Response** `200`:
|
||||
```json
|
||||
{
|
||||
"total_proxies": 15420,
|
||||
"by_status": {"active": 3200, "dead": 11800, "unchecked": 420},
|
||||
"by_protocol": {"http": 8000, "https": 4000, "socks4": 1200, "socks5": 2220},
|
||||
"by_anonymity": {"transparent": 1500, "anonymous": 1000, "elite": 700},
|
||||
"avg_score": 0.62,
|
||||
"avg_latency_ms": 380.5,
|
||||
"sources_active": 12,
|
||||
"sources_total": 15,
|
||||
"last_scrape_at": "2025-01-15T10:30:00Z",
|
||||
"last_validation_at": "2025-01-15T10:25:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
#### `GET /stats/plugins`
|
||||
|
||||
Plugin registry status.
|
||||
|
||||
**Response** `200`:
|
||||
```json
|
||||
{
|
||||
"parsers": [{"name": "plaintext", "type": "SourceParser"}],
|
||||
"checkers": [
|
||||
{"name": "tcp_connect", "stage": 1, "priority": 0},
|
||||
{"name": "http_anonymity", "stage": 2, "priority": 0}
|
||||
],
|
||||
"notifiers": [
|
||||
{"name": "smtp", "healthy": true, "subscribes_to": ["proxy.pool_low", "credits.*"]},
|
||||
{"name": "webhook", "healthy": false, "subscribes_to": ["*"]}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Accounts domain endpoints
|
||||
|
||||
### Auth
|
||||
|
||||
#### `POST /auth/register`
|
||||
|
||||
Create a new user account and initial API key. No authentication required.
|
||||
|
||||
**Request body**:
|
||||
```json
|
||||
{
|
||||
"email": "user@example.com",
|
||||
"display_name": "Alice"
|
||||
}
|
||||
```
|
||||
|
||||
**Response** `201`:
|
||||
```json
|
||||
{
|
||||
"user": {"id": "uuid", "email": "user@example.com", "display_name": "Alice"},
|
||||
"api_key": {
|
||||
"id": "uuid",
|
||||
"key": "pp_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0u1v2w3x4",
|
||||
"prefix": "pp_a1b2c",
|
||||
"label": "default"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Important**: The `key` field is the raw API key. It is returned ONLY in this response. Store it securely — it cannot be retrieved again.
|
||||
|
||||
#### `POST /auth/keys`
|
||||
|
||||
Create an additional API key for the authenticated user.
|
||||
|
||||
#### `GET /auth/keys`
|
||||
|
||||
List all API keys for the authenticated user (returns prefix and metadata, never the full key).
|
||||
|
||||
#### `DELETE /auth/keys/{key_id}`
|
||||
|
||||
Revoke an API key.
|
||||
|
||||
### Account
|
||||
|
||||
#### `GET /account`
|
||||
|
||||
Get the authenticated user's account info.
|
||||
|
||||
#### `GET /account/credits`
|
||||
|
||||
Get current credit balance and recent transaction history.
|
||||
|
||||
**Response** `200`:
|
||||
```json
|
||||
{
|
||||
"balance": 42,
|
||||
"recent_transactions": [
|
||||
{
|
||||
"amount": -1,
|
||||
"tx_type": "acquire",
|
||||
"description": "Proxy acquired: 203.0.113.42:8080",
|
||||
"created_at": "2025-01-15T10:25:00Z"
|
||||
},
|
||||
{
|
||||
"amount": 100,
|
||||
"tx_type": "purchase",
|
||||
"description": "Credit purchase",
|
||||
"created_at": "2025-01-14T00:00:00Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### `GET /account/leases`
|
||||
|
||||
List the authenticated user's active proxy leases.
|
||||
|
||||
## System endpoints
|
||||
|
||||
#### `GET /health`
|
||||
|
||||
Basic health check. No authentication required.
|
||||
|
||||
**Response** `200`:
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"postgres": "connected",
|
||||
"redis": "connected",
|
||||
"version": "0.1.0"
|
||||
}
|
||||
```
|
||||
185
docs/05-worker-tasks.md
Normal file
185
docs/05-worker-tasks.md
Normal file
@ -0,0 +1,185 @@
|
||||
# Worker and task reference
|
||||
|
||||
## Overview
|
||||
|
||||
Background tasks run in a separate ARQ worker process. The worker connects to the same PostgreSQL and Redis instances as the API. Tasks are defined in `proxy_pool.worker.tasks_*` modules and registered in `proxy_pool.worker.settings`.
|
||||
|
||||
## Running the worker
|
||||
|
||||
```bash
|
||||
# Development
|
||||
uv run arq proxy_pool.worker.settings.WorkerSettings
|
||||
|
||||
# Docker
|
||||
docker compose up worker
|
||||
```
|
||||
|
||||
The worker process is independent of the API process. You can run multiple worker instances, though for most deployments one is sufficient (ARQ handles job deduplication via Redis).
|
||||
|
||||
## Worker settings
|
||||
|
||||
```python
|
||||
# proxy_pool/worker/settings.py
|
||||
|
||||
class WorkerSettings:
|
||||
functions = [
|
||||
scrape_source,
|
||||
scrape_all,
|
||||
validate_proxy,
|
||||
revalidate_sweep,
|
||||
prune_dead_proxies,
|
||||
prune_old_checks,
|
||||
expire_leases,
|
||||
]
|
||||
|
||||
cron_jobs = [
|
||||
cron(scrape_all, minute={0, 30}), # Every 30 minutes
|
||||
cron(revalidate_sweep, minute={10, 25, 40, 55}), # Every 15 minutes
|
||||
cron(prune_dead_proxies, hour={3}, minute={0}), # Daily at 3:00 AM
|
||||
cron(prune_old_checks, hour={4}, minute={0}), # Daily at 4:00 AM
|
||||
cron(expire_leases, minute=set(range(60))), # Every minute
|
||||
]
|
||||
|
||||
redis_settings = RedisSettings.from_dsn(settings.redis_url)
|
||||
max_jobs = 50
|
||||
job_timeout = 300 # 5 minutes
|
||||
keep_result = 3600 # Keep results for 1 hour
|
||||
```
|
||||
|
||||
## Task definitions
|
||||
|
||||
### Scrape tasks
|
||||
|
||||
#### `scrape_all(ctx)`
|
||||
|
||||
Periodic task that iterates over all active `ProxySource` records and enqueues a `scrape_source` job for each one. Sources whose `cron_schedule` or parser's `default_schedule()` indicates they aren't due yet are skipped.
|
||||
|
||||
**Schedule**: Every 30 minutes (configurable).
|
||||
|
||||
**Behavior**: Enqueues individual `scrape_source` jobs rather than scraping inline. This allows the worker pool to parallelize across sources and provides per-source error isolation.
|
||||
|
||||
#### `scrape_source(ctx, source_id: str)`
|
||||
|
||||
Fetches the URL for a single `ProxySource`, selects the appropriate `SourceParser` plugin, parses the content, and upserts discovered proxies.
|
||||
|
||||
**Steps**:
|
||||
1. Load the `ProxySource` by ID.
|
||||
2. Fetch the URL via `httpx.AsyncClient` with a configurable timeout (default: 30s).
|
||||
3. Look up the parser by `source.parser_name` in the plugin registry.
|
||||
4. Call `parser.parse(raw_bytes, source)` to get a list of `DiscoveredProxy`.
|
||||
5. Upsert each proxy using `INSERT ... ON CONFLICT (ip, port, protocol) DO UPDATE SET source_id = ?, last_seen_at = now()`.
|
||||
6. Update `source.last_scraped_at`.
|
||||
7. Emit `proxy.new_batch` event if new proxies were discovered.
|
||||
8. On failure, emit `source.failed` event and log the error.
|
||||
|
||||
**Error handling**: HTTP errors, parse errors, and database errors are caught and logged. The source is not deactivated on failure — transient errors are expected. A separate `source.stale` event is emitted if a source hasn't produced results in a configurable number of hours.
|
||||
|
||||
**Timeout**: 60 seconds (includes fetch + parse + upsert).
|
||||
|
||||
### Validation tasks
|
||||
|
||||
#### `revalidate_sweep(ctx)`
|
||||
|
||||
Periodic task that selects proxies due for revalidation and enqueues `validate_proxy` jobs.
|
||||
|
||||
**Selection criteria** (in priority order):
|
||||
1. Proxies with `status = unchecked` (never validated, highest priority).
|
||||
2. Proxies with `status = active` and `last_checked_at < now() - interval` (stale active proxies).
|
||||
3. Proxies with `status = dead` and `last_checked_at < now() - longer_interval` (periodic dead re-check, lower frequency).
|
||||
|
||||
**Configurable intervals**:
|
||||
- Active proxy recheck: every 10 minutes (default).
|
||||
- Dead proxy recheck: every 6 hours (default).
|
||||
- Batch size per sweep: 200 proxies (default).
|
||||
|
||||
**Schedule**: Every 15 minutes.
|
||||
|
||||
#### `validate_proxy(ctx, proxy_id: str)`
|
||||
|
||||
Runs the full checker pipeline for a single proxy.
|
||||
|
||||
**Steps**:
|
||||
1. Load the `Proxy` by ID.
|
||||
2. Create a `CheckContext` with a fresh `httpx.AsyncClient`.
|
||||
3. Call `run_checker_pipeline(proxy, registry, http_client, db_session)`.
|
||||
4. The pipeline runs all registered checkers in stage order (see plugin system docs).
|
||||
5. Compute composite score from results.
|
||||
6. Update the proxy record with new status, score, latency, uptime, exit IP, country, anonymity.
|
||||
7. If the proxy transitioned from `active` to `dead` or vice versa, check pool health thresholds and emit `proxy.pool_low` if needed.
|
||||
|
||||
**Timeout**: 120 seconds (individual checker timeouts are enforced within the pipeline).
|
||||
|
||||
**Concurrency**: Multiple `validate_proxy` jobs can run simultaneously. Each job operates on a different proxy, so there are no conflicts. ARQ's `job_id` parameter is set to `validate:{proxy_id}` to prevent duplicate validation of the same proxy.
|
||||
|
||||
### Cleanup tasks
|
||||
|
||||
#### `prune_dead_proxies(ctx)`
|
||||
|
||||
Removes proxies that have been dead for an extended period.
|
||||
|
||||
**Criteria**: `status = dead` AND `last_checked_at < now() - retention_days` (default: 30 days).
|
||||
|
||||
**Behavior**: Hard deletes the proxy row. CASCADE deletes remove associated `proxy_checks`, `proxy_tags`, and `proxy_leases`.
|
||||
|
||||
**Schedule**: Daily at 3:00 AM.
|
||||
|
||||
#### `prune_old_checks(ctx)`
|
||||
|
||||
Trims the `proxy_checks` table to control storage growth.
|
||||
|
||||
**Strategy**: For each proxy, keep the most recent N check records (default: 100) and delete anything older than a retention period (default: 7 days). Both conditions must be met — recent proxies keep all their checks even if they have more than 100, while old checks are always pruned.
|
||||
|
||||
**Schedule**: Daily at 4:00 AM.
|
||||
|
||||
#### `expire_leases(ctx)`
|
||||
|
||||
Cleans up expired proxy leases.
|
||||
|
||||
**Steps**:
|
||||
1. Query `proxy_leases WHERE is_released = false AND expires_at < now()`.
|
||||
2. For each expired lease, set `is_released = true`.
|
||||
3. Delete the corresponding Redis lease key (if it still exists — it should have expired via TTL, but this is a safety net).
|
||||
|
||||
**Schedule**: Every minute.
|
||||
|
||||
**Note**: Redis TTL is the primary expiration mechanism. This task is a consistency backstop that ensures the PostgreSQL records are accurate even if Redis keys expire silently.
|
||||
|
||||
## Task retry behavior
|
||||
|
||||
ARQ retries are configured per-task:
|
||||
|
||||
| Task | Max retries | Retry delay |
|
||||
|------|-------------|-------------|
|
||||
| `scrape_source` | 2 | 60s exponential |
|
||||
| `validate_proxy` | 1 | 30s |
|
||||
| `prune_dead_proxies` | 0 | — |
|
||||
| `prune_old_checks` | 0 | — |
|
||||
| `expire_leases` | 1 | 10s |
|
||||
|
||||
Retry delays use exponential backoff. Failed tasks after max retries are logged and the job result is stored in Redis for inspection.
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Job results
|
||||
|
||||
ARQ stores job results in Redis for `keep_result` seconds (default: 3600). Query results via:
|
||||
|
||||
```python
|
||||
from arq.connections import ArqRedis
|
||||
redis = ArqRedis(...)
|
||||
result = await redis.get_result("job_id")
|
||||
```
|
||||
|
||||
### Health indicators
|
||||
|
||||
The `GET /stats/pool` endpoint includes `last_scrape_at` and `last_validation_at` timestamps. If these fall behind schedule, the worker may be down or stuck.
|
||||
|
||||
### Logging
|
||||
|
||||
Tasks log at structured INFO level on start/completion and WARN/ERROR on failures:
|
||||
|
||||
```
|
||||
INFO scrape_source source_id=abc count_new=23 count_updated=119 duration_ms=1540
|
||||
WARN scrape_source source_id=def error="HTTP 503" retrying=true attempt=2
|
||||
ERROR validate_proxy proxy_id=ghi error="Pipeline timeout after 120s"
|
||||
```
|
||||
333
docs/06-development-guide.md
Normal file
333
docs/06-development-guide.md
Normal file
@ -0,0 +1,333 @@
|
||||
# Development guide
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.12+
|
||||
- [uv](https://docs.astral.sh/uv/) (package manager)
|
||||
- Docker and Docker Compose (for dependencies and testing)
|
||||
- Git
|
||||
|
||||
## Initial setup
|
||||
|
||||
### 1. Clone and install
|
||||
|
||||
```bash
|
||||
git clone <repo-url> proxy-pool
|
||||
cd proxy-pool
|
||||
|
||||
# Install all dependencies (including dev) in a virtual env
|
||||
uv sync
|
||||
|
||||
# Verify installation
|
||||
uv run python -c "import proxy_pool; print('OK')"
|
||||
```
|
||||
|
||||
`uv sync` creates a `.venv/` in the project root, installs all dependencies from `uv.lock`, and installs the `proxy_pool` package in editable mode (thanks to the `src/` layout and `pyproject.toml` build config).
|
||||
|
||||
### 2. Start infrastructure
|
||||
|
||||
```bash
|
||||
# Start PostgreSQL and Redis
|
||||
docker compose up -d postgres redis
|
||||
|
||||
# Verify they're running
|
||||
docker compose ps
|
||||
```
|
||||
|
||||
### 3. Configure environment
|
||||
|
||||
```bash
|
||||
cp .env.example .env
|
||||
# Edit .env with your local settings
|
||||
```
|
||||
|
||||
Key variables:
|
||||
|
||||
```env
|
||||
DATABASE_URL=postgresql+asyncpg://proxypool:proxypool@localhost:5432/proxypool
|
||||
REDIS_URL=redis://localhost:6379/0
|
||||
SECRET_KEY=your-random-secret-for-dev
|
||||
LOG_LEVEL=DEBUG
|
||||
|
||||
# Optional: SMTP for notifier plugin testing
|
||||
SMTP_HOST=
|
||||
SMTP_PORT=587
|
||||
SMTP_USER=
|
||||
SMTP_PASSWORD=
|
||||
ALERT_EMAIL=
|
||||
```
|
||||
|
||||
### 4. Run migrations
|
||||
|
||||
```bash
|
||||
uv run alembic upgrade head
|
||||
```
|
||||
|
||||
### 5. Start the application
|
||||
|
||||
```bash
|
||||
# API server (with hot reload)
|
||||
uv run uvicorn proxy_pool.app:create_app --factory --reload --port 8000
|
||||
|
||||
# In a separate terminal: ARQ worker
|
||||
uv run arq proxy_pool.worker.settings.WorkerSettings
|
||||
```
|
||||
|
||||
The API is now available at `http://localhost:8000`. OpenAPI docs are at `http://localhost:8000/docs`.
|
||||
|
||||
## Project layout
|
||||
|
||||
```
|
||||
proxy-pool/
|
||||
├── src/proxy_pool/ # Application source code
|
||||
│ ├── app.py # App factory + lifespan
|
||||
│ ├── config.py # Settings (env-driven)
|
||||
│ ├── common/ # Shared utilities
|
||||
│ ├── db/ # Database infrastructure
|
||||
│ ├── proxy/ # Proxy domain module
|
||||
│ ├── accounts/ # Accounts domain module
|
||||
│ ├── plugins/ # Plugin system + built-in plugins
|
||||
│ └── worker/ # ARQ task definitions
|
||||
├── tests/ # Test suite
|
||||
├── alembic/ # Migration files
|
||||
├── docs/ # This documentation
|
||||
└── pyproject.toml # Project config (uv, ruff, mypy, pytest)
|
||||
```
|
||||
|
||||
See `01-architecture.md` for detailed structure and rationale.
|
||||
|
||||
## Working with the database
|
||||
|
||||
### Creating a migration
|
||||
|
||||
```bash
|
||||
# Auto-generate from model changes
|
||||
uv run alembic revision --autogenerate -m "add proxy_tags table"
|
||||
|
||||
# Review the generated migration!
|
||||
cat alembic/versions/NNN_add_proxy_tags_table.py
|
||||
|
||||
# Apply it
|
||||
uv run alembic upgrade head
|
||||
```
|
||||
|
||||
Always review autogenerated migrations. Alembic can miss custom indexes, enum type changes, and data migrations. Common things to verify:
|
||||
|
||||
- Enum types are created/altered correctly.
|
||||
- Index names match the naming convention.
|
||||
- `downgrade()` reverses the change completely.
|
||||
- No data is dropped unintentionally.
|
||||
|
||||
### Useful Alembic commands
|
||||
|
||||
```bash
|
||||
# Show current revision
|
||||
uv run alembic current
|
||||
|
||||
# Show migration history
|
||||
uv run alembic history --verbose
|
||||
|
||||
# Downgrade one step
|
||||
uv run alembic downgrade -1
|
||||
|
||||
# Downgrade to a specific revision
|
||||
uv run alembic downgrade abc123
|
||||
|
||||
# Generate a blank migration (for data migrations)
|
||||
uv run alembic revision -m "backfill proxy scores"
|
||||
```
|
||||
|
||||
### Database shell
|
||||
|
||||
```bash
|
||||
# Via Docker
|
||||
docker compose exec postgres psql -U proxypool
|
||||
|
||||
# Or directly
|
||||
psql postgresql://proxypool:proxypool@localhost:5432/proxypool
|
||||
```
|
||||
|
||||
## Running tests
|
||||
|
||||
### Quick: unit tests only (no Docker needed)
|
||||
|
||||
```bash
|
||||
uv run pytest tests/unit/ -x -v
|
||||
```
|
||||
|
||||
### Full: integration tests with Docker dependencies
|
||||
|
||||
```bash
|
||||
# Start test infrastructure
|
||||
docker compose -f docker-compose.yml -f docker-compose.test.yml up -d postgres redis
|
||||
|
||||
# Run all tests
|
||||
uv run pytest tests/ -x -v --timeout=30
|
||||
|
||||
# Or run via Docker (how CI does it)
|
||||
docker compose -f docker-compose.yml -f docker-compose.test.yml run --rm test
|
||||
```
|
||||
|
||||
### Test organization
|
||||
|
||||
- `tests/unit/` — No I/O. All external dependencies are mocked. Fast.
|
||||
- `tests/integration/` — Uses real PostgreSQL and Redis via Docker. Tests full request flows, database queries, and cache behavior.
|
||||
- `tests/plugins/` — Plugin-specific tests. Most are unit tests, but some (like SMTP notifier) may use integration fixtures.
|
||||
|
||||
### Key fixtures (in `conftest.py`)
|
||||
|
||||
```python
|
||||
@pytest.fixture
|
||||
async def db_session():
|
||||
"""Provides an async SQLAlchemy session rolled back after each test."""
|
||||
|
||||
@pytest.fixture
|
||||
async def redis():
|
||||
"""Provides a Redis connection flushed after each test."""
|
||||
|
||||
@pytest.fixture
|
||||
async def client(db_session, redis):
|
||||
"""Provides an httpx.AsyncClient wired to a test app instance."""
|
||||
|
||||
@pytest.fixture
|
||||
def registry():
|
||||
"""Provides a PluginRegistry with built-in plugins loaded."""
|
||||
```
|
||||
|
||||
### Writing a test
|
||||
|
||||
```python
|
||||
# tests/unit/test_scoring.py
|
||||
|
||||
from proxy_pool.proxy.service import compute_proxy_score
|
||||
|
||||
def test_score_weights_latency():
|
||||
checks = [make_check(passed=True, latency_ms=100)]
|
||||
score = compute_proxy_score(make_proxy(), checks, make_context())
|
||||
assert 0.8 < score <= 1.0
|
||||
|
||||
def test_dead_proxy_gets_zero_score():
|
||||
checks = [make_check(passed=False)]
|
||||
score = compute_proxy_score(make_proxy(), checks, make_context())
|
||||
assert score == 0.0
|
||||
```
|
||||
|
||||
```python
|
||||
# tests/integration/test_acquire_flow.py
|
||||
|
||||
async def test_acquire_deducts_credit(client, db_session):
|
||||
user = await create_user_with_credits(db_session, credits=10)
|
||||
await create_active_proxy(db_session)
|
||||
|
||||
response = await client.post(
|
||||
"/proxies/acquire",
|
||||
headers={"Authorization": f"Bearer {user.api_key}"},
|
||||
json={"protocol": "http"},
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
assert response.json()["credits_remaining"] == 9
|
||||
```
|
||||
|
||||
## Code quality
|
||||
|
||||
### Linting and formatting
|
||||
|
||||
```bash
|
||||
# Check
|
||||
uv run ruff check src/ tests/
|
||||
uv run ruff format --check src/ tests/
|
||||
|
||||
# Fix
|
||||
uv run ruff check --fix src/ tests/
|
||||
uv run ruff format src/ tests/
|
||||
```
|
||||
|
||||
### Type checking
|
||||
|
||||
```bash
|
||||
uv run mypy src/
|
||||
```
|
||||
|
||||
`mypy` is configured with `strict = true` in `pyproject.toml`. The `pydantic.mypy` plugin is enabled for correct Pydantic model inference.
|
||||
|
||||
### Pre-commit (optional)
|
||||
|
||||
If you want automated checks on every commit:
|
||||
|
||||
```bash
|
||||
uv tool install pre-commit
|
||||
pre-commit install
|
||||
```
|
||||
|
||||
## Docker workflow
|
||||
|
||||
### Build the image
|
||||
|
||||
```bash
|
||||
docker compose build
|
||||
```
|
||||
|
||||
### Run the full stack
|
||||
|
||||
```bash
|
||||
# Run migrations + start API + worker
|
||||
docker compose --profile migrate up -d migrate
|
||||
docker compose up -d api worker
|
||||
```
|
||||
|
||||
### View logs
|
||||
|
||||
```bash
|
||||
docker compose logs -f api worker
|
||||
```
|
||||
|
||||
### Rebuild after code changes
|
||||
|
||||
```bash
|
||||
docker compose build api
|
||||
docker compose up -d api worker
|
||||
```
|
||||
|
||||
### Shell into a running container
|
||||
|
||||
```bash
|
||||
docker compose exec api bash
|
||||
docker compose exec postgres psql -U proxypool
|
||||
docker compose exec redis redis-cli
|
||||
```
|
||||
|
||||
## Adding a new plugin
|
||||
|
||||
1. Create a file in `src/proxy_pool/plugins/builtin/<type>/your_plugin.py`.
|
||||
2. Implement the relevant Protocol (see `02-plugin-system.md`).
|
||||
3. Define `create_plugin(settings: Settings) -> YourPlugin | None`.
|
||||
4. Add tests in `tests/plugins/test_your_plugin.py`.
|
||||
5. Restart the app — the plugin is auto-discovered.
|
||||
|
||||
For third-party plugins, place files in `plugins/contrib/` (or mount a directory at `/app/plugins-contrib` in Docker).
|
||||
|
||||
## Common development tasks
|
||||
|
||||
### Add a new API endpoint
|
||||
|
||||
1. Define Pydantic schemas in `<domain>/schemas.py`.
|
||||
2. Add business logic in `<domain>/service.py`.
|
||||
3. Create the route in `<domain>/router.py`.
|
||||
4. Register the router in `app.py` if it's a new router.
|
||||
5. Add tests.
|
||||
|
||||
### Add a new database table
|
||||
|
||||
1. Define the SQLAlchemy model in `<domain>/models.py`.
|
||||
2. Import the model in `db/base.py` (so Alembic sees it).
|
||||
3. Generate a migration: `uv run alembic revision --autogenerate -m "description"`.
|
||||
4. Review and apply: `uv run alembic upgrade head`.
|
||||
5. Add tests.
|
||||
|
||||
### Add a new background task
|
||||
|
||||
1. Define the task function in `worker/tasks_<category>.py`.
|
||||
2. Register it in `worker/settings.py` (add to `functions` list, and `cron_jobs` if periodic).
|
||||
3. Restart the worker.
|
||||
4. Add tests.
|
||||
236
docs/07-operations-guide.md
Normal file
236
docs/07-operations-guide.md
Normal file
@ -0,0 +1,236 @@
|
||||
# Operations guide
|
||||
|
||||
## Deployment
|
||||
|
||||
### Docker Compose (single-server)
|
||||
|
||||
The simplest deployment for small-to-medium workloads. All services run on a single machine.
|
||||
|
||||
```bash
|
||||
# Clone and configure
|
||||
git clone <repo-url> proxy-pool && cd proxy-pool
|
||||
cp .env.example .env
|
||||
# Edit .env with production values
|
||||
|
||||
# Build and start
|
||||
docker compose build
|
||||
docker compose --profile migrate up -d migrate # Run migrations
|
||||
docker compose up -d api worker # Start services
|
||||
```
|
||||
|
||||
### Production considerations
|
||||
|
||||
**API scaling**: Run multiple API instances behind a load balancer. The API is stateless — any instance can handle any request. In Docker Compose, use `docker compose up -d --scale api=3`.
|
||||
|
||||
**Worker scaling**: Typically 1-2 worker instances are sufficient. ARQ deduplicates jobs via Redis, so multiple workers don't cause duplicate work. Scale workers if validation throughput is a bottleneck.
|
||||
|
||||
**Database**: Use a managed PostgreSQL service (AWS RDS, GCP Cloud SQL, etc.) for production. Enable connection pooling (PgBouncer) if running more than ~10 API instances.
|
||||
|
||||
**Redis**: A single Redis instance is sufficient for most workloads. Enable persistence (AOF or RDB snapshots) if you want lease state to survive Redis restarts. For high availability, use Redis Sentinel or a managed Redis service.
|
||||
|
||||
## Configuration reference
|
||||
|
||||
All configuration is via environment variables, parsed by `pydantic-settings`.
|
||||
|
||||
### Required
|
||||
|
||||
| Variable | Description | Example |
|
||||
|----------|-------------|---------|
|
||||
| `DATABASE_URL` | PostgreSQL connection string | `postgresql+asyncpg://user:pass@host:5432/db` |
|
||||
| `REDIS_URL` | Redis connection string | `redis://host:6379/0` |
|
||||
| `SECRET_KEY` | Used for internal signing (API key generation) | Random 64+ character string |
|
||||
|
||||
### Application
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `APP_NAME` | `proxy-pool` | Application name (appears in logs, OpenAPI docs) |
|
||||
| `LOG_LEVEL` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR` |
|
||||
| `CORS_ORIGINS` | `[]` | Comma-separated list of allowed CORS origins |
|
||||
| `API_KEY_PREFIX` | `pp_` | Prefix for generated API keys |
|
||||
|
||||
### Proxy pipeline
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `SCRAPE_TIMEOUT_SECONDS` | `30` | HTTP timeout when fetching proxy sources |
|
||||
| `SCRAPE_USER_AGENT` | `ProxyPool/0.1` | User-Agent header for scrape requests |
|
||||
| `CHECK_TCP_TIMEOUT` | `5.0` | Timeout for TCP connect checks |
|
||||
| `CHECK_HTTP_TIMEOUT` | `10.0` | Timeout for HTTP-level checks |
|
||||
| `CHECK_PIPELINE_TIMEOUT` | `120` | Overall pipeline timeout per proxy |
|
||||
| `JUDGE_URL` | `http://httpbin.org/ip` | URL used by the HTTP anonymity checker to determine exit IP |
|
||||
| `REVALIDATE_ACTIVE_INTERVAL_MINUTES` | `10` | How often active proxies are re-checked |
|
||||
| `REVALIDATE_DEAD_INTERVAL_HOURS` | `6` | How often dead proxies are re-checked |
|
||||
| `REVALIDATE_BATCH_SIZE` | `200` | Max proxies per revalidation sweep |
|
||||
| `POOL_LOW_THRESHOLD` | `100` | Emit `proxy.pool_low` event when active count drops below this |
|
||||
|
||||
### Accounts
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `DEFAULT_CREDITS` | `100` | Credits granted to new accounts |
|
||||
| `MAX_LEASE_DURATION_SECONDS` | `3600` | Maximum allowed lease duration |
|
||||
| `DEFAULT_LEASE_DURATION_SECONDS` | `300` | Default lease duration if not specified |
|
||||
| `CREDIT_LOW_THRESHOLD` | `10` | Emit `credits.low_balance` when balance drops below this |
|
||||
|
||||
### Cleanup
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `PRUNE_DEAD_AFTER_DAYS` | `30` | Delete dead proxies older than this |
|
||||
| `PRUNE_CHECKS_AFTER_DAYS` | `7` | Delete check history older than this |
|
||||
| `PRUNE_CHECKS_KEEP_LAST` | `100` | Always keep at least this many checks per proxy |
|
||||
|
||||
### Notifications
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `SMTP_HOST` | (empty) | SMTP server. If empty, SMTP notifier is disabled. |
|
||||
| `SMTP_PORT` | `587` | SMTP port |
|
||||
| `SMTP_USER` | (empty) | SMTP username |
|
||||
| `SMTP_PASSWORD` | (empty) | SMTP password |
|
||||
| `ALERT_EMAIL` | (empty) | Recipient for alert emails |
|
||||
| `WEBHOOK_URL` | (empty) | Webhook URL. If empty, webhook notifier is disabled. |
|
||||
|
||||
### Redis cache
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `CACHE_PROXY_LIST_TTL` | `60` | TTL in seconds for cached proxy query results |
|
||||
| `CACHE_CREDIT_BALANCE_TTL` | `300` | TTL in seconds for cached credit balances |
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Health check
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/health
|
||||
```
|
||||
|
||||
Returns `200` with connection status for PostgreSQL and Redis. Use this as a Docker/Kubernetes health check and load balancer target.
|
||||
|
||||
### Key metrics to watch
|
||||
|
||||
**Pool health** (`GET /stats/pool`):
|
||||
- `by_status.active` — The number of working proxies. If this drops suddenly, investigate source failures or upstream blocks.
|
||||
- `last_scrape_at` — If this is stale, the worker may be down or the scrape task is failing.
|
||||
- `last_validation_at` — If this is stale, validation is backed up or the worker is stuck.
|
||||
|
||||
**Plugin health** (`GET /stats/plugins`):
|
||||
- Check `notifiers[].healthy` — if a notifier is unhealthy, alerts won't be delivered.
|
||||
|
||||
**Worker job queue**: Monitor Redis keys `arq:queue:default` (pending jobs) and `arq:result:*` (completed/failed jobs). A growing queue indicates the worker can't keep up.
|
||||
|
||||
### Log format
|
||||
|
||||
Logs are structured JSON in production (`LOG_LEVEL=INFO`):
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2025-01-15T10:30:00Z",
|
||||
"level": "INFO",
|
||||
"message": "scrape_source completed",
|
||||
"source_id": "abc-123",
|
||||
"proxies_new": 23,
|
||||
"duration_ms": 1540
|
||||
}
|
||||
```
|
||||
|
||||
### Alerting
|
||||
|
||||
The built-in notification system handles operational alerts:
|
||||
|
||||
- `proxy.pool_low` — Active proxy count below threshold. Action: add more sources or investigate why proxies are dying.
|
||||
- `source.failed` — A scrape failed. Usually transient (upstream 503). Investigate if persistent.
|
||||
- `source.stale` — A source hasn't produced results in N hours. The source may be dead or blocking your scraper.
|
||||
- `credits.low_balance` / `credits.exhausted` — User account alerts. No operational action needed unless it's your own account.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Proxies are all dying
|
||||
|
||||
**Symptoms**: `by_status.active` dropping, `by_status.dead` increasing.
|
||||
|
||||
**Possible causes**:
|
||||
- The judge URL (`JUDGE_URL`) is down or rate-limiting you. Check if `httpbin.org/ip` is accessible from your server.
|
||||
- Your server's IP is blocked by proxy providers. Try from a different IP or use a self-hosted judge endpoint.
|
||||
- Proxy sources are returning stale lists. Check `last_scraped_at` on sources.
|
||||
|
||||
**Fix**: Self-host a simple judge endpoint (a Flask/FastAPI app that returns `{"ip": request.remote_addr}`) to eliminate dependency on httpbin.
|
||||
|
||||
### Worker is not processing jobs
|
||||
|
||||
**Symptoms**: `last_scrape_at` and `last_validation_at` are stale. Redis queue is growing.
|
||||
|
||||
**Check**:
|
||||
```bash
|
||||
docker compose logs worker --tail=50
|
||||
docker compose exec redis redis-cli LLEN arq:queue:default
|
||||
```
|
||||
|
||||
**Possible causes**:
|
||||
- Worker process crashed. Restart it: `docker compose restart worker`.
|
||||
- Redis connection lost. Check Redis health: `docker compose exec redis redis-cli ping`.
|
||||
- A task is stuck (infinite loop or hung network call). Check `CHECK_PIPELINE_TIMEOUT`.
|
||||
|
||||
### Database connections exhausted
|
||||
|
||||
**Symptoms**: `asyncpg.exceptions.TooManyConnectionsError` or slow queries.
|
||||
|
||||
**Fix**: Reduce the connection pool size in `DATABASE_URL` parameters, or deploy PgBouncer. The default asyncpg pool size is 10 connections per process — with 3 API instances and 1 worker, that's 40 connections. PostgreSQL's default limit is 100.
|
||||
|
||||
```env
|
||||
# In DATABASE_URL or via SQLAlchemy pool config
|
||||
DATABASE_POOL_SIZE=5
|
||||
DATABASE_MAX_OVERFLOW=10
|
||||
```
|
||||
|
||||
### Redis memory growing
|
||||
|
||||
**Symptoms**: Redis memory usage increasing over time.
|
||||
|
||||
**Possible causes**:
|
||||
- ARQ job results not expiring. Check `keep_result` setting.
|
||||
- Proxy cache not being invalidated. Verify `CACHE_PROXY_LIST_TTL` is set.
|
||||
- Lease keys not expiring (should auto-expire via TTL).
|
||||
|
||||
**Fix**: Set a Redis `maxmemory` policy:
|
||||
```
|
||||
maxmemory 256mb
|
||||
maxmemory-policy allkeys-lru
|
||||
```
|
||||
|
||||
### Migration failed
|
||||
|
||||
**Symptoms**: `alembic upgrade head` errors.
|
||||
|
||||
**Steps**:
|
||||
1. Check the current state: `uv run alembic current`.
|
||||
2. Look at the error — usually a constraint violation or type mismatch.
|
||||
3. If the migration is partially applied, you may need to manually fix the state: `uv run alembic stamp <revision>`.
|
||||
4. For production, always test migrations against a copy of the production database first.
|
||||
|
||||
## Backup and recovery
|
||||
|
||||
### Database backup
|
||||
|
||||
```bash
|
||||
# Dump
|
||||
docker compose exec postgres pg_dump -U proxypool proxypool > backup.sql
|
||||
|
||||
# Restore
|
||||
docker compose exec -T postgres psql -U proxypool proxypool < backup.sql
|
||||
```
|
||||
|
||||
### Redis
|
||||
|
||||
For proxy pool, Redis data is ephemeral (cache + queue). Losing Redis state means:
|
||||
- Cached proxy lists are rebuilt on next query (minor latency spike).
|
||||
- Active leases are lost (the `expire_leases` task will clean up PostgreSQL state).
|
||||
- Pending ARQ jobs are lost (the next cron cycle will re-enqueue them).
|
||||
|
||||
If lease integrity is critical, enable Redis persistence (AOF recommended):
|
||||
```
|
||||
appendonly yes
|
||||
appendfsync everysec
|
||||
```
|
||||
Loading…
x
Reference in New Issue
Block a user