proxy-pool/docs/01-architecture.md

123 lines
7.9 KiB
Markdown

# Architecture overview
## Purpose
Proxy Pool is a FastAPI backend that discovers, validates, and serves free proxy servers. It scrapes configurable proxy list sources, runs a multi-stage validation pipeline to determine which proxies are alive and usable, and exposes a query API for consumers to find and acquire working proxies.
## System components
The application is a single FastAPI process with a companion ARQ worker process. Both share the same codebase and connect to the same PostgreSQL database and Redis instance.
### FastAPI application
The HTTP API layer. Handles all inbound requests, dependency injection, middleware (CORS, rate limiting, request logging), and the app lifespan (startup/shutdown). Built with the app factory pattern via `create_app()` so tests can instantiate independent app instances.
### ARQ worker
A separate process running ARQ (async Redis queue). Executes background tasks: periodic source scraping, proxy validation sweeps, stale proxy cleanup, and lease expiration. Tasks are defined in `proxy_pool.worker` and import from domain service layers — they contain orchestration, not business logic.
### PostgreSQL (asyncpg)
Primary data store. All proxy data, user accounts, credit ledger, check history, and source configuration. Accessed through SQLAlchemy 2.0 async ORM with asyncpg as the dialect driver. Schema managed by Alembic.
### Redis
Serves three roles:
- **Task queue**: ARQ uses Redis as its message broker for enqueuing and scheduling background tasks.
- **Cache**: Hot proxy lists (the "top N good proxies" result), user credit balances, and recently-validated proxy sets are cached in Redis with TTL-based expiration.
- **Lease manager**: The proxy acquire endpoint uses Redis `SET key value EX ttl NX` for atomic lease creation. This prevents two consumers from acquiring the same proxy simultaneously.
### Plugin registry
A runtime plugin system that manages three plugin types: source parsers, proxy checkers, and notifiers. Plugins are discovered at startup by scanning the `plugins/builtin/` and `plugins/contrib/` directories. The registry validates plugins against Python Protocol classes and stores them for use by the pipeline and event bus.
## Domain modules
The codebase is split into two domain modules with a shared infrastructure layer:
### Proxy domain (`proxy_pool.proxy`)
Owns all proxy-related data and logic: sources, proxies, check history, tags, scoring, and the validation pipeline. Exposes routes for CRUD on sources, querying/filtering proxies, running on-demand site reachability tests, and viewing pool health stats.
### Accounts domain (`proxy_pool.accounts`)
Owns user accounts, API key authentication, and the credit ledger. Exposes routes for key management, account info, and credit balance/history. Provides a FastAPI dependency (`get_current_user`) that resolves an API key header into an authenticated `User`.
### Integration boundary
The two domains never import from each other directly. The single integration point is the `POST /proxies/acquire` endpoint, which:
1. Resolves the API key via the accounts auth dependency
2. Checks the user's credit balance (Redis cache, DB fallback)
3. Selects the best matching proxy via the proxy service layer
4. Creates an atomic lease in Redis
5. Debits the credit ledger and logs the lease in a single DB transaction
6. Invalidates the cached credit balance
This is the only place where both domains participate in the same request.
## Data flow
### Discovery
A periodic ARQ cron task (`scrape_all`) iterates over active `ProxySource` records. For each source, it fetches the URL, selects the appropriate `SourceParser` plugin (by `parser_name` or auto-detection via `supports()`), parses the raw content into a list of `DiscoveredProxy` objects, and upserts them into the `proxies` table using `ON CONFLICT (ip, port, protocol) DO UPDATE`.
### Validation
Newly discovered proxies (status `UNCHECKED`) and proxies due for revalidation are fed into the checker pipeline. The pipeline runs all registered `ProxyChecker` plugins in stage order:
- **Stage 1** — Quick liveness checks (TCP connect, SOCKS handshake). Run concurrently within the stage.
- **Stage 2** — HTTP-level checks (exit IP detection, anonymity classification, GeoIP lookup). Run concurrently within the stage.
- **Stage 3** — Optional site-specific reachability checks.
If any checker in a stage fails, all subsequent stages are skipped. Every check result is logged to `proxy_checks` for historical analysis. After the pipeline completes, a composite score is computed from latency, uptime percentage, and proxy age, then persisted to the `proxies` table.
### Serving
The query API (`GET /proxies`) filters by protocol, country, anonymity level, minimum score, latency range, and "last verified within N minutes." Results are served from a Redis cache when available, with cache invalidation triggered by validation sweeps.
The acquire endpoint (`POST /proxies/acquire`) selects a proxy matching the requested filters, creates a time-limited lease, debits one credit, and returns the proxy details. The lease is tracked in both Redis (for fast exclusion from future queries) and PostgreSQL (for audit trail and cleanup).
### Notification
The event bus dispatches events (`proxy.pool_low`, `source.failed`, `credits.low_balance`, etc.) to registered `Notifier` plugins. Notifications are fire-and-forget via `asyncio.create_task` — they never block the main request or pipeline path.
## Technology decisions
| Component | Choice | Rationale |
|-----------|--------|-----------|
| Web framework | FastAPI | Async-native, dependency injection, automatic OpenAPI docs |
| ORM | SQLAlchemy 2.0 async | Mapped classes, native async session, Alembic integration |
| DB driver | asyncpg | Fastest async PostgreSQL driver for Python |
| Migrations | Alembic | Industry standard, autogeneration from SA models |
| Task queue | ARQ + Redis | Lightweight async task queue, cron scheduling built in |
| HTTP client | httpx + httpx-socks | Async, SOCKS proxy support, connection pooling |
| Settings | pydantic-settings | Env-driven config with type validation |
| Package manager | uv | Fast, lockfile-based, handles editable installs |
| Containerization | Docker + Compose | Multi-stage builds, service orchestration |
## Deployment topology
```
┌─────────────────┐
│ Load balancer │
└────────┬────────┘
┌──────────────┼──────────────┐
│ │ │
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
│ API (N) │ │ API (N) │ │ API (N) │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└──────────────┼──────────────┘
┌──────────────┼──────────────┐
│ │ │
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
│ Worker (M) │ │PostgreSQL │ │ Redis │
└───────────┘ └───────────┘ └───────────┘
```
API processes are stateless and can be scaled horizontally. Workers should typically run as 1-2 instances to avoid duplicate scrape/validation work (ARQ handles job deduplication via Redis). PostgreSQL and Redis are single-instance in the default setup but can be replicated per standard practices.