# Operations guide ## Deployment ### Docker Compose (single-server) The simplest deployment for small-to-medium workloads. All services run on a single machine. ```bash # Clone and configure git clone proxy-pool && cd proxy-pool cp .env.example .env # Edit .env with production values # Build and start docker compose build docker compose --profile migrate up -d migrate # Run migrations docker compose up -d api worker # Start services ``` ### Production considerations **API scaling**: Run multiple API instances behind a load balancer. The API is stateless — any instance can handle any request. In Docker Compose, use `docker compose up -d --scale api=3`. **Worker scaling**: Typically 1-2 worker instances are sufficient. ARQ deduplicates jobs via Redis, so multiple workers don't cause duplicate work. Scale workers if validation throughput is a bottleneck. **Database**: Use a managed PostgreSQL service (AWS RDS, GCP Cloud SQL, etc.) for production. Enable connection pooling (PgBouncer) if running more than ~10 API instances. **Redis**: A single Redis instance is sufficient for most workloads. Enable persistence (AOF or RDB snapshots) if you want lease state to survive Redis restarts. For high availability, use Redis Sentinel or a managed Redis service. ## Configuration reference All configuration is via environment variables, parsed by `pydantic-settings`. ### Required | Variable | Description | Example | |----------|-------------|---------| | `DATABASE_URL` | PostgreSQL connection string | `postgresql+asyncpg://user:pass@host:5432/db` | | `REDIS_URL` | Redis connection string | `redis://host:6379/0` | | `SECRET_KEY` | Used for internal signing (API key generation) | Random 64+ character string | ### Application | Variable | Default | Description | |----------|---------|-------------| | `APP_NAME` | `proxy-pool` | Application name (appears in logs, OpenAPI docs) | | `LOG_LEVEL` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR` | | `CORS_ORIGINS` | `[]` | Comma-separated list of allowed CORS origins | | `API_KEY_PREFIX` | `pp_` | Prefix for generated API keys | ### Proxy pipeline | Variable | Default | Description | |----------|---------|-------------| | `SCRAPE_TIMEOUT_SECONDS` | `30` | HTTP timeout when fetching proxy sources | | `SCRAPE_USER_AGENT` | `ProxyPool/0.1` | User-Agent header for scrape requests | | `CHECK_TCP_TIMEOUT` | `5.0` | Timeout for TCP connect checks | | `CHECK_HTTP_TIMEOUT` | `10.0` | Timeout for HTTP-level checks | | `CHECK_PIPELINE_TIMEOUT` | `120` | Overall pipeline timeout per proxy | | `JUDGE_URL` | `http://httpbin.org/ip` | URL used by the HTTP anonymity checker to determine exit IP | | `REVALIDATE_ACTIVE_INTERVAL_MINUTES` | `10` | How often active proxies are re-checked | | `REVALIDATE_DEAD_INTERVAL_HOURS` | `6` | How often dead proxies are re-checked | | `REVALIDATE_BATCH_SIZE` | `200` | Max proxies per revalidation sweep | | `POOL_LOW_THRESHOLD` | `100` | Emit `proxy.pool_low` event when active count drops below this | ### Accounts | Variable | Default | Description | |----------|---------|-------------| | `DEFAULT_CREDITS` | `100` | Credits granted to new accounts | | `MAX_LEASE_DURATION_SECONDS` | `3600` | Maximum allowed lease duration | | `DEFAULT_LEASE_DURATION_SECONDS` | `300` | Default lease duration if not specified | | `CREDIT_LOW_THRESHOLD` | `10` | Emit `credits.low_balance` when balance drops below this | ### Cleanup | Variable | Default | Description | |----------|---------|-------------| | `PRUNE_DEAD_AFTER_DAYS` | `30` | Delete dead proxies older than this | | `PRUNE_CHECKS_AFTER_DAYS` | `7` | Delete check history older than this | | `PRUNE_CHECKS_KEEP_LAST` | `100` | Always keep at least this many checks per proxy | ### Notifications | Variable | Default | Description | |----------|---------|-------------| | `SMTP_HOST` | (empty) | SMTP server. If empty, SMTP notifier is disabled. | | `SMTP_PORT` | `587` | SMTP port | | `SMTP_USER` | (empty) | SMTP username | | `SMTP_PASSWORD` | (empty) | SMTP password | | `ALERT_EMAIL` | (empty) | Recipient for alert emails | | `WEBHOOK_URL` | (empty) | Webhook URL. If empty, webhook notifier is disabled. | ### Redis cache | Variable | Default | Description | |----------|---------|-------------| | `CACHE_PROXY_LIST_TTL` | `60` | TTL in seconds for cached proxy query results | | `CACHE_CREDIT_BALANCE_TTL` | `300` | TTL in seconds for cached credit balances | ## Monitoring ### Health check ```bash curl http://localhost:8000/health ``` Returns `200` with connection status for PostgreSQL and Redis. Use this as a Docker/Kubernetes health check and load balancer target. ### Key metrics to watch **Pool health** (`GET /stats/pool`): - `by_status.active` — The number of working proxies. If this drops suddenly, investigate source failures or upstream blocks. - `last_scrape_at` — If this is stale, the worker may be down or the scrape task is failing. - `last_validation_at` — If this is stale, validation is backed up or the worker is stuck. **Plugin health** (`GET /stats/plugins`): - Check `notifiers[].healthy` — if a notifier is unhealthy, alerts won't be delivered. **Worker job queue**: Monitor Redis keys `arq:queue:default` (pending jobs) and `arq:result:*` (completed/failed jobs). A growing queue indicates the worker can't keep up. ### Log format Logs are structured JSON in production (`LOG_LEVEL=INFO`): ```json { "timestamp": "2025-01-15T10:30:00Z", "level": "INFO", "message": "scrape_source completed", "source_id": "abc-123", "proxies_new": 23, "duration_ms": 1540 } ``` ### Alerting The built-in notification system handles operational alerts: - `proxy.pool_low` — Active proxy count below threshold. Action: add more sources or investigate why proxies are dying. - `source.failed` — A scrape failed. Usually transient (upstream 503). Investigate if persistent. - `source.stale` — A source hasn't produced results in N hours. The source may be dead or blocking your scraper. - `credits.low_balance` / `credits.exhausted` — User account alerts. No operational action needed unless it's your own account. ## Troubleshooting ### Proxies are all dying **Symptoms**: `by_status.active` dropping, `by_status.dead` increasing. **Possible causes**: - The judge URL (`JUDGE_URL`) is down or rate-limiting you. Check if `httpbin.org/ip` is accessible from your server. - Your server's IP is blocked by proxy providers. Try from a different IP or use a self-hosted judge endpoint. - Proxy sources are returning stale lists. Check `last_scraped_at` on sources. **Fix**: Self-host a simple judge endpoint (a Flask/FastAPI app that returns `{"ip": request.remote_addr}`) to eliminate dependency on httpbin. ### Worker is not processing jobs **Symptoms**: `last_scrape_at` and `last_validation_at` are stale. Redis queue is growing. **Check**: ```bash docker compose logs worker --tail=50 docker compose exec redis redis-cli LLEN arq:queue:default ``` **Possible causes**: - Worker process crashed. Restart it: `docker compose restart worker`. - Redis connection lost. Check Redis health: `docker compose exec redis redis-cli ping`. - A task is stuck (infinite loop or hung network call). Check `CHECK_PIPELINE_TIMEOUT`. ### Database connections exhausted **Symptoms**: `asyncpg.exceptions.TooManyConnectionsError` or slow queries. **Fix**: Reduce the connection pool size in `DATABASE_URL` parameters, or deploy PgBouncer. The default asyncpg pool size is 10 connections per process — with 3 API instances and 1 worker, that's 40 connections. PostgreSQL's default limit is 100. ```env # In DATABASE_URL or via SQLAlchemy pool config DATABASE_POOL_SIZE=5 DATABASE_MAX_OVERFLOW=10 ``` ### Redis memory growing **Symptoms**: Redis memory usage increasing over time. **Possible causes**: - ARQ job results not expiring. Check `keep_result` setting. - Proxy cache not being invalidated. Verify `CACHE_PROXY_LIST_TTL` is set. - Lease keys not expiring (should auto-expire via TTL). **Fix**: Set a Redis `maxmemory` policy: ``` maxmemory 256mb maxmemory-policy allkeys-lru ``` ### Migration failed **Symptoms**: `alembic upgrade head` errors. **Steps**: 1. Check the current state: `uv run alembic current`. 2. Look at the error — usually a constraint violation or type mismatch. 3. If the migration is partially applied, you may need to manually fix the state: `uv run alembic stamp `. 4. For production, always test migrations against a copy of the production database first. ## Backup and recovery ### Database backup ```bash # Dump docker compose exec postgres pg_dump -U proxypool proxypool > backup.sql # Restore docker compose exec -T postgres psql -U proxypool proxypool < backup.sql ``` ### Redis For proxy pool, Redis data is ephemeral (cache + queue). Losing Redis state means: - Cached proxy lists are rebuilt on next query (minor latency spike). - Active leases are lost (the `expire_leases` task will clean up PostgreSQL state). - Pending ARQ jobs are lost (the next cron cycle will re-enqueue them). If lease integrity is critical, enable Redis persistence (AOF recommended): ``` appendonly yes appendfsync everysec ```