proxy-pool/docs/07-operations-guide.md

# Operations guide

## Deployment

### Docker Compose (single-server)

The simplest deployment for small-to-medium workloads. All services run on a single machine.

```bash
# Clone and configure
git clone <repo-url> proxy-pool && cd proxy-pool
cp .env.example .env
# Edit .env with production values

# Build and start
docker compose build
docker compose --profile migrate up -d migrate   # Run migrations
docker compose up -d api worker                    # Start services
```

### Production considerations

**API scaling**: Run multiple API instances behind a load balancer. The API is stateless — any instance can handle any request. In Docker Compose, use `docker compose up -d --scale api=3`.

**Worker scaling**: Typically 1-2 worker instances are sufficient. ARQ deduplicates jobs via Redis, so multiple workers don't cause duplicate work. Scale workers if validation throughput is a bottleneck.

**Database**: Use a managed PostgreSQL service (AWS RDS, GCP Cloud SQL, etc.) for production. Enable connection pooling (PgBouncer) if running more than ~10 API instances.

**Redis**: A single Redis instance is sufficient for most workloads. Enable persistence (AOF or RDB snapshots) if you want lease state to survive Redis restarts. For high availability, use Redis Sentinel or a managed Redis service.

## Configuration reference

All configuration is via environment variables, parsed by `pydantic-settings`.

### Required

| Variable | Description | Example |
|----------|-------------|---------|
| `DATABASE_URL` | PostgreSQL connection string | `postgresql+asyncpg://user:pass@host:5432/db` |
| `REDIS_URL` | Redis connection string | `redis://host:6379/0` |
| `SECRET_KEY` | Used for internal signing (API key generation) | Random 64+ character string |

### Application

| Variable | Default | Description |
|----------|---------|-------------|
| `APP_NAME` | `proxy-pool` | Application name (appears in logs, OpenAPI docs) |
| `LOG_LEVEL` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR` |
| `CORS_ORIGINS` | `[]` | Comma-separated list of allowed CORS origins |
| `API_KEY_PREFIX` | `pp_` | Prefix for generated API keys |

### Proxy pipeline

| Variable | Default | Description |
|----------|---------|-------------|
| `SCRAPE_TIMEOUT_SECONDS` | `30` | HTTP timeout when fetching proxy sources |
| `SCRAPE_USER_AGENT` | `ProxyPool/0.1` | User-Agent header for scrape requests |
| `CHECK_TCP_TIMEOUT` | `5.0` | Timeout for TCP connect checks |
| `CHECK_HTTP_TIMEOUT` | `10.0` | Timeout for HTTP-level checks |
| `CHECK_PIPELINE_TIMEOUT` | `120` | Overall pipeline timeout per proxy |
| `JUDGE_URL` | `http://httpbin.org/ip` | URL used by the HTTP anonymity checker to determine exit IP |
| `REVALIDATE_ACTIVE_INTERVAL_MINUTES` | `10` | How often active proxies are re-checked |
| `REVALIDATE_DEAD_INTERVAL_HOURS` | `6` | How often dead proxies are re-checked |
| `REVALIDATE_BATCH_SIZE` | `200` | Max proxies per revalidation sweep |
| `POOL_LOW_THRESHOLD` | `100` | Emit `proxy.pool_low` event when active count drops below this |

### Accounts

| Variable | Default | Description |
|----------|---------|-------------|
| `DEFAULT_CREDITS` | `100` | Credits granted to new accounts |
| `MAX_LEASE_DURATION_SECONDS` | `3600` | Maximum allowed lease duration |
| `DEFAULT_LEASE_DURATION_SECONDS` | `300` | Default lease duration if not specified |
| `CREDIT_LOW_THRESHOLD` | `10` | Emit `credits.low_balance` when balance drops below this |

### Cleanup

| Variable | Default | Description |
|----------|---------|-------------|
| `PRUNE_DEAD_AFTER_DAYS` | `30` | Delete dead proxies older than this |
| `PRUNE_CHECKS_AFTER_DAYS` | `7` | Delete check history older than this |
| `PRUNE_CHECKS_KEEP_LAST` | `100` | Always keep at least this many checks per proxy |

### Notifications

| Variable | Default | Description |
|----------|---------|-------------|
| `SMTP_HOST` | (empty) | SMTP server. If empty, SMTP notifier is disabled. |
| `SMTP_PORT` | `587` | SMTP port |
| `SMTP_USER` | (empty) | SMTP username |
| `SMTP_PASSWORD` | (empty) | SMTP password |
| `ALERT_EMAIL` | (empty) | Recipient for alert emails |
| `WEBHOOK_URL` | (empty) | Webhook URL. If empty, webhook notifier is disabled. |

### Redis cache

| Variable | Default | Description |
|----------|---------|-------------|
| `CACHE_PROXY_LIST_TTL` | `60` | TTL in seconds for cached proxy query results |
| `CACHE_CREDIT_BALANCE_TTL` | `300` | TTL in seconds for cached credit balances |

## Monitoring

### Health check

```bash
curl http://localhost:8000/health
```

Returns `200` with connection status for PostgreSQL and Redis. Use this as a Docker/Kubernetes health check and load balancer target.

### Key metrics to watch

**Pool health** (`GET /stats/pool`):
- `by_status.active` — The number of working proxies. If this drops suddenly, investigate source failures or upstream blocks.
- `last_scrape_at` — If this is stale, the worker may be down or the scrape task is failing.
- `last_validation_at` — If this is stale, validation is backed up or the worker is stuck.

**Plugin health** (`GET /stats/plugins`):
- Check `notifiers[].healthy` — if a notifier is unhealthy, alerts won't be delivered.

**Worker job queue**: Monitor Redis keys `arq:queue:default` (pending jobs) and `arq:result:*` (completed/failed jobs). A growing queue indicates the worker can't keep up.

### Log format

Logs are structured JSON in production (`LOG_LEVEL=INFO`):

```json
{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "INFO",
  "message": "scrape_source completed",
  "source_id": "abc-123",
  "proxies_new": 23,
  "duration_ms": 1540
}
```

### Alerting

The built-in notification system handles operational alerts:

- `proxy.pool_low` — Active proxy count below threshold. Action: add more sources or investigate why proxies are dying.
- `source.failed` — A scrape failed. Usually transient (upstream 503). Investigate if persistent.
- `source.stale` — A source hasn't produced results in N hours. The source may be dead or blocking your scraper.
- `credits.low_balance` / `credits.exhausted` — User account alerts. No operational action needed unless it's your own account.

## Troubleshooting

### Proxies are all dying

**Symptoms**: `by_status.active` dropping, `by_status.dead` increasing.

**Possible causes**:
- The judge URL (`JUDGE_URL`) is down or rate-limiting you. Check if `httpbin.org/ip` is accessible from your server.
- Your server's IP is blocked by proxy providers. Try from a different IP or use a self-hosted judge endpoint.
- Proxy sources are returning stale lists. Check `last_scraped_at` on sources.

**Fix**: Self-host a simple judge endpoint (a Flask/FastAPI app that returns `{"ip": request.remote_addr}`) to eliminate dependency on httpbin.

### Worker is not processing jobs

**Symptoms**: `last_scrape_at` and `last_validation_at` are stale. Redis queue is growing.

**Check**:
```bash
docker compose logs worker --tail=50
docker compose exec redis redis-cli LLEN arq:queue:default
```

**Possible causes**:
- Worker process crashed. Restart it: `docker compose restart worker`.
- Redis connection lost. Check Redis health: `docker compose exec redis redis-cli ping`.
- A task is stuck (infinite loop or hung network call). Check `CHECK_PIPELINE_TIMEOUT`.

### Database connections exhausted

**Symptoms**: `asyncpg.exceptions.TooManyConnectionsError` or slow queries.

**Fix**: Reduce the connection pool size in `DATABASE_URL` parameters, or deploy PgBouncer. The default asyncpg pool size is 10 connections per process — with 3 API instances and 1 worker, that's 40 connections. PostgreSQL's default limit is 100.

```env
# In DATABASE_URL or via SQLAlchemy pool config
DATABASE_POOL_SIZE=5
DATABASE_MAX_OVERFLOW=10
```

### Redis memory growing

**Symptoms**: Redis memory usage increasing over time.

**Possible causes**:
- ARQ job results not expiring. Check `keep_result` setting.
- Proxy cache not being invalidated. Verify `CACHE_PROXY_LIST_TTL` is set.
- Lease keys not expiring (should auto-expire via TTL).

**Fix**: Set a Redis `maxmemory` policy:
```
maxmemory 256mb
maxmemory-policy allkeys-lru
```

### Migration failed

**Symptoms**: `alembic upgrade head` errors.

**Steps**:
1. Check the current state: `uv run alembic current`.
2. Look at the error — usually a constraint violation or type mismatch.
3. If the migration is partially applied, you may need to manually fix the state: `uv run alembic stamp <revision>`.
4. For production, always test migrations against a copy of the production database first.

## Backup and recovery

### Database backup

```bash
# Dump
docker compose exec postgres pg_dump -U proxypool proxypool > backup.sql

# Restore
docker compose exec -T postgres psql -U proxypool proxypool < backup.sql
```

### Redis

For proxy pool, Redis data is ephemeral (cache + queue). Losing Redis state means:
- Cached proxy lists are rebuilt on next query (minor latency spike).
- Active leases are lost (the `expire_leases` task will clean up PostgreSQL state).
- Pending ARQ jobs are lost (the next cron cycle will re-enqueue them).

If lease integrity is critical, enable Redis persistence (AOF recommended):
```
appendonly yes
appendfsync everysec
```