harvester/README.md

# Harvester
Python package for harvesting commonly available data, such as free proxy servers.

## Modules
### Proxy
#### fetch_list
The `proxy` module will harvest proxies from URLs with the `fetch_list` function.

It functions by running a regular expression against the HTTP response, looking for
strings that match a `username:password@ip:port` pattern where username and password
are optional.

```python
from harvester.proxy import fetch_list


URLS = [
    'https://api.openproxylist.xyz/socks4.txt',
    'https://api.openproxylist.xyz/socks5.txt',
    'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]


def main():
    """Main entry point."""
    for url in URLS:
        proxies = fetch_list(url)
        print(proxies)


if __name__ == '__main__':
    main()

```

#### fetch_all
Proxies can be fetched from multiple source URLs by using the `fetch_all` function.

It takes a list of URLs and an optional `max_workers` parameter. Proxies will be fetched from
the source URLs concurrently using a `ThreadPoolExecutor`:

```python
from harvester.proxy import fetch_all


URLS = [
    'https://api.openproxylist.xyz/socks4.txt',
    'https://api.openproxylist.xyz/socks5.txt',
    'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]


def main():
    """Main entry point."""
    proxies = fetch_all(URLS)
    print(proxies)


if __name__ == '__main__':
    main()
```

## Testing
```
pip install -r requirements.txt
pip install -r requirement-dev.txt
pytest -v
```