harvester/README.md
2023-11-07 19:02:52 -05:00

1.8 KiB

Harvester

Python package for harvesting commonly available data, such as free proxy servers.

Modules

Proxy

fetch_list

The proxy module will harvest proxies from URLs with the fetch_list function.

It functions by running a regular expression against the HTTP response, looking for strings that match a username:password@ip:port pattern where username and password are optional.

from harvester.proxy import fetch_list


URLS = [
    'https://api.openproxylist.xyz/socks4.txt',
    'https://api.openproxylist.xyz/socks5.txt',
    'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]


def main():
    """Main entry point."""
    for url in URLS:
        proxies = fetch_list(url)
        print(proxies)


if __name__ == '__main__':
    main()

fetch_all

Proxies can be fetched from multiple source URLs by using the fetch_all function.

It takes a list of URLs and an optional max_workers parameter. Proxies will be fetched from the source URLs concurrently using a ThreadPoolExecutor:

from harvester.proxy import fetch_all


URLS = [
    'https://api.openproxylist.xyz/socks4.txt',
    'https://api.openproxylist.xyz/socks5.txt',
    'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]


def main():
    """Main entry point."""
    proxies = fetch_all(URLS)
    print(proxies)


if __name__ == '__main__':
    main()

validate_socks

SOCKS5 proxies can be tested with the validate_socks method. The method takes a proxy string as its only argument. It returns a requests.Response object if the request is successful with no issues, otherwise it will raise an exception and the caller can decide how to proceed.

For an example implementation, see main.py.

Testing

pip install -r requirements.txt
pip install -r requirement-dev.txt
pytest -v