Python module for harvesting commonly available data such as free proxy servers.

Go to file

agatha 7538f37596 Update documentation		2023-11-07 19:02:52 -05:00
data	Add common proxy source list	2023-11-06 17:25:54 -05:00
harvester	Implement simple validation	2023-11-07 18:53:07 -05:00
tests	Update unit tests to use set and issubset	2023-11-06 17:23:21 -05:00
.gitignore	Initial commit	2023-11-06 14:41:05 -05:00
main.py	Implement simple validation	2023-11-07 18:53:07 -05:00
README.md	Update documentation	2023-11-07 19:02:52 -05:00
requirements-dev.txt	Initial commit	2023-11-06 14:41:05 -05:00
requirements.txt	Implement simple validation	2023-11-07 18:53:07 -05:00

README.md

Harvester

Python package for harvesting commonly available data, such as free proxy servers.

Modules

Proxy

fetch_list

The proxy module will harvest proxies from URLs with the fetch_list function.

It functions by running a regular expression against the HTTP response, looking for strings that match a username:password@ip:port pattern where username and password are optional.

from harvester.proxy import fetch_list


URLS = [
    'https://api.openproxylist.xyz/socks4.txt',
    'https://api.openproxylist.xyz/socks5.txt',
    'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]


def main():
    """Main entry point."""
    for url in URLS:
        proxies = fetch_list(url)
        print(proxies)


if __name__ == '__main__':
    main()

fetch_all

Proxies can be fetched from multiple source URLs by using the fetch_all function.

It takes a list of URLs and an optional max_workers parameter. Proxies will be fetched from the source URLs concurrently using a ThreadPoolExecutor:

from harvester.proxy import fetch_all


URLS = [
    'https://api.openproxylist.xyz/socks4.txt',
    'https://api.openproxylist.xyz/socks5.txt',
    'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]


def main():
    """Main entry point."""
    proxies = fetch_all(URLS)
    print(proxies)


if __name__ == '__main__':
    main()

validate_socks

SOCKS5 proxies can be tested with the validate_socks method. The method takes a proxy string as its only argument. It returns a requests.Response object if the request is successful with no issues, otherwise it will raise an exception and the caller can decide how to proceed.

For an example implementation, see main.py.

Testing

pip install -r requirements.txt
pip install -r requirement-dev.txt
pytest -v