Python module for harvesting commonly available data such as free proxy servers.
Go to file
2023-11-06 17:08:15 -05:00
harvester Add docstrings for fetch_all function 2023-11-06 17:08:15 -05:00
tests Update tests 2023-11-06 17:07:23 -05:00
.gitignore Initial commit 2023-11-06 14:41:05 -05:00
main.py Update example to use logger 2023-11-06 16:47:22 -05:00
README.md PEP-8 2023-11-06 16:33:56 -05:00
requirements-dev.txt Initial commit 2023-11-06 14:41:05 -05:00
requirements.txt Initial commit 2023-11-06 14:41:05 -05:00

Harvester

Python package for harvesting commonly available data, such as free proxy servers.

Modules

Proxy

fetch_list

The proxy module will harvest proxies from URLs with the fetch_list function.

It functions by running a regular expression against the HTTP response, looking for strings that match a username:password@ip:port pattern where username and password are optional.

from harvester.proxy import fetch_list


URLS = [
    'https://api.openproxylist.xyz/socks4.txt',
    'https://api.openproxylist.xyz/socks5.txt',
    'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]


def main():
    """Main entry point."""
    for url in URLS:
        proxies = fetch_list(url)
        print(proxies)


if __name__ == '__main__':
    main()

fetch_all

Proxies can be fetched from multiple source URLs by using the fetch_all function.

It takes a list of URLs and an optional max_workers parameter. Proxies will be fetched from the source URLs concurrently using a ThreadPoolExecutor:

from harvester.proxy import fetch_all


URLS = [
    'https://api.openproxylist.xyz/socks4.txt',
    'https://api.openproxylist.xyz/socks5.txt',
    'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]


def main():
    """Main entry point."""
    proxies = fetch_all(URLS)
    print(proxies)


if __name__ == '__main__':
    main()

Testing

pip install -r requirements.txt
pip install -r requirement-dev.txt
pytest -v