data | ||
harvester | ||
tests | ||
.gitignore | ||
main.py | ||
README.md | ||
requirements-dev.txt | ||
requirements.txt |
Harvester
Python package for harvesting commonly available data, such as free proxy servers.
Modules
Proxy
fetch_list
The proxy
module will harvest proxies from URLs with the fetch_list
function.
It functions by running a regular expression against the HTTP response, looking for
strings that match a username:password@ip:port
pattern where username and password
are optional.
from harvester.proxy import fetch_list
URLS = [
'https://api.openproxylist.xyz/socks4.txt',
'https://api.openproxylist.xyz/socks5.txt',
'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]
def main():
"""Main entry point."""
for url in URLS:
proxies = fetch_list(url)
print(proxies)
if __name__ == '__main__':
main()
fetch_all
Proxies can be fetched from multiple source URLs by using the fetch_all
function.
It takes a list of URLs and an optional max_workers
parameter. Proxies will be fetched from
the source URLs concurrently using a ThreadPoolExecutor
:
from harvester.proxy import fetch_all
URLS = [
'https://api.openproxylist.xyz/socks4.txt',
'https://api.openproxylist.xyz/socks5.txt',
'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]
def main():
"""Main entry point."""
proxies = fetch_all(URLS)
print(proxies)
if __name__ == '__main__':
main()
validate_socks
SOCKS5 proxies can be tested with the validate_socks
method. The method takes a proxy
string as its only argument. It returns a requests.Response
object if the request is successful
with no issues, otherwise it will raise an exception and the caller can decide how to proceed.
For an example implementation, see main.py.
Testing
pip install -r requirements.txt
pip install -r requirement-dev.txt
pytest -v