harvester/README.md
2023-11-07 19:02:52 -05:00

77 lines
1.8 KiB
Markdown

# Harvester
Python package for harvesting commonly available data, such as free proxy servers.
## Modules
### Proxy
#### fetch_list
The `proxy` module will harvest proxies from URLs with the `fetch_list` function.
It functions by running a regular expression against the HTTP response, looking for
strings that match a `username:password@ip:port` pattern where username and password
are optional.
```python
from harvester.proxy import fetch_list
URLS = [
'https://api.openproxylist.xyz/socks4.txt',
'https://api.openproxylist.xyz/socks5.txt',
'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]
def main():
"""Main entry point."""
for url in URLS:
proxies = fetch_list(url)
print(proxies)
if __name__ == '__main__':
main()
```
#### fetch_all
Proxies can be fetched from multiple source URLs by using the `fetch_all` function.
It takes a list of URLs and an optional `max_workers` parameter. Proxies will be fetched from
the source URLs concurrently using a `ThreadPoolExecutor`:
```python
from harvester.proxy import fetch_all
URLS = [
'https://api.openproxylist.xyz/socks4.txt',
'https://api.openproxylist.xyz/socks5.txt',
'https://api.proxyscrape.com/?request=displayproxies&proxytype=socks4',
]
def main():
"""Main entry point."""
proxies = fetch_all(URLS)
print(proxies)
if __name__ == '__main__':
main()
```
#### validate_socks
SOCKS5 proxies can be tested with the `validate_socks` method. The method takes a proxy
string as its only argument. It returns a `requests.Response` object if the request is successful
with no issues, otherwise it will raise an exception and the caller can decide how to proceed.
For an example implementation, see [main.py](main.py).
## Testing
```
pip install -r requirements.txt
pip install -r requirement-dev.txt
pytest -v
```