Anonymous scraping by TOR network
The project of this post can be downloaded clicking HERE.
In this tutorial we will explain how to configure a proxy server for scraping websites anonymously through the TOR network, using TOR (https://www.torproject.org), Privoxy (http://www.privoxy.org) and the python-Stem library (https://stem.torproject.org). On the other hand this server will run under a Linux distribution based on Ubuntu packages (in this tutorial a Linux Mint 18.1 has been used).
The most common use case when you are scraping a website is to be able to change your identity (IP) using TOR (or a proxy that rotate an IP) when you have been done multiple requests per unit time with the same IP in a website (like google) and don't block your connection and you can continue scraping this website.
Configuration
To develop agents that perform scraping anonymously on a proxy server, it is necessary install a Linux distribution (with Ubuntu packages) with following tools:
- TOR: Is an abbreviation of "The Onion Project", a project that seeks to create a low latency distributed communication network above the Internet layer so that the data of the users who use it will be never reveal, thus maintaining a private and anonymous network.
- Stem: is a Python controller library for TOR.
- Privoxy: Privoxy is a non-caching web proxy with advanced filtering capabilities for enhancing privacy, modifying web page data and HTTP headers, controlling access, and removing ads and other obnoxious Internet junk. Privoxy has a flexible configuration and can be customized to suit individual needs and tastes. It has application for both stand-alone systems and multi-user networks.
TOR (Install and configuration)
We open a terminal and install TOR as follows:
sudo apt-get update sudo apt-get install tor sudo /etc/init.d/tor restart
Next, do the following:
- Enable the "ControlPort" listener for TOR to listen on port 9051, as this is the port to which TOR will listen for any communication from applications talking to the Tor controller.
- Hash a new password that prevents random access to the port by outside agents.
- Implement cookie authentication as well.
We create a hashed password out of your password using:
tor --hash-password my_password
In this case i put 1234 proof password. The hash password generated is:
16:9529EB03A306DE6F60171DE514EA2FCD49235BAF1E1E55897209679683
Finally in the TOR configuration file located in / etc / tor / torrc update the port, the password (hash) and we enable authentication cookies. For that we open with vim (you can open it with other editors: vi, nano, gedit, etc.) for editing the file as follows:
sudo vim /etc/tor/torrc
In this file we have to uncomment and modify the following:
ControlPort 9051 # hashed password below is obtained via `tor --hash-password my_password` HashedControlPassword 16:9529EB03A306DE6F60171DE514EA2FCD49235BAF1E1E55897209679683 CookieAuthentication 1
Restart TOR again to the configuration changes are applied.
sudo /etc/init.d/tor restart
if you have any problems, you can enable control port using –controlport flag as follows:
tor --controlport 9051 &
Python-Stem
Install python-stem which is a Python-based module used to interact with the Tor Controller, letting us send and receive commands to and from the Tor Control port programmatically.
sudo apt-get install python-stem
Privoxy
Tor itself is not a http proxy. So in order to get access to the Tor Network, use privoxy as an http-proxy though socks5. Install privoxy via the following command:
sudo apt-get install privoxy
Now, tell privoxy to use TOR by routing all traffic through the SOCKS servers at localhost port 9050:
sudo vim /etc/privoxy/config
Enable forward-socks5 uncommenting the next line of the file:
forward-socks5 / 127.0.0.1:9050
Restart privoxy after making the change to the configuration file:
sudo /etc/init.d/privoxy restart
Scraping
Once configured our proxy server, we will develop an agent that will perform scraping to webpages, changing the IP every X requests. For this, a class called "ConnectionManager.py" has been developed to manage: connections, IP change, requests, etc. This class has been implemented in the following way:
# -*- coding: utf-8 -*- __author__ = 'RicardoMoya' import time import urllib2 from stem import Signal from stem.control import Controller class ConnectionManager: def __init__(self): self.new_ip = "0.0.0.0" self.old_ip = "0.0.0.0" self.new_identity() @classmethod def _get_connection(self): """ TOR new connection """ with Controller.from_port(port=9051) as controller: controller.authenticate(password="1234") controller.signal(Signal.NEWNYM) controller.close() @classmethod def _set_url_proxy(self): """ Request to URL through local proxy """ proxy_support = urllib2.ProxyHandler({"http": "127.0.0.1:8118"}) opener = urllib2.build_opener(proxy_support) urllib2.install_opener(opener) @classmethod def request(self, url): """ TOR communication through local proxy :param url: web page to parser :return: request """ try: self._set_url_proxy() request = urllib2.Request(url, None, { 'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) " "AppleWebKit/535.11 (KHTML, like Gecko) " "Ubuntu/10.10 Chromium/17.0.963.65 " "Chrome/17.0.963.65 Safari/535.11"}) request = urllib2.urlopen(request) return request except urllib2.HTTPError, e: return e.message def new_identity(self): """ new connection with new IP """ # First Connection if self.new_ip == "0.0.0.0": self._get_connection() self.new_ip = self.request("http://icanhazip.com/").read() else: self.old_ip = self.new_ip self._get_connection() self.new_ip = self.request("http://icanhazip.com/").read() seg = 0 # If we get the same ip, we'll wait 5 seconds to request a new IP while self.old_ip == self.new_ip: time.sleep(5) seg += 5 print ("Waiting to obtain new IP: %s Seconds" % seg) self.new_ip = self.request("http://icanhazip.com/").read() print ("New connection with IP: %s" % self.new_ip)
In the constructor of the class we have two attributes (two IPs) that serve to maintain the state of the IP (or identity) currently have and the new IP that we want to obtain for TOR to assign us a new IP. if TOR assigns us the same IP that we had before, we discard and request a new IP.
On the other hand, we have two (public) methods that are request(url) and new_identity(). The request(url) method will be use to make a request a webpage that we pass as a parameter and the new_identity() method will assign us a new IP. The private methods _get_connection() and _set_url_proxy() we will use to establish a new connection by TOR network and to make a request through the local proxy.
Below, we show a simple example of identity change (IP) after making 3 requests to a webpage; In this case, to the same webpage of "icanhazip". The code in this example is as follows (Example.py):
# -*- coding: utf-8 -*- __author__ = 'RicardoMoya' from ConnectionManager import ConnectionManager cm = ConnectionManager() for j in range(5): for i in range(3): print ("\t\t" + cm.request("http://icanhazip.com/").read()) cm.new_identity()
As show in the code when making 3 requests to the "icanhazip" webpage, we ask for an IP change by calling the method new_identity() of the ConnectionManager class. The result of the execution of this code is the following:
New connection with IP: 185.38.14.171 185.38.14.171 185.38.14.171 185.38.14.171 Waiting to obtain new IP: 5 Seconds Waiting to obtain new IP: 10 Seconds New connection with IP: 94.23.173.249 94.23.173.249 94.23.173.249 94.23.173.249 Waiting to obtain new IP: 5 Seconds New connection with IP: 144.217.99.46 144.217.99.46 144.217.99.46 144.217.99.46 Waiting to obtain new IP: 5 Seconds Waiting to obtain new IP: 10 Seconds New connection with IP: 62.210.129.246 62.210.129.246 62.210.129.246 62.210.129.246 Waiting to obtain new IP: 5 Seconds Waiting to obtain new IP: 10 Seconds New connection with IP: 185.34.33.2 185.34.33.2 185.34.33.2 185.34.33.2
Let's look a more practical example, based on the second example of the tutorial "Scraping en Python (BeautifulSoup), con ejemplos". In that example we obtained all the posts of this website. In this case we will modify the code so that every 5 requests will change the IP. The code is as follows (Scraping_All_Post.py):
# -*- coding: utf-8 -*- __author__ = 'RicardoMoya' from bs4 import BeautifulSoup from ConnectionManager import ConnectionManager URL_BASE = "http://jarroba.com/" MAX_PAGES = 30 counter_post = 0 cm = ConnectionManager() for i in range(1, MAX_PAGES): # Build URL if i > 1: url = "%spage/%d/" % (URL_BASE, i) else: url = URL_BASE print (url) # Do the request req = cm.request(url) status_code = req.code if req != '' else -1 if status_code == 200: html = BeautifulSoup(req.read(), "html.parser") posts = html.find_all('div', {'class': 'col-md-4 col-xs-12'}) for post in posts: counter_post += 1 title = post.find('span', {'class': 'tituloPost'}).getText() author = post.find('span', {'class': 'autor'}).getText() date = post.find('span', {'class': 'fecha'}).getText() print ( str(counter_post) + ' - ' + title + ' | ' + author + ' | ' + date) else: # if status code is diferent to 200 break # obtain new ip if 5 requests have already been made if i % 5 == 0: cm.new_identity()
As an expected result, we see that every 5 requests return a new IP to make the following 5 requests with a new identity. The result obtained (partially shown) is as follows:
New connection with IP: 192.42.116.16 http://jarroba.com/ 1 - Bit | Por: Ramón Invarato | 02-Abr-2017 ........... http://jarroba.com/page/5/ 37 - Multitarea e Hilos en Java con ejemplos II (Runnable & Executors) | Por: Ricardo Moya | 06-Dic-2014 ........... 45 - MEAN (Mongo-Express-Angular-Node) Desarrollo Full Stack JavaScript (Parte I) | Por: Ricardo Moya | 09-Jul-2014 Waiting to obtain new IP: 5 Seconds New connection with IP: 178.175.131.194 http://jarroba.com/page/6/ 46 - Atributos para diseñadores Android (tools:xxxxx) | Por: Ramón Invarato | 26-May-2014 ........... http://jarroba.com/page/10/ 82 - Error Android – java.lang.NoClassDefFoundError sin motivo aparente | Por: Ramón Invarato | 30-May-2013 ........... 90 - ArrayList en Java, con ejemplos | Por: Ricardo Moya | 28-Mar-2013 New connection with IP: 216.239.90.19 http://jarroba.com/page/11/ 91 - Intent – Pasar datos entre Activities – App Android (Video) | Por: Ricardo Moya | 03-Mar-2013 ........... http://jarroba.com/page/15/ 127 - Modelo “4+1” vistas de Kruchten (para Dummies) | Por: Ricardo Moya | 31-Mar-2012 ........... 135 - Aprender a programar conociendo lo que es un Entorno de Desarrollo Integrado (IDE) | Por: Ramón Invarato | 14-Feb-2012 Waiting to obtain new IP: 5 Seconds New connection with IP: 144.217.99.46 http://jarroba.com/page/16/ 136 - Instalación del XAMPP para Windows | Por: Ricardo Moya | 13-Feb-2012 ........... 151 - Error Android – Aplicación no especifica el nivel de la API | Por: Ramón Invarato | 12-Dic-2011 http://jarroba.com/page/18/
References
To make this article has taken as a reference the content of the following links:
1.- Article Crawling anonymously with Tor in Python: http://sacharya.com/crawling-anonymously-with-tor-in-python/
2.- Project "PyTorStemPrivoxy" of FrackingAnalysis github account: https://github.com/FrackingAnalysis/PyTorStemPrivoxy
ricardo
un gran trabajo.
me ha venido bien para entender como funciona y explorar todo el tema.
pablo
Hi Ricardo, Thanks for the tutorial!
Would you have time to explain this same thing but for macOs users?
You could always run Ubuntu in a docker container.
Hi Ricardo. That’s a really good idea to post your articles in English in order to get more visitors and share the information. I’ll suggest some kind of button in the top of the page, to switch the entire page between English and Spanish.
Regards
Agustín