Anonymous scraping by TOR network

The project of this post can be downloaded clicking HERE.

In this tutorial we will explain how to configure a proxy server for scraping websites anonymously through the TOR network, using TOR (https://www.torproject.org), Privoxy (http://www.privoxy.org) and the python-Stem library (https://stem.torproject.org). On the other hand this server will run under a Linux distribution based on Ubuntu packages (in this tutorial a Linux Mint 18.1 has been used).

The most common use case when you are scraping a website is to be able to change your identity (IP) using TOR (or a proxy that rotate an IP) when you have been done multiple requests per unit time with the same IP in a website (like google) and don't block your connection and you can continue scraping this website.

Configuration

To develop agents that perform scraping anonymously on a proxy server, it is necessary install a Linux distribution (with Ubuntu packages) with following tools:

TOR: Is an abbreviation of "The Onion Project", a project that seeks to create a low latency distributed communication network above the Internet layer so that the data of the users who use it will be never reveal, thus maintaining a private and anonymous network.
Stem: is a Python controller library for TOR.
Privoxy: Privoxy is a non-caching web proxy with advanced filtering capabilities for enhancing privacy, modifying web page data and HTTP headers, controlling access, and removing ads and other obnoxious Internet junk. Privoxy has a flexible configuration and can be customized to suit individual needs and tastes. It has application for both stand-alone systems and multi-user networks.

TOR (Install and configuration)

We open a terminal and install TOR as follows:

sudo apt-get update
sudo apt-get install tor
sudo /etc/init.d/tor restart

Next, do the following:

Enable the "ControlPort" listener for TOR to listen on port 9051, as this is the port to which TOR will listen for any communication from applications talking to the Tor controller.
Hash a new password that prevents random access to the port by outside agents.
Implement cookie authentication as well.

We create a hashed password out of your password using:

tor --hash-password my_password

In this case i put 1234 proof password. The hash password generated is:

16:9529EB03A306DE6F60171DE514EA2FCD49235BAF1E1E55897209679683

Finally in the TOR configuration file located in / etc / tor / torrc update the port, the password (hash) and we enable authentication cookies. For that we open with vim (you can open it with other editors: vi, nano, gedit, etc.) for editing the file as follows:

sudo vim /etc/tor/torrc

In this file we have to uncomment and modify the following:

ControlPort 9051
# hashed password below is obtained via `tor --hash-password my_password`
HashedControlPassword 16:9529EB03A306DE6F60171DE514EA2FCD49235BAF1E1E55897209679683
CookieAuthentication 1

Restart TOR again to the configuration changes are applied.

sudo /etc/init.d/tor restart

if you have any problems, you can enable control port using –controlport flag as follows:

tor --controlport 9051 &

Python-Stem

Install python-stem which is a Python-based module used to interact with the Tor Controller, letting us send and receive commands to and from the Tor Control port programmatically.

sudo apt-get install python-stem

Privoxy

Tor itself is not a http proxy. So in order to get access to the Tor Network, use privoxy as an http-proxy though socks5. Install privoxy via the following command:

sudo apt-get install privoxy

Now, tell privoxy to use TOR by routing all traffic through the SOCKS servers at localhost port 9050:

sudo vim /etc/privoxy/config

Enable forward-socks5 uncommenting the next line of the file:

forward-socks5 / 127.0.0.1:9050

Restart privoxy after making the change to the configuration file:

sudo /etc/init.d/privoxy restart

Scraping

Once configured our proxy server, we will develop an agent that will perform scraping to webpages, changing the IP every X requests. For this, a class called "ConnectionManager.py" has been developed to manage: connections, IP change, requests, etc. This class has been implemented in the following way:

# -*- coding: utf-8 -*-
__author__ = 'RicardoMoya'

import time
import urllib2
from stem import Signal
from stem.control import Controller


class ConnectionManager:
    def __init__(self):
        self.new_ip = "0.0.0.0"
        self.old_ip = "0.0.0.0"
        self.new_identity()

    @classmethod
    def _get_connection(self):
        """
        TOR new connection
        """
        with Controller.from_port(port=9051) as controller:
            controller.authenticate(password="1234")
            controller.signal(Signal.NEWNYM)
            controller.close()

    @classmethod
    def _set_url_proxy(self):
        """
        Request to URL through local proxy
        """
        proxy_support = urllib2.ProxyHandler({"http": "127.0.0.1:8118"})
        opener = urllib2.build_opener(proxy_support)
        urllib2.install_opener(opener)

    @classmethod
    def request(self, url):
        """
        TOR communication through local proxy
        :param url: web page to parser
        :return: request
        """
        try:
            self._set_url_proxy()
            request = urllib2.Request(url, None, {
                'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) "
                              "AppleWebKit/535.11 (KHTML, like Gecko) "
                              "Ubuntu/10.10 Chromium/17.0.963.65 "
                              "Chrome/17.0.963.65 Safari/535.11"})
            request = urllib2.urlopen(request)
            return request
        except urllib2.HTTPError, e:
            return e.message

    def new_identity(self):
        """
        new connection with new IP
        """
        # First Connection
        if self.new_ip == "0.0.0.0":
            self._get_connection()
            self.new_ip = self.request("http://icanhazip.com/").read()
        else:
            self.old_ip = self.new_ip
            self._get_connection()
            self.new_ip = self.request("http://icanhazip.com/").read()

        seg = 0

        # If we get the same ip, we'll wait 5 seconds to request a new IP
        while self.old_ip == self.new_ip:
            time.sleep(5)
            seg += 5
            print ("Waiting to obtain new IP: %s Seconds" % seg)
            self.new_ip = self.request("http://icanhazip.com/").read()

        print ("New connection with IP: %s" % self.new_ip)

Note 1: It isn't a good programming practice hardcode elements in the code (IP, port, password, etc.) although in this case; being a tutorial for didactic purposes, has been done in this way to make it easier to understand the code.

Note 2: http://icanhazip.com is a web page that shows the IP address of the client making the request. When making requests through the TOR network we can not know (asking the O.S) what is our IP because TOR network assign us another IP (in this case, is the IP shown in icanhazip webpage).

In the constructor of the class we have two attributes (two IPs) that serve to maintain the state of the IP (or identity) currently have and the new IP that we want to obtain for TOR to assign us a new IP. if TOR assigns us the same IP that we had before, we discard and request a new IP.

On the other hand, we have two (public) methods that are request(url) and new_identity(). The request(url) method will be use to make a request a webpage that we pass as a parameter and the new_identity() method will assign us a new IP. The private methods _get_connection() and _set_url_proxy() we will use to establish a new connection by TOR network and to make a request through the local proxy.

Below, we show a simple example of identity change (IP) after making 3 requests to a webpage; In this case, to the same webpage of "icanhazip". The code in this example is as follows (Example.py):

# -*- coding: utf-8 -*-
__author__ = 'RicardoMoya'

from ConnectionManager import ConnectionManager

cm = ConnectionManager()
for j in range(5):
    for i in range(3):
        print ("\t\t" + cm.request("http://icanhazip.com/").read())
    cm.new_identity()

As show in the code when making 3 requests to the "icanhazip" webpage, we ask for an IP change by calling the method new_identity() of the ConnectionManager class. The result of the execution of this code is the following:

New connection with IP: 185.38.14.171
		185.38.14.171
		185.38.14.171
		185.38.14.171

Waiting to obtain new IP: 5 Seconds
Waiting to obtain new IP: 10 Seconds
New connection with IP: 94.23.173.249
		94.23.173.249
		94.23.173.249
		94.23.173.249

Waiting to obtain new IP: 5 Seconds
New connection with IP: 144.217.99.46
		144.217.99.46
		144.217.99.46
		144.217.99.46

Waiting to obtain new IP: 5 Seconds
Waiting to obtain new IP: 10 Seconds
New connection with IP: 62.210.129.246
		62.210.129.246
		62.210.129.246
		62.210.129.246

Waiting to obtain new IP: 5 Seconds
Waiting to obtain new IP: 10 Seconds
New connection with IP: 185.34.33.2
		185.34.33.2
		185.34.33.2
		185.34.33.2

Let's look a more practical example, based on the second example of the tutorial "Scraping en Python (BeautifulSoup), con ejemplos". In that example we obtained all the posts of this website. In this case we will modify the code so that every 5 requests will change the IP. The code is as follows (Scraping_All_Post.py):

# -*- coding: utf-8 -*-
__author__ = 'RicardoMoya'

from bs4 import BeautifulSoup
from ConnectionManager import ConnectionManager

URL_BASE = "http://jarroba.com/"
MAX_PAGES = 30
counter_post = 0

cm = ConnectionManager()
for i in range(1, MAX_PAGES):

    # Build URL
    if i > 1:
        url = "%spage/%d/" % (URL_BASE, i)
    else:
        url = URL_BASE
    print (url)

    # Do the request
    req = cm.request(url)
    status_code = req.code if req != '' else -1
    if status_code == 200:
        html = BeautifulSoup(req.read(), "html.parser")
        posts = html.find_all('div', {'class': 'col-md-4 col-xs-12'})
        for post in posts:
            counter_post += 1
            title = post.find('span', {'class': 'tituloPost'}).getText()
            author = post.find('span', {'class': 'autor'}).getText()
            date = post.find('span', {'class': 'fecha'}).getText()
            print (
            str(counter_post) + ' - ' + title + ' | ' + author + ' | ' + date)

    else:
        # if status code is diferent to 200
        break

    # obtain new ip if 5 requests have already been made
    if i % 5 == 0:
        cm.new_identity()

As an expected result, we see that every 5 requests return a new IP to make the following 5 requests with a new identity. The result obtained (partially shown) is as follows:

New connection with IP: 192.42.116.16

http://jarroba.com/
1 - Bit | Por: Ramón	Invarato | 02-Abr-2017

...........

http://jarroba.com/page/5/
37 - Multitarea e Hilos en Java con ejemplos II (Runnable & Executors) | Por: Ricardo	Moya | 06-Dic-2014

...........

45 - MEAN (Mongo-Express-Angular-Node) Desarrollo Full Stack JavaScript (Parte I) | Por: Ricardo	Moya | 09-Jul-2014

Waiting to obtain new IP: 5 Seconds
New connection with IP: 178.175.131.194

http://jarroba.com/page/6/
46 - Atributos para diseñadores Android (tools:xxxxx) | Por: Ramón	Invarato | 26-May-2014

...........

http://jarroba.com/page/10/
82 - Error Android – java.lang.NoClassDefFoundError sin motivo aparente | Por: Ramón	Invarato | 30-May-2013

...........

90 - ArrayList en Java, con ejemplos | Por: Ricardo	Moya | 28-Mar-2013

New connection with IP: 216.239.90.19

http://jarroba.com/page/11/
91 - Intent – Pasar datos entre Activities – App Android (Video) | Por: Ricardo	Moya | 03-Mar-2013

...........

http://jarroba.com/page/15/
127 - Modelo “4+1” vistas de Kruchten (para Dummies) | Por: Ricardo	Moya | 31-Mar-2012

...........

135 - Aprender a programar conociendo lo que es un Entorno de Desarrollo Integrado (IDE) | Por: Ramón	Invarato | 14-Feb-2012

Waiting to obtain new IP: 5 Seconds
New connection with IP: 144.217.99.46

http://jarroba.com/page/16/
136 - Instalación del XAMPP para Windows | Por: Ricardo	Moya | 13-Feb-2012

...........

151 - Error Android – Aplicación no especifica el nivel de la API | Por: Ramón	Invarato | 12-Dic-2011
http://jarroba.com/page/18/

References

To make this article has taken as a reference the content of the following links:

1.- Article Crawling anonymously with Tor in Python: http://sacharya.com/crawling-anonymously-with-tor-in-python/

2.- Project "PyTorStemPrivoxy" of FrackingAnalysis github account: https://github.com/FrackingAnalysis/PyTorStemPrivoxy

Comparte esta entrada en:

Anonymous scraping by TOR network por "www.jarroba.com" esta bajo una licencia Creative Commons
Reconocimiento-NoComercial-CompartirIgual 3.0 Unported License.
Creado a partir de la obra en www.jarroba.com

4 thoughts on “Anonymous scraping by TOR network”

pablo dice:

28/09/2019 a las 21:37

ricardo
un gran trabajo.
me ha venido bien para entender como funciona y explorar todo el tema.
pablo

Responder
Mikael dice:

27/11/2017 a las 19:23

Hi Ricardo, Thanks for the tutorial!
Would you have time to explain this same thing but for macOs users?

Responder
1. Bob dice:
  
  14/06/2021 a las 23:57
  
  You could always run Ubuntu in a docker container.
  
  Responder
Agustín dice:

24/04/2017 a las 14:12

Hi Ricardo. That’s a really good idea to post your articles in English in order to get more visitors and share the information. I’ll suggest some kind of button in the top of the page, to switch the entire page between English and Spanish.

Regards
Agustín

Responder

Deja una respuesta Cancelar la respuesta

Este sitio usa Akismet para reducir el spam. Aprende cómo se procesan los datos de tus comentarios.