Writing A Proxy Checker In Python – Part 1

26 Jul

This is a deep dive into a proxy checker I wrote here. It assumes the reader already knows what proxies are and wants to see how they can be analyzed with Python 3. In this first part we’ll explore what’s involved in proxy checking and look at the code for establishing a connection to a network resource through a proxy.

Starting Out

When asking whether a proxy is working, we need to differentiate from the point of view of a source or a target. Confirming that a proxy is working from the point of view of the source means successfully requesting an external or internal network resource. If a connection through a proxy is established and the resource is obtained (a file, a printer connection, a webservice connection, etc.) – the proxy is considered functional. Checking whether a proxy is working from the target’s point of view is more involved and may consist of asking one of the following:

Is the requestor a proxy, and if so, what level of anonymity does it provide?

Is the requestor not a proxy i.e. are we testing an elite proxy?

For the purposes of this post we are focusing on testing proxies via HTTP. Also known as web proxies that understand the HTTP protocol and are able to alter their output headers. This leads to the rise of proxy anonymity classification levels such as Transparent, Anonymous, Elite, etc.

Headers

For a quick overview of headers involved in the analysis we’ll focus on the most common subset that proxies include in their requests to target servers.

Via
X-Via
X-Forwarded-For
X-Forwarded-Proto
X-Forwarded-Port
Proxy-Connection
Cdn-Src-Ip
X-IMForwards
HTTP_VIA
HTTP_FORWARDED_FOR
Forwarded

The top half of these appear in the majority of HTTP requests from transparent or anonymous proxies. A proxy that doesn’t hide the source IP and that labels itself a proxy will most likely include this information in the Via and X-Forwarded-For headers. At the other extreme, an elite proxy is likely to omit these fields to masquerade as the original requestor. Headers starting with X are non-standard but still widely used. The last field – Forwarded – is a proposed standard and applications wishing to be standards compliant should use it.

Making The Call

We begin as a source host requesting an external network resource – the HTML from http://www.google.com . The source host needs to accomplish two objectives:

1. Connect successfully to the proxy host

2. Receive the correct HTML

When exploring proxies I was surprised to find that some proxies were not returning the HTML requested. It would either be an error message from the proxy itself or the HTML requested but surrounded by additional tags. So to say that a proxy is functional just because it accepts connection requests from the source host is not sufficient.

In our code we’ll make use of Python’s urllib.request module, although the Requests package is a popular alternative for handling HTTP with a simplified interface.

Line 3: A urllib.request.ProxyHandler accepts a dictionary of protocol and proxy pairs. Handlers, as described here,  do the heavy lifting of opening URLs for specific schemes (https, ftp, etc.). For our purposes, we are configuring a handler with a single proxy only. Note also the use of 'http' . Remember that proxies operate across multiple protocols and we are focusing on processing HTTP in this example.

Line 4: The urllib.request.build_opener()  is a convenience function for creating instances of the urllib.request.OpenerDirector class. Openers manage chains of handlers and, by default, the build_opener() creates an opener with a few default handlers pre-installed, including a default ProxyHandler. The default ProxyHandler configures itself with proxies from the environment variables. Since we supply a proxy programatically in this example – a custom ProxyHandler was created instead.

Line 5: Calling install_opener()  will make the provided opener the default when calling urlopen() .

Line 6: The urllib.request.Request object simplifies the exchange of requests and responses of the HTTP protocol. Passing a request object into urlopen() fetches the URL and returns a response in a single call. Additional features provided by the Request object includes passing of custom headers, as well as POST data. Unless specified, urllib uses the GET method when making a request.

Note the use of the header User-Agent in the last line. When ommited, urlib supplies it’s own, for example: User-Agent : Python-urllib/3.4. In an effort to prevent programmatic connections, some websites block requests containing such default headers.

The try ... catch  block makes the actual call using urlopen() to fetch the HTML. Should the connection to the proxy fail (we are assuming that a connection to Google will go through) some of the possible errors are mentioned in the except block. It’s important to reiterate that it’s critical to confirm that the HTML returned from the proxy is the same HTML you get when manually checking the URL source. The purpose of decode("utf-8") will be explained in Part 2 of this series.

For more information about proxies, anonymity and headers take a look at the excellent resources below:

Proxy Server Wiki
AnonymousProxy
Forwarded HTTP Extension

Subscribe to the RSS feed or check back soon for Part 2!

One Reply to “Writing A Proxy Checker In Python – Part 1”

Leave a Reply

Your email address will not be published. Required fields are marked *