When I started working with the awesome scrapy project, one of the frustrations I had was how to unit my CSS/XPath selectors.
I am aware of spider contracts but I just wanted to write standard python unit tests for my HTML parsing.

This article covers how I unit tested the spider I wrote to parse the free proxy servers at https://free-proxy-list.net

The pertinent data items that we want (ip, port, country, etc.) are defined in a table like this:

    ... 53281 EC Ecuador elite proxy no no 11 seconds ago

In items.py I created a ProxyServerItem for this using the scrapy Item class:

class ProxyServerItem(scrapy.Item):
    ip = scrapy.Field()
    port = scrapy.Field()
    code = scrapy.Field()
    country = scrapy.Field()
    anonymity = scrapy.Field()
    google = scrapy.Field()
    https = scrapy.Field()
    last_checked = scrapy.Field()

Initially, here is how I wrote the parse method in my spider class:

def parse(self, response):
    for tr in response.xpath('//table[@id="proxylisttable"]/tbody/tr'):
        yield ProxyServerItem(

Then, I read about the ItemLoader class and changed my implementation:

def parse(self, response):
        tbody = response.xpath('//table[@id="proxylisttable"]/tbodyS')
        l = ProxyServerLoader(item=ProxyServerItem(), response=tbody)
        l.add_xpath('ip', './tr/td[1]')
        l.add_xpath('port', './td[2]')
        l.add_xpath('code', './td[3]')
        l.add_xpath('country', './td[4]')
        l.add_xpath('anonymity', './td[5]')
        l.add_xpath('google', './td[6]')
        l.add_xpath('https', './td[7]')
        l.add_css('last_checked', './td[8]')
        return l.load_item()

I started thinking about how to unit test the parse method above. I would have to instantiate the class (FreeProxyListSpider) and pass in the sample HTML.

However what I really wanted to test was my xpath selectors. There are two main parts to the xpath. The row selectors in the for loop and the item xpath used to create each item.

At first, after much googling, I wrote a test for the whole spider class using HtmlResponse. The test uses the standard python unittest.TestCase class and includes an HTML snippet that I cobbled together:

class TestFreeProxyListSpider(unittest.TestCase):
    def test_parse(self):
        body = b"""
ip1port1code1country1anonymity1google1https11 days ip2port2code2country2anonymity2google2https22 days
""" response = HtmlResponse(url="a test", body=body) items = FreeProxyListSpider().parse(response) for i, item in enumerate(items, start=1): self.assertEquals(item['ip'], "ip%s" % i) self.assertEquals(item['port'], "port%s" % i) self.assertEquals(item['code'], "code%s" % i) self.assertEquals(item['country'], "country%s" % i) self.assertEquals(item['anonymity'], "anonymity%s" % i) self.assertEquals(item['google'], "google%s" % i) self.assertEquals(item['https'], "https%s" % i) self.assertEquals(item['last_checked'], '%s days' % i)

If you really wanted to be certain the HTML is correct you could just copy it from the actual website and load from a file. However I think inlining the HTML makes the test self contained and easier to understand.

In the next refactor of the code I will move the code that create the ItemLoader into a factory.

Stay tuned.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s