Blog

Install Scrapy in AWS using Ansible Playbook

Here is a simple Ansible playbook to install Scrapy on an Amazon Web Services (AWS) EC2 instance. It includes all the prerequisites packages for Scrapy and is based on Python 3.5.
Note: This installs the scrapy app and not scrapyd, the scrapy daemon.

The EC2 image I used is an AWS 64 bit Linux AMI (ami-ebd02392) but it should work on any recent YUM based image.

Just save the contents below as part of an Ansible role or single playbook.

---
- name: Install Scrapy prereqs
  yum: pkg={{item}} state=present
  with_items:
    - gcc
    - python35
    - python35-devel 
    - python35-pip 
    - libxml2 
    - libxml2-devel 
    - libxslt
    - libxslt-devel
    - libffi 
    - libffi-devel 
    - openssl 
    - openssl-devel

- name: Install Scrapy
  pip:
    executable: pip-3.5
    name: scrapy

Unit Test Scrapy CSS/XPath using ItemLoader and HtmlResponse

When I started working with the awesome scrapy project, one of the frustrations I had was how to unit my CSS/XPath selectors.
I am aware of spider contracts but I just wanted to write standard python unit tests for my HTML parsing.

This article covers how I unit tested the spider I wrote to parse the free proxy servers at https://free-proxy-list.net

The pertinent data items that we want (ip, port, country, etc.) are defined in a table like this:

...
      ...
    ...
181.211.187.250 53281 EC Ecuador elite proxy no no 11 seconds ago

In items.py I created a ProxyServerItem for this using the scrapy Item class:


class ProxyServerItem(scrapy.Item):
    ip = scrapy.Field()
    port = scrapy.Field()
    code = scrapy.Field()
    country = scrapy.Field()
    anonymity = scrapy.Field()
    google = scrapy.Field()
    https = scrapy.Field()
    last_checked = scrapy.Field()

Initially, here is how I wrote the parse method in my spider class:

def parse(self, response):
    for tr in response.xpath('//table[@id="proxylisttable"]/tbody/tr'):
        yield ProxyServerItem(
            ip=(tr.xpath('./td[1]/text()').extract_first()),
            port=(tr.xpath('./td[2]/text()').extract_first()),
            code=(tr.xpath('./td[3]/text()').extract_first()),
            country=(tr.xpath('./td[4]/text()').extract_first()),
            anonymity=(tr.xpath('./td[5]/text()').extract_first()),
            google=(tr.xpath('./td[6]/text()').extract_first()),
            https=(tr.xpath('./td[7]/text()').extract_first()),
            last_checked=(str(dateparser.parse(
                tr.xpath('./td[8]/text()').extract_first()))))

Then, I read about the ItemLoader class and changed my implementation:

def parse(self, response):
        tbody = response.xpath('//table[@id="proxylisttable"]/tbodyS')
        l = ProxyServerLoader(item=ProxyServerItem(), response=tbody)
        l.add_xpath('ip', './tr/td[1]')
        l.add_xpath('port', './td[2]')
        l.add_xpath('code', './td[3]')
        l.add_xpath('country', './td[4]')
        l.add_xpath('anonymity', './td[5]')
        l.add_xpath('google', './td[6]')
        l.add_xpath('https', './td[7]')
        l.add_css('last_checked', './td[8]')
        return l.load_item()

I started thinking about how to unit test the parse method above. I would have to instantiate the class (FreeProxyListSpider) and pass in the sample HTML.

However what I really wanted to test was my xpath selectors. There are two main parts to the xpath. The row selectors in the for loop and the item xpath used to create each item.

At first, after much googling, I wrote a test for the whole spider class using HtmlResponse. The test uses the standard python unittest.TestCase class and includes an HTML snippet that I cobbled together:


class TestFreeProxyListSpider(unittest.TestCase):
    def test_parse(self):
        body = b"""
        
ip1port1code1country1anonymity1google1https11 days ip2port2code2country2anonymity2google2https22 days
""" response = HtmlResponse(url="a test", body=body) items = FreeProxyListSpider().parse(response) for i, item in enumerate(items, start=1): self.assertEquals(item['ip'], "ip%s" % i) self.assertEquals(item['port'], "port%s" % i) self.assertEquals(item['code'], "code%s" % i) self.assertEquals(item['country'], "country%s" % i) self.assertEquals(item['anonymity'], "anonymity%s" % i) self.assertEquals(item['google'], "google%s" % i) self.assertEquals(item['https'], "https%s" % i) self.assertEquals(item['last_checked'], '%s days' % i)

If you really wanted to be certain the HTML is correct you could just copy it from the actual website and load from a file. However I think inlining the HTML makes the test self contained and easier to understand.

In the next refactor of the code I will move the code that create the ItemLoader into a factory.

Stay tuned.

VIM: insert text on multiple lines

If you need to add the same text on multiple lines e.g. comment out multiple lines, insert semi-colons etc, then visual block mode is your friend!

Let’s take a look at the following ansible YAML playbook :

Assuming we want to comment out the nginx, php-fpm and wordpress roles (lines 12 – 14). In YAML the comment character is ‘#’.

We need to go to line 12, start visual block mode and select the 3 lines, insert a ‘#’ char at beginning of line, exit to normal mode.

In vim you start visual block by pressing <Ctrl+v>. Now these commands:

12G
<Ctrl+v>jj$
I#
<Esc>

Lines 12 to 14 should now be preceeded by a ‘#’. This is because in visual block mode all insert commands are repeated for each line. Nice!