How to write a scraper

What are scrapers?

To be able to display and send alerts for planning applications, PlanningAlerts needs to download applications from as many councils as possible. As the vast majority of councils don't supply the data in a reusable, machine-readable format we need to write web scrapers for each local government authority.

These scrapers fetch the data from council web pages and present it in a structured format so we can load it into the PlanningAlerts database.

How can I help?

If you have some computer programming experience, you should be able to work out how to prepare a scraper for PlanningAlerts. All PlanningAlerts scrapers are hosted on our morph.io scraping platform, which takes care of all the boring bits of scraping for you (well, most of the boring bits!).

The next thing to do is to decide what council to scrape. Once you've picked one, look it up on our crowd-sourced list of councils and have a look at their published planning applications. Quickly double-check that the council isn't covered already.

An introduction to scraping with morph.io

With morph.io, you can choose to write your scraper in Ruby, Python, PHP or Perl so there's a good chance you're already familiar an available programming language. Since all of the code is hosted on GitHub you're probably also already familiar with how to share and collaborate on your scraper code.

morph.io provides an easy migration path from our previous scraper host, ScraperWiki Classic, and provides all the same great conveniences like taking care of saving your data, running your scraper regularly, and emailing you when there's a problem.

You can find out more in the morph.io documentation. If you're a novice scraper writer the tutorials on scraping with ScraperWiki Classic for Ruby, Python or PHP are still helpful resources despite being out of date.

Now it's time to scrape

Make sure you have a GitHub account, then you can use it to sign in to morph.io and create a new scraper that downloads and saves the following information:

The following fields are required. All development applications should have these bits of information.

field Example value Description
council_reference TA/00323/2012

The ID that the council has given the planning application. This also must be the unique key for this data set.

address 1 Sowerby St, Goulburn, NSW

The physical address that this application relates to. This will be geocoded so doesn't need to be a specific format but obviously the more explicit it is the more likely it will be successfully geo-coded. If the original address did not include the state (e.g. "QLD") at the end, then add it.

description Ground floor alterations to rear and first floor addition

A text description of what the planning application seeks to carry out.

info_url http://foo.gov.au/app?key=527230

A URL that provides more information about the planning application.

This should be a persistent URL that preferably is specific to this particular application. In many cases councils force users to click through a license to access planning application. In this case be careful about what URL you provide. Test clicking the link in a browser that hasn't established a session with the council's site to ensure users of PlanningAlerts will be able to click the link and not be presented with an error.

comment_url http://foo.gov.au/comment?key=527230

A URL where users can provide a response to council about this particular application.

As in info_url this needs to be a persistent URL and should be specific to this particular application if possible. Email mailto links are also valid in this field.

date_scraped 2012-08-01

The date that your scraper is collecting this data (i.e. now). Should be in ISO 8601 format.

Use the following Ruby code: Date.today.to_s

The following fields are optional because not every planning authority provides them. Please do include them if data is available.

field Example value Description
date_received 2012-06-23

The date this application was received by council. Should be in ISO 8601 format.

on_notice_from 2012-08-01

The date from when public submissions can be made about this application. Should be in ISO 8601 format.

on_notice_to 2012-08-14

The date until when public submissions can be made about this application. Should be in ISO 8601 format.

To ensure the date_scraped field isn't overwritten on subsequent scrapes add code similar to this to save your planning application to the morph.io datastore:

if (ScraperWikiMorph.select("* from data where `council_reference`='#{record['council_reference']}'").empty? rescue true)
  ScraperWikiMorph.save_sqlite(['council_reference'], record)
else
  puts "Skipping already saved record " + record['council_reference']
end

If you get stuck, have a look at the scrapers already written for PlanningAlerts and email community mailing list if you have any questions.

Scheduling the scraper

Set the scraper to run once per day. This can be done on morph.io on the settings page of the scraper.

Finishing up

Once you've finished your scraper and it's successfully downloading planning applications, simply email the OpenAustralia Community mailing list or contact@planningalerts.org.au and we'll fork it into the planningalerts-scrapers organization and import it into PlanningAlerts.

The last thing to do is look up on Wikipedia how many people live within the council you've just covered so you can pat yourself on the back knowing that you've just helped tens of thousands of people get PlanningAlerts.

This week

Find PlanningAlerts useful?

This independent project is run by a local charity, the OpenAustralia Foundation. PlanningAlerts is powered by small donations from the people who use it to stay informed about changes to their local area. If you find it useful, chip in to support PlanningAlerts.

Donate