How to write a scraper
What are scrapers?
To be able to display and send alerts for planning applications, PlanningAlerts needs to download applications from as many councils as possible. As the vast majority of councils don't supply the data in a reusable, machine-readable format we need to write web scrapers for each local government authority.
These scrapers fetch the data from council web pages and present it in a structured format so we can load it into the PlanningAlerts database.
How can I help?
If you have some computer programming experience, you should be able to work out how to prepare a scraper for PlanningAlerts. All PlanningAlerts scrapers are hosted on our morph.io scraping platform, which takes care of all the boring bits of scraping for you (well, most of the boring bits!).
The next thing to do is to decide what council to scrape. Once you've picked one, look it up on our crowd-sourced list of councils and have a look at their published planning applications. Quickly double-check that the council isn't covered already.
Some systems for displaying development applications on council websites are widely used. For most of those we have already developed scrapers that are capable of scraping many authorities using the same system. Check whether the council you want to scrape is using one of the systems Masterview, Civica, Icon, ATDIS, Horizon, Technology One or Epathway. We don't yet have good documentation on how to recognise these different systems but that's something we want to create. Or maybe you can help?
An introduction to scraping with morph.io
With morph.io, you can choose to write your scraper in Ruby, Python, PHP or Perl so there's a good chance you're already familiar an available programming language. Since all of the code is hosted on GitHub you're probably also already familiar with how to share and collaborate on your scraper code.
morph.io provides great conveniences like taking care of saving your data, running your scraper regularly, and emailing you when there's a problem.
You can find out more in the morph.io documentation.
Now it's time to scrape
The following fields are required. All development applications should have these bits of information.
The ID that the council has given the planning application. This also must be the unique key for this data set.
|address||1 Sowerby St, Goulburn, NSW||
The physical address that this application relates to. This will be geocoded so doesn't need to be a specific format but obviously the more explicit it is the more likely it will be successfully geo-coded. If the original address did not include the state (e.g. "QLD") at the end, then add it.
|description||Ground floor alterations to rear and first floor addition||
A text description of what the planning application seeks to carry out.
A URL that provides more information about the planning application.
This should be a persistent URL that preferably is specific to this particular application. In many cases councils force users to click through a license to access planning application. In this case be careful about what URL you provide. Test clicking the link in a browser that hasn't established a session with the council's site to ensure users of PlanningAlerts will be able to click the link and not be presented with an error.
The date that your scraper is collecting this data (i.e. now). Should be in ISO 8601 format.
Use the following Ruby code:
Note that there used to be a field "comment_url" above that was required. This is no longer used though you might still see it referenced in older scrapers.
The following fields are optional because not every planning authority provides them. Please do include them if data is available.
The date this application was received by council. Should be in ISO 8601 format.
The date from when public submissions can be made about this application. Should be in ISO 8601 format.
The date until when public submissions can be made about this application. Should be in ISO 8601 format.
Versioning application data
It's important that scrapers collect the latest, most up-to-date, information. In fact, if the information about an application changes (because, for instance, a council updates the wording or corrects a mistake) your scraper should get the most up to date information.
For that reason, it's good practise for your scraper to look back a reasonable amount of time (one month is good) in which you scrape all applications that might have changed in that time. That way you're most likely to catch any changes. Often it's not possible to simply get a list of applications that recently changed. Instead you have to scrape say a list of applications that were recently received and applications that have recently been determined (whether they're approved or not).
When you save an updated version of an application make sure you use the council_reference field as the unique id. That way you don't end up with multiple versions of the same record. If you're writing your scraper in Ruby that will look something like:
When the main PlanningAlerts system reads the latest application data from your scraper on morph.io it automatically keeps track of the changes that occur on indidivual applications. That way you can make sure that nothing truly gets overwritten. There is always a history of what fields changed when. At the moment this information is recorded in the database but isn't yet exposed to users in the main application or through the data published through the API.
Scheduling the scraper
Set the scraper to run once per day. This can be done on morph.io on the settings page of the scraper.
Once you've finished your scraper and it's successfully downloading planning applications, simply email firstname.lastname@example.org and we'll fork it into the planningalerts-scrapers organization and import it into PlanningAlerts.
The last thing to do is look up on Wikipedia how many people live within the council you've just covered so you can pat yourself on the back knowing that you've just helped tens of thousands of people get PlanningAlerts.