What is Screenscraping?

It’s programmatically gathering data from a website. A software developer writes a bit of code, to;

  1. Startup a browser
  2. Click around the website as if its a normal user
  3. Download html / download data dependent on the particular website

 

Why would you consider it?

  • When you need the data from a particular website automated into your reporting or workflow
  • When API data isn’t available for that particular website

What are the pros?

  • Usually the cost is free
  • Handy to get data when API data isn’t available

 

What are the cons and why would I bother using an API then?

An API is a contract, it’s usually alot more reliable, and robust. So the systems that you build off an API will run far more reliably.

A website can be changed by the owner at any stage, websites are not as reliable as APIs (usually) and therefore your downstream systems need to manage this carefully.

So should I use screenscraping as a data-gathering technique?

Absolutely, if you manage expectations and you get a proper bit of code written to do it.

Any tips on getting it done?

  • Don’t use a hand cranked script running from someones desktop, it won’t work reliably
  • Do make sure the screenscraping code is running from a server
  • Do make sure there are notifications on both success and error paths
  • Do make sure there are retries built in

 

If you have questions on your datalake and how to make it more reliable and valuable, get in touch, we do this for a living!