What is Screenscraping?
It’s programmatically gathering data from a website. A software developer writes a bit of code, to;
- Startup a browser
- Click around the website as if its a normal user
- Download html / download data dependent on the particular website
Why would you consider it?
- When you need the data from a particular website automated into your reporting or workflow
- When API data isn’t available for that particular website
What are the pros?
- Usually the cost is free
- Handy to get data when API data isn’t available
What are the cons and why would I bother using an API then?
An API is a contract, it’s usually alot more reliable, and robust. So the systems that you build off an API will run far more reliably.
A website can be changed by the owner at any stage, websites are not as reliable as APIs (usually) and therefore your downstream systems need to manage this carefully.
So should I use screenscraping as a data-gathering technique?
Absolutely, if you manage expectations and you get a proper bit of code written to do it.
Any tips on getting it done?
- Don’t use a hand cranked script running from someones desktop, it won’t work reliably
- Do make sure the screenscraping code is running from a server
- Do make sure there are notifications on both success and error paths
- Do make sure there are retries built in
If you have questions on your datalake and how to make it more reliable and valuable, get in touch, we do this for a living!