Web Scraping 101

July 12, 2016 | by Luis Carvajal

In developers, QA testing, App development

Web scraping is just the simple process of gleaning and automatically collecting information from websites.

Every Friday, one of the technical experts at Definity First gives a little talk about a subject matter that they have mastered.  This week we take a look at web scraping.

Web Scraping - Do We Really Need API?

Because it’s all about the HTTP, HTTP, HTTP…

If you can view content on a website, the information can be scraped.  It gets a little more complicated than manually copying and pasting, however.  Programming specific HTTP requests can allow anyone to automatically retrieve information instantaneously.

But why should I?

Websites are more important than the sum of their APIs.  With web scraping, all access can be anonymous and fast!  Sure APIs organize data and provide uniform outputs, but when you need information quick, web scraping is your tool.  It cuts out the middle man and allows anyone to pull data from any website.

Web scraping, benefits and relevance

Is it even useful?

In a word?  Yes.  It is useful.  Web scraping is an extremely versatile tool.  It doesn’t matter who you are, if you have access to the website, you can pull the data.  With APIs, you input specific criteria into the code to get back a list of results in a certain format.  Screen scraping allows you to do what you need to with data.

What does the process look like?

Begin by identifying your endpoints, or the URLs that encompass the data that you need.  After you have ‘scraped’ the data that you need, it then becomes necessary for you to organize and interpret the information.  Structure the data the way you need it.

Traps to avoid:

Watch out for common pitfalls in the scraping business.  Some websites strictly prohibit screen scraping—double check your target website’s Terms of Use page to make sure you don’t do anything to make someone upset.  Spoof headers, watch out for rate limits, and avoid poorly formed markups.