Web scraping and crawling have played a major role in creating the internet we see today. While the technology, the process, and the results remain invisible to most, all of it is here to stay. I’d even say that scraping will never go ‘out of fashion’, barring some extreme regulatory changes.
Of course, over its history, web scraping has undergone significant changes, primarily due to the ever increasing complexity of the internet. I think relatively few remember the magnificent simplicity of web pages from the 90s. Scraping was a little easier back then.
Starting in tandem
If you were to ask around for the origin story of web scraping, most people would point to relatively new inventions or products. Most likely, you’d get the answer everyone knows – Google. It is definitely the most successful crawling-based company, but far from the first.
As far as we know, the first web crawling application was developed in 1993. Matthew Gray built the fittingly named ‘wanderer’, which was used to discover new websites and estimate the size of the World Wide Web. It should come as no surprise that Matthew is now the Engineering Director for Search at Google.
Evidently, web scraping kicked off soon after the creation of the internet (or, to be exact, the World Wide Web) in 1989. It took just a few odd years before someone started collecting data stored on the internet.
Of course, it was driven primarily by curiosity and passion. There was, likely, little financial value on the internet 1993. In the age of Netscape Navigator, a lot of the websites were still far away from being something close to a business.
It didn’t take long before the usefulness of web scraping was discovered, though, as, in the same year, Jump Station launched – the first crawling-driven search engine. Upgrades, competitors, and new technologies followed suit.
Most of the search engines used rudimentary scraping to collect and index pages. Rankings were usually exploitable by stuffing in keywords everywhere. It was an issue that arose due to a lack of sophisticated data analysis.
What could be considered the most significant early advancement in scraping is Larry Page’s PageRank algorithm, which was adopted by Google. Instead of going purely by keywords, inbound and outbound links became a measure of a website’s importance.
The professional WWW
Yet, web scraping never really caught on back then. Search engines and companies that profit from data were the only ones that truly engaged with scraping and crawling. For a large part of the early history, there was no reason to do scraping for anyone else.
As the internet moved away from glorified TXT-files-as-websites, Geocities, and AngelFire towards professionally built pages with payment gateways and products, business interest rose. A possibility to reach new audiences and buyers revealed itself. In turn, companies began turning to digital.
Suddenly, monitoring specific pages on the internet became something that might be useful. Data on the internet no longer was just information. Data had gained utility. It could be analyzed for profit or research incentives.
There was (and still is) one problem, though. While regular internet users would create simplistic websites back in the day, doing business meant doing marketing and sales. Companies had lifted all the best practices from regular advertising and moved it online. It meant shiny, sleek, and optimised websites. Ones optimised for viewing, browsing, and buying.
The professionalisation of the internet had led to the creation of websites that were much more than just glorified Excel spreadsheets. As a result, the underlying HTML became more intricate, which meant that data extraction became significantly more difficult.
We were left with an interesting dilemma. In one sense, the internet became a treasure trove of incredibly useful data. On the other hand, getting to that data became unreasonably difficult. It was made even more complicated by the ever increasing speed of changes that happen to websites.
Dedicated scraping
As a result, scraping had to become highly specialised and dedicated. Scrapers and parsers had to be written for specific websites. A lot of homebrew projects still go through the same process.
Funnily enough, many industry-level scrapers haven’t reached that much further. Some dedicated scrapers can take care of specified types of pages. For example, at Oxylabs we have SERP Scraper API, E-Commerce Scraper API, and Web Scraper API – dedicated scrapers for search engines, e-commerce pages, and generic websites respectively.
These splits are required due to the nature of the pages. Product pages, by their end-goal, differ greatly from search engine pages, which makes their structure different by necessity. Theoretically, as the difference between page structure grows, so would the complexity of an all-in-one scraper and parser. Since so many types and variations of pages exist, the complexity of an all-in-one scraper and parser that never breaks would be near infinite.
In practice, that means dedicated scrapers and parsers are and will be required for the foreseeable future. There is some hope that AI and machine learning based solutions might make the process easier. Our own tests have shown some promising results for ML-based parsing.
Scraping is (now) forever
Some may say that there is a growing global demand for data. I think that it would be slightly misleading to assume that. The demand for data has always existed and alrways will. There’s nothing more valuable for any activity, business or otherwise, than being able to understand the environment.
Sentiments about ‘growing demand for data’ are not unlike looking into a warped mirror. What they reflect exists (and is true), but not in its entirety. Data has always been the foundation of business, research, and government. Even relatively simple businesses use ledgers, write invoices, and manage inventory.
As such data has always had its place. What changed with the appearance of the internet and the evolution of digital businesses is the breaking away from the restrictions of geographical space (and, in some sense, time). Businesses now don’t have to be attached to a physical location.
Businesses were, in some sense, liberated and granted better access to other markets. On the other hand, more sources of data became relevant, because the field of competition and resources increased as well. As such, digitalisation accelerated the demand for data.
Previously, there was no reason to compete with a business on the other end of the world. Any data about them might have been interesting at best, useless at worst. Now, such data is interesting at worst and vital at best.
Web scraping is the way to fill in that demand. There’s no reason to believe the demand will decelerate, either. Digitalisation, the opening up of new markets, and the importance of having even more data all go hand-in-hand. Thus, web scraping, barring extreme regulatory oversight or a global apocalypse, is now forever.