From AI as a Service by Peter Elger and Eóin Shanaghy
This article discusses gathering data for real-world AI projects and platforms.
Gathering Data from the Web
This article looks in detail at gathering data from websites. Whilst some data may be available in pre-packaged, structured formats, accessible as either flat files or through an API, this isn’t the case with web pages.
Web pages are an unstructured source of information such as product data, news articles and financial data. Finding the right web pages, retrieving them and extracting relevant information is non-trivial. The processes required to do this are known as web crawling and web scraping.
- Web Crawling is the process of fetching web content and navigating to linked pages according to a specific strategy.
- Web Scraping follows the crawling process to extract specific data from content which has been fetched.
Figure 1 shows how the two processes combine to produce meaningful, structured data.
Figure 1. Webpage crawling and scraping process overview. In this article we’re concerned with the crawler part of this picture and the pages it produces as output.
Let’s imagine that we are a conference organizer who wants to scrape some data for their conference web page. Our first step in creating a solution for this scenario is to build a serverless web crawling system.
Introduction to Web Crawling
The crawler for our scenario is a generic crawler. Generic crawlers can crawl any site with an unknown structure. Site-specific crawlers are usually created for large sites with specific selectors for findings links and content. An example of a site-specific crawler could be one written to crawl particular products from amazon.com, or auctions from ebay.com.
Examples of well-known crawlers include:
- Search Engines such as Google, Bing, Yandex or Baidu
- GDELT Project, an open database of human society and global events
- OpenCorporates, the largest open database of companies in the world
- Internet Archive, a digital library of Internet sites and other cultural artifacts in digital form
- CommonCrawl, an open repository of web crawl data
One challenge for web crawling is the sheer number of web pages to visit and analyze. When we’re performing the crawling task, we may need arbitrarily large compute resources. Once the crawling process is complete, our compute resource requirement drops. This sort of scalable, burst-prone computing requirement is an ideal fit for on-demand, cloud computing and serverless!
Typical Web Crawler Process
To understand how a web crawler might work, consider how a web browser allows a user to navigate a webpage manually.
- The user enters a webpage URL into a web browser.
- The browser fetches the page’s first HTML file.
- Links are rendered. When the user clicks on a link, the process is repeated for a new URL.
Listing 1 shows the HTML source for a simple example webpage.
Listing 1. Example Webpage HTML Source
❶ External link
❷ Absolute internal link
❸ Relative internal link
❹ Image resource
❺ Paragraph text
We’ve shown the structure of a basic page. In reality, a single HTML page can contain hundreds of hyperlinks, both internal and external. The set of pages required to be crawled for a given application is known as the crawl space. Let’s talk about the architecture of a typical web crawler and how it’s structured to deal with various sizes of crawl space.
Web Crawler Architecture
A typical web crawler architecture is illustrated in Figure 2. Let’s get an understanding of each component of the architecture and how it relates to our conference website scenario before describing how this might be realized with a serverless approach.
Figure 2. Components of a Web Crawler. The distinct responsibilities for each component can guide us in our software architecture.
- The Frontier maintains a database of URLs to be crawled. This is initially populated with the conference websites. From there, URLs of individual pages on the site are added here.
- The Fetcher takes a URL and retrieves the corresponding document.
- The Parser takes the fetched document, parses it and extracts required information from it. We won’t look for specific speaker details or anything conference specific at this point.
- The Strategy Worker or Generator is one of the most crucial components of a web crawler, because it determines the crawl space. URLs generated by the Strategy Worker are fed back into the Frontier. The Strategy Worker decides:
- which links should be followed
- the priority of links to be crawled
- the crawl depth
- when to revisit/re-crawl pages if required
- The Item Store is where the extracted documents or data or stored.
- The Scheduler takes a set of URLs, initially the seed URLs, and schedules the Fetcher to download resources. The scheduler is responsible for ensuring that the crawler behaves politely towards web servers, that no duplicate URLs are fetched, and that URLs are normalized.
For our web crawler, we’re dealing with conferences. Because these constitute a minority of all web pages, there’s no need to crawl the entire web for such sites. Instead, we’ll provide the crawler with a “seed” URL.
On the conference sites, we’ll crawl local hyperlinks. We won’t follow hyperlinks to external domains. Our goal is to find the pages which contain the required data such as speaker information and dates. We aren’t interested in crawling the entire conference site, and for this reason we’ll also use a depth limit to stop crawling after reaching a given depth in the link graph. The crawl depth is the number of links which have been followed from the seed URL. A depth limit stops the process from going beyond a specified depth.
Basic Crawlers vs. Rendering Crawlers
A number of options are available for rendering web pages when there’s no user or screen available.
- Splash, a browser designed for web scraping applications
- Headless Chrome with the Puppeteer API. This runs the popular Chrome browser and allows us to control it programmatically.
- Headless Firefox with Selenium. This option is a Firefox-based alternative to Puppeteer.
For our solution, we’re going to use headless Chrome. We chose this option as there are readily-available Serverless Framework plugins for use with AWS Lambda.
Serverless Web Crawler Architecture
Let’s take a look at how we map our system to a canonical architecture. Figure 3 provides us with a breakdown of the system’s layers and how services collaborate to deliver the solution.
Figure 3. Serverless Web Crawler System Architecture. The system is composed of custom services implemented using AWS Lambda and AWS Step Functions. SQS and the CloudWatch Events service are used for asynchronous communication. Internal API Gateways are used for synchronous communication. S3 and DynamoDB are used for data storage.
The system architecture shows the layers of the system across all services. Note that, in this system, we have no front-end web application.
- The system architecture shows the layers of the system across all services. Note that, in this system, we have no front-end web application.
- Synchronous tasks in the Frontier and fetch services are implemented using AWS Lambda. For the first time, we introduce AWS Step Functions to implement the Scheduler. It’s responsible for orchestrating the Fetcher based on data in the Frontier.
- The Strategy service is asynchronous and reacts to events on the event bus indicating that new URLs have been discovered.
- Synchronous communication between internal services in our system is handled with API Gateway. We have chosen CloudWatch Events and SQS for asynchronous communication.
- Shared parameters are published to Systems Manager Parameter Store. IAM is used to manage privileges between services.
- DynamoDB is used for Frontier URL storage. An S3 bucket is used as our Item Store.
That’s all for now.