Collect data across numerically-paginated webpages
Some sites split content up across multiple pages. When working with paginated websites that use numerical navigation (listing each page as a number), you can use the paginate
function from the AgentQL SDK to collect data from all pages.
Overview
This guide shows how to use the paginate
function to collect data from all pages of a paginated website and save the data to a JSON file.
Writing the query
For this guide, the goal is to query all post titles from the first 3 pages of hackernews feed page.
First, you need to write a query that returns the post titles on each page:
Using the pagination function
Next, you can use the paginate
function to automatically scrape through specified number of pages and retrieve the aggregated data.
In this example, the paginate
function takes the following arguments:
page
: An AgentQL Page object of the webpage you want to scrape.query
: An AgentQL query in String format that specifies the data to extract on each page.number_of_pages
: Number of pages to paginate over.
Internally, the paginate
function first attempts to find the operable element to navigate to the next page and clicks it, then uses the provided query to extract the data from the page. The function then repeats this process for the specified number of pages.
Finally, here's the complete script to save the paginated data into a JSON file:
If you want to take a look at the final version of this example, it's available in AgentQL's GitHub examples repo.