Collect data across numerically-paginated webpages

Some sites split content up across multiple pages. When working with paginated websites that use numerical navigation (listing each page as a number), you can use the paginate function from the AgentQL SDK to collect data from all pages.

note

The paginate function only supports web pages that use numerical pagination or provide links/buttons to navigate to the next page. It doesn't support other forms of pagination, like alphabetically paginated web pages.

Overview

This guide shows how to use the paginate function to collect data from all pages of a paginated website and save the data to a JSON file.

Writing the query

For this guide, the goal is to query all post titles from the first 3 pages of hackernews feed page.

First, you need to write a query that returns the post titles on each page:

{
    posts[] {
        title
    }
}

Using the pagination function

Next, you can use the paginate function to automatically scrape through specified number of pages and retrieve the aggregated data. In this example, the paginate function takes the following arguments:

  • page: An AgentQL Page object of the webpage you want to scrape.
  • query: An AgentQL query in String format that specifies the data to extract on each page.
  • number_of_pages: Number of pages to paginate over.
pagination_function.py
python
paginated_data = paginate(page, QUERY, 3)

Internally, the paginate function first attempts to find the operable element to navigate to the next page and clicks it, then uses the provided query to extract the data from the page. The function then repeats this process for the specified number of pages.

Finally, here's the complete script to save the paginated data into a JSON file:

hackernews_pagination.py
python
with sync_playwright() as playwright, playwright.chromium.launch(headless=False) as browser:
    page = agentql.wrap(browser.new_page())
    page.goto("https://news.ycombinator.com/")

    QUERY = """
    {
        posts[] {
            title
        }
    }
    """
    paginated_data = paginate(page, QUERY, 3)

    with open("./hackernews_paginated_data.json", "w") as f:
        json.dump(paginated_data, f, indent=4)
    log.debug("Paginated data has been saved to hackernews_paginated_data.json")

If you want to take a look at the final version of this example, it's available in AgentQL's GitHub examples repo.