Handling Infinite Scroll

Modern websites often have content that's dynamically loaded as you scroll down the page.

Overview

This guide shows you how to handle loading this type of content for most pages, as well as some challenges of loading this type of content.

Get infinite scroll page in ready state

First, start with a script that loads the page and wait for it to be in a ready state.

infinite_scroll.py
python
with sync_playwright() as playwright, playwright.chromium.launch(headless=False) as browser:
    page = agentql.wrap(browser.new_page())
    page.goto("https://infinite-scroll.com/demo/full-page/")
    page.wait_for_page_ready_state()

If you run a data query against this page directly:

infinite_scroll.py
python
QUERY = """
{
    page_title
    post_headers[]
}
"""

response = page.query_data(QUERY)

The query then returns the following:

{
  "page_title": "Full page demo",
  "post_headers": ["1a - Infinite Scroll full page demo", "1b - RGB Schemes logo in Computer Arts"]
}

This indicates your browser has only loaded the first page of this site. You'll need to leverage the Playwright SDK's ability to send input events to the browser page to load more content on this page.

Trigger content load on the page

There are a few options for scrolling down the page, but the simplest one is to:

  1. Use a key press input for End, which takes you to the bottom of the page that's currently loaded.
  2. Give the content time to load by leveraging wait_for_page_ready_state().
infinite_scroll.py
python
page.keyboard.press("End")
page.wait_for_page_ready_state()

If you run the same query, you'll receive a different response.

{
  "page_title": "Full page demo",
  "post_headers": [
    "1a - Infinite Scroll full page demo",
    "1b - RGB Schemes logo in Computer Arts",
    "2a - RGB Schemes logo",
    "2b - Masonry gets horizontalOrder",
    "2c - Every vector 2016"
  ]
}

You've successfully loaded one additional "page" of content on this site, but what if you need to load additional "pages" of content?

Load multiple pages of content with looping

In order to load multiple pages of content, you can leverage the pagination logic inside of a loop.

The following example shows how you can load the three additional pages of content:

infinite_scroll.py
python
num_extra_pages_to_load = 3

for _ in range(num_extra_pages_to_load):
    page.keyboard.press("End")
    page.wait_for_page_ready_state()

If you look at the response, you'll see it's much more comprehensive than before.

{
  "page_title": "Infinite Scroll · Full page demo",
  "post_headers": [
    "1a - Infinite Scroll full page demo",
    "1b - RGB Schemes logo in Computer Arts",
    "2a - RGB Schemes logo",
    "2b - Masonry gets horizontalOrder",
    "2c - Every vector 2016",
    "3a - Logo Pizza delivered",
    "3b - Some CodePens",
    "3c - 365daysofmusic.com",
    "3d - Holograms",
    "4a - Huebee: 1-click color picker",
    "4b - Word is Flickity is good"
  ]
}

Putting it all together

If you want to take a look at the final version of this example, it's available in AgentQL's GitHub examples repo.

Conclusion

Pagination on web can be tricky since there are different ways that websites can choose to implement it. As a result, while the End key press works on many sites, other sites may require using a combination of Playwright mouse move and mouse wheel to emulate hovering over different scrolling containers and scrolling.

Here is a basic example of using mousewheel to scroll down the page:

infinite_scroll.py
python
def mouse_wheel_scroll(page: Page):
    viewport_height, total_height, scroll_height = page.evaluate(
        "() => [window.innerHeight, document.body.scrollHeight, window.scrollY]"
    )
    while scroll_height < total_height:
        scroll_height = scroll_height + viewport_height
        page.mouse.wheel(delta_x=0, delta_y=viewport_height)
        time.sleep(random.uniform(0.05, 0.1))

In addition, it's tricky to detect if all of the content has loaded, or if it's even possible to load "all" of the content. On some pages, you can look for loading indicators and placeholders such as "Scroll to load more" to detect whether more content is available.

As a result, be mindful when working with infinite scroll pages so that you craft that right level of automation based on the desired outcome.

tip

As the amount of content on a particular page gets longer, AgentQL queries can slow down significantly, so it's generally a good idea to set a cap on the amount of additional pages to load. The right number here depends on the exact website and data that you're looking for.