Handling Infinite Scroll
Modern websites often have content that's dynamically loaded as you scroll down the page.
Overview
This guide shows you how to handle loading this type of content for most pages, as well as some challenges of loading this type of content.
Get infinite scroll page in ready state
First, start with a script that loads the page and wait for it to be in a ready state.
with sync_playwright() as playwright, playwright.chromium.launch(headless=False) as browser:
page = agentql.wrap(browser.new_page())
page.goto("https://infinite-scroll.com/demo/full-page/")
page.wait_for_page_ready_state()
If you run a data query against this page directly:
QUERY = """
{
page_title
post_headers[]
}
"""
response = page.query_data(QUERY)
The query then returns the following:
{
"page_title": "Full page demo",
"post_headers": ["1a - Infinite Scroll full page demo", "1b - RGB Schemes logo in Computer Arts"]
}
This indicates your browser has only loaded the first page of this site. You'll need to leverage the Playwright SDK's ability to send input events to the browser page to load more content on this page.
Trigger content load on the page
There are a few options for scrolling down the page, but the simplest one is to:
- Use a key press input for
End
, which takes you to the bottom of the page that's currently loaded. - Give the content time to load by leveraging
wait_for_page_ready_state()
.
page.keyboard.press("End")
page.wait_for_page_ready_state()
If you run the same query, you'll receive a different response.
{
"page_title": "Full page demo",
"post_headers": [
"1a - Infinite Scroll full page demo",
"1b - RGB Schemes logo in Computer Arts",
"2a - RGB Schemes logo",
"2b - Masonry gets horizontalOrder",
"2c - Every vector 2016"
]
}
You've successfully loaded one additional "page" of content on this site, but what if you need to load additional "pages" of content?
Load multiple pages of content with looping
In order to load multiple pages of content, you can leverage the pagination logic inside of a loop.
The following example shows how you can load the three additional pages of content:
num_extra_pages_to_load = 3
for _ in range(num_extra_pages_to_load):
page.keyboard.press("End")
page.wait_for_page_ready_state()
If you look at the response, you'll see it's much more comprehensive than before.
{
"page_title": "Infinite Scroll · Full page demo",
"post_headers": [
"1a - Infinite Scroll full page demo",
"1b - RGB Schemes logo in Computer Arts",
"2a - RGB Schemes logo",
"2b - Masonry gets horizontalOrder",
"2c - Every vector 2016",
"3a - Logo Pizza delivered",
"3b - Some CodePens",
"3c - 365daysofmusic.com",
"3d - Holograms",
"4a - Huebee: 1-click color picker",
"4b - Word is Flickity is good"
]
}
Putting it all together
If you want to take a look at the final version of this example, it's available in AgentQL's GitHub examples repo.
Conclusion
Pagination on web can be tricky since there are different ways that websites can choose to implement it. As a result, while the End
key press works on many sites, other sites may require using a combination of Playwright mouse move and mouse wheel to emulate hovering over different scrolling containers and scrolling.
Here is a basic example of using mousewheel to scroll down the page:
def mouse_wheel_scroll(page: Page):
viewport_height, total_height, scroll_height = page.evaluate(
"() => [window.innerHeight, document.body.scrollHeight, window.scrollY]"
)
while scroll_height < total_height:
scroll_height = scroll_height + viewport_height
page.mouse.wheel(delta_x=0, delta_y=viewport_height)
time.sleep(random.uniform(0.05, 0.1))
In addition, it's tricky to detect if all of the content has loaded, or if it's even possible to load "all" of the content. On some pages, you can look for loading indicators and placeholders such as "Scroll to load more" to detect whether more content is available.
As a result, be mindful when working with infinite scroll pages so that you craft that right level of automation based on the desired outcome.
As the amount of content on a particular page gets longer, AgentQL queries can slow down significantly, so it's generally a good idea to set a cap on the amount of additional pages to load. The right number here depends on the exact website and data that you're looking for.