Extracting data from PDFs and image files with AgentQL

AgentQL supports extracting data from PDFs and image files..

Overview

This guide shows you how to use REST API to from documents like PDFs and images, customize parameters for enhanced scraping capabilities, and retrieve structured data in JSON format with AgentQL queries. Test it out in Playground

Defining the REST API request structure

The following fields outline the high-level structure of a data scraping request:

  • file: The file to extract data from.
  • query: An AgentQL query that defines the data to extract and the format for the retrieved output.
  • params: (Optional) Additional settings for enhanced data retrieval, such as fast or standard mode. See the API Reference for more details about params.

Constructing the API request

To perform a basic data scraping request, start by defining the url of the desired webpage and the query to specify the data you want to retrieve in the request body.

  1. Example REST API Request

Define parameters for the request body.

request_body_parameters
json
1. query: AgentQL query to extract data from the file
2. file: Path of file to extract data from
3. params: (Optional) Additional settings for enhanced data retrieval, such as fast mode. See the [API Reference](/rest-api/api-reference#request-body) for more details about params.
  1. Setting Request Headers

Before making the API request, include the necessary headers for authentication and content type. These headers authorize the request and specify the data format.

  • X-API-Key: this header should have your AgentQL API key for authentication.

  • Content-Type: set it to application/json to indicate that the request body is in JSON format, allowing the server to interpret the data correctly.

  1. Making the API Request

Using your preferred HTTP client (like curl, Postman, or an HTTP library in Python or your preferred language), you can make a POST request to the AgentQL REST API endpoint.

terminal
curl -X POST https://api.agentql.com/v1/query-document \
-H "Content-Type: multipart/form-data" \
-H "X-API-Key:  \
-F "query={products[] { product_name product_price(integer) } }" \
-F "file=@/path/to/file.pdf" \
-d " { "params": { "mode": "fast" } }
note

Make sure to replace $AGENTQL_API_KEY with your actual API key.

  1. Reviewing the API Response

If the request is successful, the API returns a JSON response with the extracted data.

Example Response

response
json
{
  "data": {
    "products": [
      {
        "product_name": "Qwilfish",
        "product_price": 77
      },
      {
        "product_name": "Huntail",
        "product_price": 52
      },
      ...
    ]
  },
  "metadata": {
    "request_id": "ecab9d2c-0212-4b70-a5bc-0c821fb30ae3"
  }
}

You can read more about the response structure and metadata fields in the API Reference.

Test this feature in Playground!

  1. Go to AgentQL's Playground.
  2. Click the "Documented (Experimental)" toggle.
  3. Either click "Choose file" to upload your PDF, JPG, or PNG file or drag and drop your file into the target preview area.
  4. Add an AgentQL query to the query box (or use the "Suggest a Query" button to have AgentQL craft a query for you).
  5. Click the "Fetch Data" button.

Check the results box for your extracted data, and please let the AgentQL team know your feedback. If you would like to access the feature via the SDK, please reach out to join Tiny Fish's Beta Access Program.