Overview

Our API allows you to extract data in the format that you need by defining a custom schema for each request. This guide explains how to structure your schema for optimal extraction results.

Schema Structure

A document schema is defined using a structured format that specifies the fields and their properties. Here’s a basic example:

{
  "name": "Invoice",
  "description": "Schema for extracting invoice details",
  "fields": [
    {
      "name": "invoice_number",
      "description": "The unique identifier for the invoice",
      "type": "string"
    },
    {
      "name": "issue_date",
      "description": "The date when the invoice was issued",
      "type": "date"
    }
  ]
}

Primitive fields

  • string: Generic text data
  • number: Numeric values
  • email: Email addresses
  • phone: Phone numbers
  • date: Date values

Objects and arrays

  • object: Nested object containing additional fields where each field is a primitive field.
  • array: List of items where each element is an object. As above, fields within each object can be any one of the primitive fields.

Working with Objects

Use objects when you need to group related fields together. Here’s how to structure an object type:

{
  "name": "billing_address",
  "description": "Customer's billing address details",
  "type": "object",
  "fields": [
    {
      "name": "street",
      "description": "Street address",
      "type": "string"
    },
    {
      "name": "city",
      "description": "City name",
      "type": "string"
    },
    {
      "name": "postal_code",
      "description": "Postal/ZIP code",
      "type": "string"
    }
  ]
}

Working with Arrays

Use arrays when you need to extract repeating elements, such as line items in an invoice:

{
  "name": "line_items",
  "description": "Individual items in the invoice",
  "type": "array",
  "fields": [
    {
      "name": "description",
      "description": "Item description",
      "type": "string"
    },
    {
      "name": "quantity",
      "description": "Number of items",
      "type": "number"
    },
    {
      "name": "unit_price",
      "description": "Price per unit",
      "type": "number"
    }
  ]
}

Response Format

The extraction system will return results in the following format:

{
  "schema_name": "Invoice",
  "extracted_data": {
    "invoice_number": {
      "value": "INV-2024-001",
      "confidence": 0.95,
      "source_context": "Invoice #INV-2024-001",
      "page_number": 1
    },
    "billing_address": {
      "fields": {
        "street": {
          "value": "123 Main St",
          "confidence": 0.92,
          "source_context": "Billing Address: 123 Main St",
          "page_number": 1
        },
        "city": {
          "value": "San Francisco",
          "confidence": 0.94,
          "source_context": "San Francisco, CA",
          "page_number": 1
        }
      }
    }
  },
  "processing_metadata": {
    "processing_time": "2.5s",
    "engine_version": "1.0.0"
  }
}

Best Practices

Field Names

  • Use clear, descriptive names - Use snake_case for consistency - Avoid special characters

Descriptions

  • Provide detailed descriptions - Include format examples - Specify any expected patterns

Nested Structures

  • Keep nesting depth reasonable (max 3-4 levels) - Use objects for logical grouping - Use arrays for repeated structures

Schema Examples

Invoice Schema

Receipt Schema

{
  "name": "Receipt",
  "description": "Schema for extracting receipt details",
  "fields": [
    {
      "name": "merchant",
      "description": "Merchant information",
      "type": "object",
      "fields": [
        {
          "name": "name",
          "description": "Merchant name",
          "type": "string"
        },
        {
          "name": "phone",
          "description": "Merchant phone number",
          "type": "phone"
        }
      ]
    },
    {
      "name": "transaction_date",
      "description": "Date of purchase",
      "type": "date"
    },
    {
      "name": "items",
      "description": "Purchased items",
      "type": "array",
      "fields": [
        {
          "name": "name",
          "description": "Item name",
          "type": "string"
        },
        {
          "name": "price",
          "description": "Item price",
          "type": "number"
        }
      ]
    },
    {
      "name": "total",
      "description": "Total amount",
      "type": "number"
    }
  ]
}