Tutorials
26 February, 2025 · 3 min read

How to Extract Data from PDFs Using Documind’s API

Tami

Tami

Product

How to Extract Data from PDFs Using Documind’s API

Extracting structured data from PDFs and other document types is a common challenge. Instead of relying on unreliable OCR solutions or manually copying and pasting data, Documind provides a powerful API that allows you to automate data extraction seamlessly.

In this guide, you’ll learn how to:

  • Set up authentication to securely access the API

  • Create an extraction job for processing

  • Poll for job completion and retrieve the extracted data

To make this practical, we’ll extract key details from this invoice, including:

  • Invoice number

  • Invoice date

  • Due date

  • Supplier's name

  • Supplier's phone

  • Supplier's VAT number

  • Payment method

  • Bank name

  • IBAN

  • SWIFT/BIC code

  • Payment reference

  • Subtotal

  • Tax amount

  • Total amount

  • Items (Name, SKU, Quantity, Unit price, Discount, Total price)

Once the data is extracted, it can be stored in a database for tracking invoices and payments or sent directly to accounting platforms like Xero, QuickBooks, and other financial tools. This guide will take you through the complete process so you can integrate Documind’s API into your own applications.

Step 1: Setting Up Authentication

Before making requests to the Documind API, you'll need to install the necessary dependencies and obtain your API key for authentication.

1.1 Install Required Packages

Ensure you have Node.js installed, then install axios (for making HTTP requests) and dotenv (for securely storing API keys):

bash
1npm install axios dotenv
2

1.2 Get Your API Key

  1. Sign in to the Documind Dashboard

  2. Navigate to the Settings page

  3. Copy your Secret API Key

To keep your key secure, create a .env file in your project directory and store the key there:

Text
1DOCUMIND_API_KEY=your-secret-api-key
2

1.3 Create an Axios Instance

Now, set up an axios instance with the base URL of the API and authentication headers:

javascript
1import 'dotenv/config'
2import axios from 'axios';
3
4// Load API key from environment variables
5const API_KEY = process.env.DOCUMIND_API_KEY;
6
7const documindAPI = axios.create({
8  baseURL: 'https://api.documind.xyz',
9  headers: {
10    'Authorization': `Bearer ${API_KEY}`,
11    'Content-Type': 'application/json'
12  }
13});
14
15

Step 2: Creating an Extraction Job

To extract data from the invoice, you need to create an extraction job. This involves sending the document's URL to Documind along with a schema that defines the structure of the data you want to extract.

2.1 Defining the Schema

The schema acts as a blueprint, specifying what information should be extracted from the document. It outlines the key fields and their expected data types.

For a detailed breakdown of schema definitions and best practices, check out this guide. You can also find additional schema examples in the documentation.

Here’s how you can define a schema for an invoice:

javascript
1const schema =[
2  {
3    "name": "invoiceNumber",
4    "type": "string",
5    "description": "Unique identifier for the invoice"
6  },
7  {
8    "name": "invoiceDate",
9    "type": "string",
10    "description": "Date when the invoice was issued"
11  },
12  {
13    "name": "dueDate",
14    "type": "string",
15    "description": "Payment due date for the invoice"
16  },
17  {
18    "name": "supplier",
19    "type": "object",
20    "description": "Details of the supplier issuing the invoice",
21    "children": [
22      {
23        "name": "name",
24        "type": "string",
25        "description": "Supplier's name"
26      },
27      {
28        "name": "email",
29        "type": "string",
30        "description": "Supplier's email address"
31      },
32      {
33        "name": "vatNumber",
34        "type": "string",
35        "description": "Supplier's VAT or Tax ID"
36      }
37    ]
38  },
39  {
40    "name": "payment",
41    "type": "object",
42    "description": "Payment details for the invoice",
43    "children": [
44      {
45        "name": "paymentMethod",
46        "type": "enum",
47        "description": "Payment method",
48        "values": ["Bank Transfer", "Credit Card", "Cheque"]
49      },
50      {
51        "name": "bankDetails",
52        "type": "object",
53        "description": "Bank details for wire transfers",
54        "children": [
55          {
56            "name": "bankName",
57            "type": "string",
58            "description": "Name of the bank"
59          },
60          {
61            "name": "iban",
62            "type": "string",
63            "description": "International Bank Account Number (IBAN)"
64          },
65          {
66            "name": "swift",
67            "type": "string",
68            "description": "SWIFT/BIC code for international transfers"
69          },
70          {
71            "name": "reference",
72            "type": "string",
73            "description": "Reference text for the payment"
74          }
75        ]
76      }
77    ]
78  },
79  {
80    "name": "financialSummary",
81    "type": "object",
82    "description": "Breakdown of financial details",
83    "children": [
84      {
85        "name": "subtotal",
86        "type": "number",
87        "description": "Total amount before taxes"
88      },
89      {
90        "name": "tax",
91        "type": "number",
92        "description": "Total tax amount applied"
93      },
94      {
95        "name": "totalAmount",
96        "type": "number",
97        "description": "Final total amount due after taxes"
98      }
99    ]
100  },
101  {
102    "name": "items",
103    "type": "array",
104    "description": "List of purchased items in the invoice",
105    "children": [
106      {
107        "name": "name",
108        "type": "string",
109        "description": "Name of the item"
110      },
111      {
112        "name": "sku",
113        "type": "string",
114        "description": "Stock Keeping Unit (SKU) identifier"
115      },
116      {
117        "name": "quantity",
118        "type": "number",
119        "description": "Number of units purchased"
120      },
121      {
122        "name": "unitPrice",
123        "type": "number",
124        "description": "Price per unit"
125      },
126      {
127        "name": "discount",
128        "type": "number",
129        "description": "Discount applied per unit"
130      },
131      {
132        "name": "totalPrice",
133        "type": "number",
134        "description": "Total price for the item"
135      }
136    ]
137  }
138]
139
140

2.2 Send the PDF to Documind

Now, let's send the extraction job with the document url and the schema above:

javascript
1async function createJob(file) {
2    try {
3      const response = await documindAPI.post('/run-job', {
4        file,
5        schema
6      });
7  
8      return response.data.id; // Store Job ID for polling
9    } catch (error) {
10      console.error('Error creating extraction job:', error.response ? error.response.data : error.message);
11    }
12  }
13
14

What Happens Here?

  • We send a POST request to /run-job

  • The file URL points to the document to be processed

  • The schema defines the expected structure of extracted data

  • The API returns a Job ID, which we use to check the status and get the results

Step 3: Polling for Job Completion

Next, poll the API until the job is complete.

javascript
1async function pollJob(jobId, maxRetries = 5, delay = 5000) {
2    for (let attempt = 1; attempt <= maxRetries; attempt++) {
3      try {
4        const { data } = await documindAPI.get(`/job/${jobId}`);
5  
6        if (data.status === "COMPLETED") {
7          return data.result;
8        }
9  
10        if (data.status === "FAILED") {
11          throw new Error(`Extraction failed for Job ID: ${jobId}`);
12        }
13  
14        await new Promise(resolve => setTimeout(resolve, delay));
15  
16      } catch (error) {
17        console.error("Error retrieving job status:", error.response?.data || error.message);
18        if (attempt === maxRetries) throw new Error(`Max retries reached. Job ID: ${jobId}`);
19      }
20    }
21  }
22

Step 4: Putting Everything Together

javascript
1async function extractData(file) {
2  const jobId = await createJob(file);
3  if (!jobId) throw new Error("Failed to create extraction job.");
4
5  const result = await pollJob(jobId);
6
7  // You can save the extracted data to a JSON file to see the results
8  fs.writeFileSync("invoice.json", JSON.stringify(result, null, 2));
9}
10
11// Usage
12const file = "<Add your file URL here>"
13extractData(file)
14  .then(() => console.log("Extraction process completed."))
15  .catch(error => console.error("Error:", error));
16
17

The Result

Once completed, you should receive structured JSON data like this:

JSON
1{
2  "items": [
3    {
4      "sku": "SRV-1001",
5      "name": "Cloud Server Hosting",
6      "discount": 0,
7      "quantity": 1,
8      "unitPrice": 3000,
9      "totalPrice": 3000
10    },
11    {
12      "sku": "LIC-4587",
13      "name": "Software Licensing",
14      "discount": 50,
15      "quantity": 5,
16      "unitPrice": 400,
17      "totalPrice": 1750
18    },
19    {
20      "sku": "CNS-2003",
21      "name": "Consulting Services",
22      "discount": 20,
23      "quantity": 10,
24      "unitPrice": 100,
25      "totalPrice": 980
26    }
27  ],
28  "dueDate": "March 1, 2024",
29  "payment": {
30    "bankDetails": {
31      "iban": "GB29HBUK40127612345678",
32      "swift": "HBUKGB4B",
33      "bankName": "HSBC Bank",
34      "reference": "Invoice #INV-2024-019"
35    },
36    "paymentMethod": "Bank Transfer"
37  },
38  "supplier": {
39    "name": "Tech Solutions",
40    "email": "accounts@techsolutions.com",
41    "vatNumber": "GB123456789"
42  },
43  "invoiceDate": "February 1, 2024",
44  "invoiceNumber": "INV-2024-019",
45  "financialSummary": {
46    "tax": 286.5,
47    "subtotal": 5730,
48    "totalAmount": 6016.5
49  }
50}
51

What Next?

  • You can import the extracted data into databases like PostgreSQL, Firebase, MongoDB

  • Sync payments with accounting software like Xero or Quickbooks

  • Generate financial reports

  • Trigger payment processing workflows

Now that you've seen how to extract structured data from PDFs, you can test it out yourself! Try uploading an invoice in the playground to see how the extraction works in real-time. When you're ready to integrate this into your own applications, sign up on Documind to start extracting structured data from PDFs and other documents.

Share this post: