Extracting structured data from PDFs and other document types is a common challenge. Instead of relying on unreliable OCR solutions or manually copying and pasting data, Documind provides a powerful API that allows you to automate data extraction seamlessly.
In this guide, you’ll learn how to:
Set up authentication to securely access the API
Create an extraction job for processing
Poll for job completion and retrieve the extracted data
To make this practical, we’ll extract key details from this invoice, including:
Invoice number
Invoice date
Due date
Supplier's name
Supplier's phone
Supplier's VAT number
Payment method
Bank name
IBAN
SWIFT/BIC code
Payment reference
Subtotal
Tax amount
Total amount
Items (Name, SKU, Quantity, Unit price, Discount, Total price)
Once the data is extracted, it can be stored in a database for tracking invoices and payments or sent directly to accounting platforms like Xero, QuickBooks, and other financial tools. This guide will take you through the complete process so you can integrate Documind’s API into your own applications.
Step 1: Setting Up Authentication
Before making requests to the Documind API, you'll need to install the necessary dependencies and obtain your API key for authentication.
1.1 Install Required Packages
Ensure you have Node.js installed, then install axios (for making HTTP requests) and dotenv (for securely storing API keys):
1npm install axios dotenv
2
1.2 Get Your API Key
Sign in to the Documind Dashboard
Navigate to the Settings page
Copy your Secret API Key
To keep your key secure, create a .env file in your project directory and store the key there:
1DOCUMIND_API_KEY=your-secret-api-key
2
Documind is in private beta, sign up here to get access.
1.3 Create an Axios Instance
Now, set up an axios instance with the base URL of the API and authentication headers:
1import 'dotenv/config'
2import axios from 'axios';
3
4// Load API key from environment variables
5const API_KEY = process.env.DOCUMIND_API_KEY;
6
7const documindAPI = axios.create({
8 baseURL: 'https://api.documind.xyz',
9 headers: {
10 'Authorization': `Bearer ${API_KEY}`,
11 'Content-Type': 'application/json'
12 }
13});
14
15
Step 2: Creating an Extraction Job
To extract data from the invoice, you need to create an extraction job. This involves sending the document's URL to Documind along with a schema that defines the structure of the data you want to extract.
2.1 Defining the Schema
The schema acts as a blueprint, specifying what information should be extracted from the document. It outlines the key fields and their expected data types.
For a detailed breakdown of schema definitions and best practices, check out this guide. You can also find additional schema examples in the documentation.
Here’s how you can define a schema for an invoice:
1const schema =[
2 {
3 "name": "invoiceNumber",
4 "type": "string",
5 "description": "Unique identifier for the invoice"
6 },
7 {
8 "name": "invoiceDate",
9 "type": "string",
10 "description": "Date when the invoice was issued"
11 },
12 {
13 "name": "dueDate",
14 "type": "string",
15 "description": "Payment due date for the invoice"
16 },
17 {
18 "name": "supplier",
19 "type": "object",
20 "description": "Details of the supplier issuing the invoice",
21 "children": [
22 {
23 "name": "name",
24 "type": "string",
25 "description": "Supplier's name"
26 },
27 {
28 "name": "email",
29 "type": "string",
30 "description": "Supplier's email address"
31 },
32 {
33 "name": "vatNumber",
34 "type": "string",
35 "description": "Supplier's VAT or Tax ID"
36 }
37 ]
38 },
39 {
40 "name": "payment",
41 "type": "object",
42 "description": "Payment details for the invoice",
43 "children": [
44 {
45 "name": "paymentMethod",
46 "type": "enum",
47 "description": "Payment method",
48 "values": ["Bank Transfer", "Credit Card", "Cheque"]
49 },
50 {
51 "name": "bankDetails",
52 "type": "object",
53 "description": "Bank details for wire transfers",
54 "children": [
55 {
56 "name": "bankName",
57 "type": "string",
58 "description": "Name of the bank"
59 },
60 {
61 "name": "iban",
62 "type": "string",
63 "description": "International Bank Account Number (IBAN)"
64 },
65 {
66 "name": "swift",
67 "type": "string",
68 "description": "SWIFT/BIC code for international transfers"
69 },
70 {
71 "name": "reference",
72 "type": "string",
73 "description": "Reference text for the payment"
74 }
75 ]
76 }
77 ]
78 },
79 {
80 "name": "financialSummary",
81 "type": "object",
82 "description": "Breakdown of financial details",
83 "children": [
84 {
85 "name": "subtotal",
86 "type": "number",
87 "description": "Total amount before taxes"
88 },
89 {
90 "name": "tax",
91 "type": "number",
92 "description": "Total tax amount applied"
93 },
94 {
95 "name": "totalAmount",
96 "type": "number",
97 "description": "Final total amount due after taxes"
98 }
99 ]
100 },
101 {
102 "name": "items",
103 "type": "array",
104 "description": "List of purchased items in the invoice",
105 "children": [
106 {
107 "name": "name",
108 "type": "string",
109 "description": "Name of the item"
110 },
111 {
112 "name": "sku",
113 "type": "string",
114 "description": "Stock Keeping Unit (SKU) identifier"
115 },
116 {
117 "name": "quantity",
118 "type": "number",
119 "description": "Number of units purchased"
120 },
121 {
122 "name": "unitPrice",
123 "type": "number",
124 "description": "Price per unit"
125 },
126 {
127 "name": "discount",
128 "type": "number",
129 "description": "Discount applied per unit"
130 },
131 {
132 "name": "totalPrice",
133 "type": "number",
134 "description": "Total price for the item"
135 }
136 ]
137 }
138]
139
140
2.2 Send the PDF to Documind
Now, let's send the extraction job with the document url and the schema above:
1async function createJob(file) {
2 try {
3 const response = await documindAPI.post('/run-job', {
4 file,
5 schema
6 });
7
8 return response.data.id; // Store Job ID for polling
9 } catch (error) {
10 console.error('Error creating extraction job:', error.response ? error.response.data : error.message);
11 }
12 }
13
14
What Happens Here?
We send a POST request to /run-job
The file URL points to the document to be processed
The schema defines the expected structure of extracted data
The API returns a Job ID, which we use to check the status and get the results
Step 3: Polling for Job Completion
Next, poll the API until the job is complete.
1async function pollJob(jobId, maxRetries = 5, delay = 5000) {
2 for (let attempt = 1; attempt <= maxRetries; attempt++) {
3 try {
4 const { data } = await documindAPI.get(`/job/${jobId}`);
5
6 if (data.status === "COMPLETED") {
7 return data.result;
8 }
9
10 if (data.status === "FAILED") {
11 throw new Error(`Extraction failed for Job ID: ${jobId}`);
12 }
13
14 await new Promise(resolve => setTimeout(resolve, delay));
15
16 } catch (error) {
17 console.error("Error retrieving job status:", error.response?.data || error.message);
18 if (attempt === maxRetries) throw new Error(`Max retries reached. Job ID: ${jobId}`);
19 }
20 }
21 }
22
Step 4: Putting Everything Together
1async function extractData(file) {
2 const jobId = await createJob(file);
3 if (!jobId) throw new Error("Failed to create extraction job.");
4
5 const result = await pollJob(jobId);
6
7 // You can save the extracted data to a JSON file to see the results
8 fs.writeFileSync("invoice.json", JSON.stringify(result, null, 2));
9}
10
11// Usage
12const file = "<Add your file URL here>"
13extractData(file)
14 .then(() => console.log("Extraction process completed."))
15 .catch(error => console.error("Error:", error));
16
17
The Result
Once completed, you should receive structured JSON data like this:
1{
2 "items": [
3 {
4 "sku": "SRV-1001",
5 "name": "Cloud Server Hosting",
6 "discount": 0,
7 "quantity": 1,
8 "unitPrice": 3000,
9 "totalPrice": 3000
10 },
11 {
12 "sku": "LIC-4587",
13 "name": "Software Licensing",
14 "discount": 50,
15 "quantity": 5,
16 "unitPrice": 400,
17 "totalPrice": 1750
18 },
19 {
20 "sku": "CNS-2003",
21 "name": "Consulting Services",
22 "discount": 20,
23 "quantity": 10,
24 "unitPrice": 100,
25 "totalPrice": 980
26 }
27 ],
28 "dueDate": "March 1, 2024",
29 "payment": {
30 "bankDetails": {
31 "iban": "GB29HBUK40127612345678",
32 "swift": "HBUKGB4B",
33 "bankName": "HSBC Bank",
34 "reference": "Invoice #INV-2024-019"
35 },
36 "paymentMethod": "Bank Transfer"
37 },
38 "supplier": {
39 "name": "Tech Solutions",
40 "email": "accounts@techsolutions.com",
41 "vatNumber": "GB123456789"
42 },
43 "invoiceDate": "February 1, 2024",
44 "invoiceNumber": "INV-2024-019",
45 "financialSummary": {
46 "tax": 286.5,
47 "subtotal": 5730,
48 "totalAmount": 6016.5
49 }
50}
51
What Next?
You can import the extracted data into databases like PostgreSQL, Firebase, MongoDB
Sync payments with accounting software like Xero or Quickbooks
Generate financial reports
Trigger payment processing workflows
Now that you've seen how to extract structured data from PDFs, you can test it out yourself! Try uploading an invoice in the playground to see how the extraction works in real-time. When you're ready to integrate this into your own applications, sign up on Documind to start extracting structured data from PDFs and other documents.