1. Intelligent Document Processing Documentation

1.1 General concepts to Intelligent Document Processing

Intelligent Document Processing (IDP) is a solution operating on ExpoentialAI’s platform Enso.

IDP extracts and organizes information from any raw document into a standardized data model defined by the user. In this document, this standardized data model will be referred to as “domain object”.

The high-level structure of the IDP API contains:

Input
  1. Document to be consumed, file path to document
  2. Solution ID for solution that has been created
  3. Pipeline ID where document has to be ingested
Output
  1. Processing Status
  2. Document type identified by ENSO classification models
  3. Confidence score of extraction at a document level
  4. Document elements (Tables, Lists, …) extracted from document with their individually associated confidence scores
  5. Entities extracted from document with their individually associated confidence scores
1.2 Architecture Blueprints
1.2.1 Sequence Diagram

Sequence diagrams will be added to the next release of this document

1.2.2 Deployment Diagrams

Please refer to IDP/Enso installation manual for deployment information; the present document focuses on the access and use of IDP/Enso via its exposed APIs.

1.3 Documentation Convention
  • In this document, all code examples are in Python.

2. APIs

2.1 APIs Endpoint

ExponentialAI’s APIs are built following RESTFul APIs, and designed to have a predictable URL structure. APIs use many standard HTTP features, including methods POST and GET, and error response codes.

All ExponentialAI’s API calls are structured as follows, and all responses return standard JSON.

POST https://mq-proxy/pipeline/dag/execute/
GET https://mq-proxy/

Where:

  1. namespace: Unique environment identifier configured when setting up IDP/Enso.
  2. client: Unique environment identifier configured when setting up IDP/Enso.
  3. pipeline_id: UUID of a pipeline once created in IDP/Enso. Typically, this represents a pipeline dedicated to a document type.
  4. job_id: once the digitization job is instantiated for the uploaded document, a unique job ID is generated to refer to the digitization in process, and delivered via a response payload.
2.2 Document Upload
2.2.1 Description

By default, 2 native mechanism can be used to provide documents for processing to a pipeline:

  1. Minio [url]: IDP/Enso is fully Kubernetes compatible, and leverages Minio as an interface to a vast number of storage technologies, such as AWS S3, Azure Blob Storage, Hadoop DFS, etc.
  2. AWS S3: while Minio can be used as an interface to an S3 backend, in the occurrence where IDP/Enso is setup within AWS with S3 procured, a direct S3 storage path can be provided.

In the event you require access to a storage technology not supported natively, a custom connector will need to be added to the processing pipeline in order to access the content to be processed. Please refer to solutions / pipelines customization for more details on how to achieve it.

The solution / pipeline needs to be provided with a fully qualified path / filename to the file that will be processed. Additionally, IDP/Enso must have adequate access rights to the path / file in order to access it and manipulate it.

2.2.2 API Definition

Request:

POST https://mq-proxy./pipeline/dag/execute/

POST PAYLOAD
{
    "ref_id":"string",
    "solution_id":"string",
    "data":{
        "inputs":{
            "file_path":"string"
        }
    }
}
  1. ref_id: Optional. Unique ID created by a developer to keep track of the uploaded document and its processing. Unless there is a need to manage an internal UID in applications calling IDP’s APIs, recommended implementation pattern should leverage job_id provided as via the post response to tie upload to processing together.
  2. solution_id: Mandatory. Indicate the solution. Solution IDs are
  3. data/inputs/file_path: Mandatory. Fully qualified location that can be accessed by the solution/pipeline, and where the file to process is located.

Response:

{
    "status": {
        "success": boolean,
        "code": numerical,
        "msg": "string"
    },
    "job_id":"string"
}

Where:

  1. status/success: true/false.
  2. status/code: Numerical description of the upload/digitization trigger status.
    1. 200: file upload and digitization pipeline were successful.
    2. 500: an error has occurred during the document upload or the initialization of the digitization pipeline.
  3. status/msg: Literal description of the upload/digitization trigger status. Provides insights to a 500-failure code.
  4. job_id: once the digitization job is instantiated for the uploaded document, a unique job ID is generated to refer to the digitization in process.
2.2.3 Example

Request:

Import requests
endpointURI = ‘https://mq-proxy.IDP.SIT.exponentialai.com/’
actionURI = ’pipeline/dag/execute/’
solutionID = ‘backofficeSolution’
pipelineID = ‘37692cf6-166c-47a3-b74a-48936fecbc4f’
filePath = ‘minio://storage.IDP.SIT.exponentialai.com/’
filename = filename123.pdf
docJSON =
{‘data’:{‘inputs’:{‘file_path’:filepath+filename}},’ref_id’:’refID123’,’solution_id’:solutionID}
res = requests.post(endpointURI+actionURI+pipelineID, data = docJSON)
payload = res.json()

Response:

{
    "status": {
        "success": true,
        "code": 200,
        "msg": ""
    },
    "job_id":"40dacefe-0377-470b-9f50-dfceaeb51b71"
}

The file was successfully uploaded, the digitization pipeline successfully triggered, and job_id for tracking purposes provided back as part of the response.

2.3 Document Status, Job Polling
2.3.1 Description

Integrating with IDP/Enso is done in an asynchronous fashion.

To that extent, and application leveraging IDP/Enso for its digitization capabilities will need to periodically check on the digitization progress in order to know when the process was terminated, and if the file uploaded failed to process; or if it succeeded, access the resulting digital version of the file, along with any other metadata available.

2.3.2 API Definition

Request:

GET https://mq-proxy./

Response:

{

{
    "status": {
        "success": boolean,
        "code": numerical,
        "msg": "string"
    },
    "process_status": "string",
    "result":{
        "data": {},
        "metadata": {
            "execution_id": "string",
            "output": {
                "result": [{
                    "domain": [{
                        “doc_id”: “string”,
                        “solution_id”: “string”,
                        “doc_state”: “string”,
                        “confidence”: “float/int”,
                        “data”: {“doc_type”: “string”,
                            “data”: {}
                        }
                    }]
                }]
            }
        }
    }
}

Where:

  1. status/success: true/false.
  2. status/code: Numerical description of the digitization status.
    1. 200: file is processing as expected OR processing was finalized.
    2. 500: an error has occurred during the document digitization.
  3. status/msg: Literal description of the digitization status. Provides insights to a 200 or 500 failure code.
  4. process_status:
    1. in-progress: file digitization is underway
    2. processed: file digitization was successfully performed
    3. failed: file digitization has failed
  5. result/data:
  6. result/metadata/execution_id: Unique execution ID associated to each API call
  7. result/metadata/output/result/domain/doc_id: Unique ID referring to the source document
  8. result/metadata/output/result/domain/solution_id: Solution ID provided when the job was invoked
  9. result/metadata/output/result/domain/doc_state:
    1. processing: file digitization is underway
    2. processed: file digitization was successfully performed
    3. failed: file digitization has failed
  10. result/metadata/output/result/domain/confidence: provide the digitization confidence score at the document level
  11. result/metadata/output/result/domain/data:
  12. result/metadata/output/result/domain/data/doctype:
  13. result/metadata/output/result/domain/data/entities: this contains all the needful information of extracted elements which are key-value pairs
2.3.3 Example

Request:

Import requests
Import requests
endpointURI = ‘https://mq-proxy.IDP.SIT.exponentialai.com/’
#jobID would have been provided via the response of a prior document upload / post request
jobID = ‘SAMPLE-0377-470b-9f50-dfceaeb51b71’
res = requests.get(endpointURI+jobID)
payload = res.json()

Response:

{
    "status": {
        "success": true,
        "code": 200,
        "msg": ""
    },
    "process_status": “processed”,
    “result” : {
        “data” : {},
        “doc_score” : 0.85,
        "source_case_form_sender_tel": [[
        "9178291596",
        "99f2e6d1-2c19-406a-9092-d87ac8fbaf15",
        0.11199999999999999,
        "terminology"
    ]],
    "source_case_form_sender_name": [[
        "John Doe",
        "5f53674c-4fc5-4f67-8168-2412badb057a",
        0.11199999999999999,
        "terminology"
    ]],
    "source_case_form_sender_organization": [[
        "Internal",
        "78aa6eb1-671a-4199-90d0-243a27c30157",
        0.11199999999999999,
        "terminology"
    ]],
    "sponsor_study_number": [[
        "ABC Support Services",
        "8426afcf-fc59-4562-a1c0-07db176dc0b4",
        0.11199999999999999,
        "terminology"
    ]],
    "first_receive_datetime": [[
        "MM/dd/yyyy",
        "8c889a61-06a2-4e24-8a49-aadf49f87d95",
        0.10880000000000001,
        "terminology"
    ]]
}
2.4 Document Retrieval
2.4.1 Description

For version 9.x of IDP’s APIs, document retrieval is embedded with job progress polling. Please refer to specification of the document status / job polling calls for document retrieval details.

2.5 Security
2.5.1 Description

API keys can be used to restrict access to specific API methods or all methods.

In the event authentication is required, then every single API request will require authenticating with an API Key and username.

Authentication information API_Key can be provided either via HTTP Basic authentication header, or via parameter of a query string or request body.

Additional details on security schemes will be added to the next release of this document.

2.5.2 API Definition

Request:

POST https://mq-proxy./pipeline/dag/execute/

POST Payload
{
    “auth”:{
        “username”:”string”,
        “api_key”:”string”
    }
}

Response:

{
    "status": {
        "success": boolean,
        "code": numerical,
        "msg": "string"
    },
...
}

Where:

  1. status/success: true/false.
  2. status/code: Numerical description of the digitization status.
    1. 200: success code for the target API.
    2. 500: an error has occurred during the execution of the call, including authentication issues.
  3. status/msg: Literal description of the digitization status. Provides insights to a 200 or 500 failure code. The message would reflect an authentication issue if authentication scheme is required and the API is invoked without (adequate) credentials.
2.5.3 Example

Request:

Import requests
endpointURI = ‘https://mq-proxy.IDP.SIT.exponentialai.com’
res = requests.post(endpointURI+'/pipeline/dag/execute/PIPELINE_ID', auth=(‘USERNAME’, ‘API_KEY’))
print(res.json())
res = requests.get(endpointURI+'/’+res.json()[‘job_id’])
print(res.json())

Response:

// the first one would be a success and shows the upload of the document, returns a job id
{
    "status": {
        "success": true,
        "code": 200,
        "msg": ""
    },
    "job_id": "40dacefe-0377-470b-9f50-dfceaeb51b71"
}

// the second fails, returns a 500 code, and the literal indicates that authentication failed
{
    "status": {
        "success": false,
        "code": 500,
        "msg": "Authentication failed"
    }
}

3. Sample Implementation with IDP

3.1 Description

The example below demonstrates the development of a script that would

  1. Upload a file named “filename123.pdf” stored on an minio endpoint
  2. Checks on the file’s digitization progress with an increasing break duration in between each check
  3. Return the simplified digitized file if successful, or a 500-error code otherwise.
3.2 Sample Implementation

The example below demonstrates the development of a script that would

Import requests

#Variables for the file we want to upload
filePath = ‘minio://storage.IDP.SIT.exponentialai.com/’
filename = filename123.pdf
docJSON =
{‘data’:{‘inputs’:{‘file_path’:filepath+filename}},’ref_id’:’refID123’,’solution_id’:solutionID}

#Variables for the server / API endpoint / processing Solution & pipeline
endpointURI = ‘https://mq-proxy.IDP.SIT.exponentialai.com/’
actionURI = ’pipeline/dag/execute/’
solutionID = ‘backofficeSolution’
pipelineID = ‘37692cf6-166c-47a3-b74a-48936fecbc4f’

#Execute the upload and retrieve the response
res = requests.post(endpointURI+actionURI+pipelineID, data = docJSON)
payload = res.json()

#If upload was NOT successful, exit with error code
if (payload[‘status’][‘code’] != 200):
    return payload[‘status’][‘code’]
#Otherwise, continue processing
else:
    jobID = payload[‘job_id’]

    #Poll job_id as long as digitization is in progress, and hasn’t exceeded 1024s runtime
    sleepDuration = 1 #Start with 1 second sleep
    processStatus = requests.get(endpointURI+jobID).json()[‘process_status’]

    while ((processStatus == ‘in-progress’) && (sleepDuration <= 1024)):
        time.sleep(sleepDuration)
        sleepDuration *= 2
        processStatus = requests.get(endpointURI+jobID).json()[‘process_status’]
        
    if (sleepDuration > 1024):
        return 500
    else:
        payload = requests.get(endpointURI+ jobID).json()
        return payload[‘result’][‘metadata’][‘output’][‘result’][0]