Leveraging AWS AI/ML services to automate the identification of various cost savings opportunities for customers in the energy space

Challenge
Our client provides their partners in the energy industry with a variety of financial solutions to uncover overlooked savings. One of their primary systems that identifies those savings requires data from hundreds of different document types to be consolidated into a single common format before performing analysis. This required an entire team dedicated to reading the financial documents and manually entering the data into this common format.
As they evaluated the future growth of their products and services, they quickly realized that the current manual solution for capturing data from the various document types would not scale. They looked to Protagona to design and build an automated solution to accurately capture the relevant data from hundreds of document types and consolidate them into a centralized data lake.
Solution
Protagona worked closely to quickly identify an appropriate sample size of documents with of these very complex formats to begin training models around. Proof-of-concepts were then performed on various AI/ML services within AWS to validate the raw data output and design an automated data pipeline to integrate each service into their corresponding stage of the data lake. The fully built data pipeline now allows Merit to upload documents to S3, where a series of Textract, Comprehend and Glue jobs are executed to take the raw data from an image and transform it into the common format their systems need in order to identify cost savings.

Tech Stack
- AWS Textract
- AWS Comprehend
- AWS Sagemaker
- AWS Glue
- AWS S3
- AWS Lambda
- AWS DynamoDB
- AWS Athena
- AWS Quicksight
- Python
- Terraform
Outcome
Business Agility
By reducing manual document processing and introducing more automation to the process, Merit is able to extract data into a data lake and make data-driven business decisions.
Cost Optimization
Leveraging AWS cloud-native services has introduced cost efficiency in otherwise expensive OCR solutions. The serverless architecture will scale based on usage, allowing them to grow their customer base and business without concern around licensing or unforeseen costs.
Data Integrity
The deployment and configuration for all components of the new architecture are fully automated. All changes are run through a multi-stage CI/CD pipeline that provides consistent deployments to each environment, ensuring onboarding of new document formats is done with consistency and lower lead-time.