Proving the "Impossible" Possible: AI Document Processing for Congressional Filings

Protagona partnered with a leading nonpartisan money-in-politics organization to build an AI-powered document processing pipeline on AWS, automating extraction from the most hostile congressional financial disclosures — including handwritten, degraded, and multi-hundred-page filings.

Industry

Nonprofit

Teams & Services

Data Engineering, Cloud Architecture, AI/ML, Delivery Management

Tech & Tools

AWS Bedrock, Amazon Textract, AWS Lambda, Amazon S3, AWS Step Functions, Amazon Bedrock Agents, Multi-modal Foundation Models

Key Data Points

Proof of concept completed on schedule across a three-week sprint, deployed directly within the client's existing AWS environment with full implementation documentation.

Confidence-based routing pipeline automatically processes high-confidence extractions while flagging uncertain results for human review, protecting decades of institutional credibility.

System demonstrated technical feasibility on documents previously deemed unprocessable — multi-generation photocopied handwritten filings and 300-plus-page brokerage statement attachments.

The Vision

One of the most trusted nonpartisan sources of money-in-politics data in the United States, this organization tracks campaign contributions, lobbying activity, and the personal financial disclosures of elected officials. Their credibility depends entirely on the accuracy of what they publish. For years, processing mandatory congressional financial disclosures required dedicated researchers working manually through documents — some hundreds of pages long, some handwritten and deliberately degraded through repeated photocopying. They needed a partner who could prove intelligent document processing was viable before committing to a full build.

The Goal

Protagona was engaged to achieve three concrete objectives: prove that an AI-powered pipeline could extract structured financial data from the full range of congressional disclosure formats, including the most difficult handwritten and degraded documents; implement confidence scoring that routes uncertain extractions to human review before they reach the public dataset; and deliver a working proof of concept deployed inside the organization's own AWS environment within three weeks.

The Challenge

Congressional financial disclosures are among the most hostile documents for automated processing. Filings range from clean machine-typed forms to handwritten submissions run through a photocopier until text becomes ambiguous — submitted upside down, buried inside brokerage statement attachments from dozens of financial institutions, each with its own format. Some exceed three hundred pages. The organization's own technical leadership had previously attempted extraction with available tools and concluded it could not be done reliably. That skepticism defined the engagement's starting point. The accuracy bar was unambiguous: data published for journalists, researchers, and the public is treated as factual record, and any error reaching the platform would damage institutional credibility built over decades.

The Solution

Protagona designed an intelligent document processing pipeline that automatically extracts financial data the moment a filing is uploaded. A coordinating AI agent breaks each document into stages — extraction, validation, and confidence assessment — and routes each one to the right tool for the job. Standard text and structure are extracted directly, while handwritten entries and degraded scans, including documents blurred through repeated photocopying, are processed using AI models built to interpret visual content beyond what traditional text recognition can handle.

‍

The confidence-scoring system was the strategic centerpiece of the design. Rather than treating every extraction the same way, the pipeline scores each data point individually — high-confidence results move straight through automated processing, while anything uncertain is flagged for human review before it reaches the public dataset. This preserves the accuracy the organization depends on without requiring every filing to be reviewed from scratch. The full system was delivered with complete documentation, so the internal team can operate and extend it independently.

‍

Automated Processing of Unprocessable Filings

Multi-modal AI successfully extracted structured data from handwritten, degraded, and multi-hundred-page filings that prior tooling had failed to handle reliably.

Confidence-Based Triage Protecting Data Integrity

Field-level confidence scoring automatically routes uncertain extractions for human review, ensuring that no low-confidence data reaches the public dataset without verification.

Production-Ready Architecture, Client-Owned

Fully documented POC deployed within the client's own AWS account, with README guides enabling internal staff to operate, extend, and build toward a full MVP independently.