Digital library

From Data Silos to Unified Publishing Intelligence

Protagona partnered with a leading library services and publishing organization to design and validate a cloud-native data lake architecture — consolidating title metadata from disconnected internal systems and external sources into a single, queryable foundation.

Industry

Startups & Software

Teams & Services

Data Engineering, Cloud Architecture, Solutions Architecture

Tech & Tools

Amazon S3, AWS Glue, Amazon Athena, AWS Lambda, Amazon RDS, AWS Data Catalog, Amazon QuickSight, Amazon Q Chat, Python

Key Data Points

Approximately four terabytes of raw title and metadata identified across disconnected systems, targeted for consolidation into a unified AWS data lake.
Data previously siloed across at least four separate systems, requiring manual reconciliation estimated at 15 hours of meetings and 40 hours of work per cross-team project.
POC validated ISBN linkage patterns across library, trade, and audio editions — establishing the technical foundation for automated title reconciliation via Amazon QuickSight and Q Chat.

The Vision

This organization curates and delivers book collections for school and public libraries across the United States. Editorial and curation teams depend on rich, accurate title metadata — sourced from publishers, industry databases, and proprietary systems — to build collections and serve library readers. As product lines expanded, so did the number of systems holding fragments of that metadata: an ERP, a proprietary title management platform, a vendor submission portal, external sources like Library of Congress and BISAC, and span planning documents living only in spreadsheets. The goal was a single, authoritative source of truth.

The Goal

Protagona was engaged to validate the architecture and data ingestion patterns needed to consolidate a fragmented title and metadata landscape into a unified data lake. Success meant demonstrating that data from two core internal systems — Myriad and Jedi — alongside external sources could be normalized, linked by ISBN, and made queryable, establishing the foundation for automated metadata enrichment and cross-team intelligence.

The Challenge

The core complexity was not moving data — it was reconciling data never designed to interoperate. Multiple systems held overlapping but inconsistent representations of the same titles, with no shared key reliably connecting them. ISBN management alone was a significant architectural challenge: a single title can carry multiple ISBNs across formats and editions, and different business units care about different ones. Any unification strategy had to preserve that multiplicity. The two systems in scope for the POC — Myriad and Jedi — sat at very different points in their lifecycle, with inconsistent versioning practices, varying retention requirements spanning up to seven years, and external sources ranging from well-structured APIs to no programmatic access at all.

The Solution

Protagona designed a cloud-native data lake on AWS, using Amazon S3 as the central storage layer organized to accommodate varying retention policies and data formats. An ingestion framework pulled from both internal databases and external metadata sources, applying normalization logic to reconcile inconsistent publisher-supplied data — particularly around ISBN variants and imprint hierarchies. AWS Glue cataloged and transformed incoming data; Amazon Athena enabled querying without requiring a separate analytical database.

Title unification by ISBN was handled at the Amazon QuickSight layer, with Amazon Q Chat used to associate alternate ISBNs across library, trade, and audio editions to a parent title record. This preserved each business unit's distinct view of the same title while surfacing unified intelligence through system-specific dashboards — eliminating the need to log into multiple systems for cross-title queries.

OUTCOMES

Your data is trying to tell you something

Contact us

... are you listening?