RAG Content Provenance — Use Case

Problem

Enterprise RAG systems index internal policies, contracts, clinical trial protocols, SOPs, and regulatory documents daily. These are often the ultimate grounds that AI agents cite in business decisions and customer responses.

At the moment of indexing, documents undergo a quiet transformation:

Identity with originals is implicit. There is no verification path to confirm that a chunk in the index matches the "authoritative copy" held by the issuer.
Issuer signatures are stripped. Metadata indicating who issued the document, when, and under what authority is largely lost during the indexing process.
Version history collapses. When the RAG index is rebuilt, reproducing which version was "authoritative" at a given historical point becomes practically impossible.
Post-ingestion tampering goes undetected. The index itself is defenseless against storage-level writes or intentional document swaps by insiders.

The EU AI Act, ISO 42001, pharmaceutical GxP regulations, and FSA AI governance guidelines — all are moving toward demanding authenticity of grounds for AI decisions and presented information. Without fixing provenance at the point of indexing, no amount of after-the-fact audit logging can stabilize the grounds of grounds.

Scenario

A pharmaceutical company's clinical team continuously indexes ongoing trial protocols, SOPs, and regulatory submission documents into an internal RAG. An AI agent responds to healthcare professional inquiries by citing the applicable protocol.

Eighteen months into the trial, a regulatory audit arrives. The auditor demands: "Prove that the AI's response to healthcare professionals on August 18, 2026 was based on the protocol version approved at that time."

Over those 18 months, the protocol has been revised 7 times. The RAG index has been rebuilt with each update — no historical state remains. The document management ledger suggests "v3.2 should have been current on August 18," but there is no cryptographic evidence. Auditors do not accept estimates.

With Lemma, the following would have been recorded at the moment each document was indexed:

Issuer signature and issuance timestamp
Original document docHash and CID
Cryptographic binding between each indexed chunk and the original
Indexing timestamp

When the AI agent claims "On August 18, I cited Section 4.2 of protocol v3.2," the regulator can independently verify that the citation was bit-for-bit identical to the original at that time. No matter how many times the RAG index is rebuilt, the authoritative historical state remains permanently verifiable.

The auditor sees not estimates, but cryptographically sealed facts.

Architecture

Lemma's four cryptographic layers cover the entire RAG document lifecycle.

1. ENCRYPT — Sealing at Ingestion Time

At the moment a document enters the indexing pipeline, the original is encrypted with AES-GCM. The original remains under the issuer's control; only docHash and CID flow into the RAG infrastructure. The indexing platform never holds the original content in plaintext.

2. PROVE — Cryptographic Binding of Index to Original

On a ZK circuit, the integrity of four elements is sealed as a proof: (a) issuer signature, (b) docHash, (c) generated embedding vectors, (d) indexed chunk set. Retrospectively, "which original did this chunk come from, and where" can be verified without disclosing the original.

3. DISCLOSE — Selective Disclosure per Verifier

At audit time, disclosure scope is controlled by verifier authority. The regulator receives full chunks and issuer signatures; internal auditors receive metadata only; AI response viewers receive only a proof-of-existence for the cited source — all enforced with issuer signatures.

4. PROVENANCE — Permanent Historical Record

docHash, CID, issuer signature, indexing timestamp, and chunk bindings are anchored on-chain. Even if RAG indexes, vector stores, and LLM backends are entirely replaced, "what was an authoritative document at a given point" remains permanently verifiable.

┌──────────────────────────────────────────────────────────┐
│  Document Sources (policies, contracts, protocols, SOPs)  │
└───────────────────────┬──────────────────────────────────┘
                        │ Indexing pipeline input
                        ▼
┌──────────────────────────────────────────────────────────┐
│  ENCRYPT (AES-GCM)                                       │
│  • Encrypt original document                              │
│  • Seal issuer signature                                  │
│  → Only docHash + CID flow into RAG infrastructure        │
│  → Original content never held in plaintext               │
└───────────────────────┬──────────────────────────────────┘
                        │ docHash + CID + chunks
                        ▼
┌──────────────────────────────────────────────────────────┐
│  PROVE (ZK Circuit)                                      │
│  Binding: (a) issuer signature (b) docHash                │
│           (c) embedding vectors (d) chunk set              │
│  → Proves "which original this chunk came from, where"    │
│  → Verifiable without disclosing original                 │
└───────────────────────┬──────────────────────────────────┘
                        │ ZK proof + chunk binding
                        ▼
┌──────────────────────────────────────────────────────────┐
│  DISCLOSE (Selective Disclosure)                          │
│  Regulator → full chunks + issuer signature               │
│  Internal auditor → metadata only                         │
│  Viewer → proof-of-existence for cited source             │
└───────────────────────┬──────────────────────────────────┘
                        │ Disclosed attributes
                        ▼
┌──────────────────────────────────────────────────────────┐
│  PROVENANCE (On-chain)                                   │
│  docHash / CID / issuer signature / indexing timestamp    │
│  / chunk bindings                                         │
│  → Immutable even if index/vector store/LLM are replaced  │
└──────────────────────────────────────────────────────────┘

Proven Facts

Lemma cryptographically guarantees the following facts in RAG content provenance:

Document issuer identity and signature
Original document docHash and CID — storage-level identity
Indexing timestamp
Version number that the issuer approved as current at that time
Cryptographic binding between original and indexed chunks
Authoritative state at a given historical point
Post-ingestion tamper detection, verifiable without disclosing originals

Get Started

Ready to prove?

Talk to us about your use case. We respond within one business day.

Talk to us Demo repo

Problem

Scenario

Architecture

Proven Facts

Related Use Cases

RAG Source Attestation — Citation Verification

AI Audit Log Proof — Permanent Decision Record

Supply Chain Component Provenance — Multi-Tier Tamper Resistance

Ready to prove?