Paper II now available

Context Streaming Through Eventing: Behavioral Context as a Living Document

Read Paper II

AgentIndexc Technical Report 2026-001

Context Lineage for Evolving AI Agents: A Formal Framework for Versioned Knowledge Persistence

Ritesh Mishra

AgentIndexc, Dublin, CA

March 3, 2026

llms.txt Cite this paper

Abstract

Today's AI agents learn from two primary channels: training data baked into model weights, and real-time web scraping for live context. Both channels suffer from a fundamental problem -- context lineage. Training data becomes stale between retraining cycles, and web scraping produces unstructured, ephemeral snapshots with no provenance tracking or change notification. We argue that the missing infrastructure layer is a standardized, versioned, machine-readable context format -- the llms.txt file -- that enables agents not only to scrape efficiently, but to subscribe to context changes and maintain a verifiable lineage of how their knowledge evolved over time.

We formalize the phenomenon of context decay and introduce a mathematical framework for context lineage as a directed acyclic graph (DAG) that tracks provenance, versioning, and temporal evolution of context documents. We define the LLM Friendliness Score (LFS), a composable scoring function grounded in information-theoretic measures of structured data density, content entropy, and discoverability. We present AgentIndexc, a system that implements this framework across 1,300+ data sources -- websites, APIs, MCP servers, and real-time data feeds -- generating and maintaining versioned llms.txt files.

The broader thesis is this: we are entering an era where every person and enterprise will maintain an agent index -- a curated collection of context files describing their data, services, and capabilities. The agents themselves become interchangeable commodities, chosen by individuals and organizations, that learn and evolve with this contextual substrate. Public llms.txt files for open data are the first step. Our empirical analysis shows that sources with LFS above 80 appear in 3.2x more AI-generated citations, and our key contribution is the formal proof that without versioned context persistence, information loss per agent generation grows as where is the inter-training interval.

1Introduction

The emergence of AI agents as primary information intermediaries represents a fundamental shift in how knowledge is consumed. Unlike search engines, which index and link to source material, LLMs internalize information during training, creating compressed representations that are inherently lossy [1]. When an LLM is retrained or updated, previously-learned context about specific entities can be overwritten, diluted, or lost entirely -- a phenomenon we term context decay.

The llms.txt standard [2] was proposed as a human- and machine-readable file format for providing context about a website or data source to LLMs. However, the current specification treats context as a static snapshot. This paper argues that context must be treated as a versioned, living artifact with full provenance lineage -- analogous to how Git tracks source code history or how data engineering pipelines maintain data lineage.

We make the following contributions:

A formal model of context decay quantifying information loss across LLM generations (Section 2)
A mathematical framework for context lineage using DAGs with temporal annotations (Section 3-4)
A composable, information-theoretic scoring function for LLM readability (Section 5)
An open system (AgentIndexc) implementing this framework across 1,400+ data sources (Section 6)
Empirical validation showing 3.2x citation improvement for high-LFS entities (Section 7)

2The Context Decay Problem

Consider an LLM trained on corpus at time . The model encodes a compressed representation of each entity in the training data. Let denote the model's internal representation of entity at time , and let denote the ground-truth state of at time .

Definition 2.1 (Context Fidelity)

The context fidelity of model with respect to entity is the mutual information between the model's representation and ground truth:

Definition 2.2 (Context Decay Rate)

For an entity whose ground-truth state evolves according to a stochastic process, the context decay rate is:where is the elapsed time since the last training cut.

Under reasonable assumptions (sub-Gaussian innovations in entity state), we can bound the expected decay:

Theorem 2.1 (Decay Bound)

Let the ground-truth state of entity evolve as where is sub-Gaussian with parameter . Then:

Proof. By the data processing inequality, . The gap is bounded below by applying the entropy power inequality to the Gaussian mixture. Since , the bound follows by subtraction. For large , this simplifies to via the concavity of the log function.

This theorem has a critical practical implication: the information an LLM holds about any entity degrades at a rate proportional to the square root of the time since its training cut. For fast-moving entities (high ), such as startups, financial APIs, or actively-developed open source projects, the decay is rapid. This motivates the need for a persistent, versioned context layer that exists outside the model.

3Formal Framework: Context Documents as a Vector Space

We model the space of context documents as a normed vector space where each document maps to a point in a high-dimensional semantic embedding. Let denote the set of all context documents (llms.txt files, API specs, MCP configs, etc.).

Definition 3.1 (Context Vector)

For a document , its context vector is:where encodes structured data density, encodes content hierarchy, encodes machine readability signals, encodes LLM discovery metadata, and encodes identity and contact information. Each sub-vector .

The distance between two context states captures how much an entity's machine-readable representation has changed:

3.1Completeness Metric

A context document's completeness measures the fraction of the semantic space it covers relative to an ideal document for that entity:

where is the theoretically complete context vector for entity . In practice, we approximate using the maximum observed feature values across all documents for similar entities in the same industry vertical.

3.2Information Density

We define the information density of a context document as the ratio of semantic content to syntactic length:

where is the normalized weight of the -th feature in , and is the token count. High-density documents convey more usable information per token, which is critical given LLM context window constraints [3].

4Context Lineage Model

The core innovation of this work is the context lineage model, which tracks the full provenance chain of context documents through time.

Definition 4.1 (Context Lineage Graph)

A context lineage graph for entity is a directed acyclic graph where:

is the ordered set of context document versions
are directed edges representing version transitions
Each vertex is annotated with timestamp , context vector , and a diff

The lineage graph enables two critical operations:

4.1Temporal Reconstruction

Given a query timestamp , we can reconstruct the context state at that time:

where . This allows an AI agent to answer questions like "what was this company doing in January 2025?" by replaying the lineage graph to the correct version.

4.2Drift Detection

We define the context drift between consecutive versions as:

A drift value exceeding a threshold triggers a re-index event in AgentIndexc. This is implemented as a webhook system: when a new version of an llms.txt is generated and the drift exceeds , all subscribed agents receive a notification to refresh their context.

4.3Lineage Completeness Theorem

Theorem 4.1 (Lineage Completeness)

Let be a context lineage graph for entity with versions spanning time interval . If the sampling rate satisfies:then for any , the reconstructed context vector satisfies:with probability at least .

Proof sketch. This follows from a concentration argument on the piecewise-linear interpolation of the lineage graph. The sub-Gaussian assumption on innovations bounds the interpolation error at each segment, and a union bound across segments yields the global guarantee. The sampling rate condition is analogous to the Nyquist criterion for bandlimited signals.

Practically, this theorem tells us: if we re-index an entity frequently enough relative to its rate of change, we can reconstruct its context state at any point in history with bounded error. This is the formal justification for AgentIndexc's periodic re-crawling and version-tracking system.

5LLM Friendliness Scoring

We define the LLM Friendliness Score (LFS) as a weighted composition of five sub-scores, each grounded in measurable properties of the context document and its source.

Definition 5.1 (LFS)

with:

: Structured Data Score -- density of JSON-LD, Schema.org, OpenGraph, and Twitter Card annotations
: Content Clarity Score -- heading hierarchy depth, description length, information entropy
: Machine Readability Score -- semantic HTML5 usage, ARIA coverage, accessible labels
: LLM Discovery Score -- presence and quality of llms.txt, llms-full.txt, AGENT.md, .well-known/mcp.json
: Contact & Identity Score -- business name, legal entity, email, social profiles

5.1Sub-score Computation

Each sub-score is computed as a normalized sum of binary and continuous feature indicators:

where is the feature set for category , is the feature value (binary or normalized), and is a learned importance weight calibrated against the AI agent citation corpus (Section 7).

5.2Score Decomposition

The LFS can be decomposed into a static component (properties unlikely to change between crawls) and a dynamic component (properties that evolve with the entity):

The dynamic component is what drives the need for re-indexing. In our empirical analysis, the dynamic component accounts for 38% of total score variance across the indexed corpus.

5.3Calibration Against Citation Data

We calibrate the weights and by maximizing the Spearman rank correlation between LFS and the number of AI-generated citations for each entity:

where is the citation count for entity across a sample of GPT-4o, Claude 3.5, and Gemini 2.0 responses to entity-relevant queries. The optimized weights are: .

6System Architecture

AgentIndexc implements the above framework as a production system processing 1,400+ data sources across four categories:

Source Type	Count	Ingestion Method	Update Cadence
Websites	~50	Deep crawl (sitemap + DFS)	On-demand + 7d refresh
Public APIs	~1,250	OpenAPI/Swagger spec parsing	14d refresh
MCP Servers	~30	MCP manifest + capability scan	7d refresh
Real-time Data Feeds	~80	Endpoint probe + schema detection	24h refresh

6.1Webhook Notification System

When a context document is regenerated and the drift (with default ), subscribed agents receive a webhook notification containing the new version's metadata, diff summary, and a URL to the updated llms.txt. This allows downstream systems to maintain up-to-date context without polling.

6.2Storage and Versioning

Each context document version is stored with its full context vector, timestamp, parent pointer (forming the lineage DAG), and a semantic diff. The storage schema uses Supabase PostgreSQL with the following per-entity cost model:

where is the number of versions over time horizon , is the average document size, and the constant terms account for the context vector and metadata overhead.

7Empirical Analysis

We evaluate the framework across the AgentIndexc corpus of 1,412 indexed entities.

7.1LFS Distribution

The LFS distribution across the corpus is right-skewed with mean , standard deviation , and median 38. Only 8.2% of entities achieve LFS > 70, confirming that the web remains largely opaque to AI agents. The top-scoring category is API documentation, where structured specs (OpenAPI, JSON Schema) provide natural machine readability.

7.2Citation Correlation

We sampled 200 entities from our corpus and measured their citation frequency across 500 AI-generated responses (100 queries x 5 models: GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Llama 3.1, Mistral Large).

Entities with LFS > 80 received a mean of 3.2x more citations than those with LFS < 40. The strongest individual predictor was (LLM Discovery Score), with , confirming that the mere presence of an llms.txt file significantly increases an entity's recall by AI agents.

7.3Decay Rate Validation

For 50 entities tracked over 90 days with weekly re-indexing, we measured the actual context fidelity decay against our Theorem 2.1 bound:

The theoretical bound holds with a tightness ratio of 0.84, confirming that our sub-Gaussian modeling assumption is well-calibrated for the web entity domain.

9Conclusion

We have presented a formal framework for understanding and combating context decay in AI agent systems. The key insight is that as AI agents evolve through retraining cycles, the entities they represent need a persistent, versioned context layer that transcends any single model generation. Our context lineage model provides the mathematical foundation for this layer, and our LLM Friendliness Score gives practitioners an actionable metric for optimizing their data sources.

AgentIndexc implements this framework at scale, maintaining versioned llms.txt files for over 1,400 data sources. Our empirical results demonstrate that the framework delivers measurable improvements in AI agent recall and citation accuracy. As the number of AI agents and their retraining frequency increases, the value of maintained context lineage will only grow.

We release the LFS scoring algorithm and context lineage specification as open standards, and invite the community to contribute to the indexed corpus through the AgentIndexc platform.

References

[1]Brown, T. B., Mann, B., Ryder, N., et al.. Language Models are Few-Shot Learners. NeurIPS, 2020.Link
[2]Willison, S., et al.. llms.txt: A Proposal for LLM-Friendly Website Content. llmstxt.org, 2024.Link
[3]Liu, N., Lin, K., Hewitt, J., et al.. Lost in the Middle: How Language Models Use Long Contexts. TACL, 2024.Link
[4]Lewis, P., Perez, E., Piktus, A., et al.. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS, 2020.Link
[5]Buneman, P., Khanna, S., Tan, W.-C.. Why and Where: A Characterization of Data Provenance. ICDT, 2001.
[6]Halevy, A., Korn, F., Noy, N. F., et al.. Goods: Organizing Google's Datasets. SIGMOD, 2016.
[7]Anthropic. Model Context Protocol Specification. modelcontextprotocol.io, 2024.Link
[8]Vrandecic, D. and Krotzsch, M.. Wikidata: A Free Collaborative Knowledge Base. Communications of the ACM, 2014.
[9]Shannon, C. E.. A Mathematical Theory of Communication. Bell System Technical Journal, 1948.
[10]Cover, T. M. and Thomas, J. A.. Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006.
[11]Vaswani, A., Shazeer, N., Parmar, N., et al.. Attention Is All You Need. NeurIPS, 2017.Link
[12]Bostrom, N.. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014.

Cite this paper

@article{mishra2026context,
  title   = {Context Lineage for Evolving AI Agents:
             A Formal Framework for Versioned
             Knowledge Persistence},
  author  = {Mishra, Ritesh},
  journal = {AgentIndexc Technical Reports},
  year    = {2026},
  number  = {2026-001},
  url     = {https://agentindexc.com/whitepaper}
}