Elastic Context Poisoning Rag Examples

Context Poisoning Rag Examples

openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIchatlogcontext-poisoningvectordatabasePythonsearchgenaistacksupporting-blog-contentvectorelasticsearch-labslangchainapplications

alph-notebooks/elasticsearch-labs / context-poisoning-rag-examples.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Context Poisoning in RAG Systems: Hands-on Examples

This notebook demonstrates three common patterns of context poisoning in RAG systems and how to defend against them using Elasticsearch's search capabilities.

What you'll learn:

Temporal Degradation: Filter outdated documents with range queries
Information Conflicts: Prioritize relevant context with metadata boosting
Semantic Noise: Eliminate irrelevant results with product filters

Requirements:

Elasticsearch 9.x or higher
Jina embeddings v3 inference endpoint (created automatically)
Python 3.8+

Section 1: Setup and Configuration

Install Dependencies

First, let's install the required Python packages.

[1]

Connect to Elasticsearch

To run this notebook, you need an Elasticsearch deployment.

Don't have one? Sign up for a free Elastic Cloud trial.

Set the following environment variables:

ES_URL: Your Elasticsearch endpoint URL
ES_API_KEY: Your API key for authentication

[2]

Connected to Elasticsearch 9.2.1

Configure Inference Endpoint

We'll use Jina embeddings v3 for semantic search. The following cell will create the inference endpoint if it doesn't exist, or verify it's available if already created.

[3]

Inference endpoint 'jina-embeddings-v3' already exists

Helper Function

We'll use a reusable function to load JSON datasets and index them into Elasticsearch.

[4]

Section 2: Temporal Degradation

The Problem

Outdated docs remain semantically similar to current queries but contain obsolete information. A query for "OAuth authentication" might retrieve docs from 6.x (Shield plugin), 7.x (legacy syntax), and 9.x (current)—all relevant, but only the latest is accurate.

The Solution

Use date range filters in your RRF query to exclude documents older than a threshold (e.g., 6 months).

Create the Index

We'll create an index for product documentation with a semantic_text field for vector search.

[5]

Created index: product-docs
Indexed 15 documents to product-docs

Query WITHOUT Temporal Filter

First, let's see what happens when we query without any date filtering. The RRF (Reciprocal Rank Fusion) query combines semantic and keyword search but retrieves documents from all time periods.

[6]

Query: 'how to configure OAuth authentication'
Filter: NONE
------------------------------------------------------------

1. Setting Up OAuth (Deprecated)
   Version: 6.x | Updated: 2022-03-15
   Configure OAuth via Shield plugin (deprecated, replaced by X-Pack)....

2. OAuth 2.0 Authentication Setup
   Version: 9.x | Updated: 2026-01-15
   Configure OAuth 2.0 in Elasticsearch 9.x using the security API via Stack Manage...

3. OAuth Authentication Configuration
   Version: 7.x | Updated: 2023-06-15
   Configure OAuth via elasticsearch.yml with xpack.security.authc.realms.oidc sett...

4. OAuth Realm Setup
   Version: 7.x | Updated: 2023-05-20
   Set up OAuth realm in xpack.security.authc.realms.oidc with op.* settings....

5. OAuth Client Registration
   Version: 7.x | Updated: 2023-04-10
   Register OAuth clients via rp.client_id and rp.client_secret in elasticsearch.ym...

Without filtering, results mix documents from 2022-2026; the RAG system would receive conflicting information about deprecated Shield plugin, legacy elasticsearch.yml config, and current API-based setup.

[7]

Query: 'how to configure OAuth authentication'
Filter: last_updated >= now-6M AND status = published
------------------------------------------------------------

1. OAuth 2.0 Authentication Setup
   Version: 9.x | Updated: 2026-01-15
   Configure OAuth 2.0 in Elasticsearch 9.x using the security API via Stack Manage...

2. OAuth Provider Configuration
   Version: 9.x | Updated: 2025-12-20
   Configure Okta, Azure AD, Auth0 via security API with OIDC auto-discovery....

3. OAuth Token Management
   Version: 9.x | Updated: 2026-01-10
   Manage OAuth tokens via /_security/oauth2/token endpoint for refresh and revocat...

4. OAuth Security Best Practices
   Version: 9.x | Updated: 2026-01-05
   OAuth best practices: short-lived tokens, PKCE, proper redirect URIs, state vali...

5. OAuth Troubleshooting Guide
   Version: 9.x | Updated: 2025-12-15
   Troubleshoot OAuth: check token expiration, redirect URIs, credentials, use debu...

All results are now from version 9.x, providing consistent and current OAuth configuration guidance.

Key Takeaway

Temporal filtering prevents context poisoning from outdated documentation:

Set appropriate staleness thresholds based on your documentation lifecycle
Consider recency boosting for soft preferences vs. hard cutoffs
Mark evergreen content (core concepts) to exempt from time filters

Section 3: Information Conflicts

The Problem

Semantically similar documents may contain contradictory information based on different contexts. A query for "configure custom users in serverless" might retrieve Serverless docs ("use SSO"), Cloud docs ("create in Stack Management"), and Self-hosted docs ("configure native realm")—all valid, but only one matches the user's context.

The Solution

Use metadata boosting with should clauses to prioritize documents matching the user's context (deployment type, product version, etc.).

[8]

Created index: platform-docs
Indexed 15 documents to platform-docs

Query WITHOUT Metadata Boosting

First, let's query without any deployment-type boosting to see results from all deployment types.

[ ]

Query with Metadata Boosting

We use should clauses to boost documents that match the user's deployment context (serverless). This ensures contextually relevant documents rank higher than semantically similar but contextually wrong results.

[ ]

Serverless documents now rank at the top, correctly informing the user that custom users are NOT supported and they should use SSO instead.

Key Takeaway

Metadata boosting resolves conflicts by prioritizing context-relevant documents:

Extract user context from the query (deployment type, version, etc.)
Apply appropriate boosts to matching metadata
Use strict filters when context is unambiguous

Section 4: Semantic Noise

The Problem

Documents about different products may share terminology, causing irrelevant results. A query for "configure agents" could match both Elastic Agent (Observability—collects logs/metrics) and Agent Builder (GenAI—builds LLM workflows). Same term, completely different products.

The Solution

Use product filters in the RRF query to exclude irrelevant product documentation entirely.

Create the Index

[ ]

Query WITHOUT Product Filter

First, let's see what happens without filtering. The query "configure agents" matches both product areas.

[ ]

Results mix Elastic Agent and Agent Builder documentation. "Agent Builder Configuration" is semantically similar but irrelevant to log/metric collection.

Query WITH Product Filter

Now let's apply a product filter to only retrieve Observability and Elastic Agent documentation.

[13]

Query: 'agent configuration logs metrics collection'
Filter: product IN [observability, elastic-agent] AND doc_type = configuration
------------------------------------------------------------

1. Elastic Agent Input Configuration
   Product: elastic-agent | Tags: inputs, logs, metrics
   URL: /docs/elastic-agent/inputs

2. Configure Elastic Agent for Log and Metric Collection
   Product: elastic-agent | Tags: configuration, logs, metrics
   URL: /docs/elastic-agent/configure

3. Agent Policies and Integrations
   Product: observability | Tags: policies, integrations, fleet
   URL: /docs/fleet/policies

4. Configuring Agent Outputs
   Product: elastic-agent | Tags: outputs, elasticsearch, logstash
   URL: /docs/elastic-agent/outputs

5. Manage Elastic Agents with Fleet
   Product: observability | Tags: fleet, agent-management, deployment
   URL: /docs/fleet/manage-agents

Section 5: Cleanup

Uncomment and run the following cell to delete the indices created in this notebook.

[ ]