Context Poisoning Rag Examples
Context Poisoning in RAG Systems: Hands-on Examples
This notebook demonstrates three common patterns of context poisoning in RAG systems and how to defend against them using Elasticsearch's search capabilities.
What you'll learn:
- Temporal Degradation: Filter outdated documents with range queries
- Information Conflicts: Prioritize relevant context with metadata boosting
- Semantic Noise: Eliminate irrelevant results with product filters
Requirements:
- Elasticsearch 9.x or higher
- Jina embeddings v3 inference endpoint (created automatically)
- Python 3.8+
Section 1: Setup and Configuration
Install Dependencies
First, let's install the required Python packages.
Connect to Elasticsearch
To run this notebook, you need an Elasticsearch deployment.
Don't have one? Sign up for a free Elastic Cloud trial.
Set the following environment variables:
ES_URL: Your Elasticsearch endpoint URLES_API_KEY: Your API key for authentication
Connected to Elasticsearch 9.2.1
Configure Inference Endpoint
We'll use Jina embeddings v3 for semantic search. The following cell will create the inference endpoint if it doesn't exist, or verify it's available if already created.
Inference endpoint 'jina-embeddings-v3' already exists
Helper Function
We'll use a reusable function to load JSON datasets and index them into Elasticsearch.
Section 2: Temporal Degradation
The Problem
Outdated docs remain semantically similar to current queries but contain obsolete information. A query for "OAuth authentication" might retrieve docs from 6.x (Shield plugin), 7.x (legacy syntax), and 9.x (current)—all relevant, but only the latest is accurate.
The Solution
Use date range filters in your RRF query to exclude documents older than a threshold (e.g., 6 months).
Create the Index
We'll create an index for product documentation with a semantic_text field for vector search.
Created index: product-docs Indexed 15 documents to product-docs
15
Query WITHOUT Temporal Filter
First, let's see what happens when we query without any date filtering. The RRF (Reciprocal Rank Fusion) query combines semantic and keyword search but retrieves documents from all time periods.
Query: 'how to configure OAuth authentication' Filter: NONE ------------------------------------------------------------ 1. Setting Up OAuth (Deprecated) Version: 6.x | Updated: 2022-03-15 Configure OAuth via Shield plugin (deprecated, replaced by X-Pack).... 2. OAuth 2.0 Authentication Setup Version: 9.x | Updated: 2026-01-15 Configure OAuth 2.0 in Elasticsearch 9.x using the security API via Stack Manage... 3. OAuth Authentication Configuration Version: 7.x | Updated: 2023-06-15 Configure OAuth via elasticsearch.yml with xpack.security.authc.realms.oidc sett... 4. OAuth Realm Setup Version: 7.x | Updated: 2023-05-20 Set up OAuth realm in xpack.security.authc.realms.oidc with op.* settings.... 5. OAuth Client Registration Version: 7.x | Updated: 2023-04-10 Register OAuth clients via rp.client_id and rp.client_secret in elasticsearch.ym...
Without filtering, results mix documents from 2022-2026; the RAG system would receive conflicting information about deprecated Shield plugin, legacy elasticsearch.yml config, and current API-based setup.
Query: 'how to configure OAuth authentication' Filter: last_updated >= now-6M AND status = published ------------------------------------------------------------ 1. OAuth 2.0 Authentication Setup Version: 9.x | Updated: 2026-01-15 Configure OAuth 2.0 in Elasticsearch 9.x using the security API via Stack Manage... 2. OAuth Provider Configuration Version: 9.x | Updated: 2025-12-20 Configure Okta, Azure AD, Auth0 via security API with OIDC auto-discovery.... 3. OAuth Token Management Version: 9.x | Updated: 2026-01-10 Manage OAuth tokens via /_security/oauth2/token endpoint for refresh and revocat... 4. OAuth Security Best Practices Version: 9.x | Updated: 2026-01-05 OAuth best practices: short-lived tokens, PKCE, proper redirect URIs, state vali... 5. OAuth Troubleshooting Guide Version: 9.x | Updated: 2025-12-15 Troubleshoot OAuth: check token expiration, redirect URIs, credentials, use debu...
All results are now from version 9.x, providing consistent and current OAuth configuration guidance.
Key Takeaway
Temporal filtering prevents context poisoning from outdated documentation:
- Set appropriate staleness thresholds based on your documentation lifecycle
- Consider recency boosting for soft preferences vs. hard cutoffs
- Mark evergreen content (core concepts) to exempt from time filters
Section 3: Information Conflicts
The Problem
Semantically similar documents may contain contradictory information based on different contexts. A query for "configure custom users in serverless" might retrieve Serverless docs ("use SSO"), Cloud docs ("create in Stack Management"), and Self-hosted docs ("configure native realm")—all valid, but only one matches the user's context.
The Solution
Use metadata boosting with should clauses to prioritize documents matching the user's context (deployment type, product version, etc.).
Created index: platform-docs Indexed 15 documents to platform-docs
15
Query WITHOUT Metadata Boosting
First, let's query without any deployment-type boosting to see results from all deployment types.
Query with Metadata Boosting
We use should clauses to boost documents that match the user's deployment context (serverless). This ensures contextually relevant documents rank higher than semantically similar but contextually wrong results.
Serverless documents now rank at the top, correctly informing the user that custom users are NOT supported and they should use SSO instead.
Key Takeaway
Metadata boosting resolves conflicts by prioritizing context-relevant documents:
- Extract user context from the query (deployment type, version, etc.)
- Apply appropriate boosts to matching metadata
- Use strict filters when context is unambiguous
Section 4: Semantic Noise
The Problem
Documents about different products may share terminology, causing irrelevant results. A query for "configure agents" could match both Elastic Agent (Observability—collects logs/metrics) and Agent Builder (GenAI—builds LLM workflows). Same term, completely different products.
The Solution
Use product filters in the RRF query to exclude irrelevant product documentation entirely.
Create the Index
Query WITHOUT Product Filter
First, let's see what happens without filtering. The query "configure agents" matches both product areas.
Results mix Elastic Agent and Agent Builder documentation. "Agent Builder Configuration" is semantically similar but irrelevant to log/metric collection.
Query WITH Product Filter
Now let's apply a product filter to only retrieve Observability and Elastic Agent documentation.
Query: 'agent configuration logs metrics collection' Filter: product IN [observability, elastic-agent] AND doc_type = configuration ------------------------------------------------------------ 1. Elastic Agent Input Configuration Product: elastic-agent | Tags: inputs, logs, metrics URL: /docs/elastic-agent/inputs 2. Configure Elastic Agent for Log and Metric Collection Product: elastic-agent | Tags: configuration, logs, metrics URL: /docs/elastic-agent/configure 3. Agent Policies and Integrations Product: observability | Tags: policies, integrations, fleet URL: /docs/fleet/policies 4. Configuring Agent Outputs Product: elastic-agent | Tags: outputs, elasticsearch, logstash URL: /docs/elastic-agent/outputs 5. Manage Elastic Agents with Fleet Product: observability | Tags: fleet, agent-management, deployment URL: /docs/fleet/manage-agents
Section 5: Cleanup
Uncomment and run the following cell to delete the indices created in this notebook.