Home Machine Learning How to Build a RAG System Using LangChain, Ragas, and Neptune

Machine Learning

How to Build a RAG System Using LangChain, Ragas, and Neptune

March 16, 2025

LangChain provides composable building blocks to create LLM-powered applications, making it an ideal framework for building RAG systems. Developers can integrate components and APIs of different vendors into coherent applications.

Evaluating a RAG system’s performance is crucial to ensure high-quality responses and robustness. The Ragas framework offers a large number of RAG-specific metrics as well as capabilities for generating dedicated evaluation datasets.

neptune.ai makes it easy for RAG developers to track evaluation metrics and metadata, enabling them to analyze and compare different system configurations. The experiment tracker can handle large amounts of data, making it well-suited for quick iteration and extensive evaluations of LLM-based applications.

Imagine asking a chat assistant about LLMOps only to receive outdated advice or irrelevant best practices. While LLMs are powerful, they rely solely on their pre-trained knowledge and lack the ability to fetch current data.

This is where Retrieval-Augmented Generation (RAG) comes in. RAG combines the generative power of LLMs with external data retrieval, enabling the assistant to access and use real-time information. For example, instead of outdated answers, the chat assistant could pull insights from Neptune’s LLMOps article collection to deliver accurate and contextually relevant responses.

In this guide, we’ll show you how to build a RAG system using the LangChain framework, evaluate its performance using Ragas, and track your experiments with neptune.ai. Along the way, you’ll learn to create a baseline RAG system, refine it using Ragas metrics, and enhance your workflow with Neptune’s experiment tracking.

Part 1: Building a baseline RAG system with LangChain

In the first part of this guide, we’ll use LangChain to build a RAG system for the blog posts in the LLMOps category on Neptune’s blog.

Overview of a baseline RAG system. A user’s question is used as the query to retrieve relevant documents from a database. The documents returned by the search are added to the prompt that is passed to the LLM together with the user’s question. The LLM uses the information in the prompt to generate an answer. | Source

What is LangChain?

LangChain offers a collection of open-source building blocks, including memory management, data loaders for various sources, and integrations with vector databases—all the essential components of a RAG system.

LangChain stands out among the frameworks for building RAG systems for its composability and versatility. Developers can combine and connect these building blocks using a coherent Python API, allowing them to focus on creating LLM applications rather than dealing with the nitty-gritty of API specifications and data transformations.

Overview of the categories of building blocks provided by LangChain. The framework includes interfaces to models and vector stores, document loaders, and text processing utilities like output parsers and text splitters. Further, LangChain offers features for prompt engineering, like templates and example selectors. The framework also contains a collection of tools that can be called by LLM agents. | Source

Step 1: Setting up

We’ll begin by installing the necessary dependencies (I used Python 3.11.4 on Linux):

pip install -qU langchain-core==0.1.45 langchain-openai==0.0.6 langchain-chroma==0.1.4 ragas==0.2.8 neptune==1.13.0 pandas==2.2.3 datasets==3.2.0

For this example, we’ll use OpenAI’s models and configure the API key. To access OpenAI models, you’ll need to create an OpenAI account and generate an API key. Our usage in this blog should be well within the free-tier limits.

Once we have obtained our API key, we’ll set it as an environment variable so that LangChain’s OpenAI building blocks can access it:

import os
os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE"

You can also use any of LangChain’s other embedding and chat models, including local models provided by Ollama. Thanks to the compositional structure of LangChain, all it takes is replacing OpenAIEmbeddings and OpenAIChat in the code with the respective alternative building blocks.

Step 2: Load and parse the raw data

Source data for RAG systems is often unstructured documents. Before we can use it effectively, we’ll need to process and parse it into a structured format.

Fetch the source data

Since we’re working with a blog, we’ll use LangChain’s WebBaseLoader to load data from Neptune’s blog. WebBaseLoader reads raw webpage content, capturing text and structure, such as headings.

The web pages are loaded as LangChain documents, which include the page content as a string and metadata associated with that document, e.g., the source page’s URL.

In this example, we select 3 blog posts to create the chat assistant’s knowledge base:

import bs4
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    web_paths=[
        "https://neptune.ai/blog/llm-hallucinations",
        "https://neptune.ai/blog/llmops",
        "https://neptune.ai/blog/llm-guardrails"
    ],
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(name=["p", "h2", "h3", "h4"])
    ),
)
docs = loader.load()

Split the data into smaller chunks

To meet the embedding model’s token limit and improve retrieval performance, we’ll split the long blog posts into smaller chunks.

The chunk size is a trade-off between specificity (capturing detailed information within each chunk) and efficiency (reducing the total number of resulting chunks). By overlapping chunks, we mitigate the loss of critical information that occurs when a self-contained sequence of the source text is split into two incoherent chunks.

Visualization of the chunks created from the article LLM Hallucinations 101. The text is split into four chunks highlighted in blue, lime green, dark orange, and dark yellow. The overlaps between chunks are marked in olive green. — Visualization of the chunks created from the article *LLM Hallucinations 101*. The text is split into four chunks highlighted in blue, lime green, dark orange, and dark yellow. The overlaps between chunks are marked in olive green. | Created with ChunkViz

For generic text, LangChain recommends the RecursiveCharacterTextSplitter. We set the chunk size to a maximum of 1,000 characters with an overlap of 200 characters. We also filter out unnecessary parts of the documents, such as the header, footer, and any promotional content:

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

header_footer_keywords = ["peers about your research", "deepsense", "ReSpo", "Was the article useful?", "related articles", "All rights reserved"]

splits = []
for s in text_splitter.split_documents(docs):
    if not any(kw in s.page_content for kw in header_footer_keywords):
        splits.append(s)

len(splits)

Step 3: Set up the vector store

Vector stores are specialized data stores that enable indexing and retrieving information based on vector representations.

Choose a vector store

LangChain supports many vector stores. In this example, we’ll use Chroma, an open-source vector store specifically designed for LLM applications.

By default, Chroma stores the collection in memory; once the session ends, all the data (embeddings and indices) are lost. While this is fine for our small example, in production, you’ll want to persist the database to disk by passing the persist_directory keyword argument when initializing Chroma.

Specify which embedding model to use

Embedding models convert chunks into vectors. There are many embedding models to choose from. The Massive Text Embedding Benchmark (MTEB) leaderboard is a great resource for selecting one based on model size, embedding dimensions, and performance requirements.

The MTEB Leaderboard provides a standardized comparison of embedding models across diverse tasks and datasets, including retrieval, clustering, classification, and reranking. The leaderboard provides a clear comparison of model performance and makes selecting embedding models easier through filters and ranking.

For our example LLMOps RAG system, we’ll use OpenAIEmbeddings with its default model. (At the time of writing, this was text-embedding-ada-002.)

Create a retriever object from the vector store

A retriever performs semantic searches to find the most relevant pieces of information based on a user query. For this baseline example, we’ll configure the retriever to return only the top result, which will be used as context for the LLM to generate an answer.

Initializing the vector store for our RAG system and instantiating a retriever takes only two lines of code:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(
   documents=splits,
   embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

In the last line, we have specified through search_kwargs that the retriever only returns the most similar document (top-k retrieval with k = 1).

Step 4: Bring it all together

Now that we’ve set up a vector database with the source data and initialized the retriever to return the most relevant chunk given a query, we’ll combine it with an LLM to complete our baseline RAG chain.

Define a prompt template

We need to set a prompt to guide the LLM in responding. This prompt should tell the model to use the retrieved context to answer the query.

We’ll use a standard RAG prompt template that specifically asks the LLM to use the provided context (the retrieved chunk) to answer the user query concisely:

from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

Create the full RAG chain

We’ll use the create_stuff_documents_chain utility function to set up the generative part of our RAG chain. It combines an instantiated LLM and a prompt template with a {context} placeholder into a chain that takes a set of documents as its input, which are “stuffed” into the prompt before it is fed into the LLM. In our case, that’s OpenAI’s GPT4o-mini.

from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain

llm = ChatOpenAI(model="gpt-4o-mini")
question_answer_chain = create_stuff_documents_chain(llm, prompt)

Then, we can use the create_retrieval_chain utility function to finally instantiate our complete RAG chain:

from langchain.chains import create_retrieval_chain

rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Get an output from the RAG chain

To see how our system works, we can run a first inference call. We’ll send a query to the chain that we know can be answered using the contents of one of the blog posts:

response = rag_chain.invoke({"input": "What are DOM-based attacks?"})
print(response["answer"])

The response is a dictionary that contains “input,” “context,” and “answer” keys:

{ "input": 'What are DOM-based attacks?', 'context': [Document(metadata={'source': 'https://neptune.ai/blog/llm-guardrails'}, page_content='By prompting the application to pretend to be a chatbot that “can do anything” and is not bound by any restrictions, users were able to manipulate ChatGPT to provide responses to questions it would usually decline to answer.Although “prompt injection” and “jailbreaking” are often used interchangeably in the community, they refer to distinct vulnerabilities that must be handled with different methods.DOM-based attacksDOM-based attacks are an extension of the traditional prompt injection attacks. The key idea is to feed a harmful instruction into the system by hiding it within a website’s code.Consider a scenario where your program crawls websites and feeds the raw HTML to an LLM on a daily basis. The rendered page looks normal to you, with no obvious signs of anything wrong. Yet, an attacker can hide a malicious key phrase by matching its color to the background or adding it in parts of the HTML code that are not rendered, such as a style Tag.While invisible to human eyes, the LLM will')], "answer": "DOM-based attacks are a type of vulnerability where harmful instructions are embedded within a website's code, often hidden from view. Attackers can conceal malicious content by matching its color to the background or placing it in non-rendered sections of the HTML, like style tags. This allows the malicious code to be executed by a system, such as a language model, when it processes the website's HTML."}

We see that the retriever appropriately identified a snippet from the LLM Guardrails: Secure and Controllable Deployment article as the most relevant chunk.

Define a prediction function

Now that we have a fully functioning end-to-end RAG chain, we can create a convenience function that enables us to query our RAG chain. It takes a RAG chain and a query and returns the chain’s response. We’ll also implement the option to pass just the stuff documents chain and provide the list of context documents via an additional input parameter. This will come in handy when evaluating the different parts of our RAG system.

Here’s what this function looks like:

from langchain_core.runnables.base import Runnable
from langchain_core.documents import Document

def predict(chain: Runnable, query: str, context: list[Document] | None = None)-> dict:
    """
    Accepts a retrieval chain or a stuff documents chain. If the latter, context must be passed in.
    Return a response dict with keys "input", "context", and "answer"
    """
    inputs = {"input": query}
    if context:
        inputs.update({"context": context})

    response = chain.invoke(inputs)

    result = {
        response["input"]: {
            "context": [d.page_content for d in response['context']],
            "answer": response["answer"],
        }
    }
    return result

Part 2: Evaluating a RAG system using Ragas and neptune.ai

Once a RAG system is built, it’s important to evaluate its performance and establish a baseline. The proper way to do this is by systematically testing it using a representative evaluation dataset. Since such a dataset is not available in our case yet, we’ll have to generate one.

To assess both the retrieval and generation aspects of the system, we’ll use Ragas as the evaluation framework and neptune.ai to track experiments as we iterate.

What is Ragas?

Ragas is an open-source toolkit for evaluating RAG applications. It offers both LLM-based and non-LLM-based metrics to assess the quality of retrieval and generated responses. Ragas works smoothly with LangChain, making it a great choice for evaluating our RAG system.

Step 1: Generate a RAG evaluation dataset

An evaluation set for RAG tasks is similar to a question-answering task dataset. The key difference is that each row includes not just the query and a reference answer but also reference contexts (documents that we expect to be retrieved to answer the query).

Thus, an example evaluation set entry looks like this:

Query	Reference context	Reference answer
How can users trick a chatbot to bypass restrictions?	[‘By prompting the application to pretend to be a chatbot that “can do anything” and is not bound by any restrictions, users were able to manipulate ChatGPT to provide responses to questions it would usually decline to answer.’]	Users trick chatbots to bypass restrictions by prompting the application to pretend to be a chatbot that ‘can do anything’ and is not bound by any restrictions, allowing it to provide responses to questions it would usually decline to answer.

Ragas provides utilities to generate such a dataset from a list of reference documents using an LLM.

As the reference documents, we’ll use the same chunks that we fed into the Chroma vector store in the first part, which is precisely the knowledge base from which our RAG system is drawing.

To test the generative part of our RAG chain, we’ll need to generate example queries and reference answers using a different model. Otherwise, we’d be testing our system’s self-consistency. We’ll use the full-sized GPT-4o model, which should outperform the GPT-4o-mini in our RAG chain.

As in the first part, it is possible to use a different LLM. The LangchainLLMWrapper and LangChainEmbeddingsWrapper make any model available via LangChain accessible to Ragas.

What happens under the hood?

Ragas’ TestSetGenerator builds a knowledge graph in which each node represents a chunk. It extracts information like named entities from the chunks and uses this data to model the relationship between nodes. From the knowledge graph, so-called query synthesizers derive scenarios consisting of a set of nodes, the desired query length and style, and a user persona. This scenario is used to populate a prompt template instructing an LLM to generate a query and answer (example). For more details, refer to the Ragas Testset Generation documentation.

Creating an evaluation dataset with 50 rows for our RAG system should take about a minute. We’ll generate a mixture of abstract queries (“What is concept A?”) and specific queries (“How often does subscription plan B bill its users?”):

from ragas.llms import LangChainLLMWrapper
from ragas.embeddings import LangChainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.testset import TestsetGenerator
from ragas.testset.synthesizers import AbstractQuerySynthesizer, SpecificQuerySynthesizer

generator_llm = LangChainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangChainEmbeddingsWrapper(OpenAIEmbeddings())

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

dataset = generator.generate_with_langchain_docs(
    splits,
    testset_size=50,
    query_distribution=[
        (AbstractQuerySynthesizer(llm=generator_llm), 0.1),
        (SpecificQuerySynthesizer(llm=generator_llm), 0.9),
    ],
)

Filtering unwanted data

We want to focus our evaluation on cases where the reference answer is helpful. In particular, we don’t want to include test samples with responses containing phrases like “the context is insufficient” or “the context does not contain.” Duplicate entries in the dataset would skew the evaluation, so they should also be omitted.

For filtering, we’ll use the ability to easily convert Ragas datasets into Pandas DataFrames or Hugging Face Datasets:


unique_indices = set(dataset.to_pandas().drop_duplicates(subset=["user_input"]).index)


not_helpful = set(dataset.to_pandas()[dataset.to_pandas()["reference"].str.contains("does not contain|does not provide|context does not|is insufficient|is incomplete", case=False, regex=True)].index)

unique_helpful_indices = unique_indices - not_helpful

ds = dataset.to_hf_dataset().select(unique_helpful_indices)

This leaves us with unique samples that look like this:

User input	Reference contexts	Reference answer
What role does reflection play in identifying and correcting hallucinations in LLM outputs?	[‘After the responseCorrecting a hallucination after the LLM output has been generated is still beneficial, as it prevents the user from seeing the incorrect information. This approach can effectively transform correction into prevention by ensuring that the erroneous response never reaches the user. The process can be broken down into the following steps:This method is part of multi-step reasoning strategies, which are increasingly important in handling complex problems. These strategies, often referred to as “agents,” are gaining popularity. One well-known agent pattern is reflection. By identifying hallucinations early, you can address and correct them before they impact the user.’]	Reflection plays a role in identifying and correcting hallucinations in LLM outputs by allowing early identification and correction of errors before they impact the user.
What are some examples of LLMs that utilize a reasoning strategy to improve their responses?	[‘Post-training or alignmentIt is hypothesized that an LLM instructed not only to respond and follow instructions but also to take time to reason and reflect on a problem could largely mitigate the hallucination issue—either by providing the correct answer or by stating that it does not know how to answer.Furthermore, you can teach a model to use external tools during the reasoning process,\xa0 like getting information from a search engine. There are a lot of different fine-tuning techniques being tested to achieve this. Some LLMs already working with this reasoning strategy are Matt Shumer’s Reflection-LLama-3.1-70b and OpenAI’s O1 family models.’]	Some examples of LLMs that utilize a reasoning strategy to improve their responses are Matt Shumer’s Reflection-LLama-3.1-70b and OpenAI’s O1 family models.
What distnguishes ‘promt injecton’ frm ‘jailbraking’ in vulnerabilties n handling?	[‘Although “prompt injection” and “jailbreaking” are often used interchangeably in the community, they refer to distinct vulnerabilities that must be handled with different methods.’]	‘Prompt injection’ and ‘jailbreaking’ are distinct vulnerabilities that require different handling methods.

In the third sample, the query contains a lot of typos. This is an example of the “MISSPELLED” query style.

💡 You can find a full example evaluation dataset on Hugging Face.

Step 2: Choose RAG evaluation metrics

As mentioned earlier, Ragas offers both LLM-based and non-LLM-based metrics for RAG system evaluation.

For this example, we’ll focus on LLM-based metrics. LLM-based metrics are more suitable for tasks requiring semantic and contextual understanding than quantitative metrics while being significantly less resource-intensive than having humans evaluate each response. This makes them a reasonable tradeoff despite concerns about reproducibility.

From the wide range of metrics available in Ragas, we’ll select five:

LLM Context Recall measures how many of the relevant documents are successfully retrieved. It uses the reference answer as a proxy for the reference context and determines whether all claims in the reference answer can be attributed to the retrieved context.
Faithfulness measures the generated answer’s factual consistency with the given context by assessing how many claims in the generated answer can be found in the retrieved context.
Factual Correctness evaluates the factual accuracy of the generated answer by assessing whether claims are present in the reference answer (true and false positives) and whether any claims from the reference answer are missing (false negatives). From this information, precision, recall, or F1 scores are calculated.
Semantic Similarity measures the similarity between the reference answer and the generated answer.
Noise Sensitivity measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents.

Each of these metrics requires specifying an LLM or an embedding model for its calculations. We’ll again use GPT-4o for this purpose:

from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity, NoiseSensitivity
from ragas import EvaluationDataset
from ragas import evaluate

evaluator_llm = LangChainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangChainEmbeddingsWrapper(OpenAIEmbeddings())

metrics = [
    LLMContextRecall(llm=evaluator_llm),
    FactualCorrectness(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
    SemanticSimilarity(embeddings=evaluator_embeddings),
    NoiseSensitivity(llm=evaluator_llm),
]

Step 3: Evaluate the baseline RAG system’s performance

To evaluate our baseline RAG system, we’ll generate predictions and analyze them with the five selected metrics.

To speed up the process, we’ll use a concurrent approach to handle the I/O-bound predict calls from the RAG chain. This allows us to process multiple queries in parallel. Afterward, we can convert the results into a data frame for further inspection and manipulation. We’ll also store the results in a CSV file.

Here’s the complete performance evaluation code:

from concurrent.futures import ThreadPoolExecutor, as_completed
from datasets import Dataset

def concurrent_predict_retrieval_chain(chain: Runnable, dataset: Dataset):
    results = {}
    threads = []
    with ThreadPoolExecutor(max_workers=5) as pool:
        for query in dataset["user_input"]:
            threads.append(pool.submit(predict, chain, query))
        for task in as_completed(threads):
            results.update(task.result())
    return results

predictions = concurrent_predict_retrieval_chain(rag_chain, ds)


ds_k_1 = ds.map(lambda example: {"response": predictions[example["user_input"]]["answer"], "retrieved_contexts": predictions[example["user_input"]]["context"]})

results = evaluate(dataset=EvaluationDataset.from_hf_dataset(ds_k_1), metrics=metrics)


df = results.to_pandas()
df.to_csv("eval_results.csv", index=False)

Part 3: Iteratively refining the RAG performance

With the evaluation setup in place, we can now start to improve our RAG system. Using the initial evaluation results as our baseline, we can systematically make changes to our RAG chain and assess whether they improve performance.

While we could make do with saving all evaluation results in cleanly named files and taking notes, we’d quickly be overwhelmed with the amount of information. To efficiently iterate and keep track of our progress, we’ll need a way to record, analyze, and compare our experiments.

What is neptune.ai?

Neptune is a machine-learning experiment tracker focused on collaboration and scalability. It provides a centralized platform for tracking, logging, and comparing metrics, artifacts, and configurations.

Neptune can track not only single metrics values but also more complex metadata, such as text, arrays, and files. All metadata can be accessed and analyzed through a highly versatile user interface as well as programmatically. All this makes it a great tool for developing RAG systems and other LLM-based applications.

Step 1: Set up neptune.ai for experiment tracking

To get started with Neptune, sign up for a free account at app.neptune.ai and follow the steps to create a new project. Once that’s done, set the project name and API token as environment variables and initialize a run:

os.environ["NEPTUNE_PROJECT"] = "YOUR_PROJECT"
os.environ["NEPTUNE_API_TOKEN"] = "YOUR_API_TOKEN"

import neptune

run = neptune.init_run()

In Neptune, each run corresponds to one tracked experiment. Thus, every time we’ll execute our evaluation script, we’ll start a new experiment.

Logging Ragas metrics to neptune.ai

To make our lives easier, we’ll define a helper function that stores the Ragas evaluation results in the Neptune Run object, which represents the current experiment.

We’ll track the metrics for each sample in the evaluation dataset and an overall performance metric, which in our case is simply the average across all metrics for the entire dataset:

import io

import neptune
import pandas as pd

def log_detailed_metrics(results_df: pd.DataFrame, run: neptune.Run, k: int):
    run[f"eval/k"].append(k)

    
    for i, row in results_df.iterrows():
        for m in metrics:
            val = row[m.name]
            run[f"eval/q{i}/{m.name}"].append(val)

        
        run[f"eval/q{i}/user_input"] = row["user_input"]
        run[f"eval/q{i}/response"].append(row["response"])
        run[f"eval/q{i}/reference"] = row["reference"]

        
        context_df = pd.DataFrame(
            zip(row["retrieved_contexts"], row["reference_contexts"]
            columns=["retrieved", "reference"],
        )
        context_stream = io.StringIO()
        context_data = context_df.to_csv(
            context_stream, index=True, index_label="k")
        run[f"eval/q{i}/contexts/{k}}"].upload(
            neptune.types.File.from_stream(context_stream, extension="csv")
        )
      
    
    overall_metrics = results_df[[m.name for m in metrics]].mean(axis=0).to_dict()
    for k, v in overall_metrics.items():
        run[f"eval/overall"].append(v)

log_detailed_metrics(df, run, k=1)


run.stop()

Once we run the evaluation and switch to Neptune’s Experiments tab, we see our currently active run and the first round of metrics that we’ve logged.

Step 2: Iterate over a retrieval parameter

In our baseline RAG chain, we only use the first retrieved document chunk in the LLM context. But what if there are relevant chunks ranked lower, perhaps in the top 3 or top 5? To explore this, we can experiment with using different values for k, the number of retrieved documents.

We’ll start by evaluating k = 3 and k = 5 to see how the results change. For each experiment, we instantiate a new retrieval chain, run the prediction and evaluation functions, and log the results for comparison:

for k in [1, 3, 5]:
    retriever_k = vectorstore.as_retriever(search_kwargs={"k": k})
    rag_chain_k = create_retrieval_chain(retriever_k, question_answer_chain)
    predictions_k = concurrent_predict_retrieval_chain(rag_chain_k, ds)

    
    ds_k = ds.map(lambda example: {
        "response": predictions_k[example["user_input"]]["answer"],
        "retrieved_contexts": predictions_k[example["user_input"]]["context"]
    })

    results_k = evaluate(dataset=EvaluationDataset.from_hf_dataset(ds_k), metrics=metrics)
    df_k = results_k.to_pandas()

    
    df_k.to_csv("eval_results.csv", index=False)
    run[f"eval/eval_data/{k}"].upload("eval_results.csv")

    log_detailed_metrics(df_k, run, k)


run.stop()

Once the evaluation is complete (this should take between 5 and 10 minutes), the script should display “Shutting down background jobs” and show “Done!” once the process is finished.

Results overview

Let’s take a look at the results. Navigate to the Charts tab. The graphs all share a common x-axis labeled “step.” The evaluations for k = [1, 3, 5] are recorded as steps [0, 1, 2].

Looking at the overall metrics, we can observe that increasing k has improved most metrics. Factual correctness decreases by a small amount. Additionally, noise sensitivity, where a lower value is preferable, increased. This is expected since increasing k will lead to more irrelevant chunks being included in the context. However, as both context recall and answer semantic similarity have gone up, it seems to be a worthy tradeoff.

Step 3: Iterate further

From here on, there are numerous possibilities for further experimentation, for example:

Trying different chunking strategies, such as semantic chunking, which determines the breakpoints between chunks based on semantic similarity rather than strict token counts.
Leveraging hybrid search, which combines keyword search algorithms like BM25 and semantic search with embeddings.
Trying other models that excel at question-answering tasks, like the Anthropic models, which are also available through LangChain.
Adding support components for dialogue systems, such as chat history.

Looking ahead

In the three parts of this tutorial, we’ve used LangChain to build a RAG system based on OpenAI models and the Chroma vector database, evaluated it with Ragas, and analyzed our progress with Neptune. Along the way, we explored essential foundations of developing performant RAG systems, such as:

How to efficiently chunk, store, and retrieve data to ensure our RAG system consistently delivers relevant and accurate responses to user queries.
How to generate an evaluation dataset for our particular RAG chain and use RAG-specific metrics like faithfulness and factual correctness to evaluate it.
How Neptune makes it easy to track, visualize, and analyze RAG system performance, allowing us to take a systematic approach when iteratively improving our application.

As we saw at the end of part 3, we’ve barely scratched the surface when it comes to improving retrieval performance and response quality. Using the triplet of tools we introduced and our evaluation setup, any new technique or change applied to the RAG system can be assessed and compared with alternative configurations. This allows us to confidently assess whether a modification improves performance and detect unwanted side effects.