explodinggradients · jjmachan · Sep 7, 2024 · Aug 27, 2024 · Aug 27, 2024 · Aug 30, 2024
diff --git a/docs/getstarted/monitoring.md b/docs/getstarted/monitoring.md
@@ -18,6 +18,7 @@ In addition, you can use the RAG metrics with other LLM observability tools like
 - [Phoenix (Arize)](../howtos/integrations/ragas-arize.ipynb)
 - [Langfuse](../howtos/integrations/langfuse.ipynb)
 - [OpenLayer](https://openlayer.com/)
+- [Opik](../howtos/integrations/opik.ipynb)
 
 These tools can provide model-based feedback about various aspects of your application, such as the ones mentioned below:
 

diff --git a/docs/howtos/applications/tracing.md b/docs/howtos/applications/tracing.md
@@ -1,7 +1,7 @@
 
 # Explainability through Logging and tracing
 
-Logging and tracing results from llm are important for any language model-based application. This is a tutorial on how to do tracing with Ragas. Ragas provides `callbacks` functionality which allows you to hook various tracers like Langmsith, wandb, etc easily.  In this notebook, I will be using Langmith for tracing
+Logging and tracing results from llm are important for any language model-based application. This is a tutorial on how to do tracing with Ragas. Ragas provides `callbacks` functionality which allows you to hook various tracers like Langmsith, wandb, Opik, etc easily.  In this notebook, I will be using Langmith for tracing.
 
 To set up Langsmith, we need to set some environment variables that it needs. For more information, you can refer to the [docs](https://docs.smith.langchain.com/)
 
@@ -12,7 +12,7 @@ export LANGCHAIN_API_KEY=<your-api-key>
 export LANGCHAIN_PROJECT=<your-project>  # if not specified, defaults to "default"
 ```
 
-Now we have to import the required tracer from langchain, here we are using `LangChainTracer` but you can similarly use any tracer supported by langchain like [WandbTracer](https://python.langchain.com/docs/integrations/providers/wandb_tracing)
+Now we have to import the required tracer from langchain, here we are using `LangChainTracer` but you can similarly use any tracer supported by langchain like [WandbTracer](https://python.langchain.com/docs/integrations/providers/wandb_tracing) or [OpikTracer](https://comet.com/docs/opik/tracing/integrations/ragas?utm_source=ragas&utm_medium=github&utm_campaign=opik&utm_content=tracing_how_to)
 
 ```{code-block} python
 # langsmith
@@ -38,4 +38,4 @@ evaluate(dataset["train"],metrics=[context_precision],callbacks=[tracer])
 ![](./../../_static/imgs/trace-langsmith.png)
 
 
-You can also write your own custom callbacks using langchain’s `BaseCallbackHandler`, refer [here](https://www.notion.so/Docs-logging-and-tracing-6f21cde9b3cb4d499526f48fd615585d?pvs=21) to read more about it.
+You can also write your own custom callbacks using langchain’s `BaseCallbackHandler`, refer [here](https://www.notion.so/Docs-logging-and-tracing-6f21cde9b3cb4d499526f48fd615585d?pvs=21) to read more about it.
diff --git a/docs/howtos/integrations/index.md b/docs/howtos/integrations/index.md
@@ -17,4 +17,5 @@ tonic-validate.ipynb
 haystack.ipynb
 openlayer.ipynb
 helicone.ipynb
+opik.ipynb
 :::
diff --git a/docs/howtos/integrations/opik.ipynb b/docs/howtos/integrations/opik.ipynb
@@ -0,0 +1,328 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Opik by Comet\n",
+    "\n",
+    "In this notebook, we will showcase how to use Opik with Ragas for monitoring and evaluation of RAG (Retrieval-Augmented Generation) pipelines.\n",
+    "\n",
+    "There are two main ways to use Opik with Ragas:\n",
+    "\n",
+    "1. Using Ragas metrics to score traces\n",
+    "2. Using the Ragas `evaluate` function to score a dataset\n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "[Comet](https://www.comet.com/site?utm_medium=github&utm_source=ragas&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&utm_medium=github&utm_source=ragas&utm_campaign=opik) and grab you API Key.\n",
+    "\n",
+    "> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/self_hosting_opik?utm_medium=github&utm_source=ragas&utm_campaign=opik/) for more information."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import getpass\n",
+    "\n",
+    "os.environ[\"OPIK_API_KEY\"] = getpass.getpass(\"Opik API Key: \")\n",
+    "os.environ[\"OPIK_WORKSPACE\"] = input(\"Comet workspace (often the same as your username): \")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you are running the Opik platform locally, simply set:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# import os\n",
+    "# os.environ[\"OPIK_URL_OVERRIDE\"] = \"http://localhost:5173/api\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preparing our environment\n",
+    "\n",
+    "First, we will install the necessary libraries, configure the OpenAI API key and create a new Opik dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install opik --quiet\n",
+    "\n",
+    "import os\n",
+    "import getpass\n",
+    "\n",
+    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API key: \")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Integrating Opik with Ragas\n",
+    "\n",
+    "### Using Ragas metrics to score traces\n",
+    "\n",
+    "Ragas provides a set of metrics that can be used to evaluate the quality of a RAG pipeline, including but not limited to: `answer_relevancy`, `answer_similarity`, `answer_correctness`, `context_precision`, `context_recall`, `context_entity_recall`, `summarization_score`. You can find a full list of metrics in the [Ragas documentation](https://docs.ragas.io/en/latest/references/metrics.html#).\n",
+    "\n",
+    "These metrics can be computed on the fly and logged to traces or spans in Opik. For this example, we will start by creating a simple RAG pipeline and then scoring it using the `answer_relevancy` metric.\n",
+    "\n",
+    "#### Create the Ragas metric\n",
+    "\n",
+    "In order to use the Ragas metric without using the `evaluate` function, you need to initialize the metric with a `RunConfig` object and an LLM provider. For this example, we will use LangChain as the LLM provider with the Opik tracer enabled.\n",
+    "\n",
+    "We will first start by initializing the Ragas metric:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import the metric\n",
+    "from ragas.metrics import AnswerRelevancy\n",
+    "\n",
+    "# Import some additional dependencies\n",
+    "from langchain_openai.chat_models import ChatOpenAI\n",
+    "from langchain_openai.embeddings import OpenAIEmbeddings\n",
+    "from ragas.llms import LangchainLLMWrapper\n",
+    "from ragas.embeddings import LangchainEmbeddingsWrapper\n",
+    "\n",
+    "# Initialize the Ragas metric\n",
+    "llm = LangchainLLMWrapper(ChatOpenAI())\n",
+    "emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())\n",
+    "\n",
+    "answer_relevancy_metric = AnswerRelevancy(llm=llm, embeddings=emb)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once the metric is initialized, you can use it to score a sample question. Given that the metric scoring is done asynchronously, you need to use the `asyncio` library to run the scoring function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run this cell first if you are running this in a Jupyter notebook\n",
+    "import nest_asyncio\n",
+    "\n",
+    "nest_asyncio.apply()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Answer Relevancy score: 0.9616931041269692\n"
+     ]
+    }
+   ],
+   "source": [
+    "import asyncio\n",
+    "from ragas.integrations.opik import OpikTracer\n",
+    "\n",
+    "# Define the scoring function\n",
+    "def compute_metric(opik_tracer, metric, row):\n",
+    "    async def get_score(opik_tracer, metric, row):\n",
+    "        score = await metric.ascore(row, callbacks=[opik_tracer])\n",
+    "        return score\n",
+    "\n",
+    "    # Run the async function using the current event loop\n",
+    "    loop = asyncio.get_event_loop()\n",
+    "    result = loop.run_until_complete(get_score(opik_tracer, metric, row))\n",
+    "    return result\n",
+    "\n",
+    "# Score a simple example\n",
+    "row = {\n",
+    "    \"question\": \"What is the capital of France?\",\n",
+    "    \"answer\": \"Paris\",\n",
+    "    \"contexts\": [\"Paris is the capital of France.\", \"Paris is in France.\"]\n",
+    "}\n",
+    "\n",
+    "opik_tracer = OpikTracer()\n",
+    "score = compute_metric(opik_tracer, answer_relevancy_metric, row)\n",
+    "print(\"Answer Relevancy score:\", score)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you now navigate to Opik, you will be able to see that a new trace has been created in the `Default Project` project.\n",
+    "\n",
+    "#### Score traces\n",
+    "\n",
+    "You can score traces by using the `get_current_trace` function to get the current trace and then calling the `log_feedback_score` function.\n",
+    "\n",
+    "The advantage of this approach is that the scoring span is added to the trace allowing for a more fine-grained analysis of the RAG pipeline. It will however run the Ragas metric calculation synchronously and so might not be suitable for production use-cases."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Paris'"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from opik import track\n",
+    "from opik.opik_context import get_current_trace\n",
+    "\n",
+    "@track\n",
+    "def retrieve_contexts(question):\n",
+    "    # Define the retrieval function, in this case we will hard code the contexts\n",
+    "    return [\"Paris is the capital of France.\", \"Paris is in France.\"]\n",
+    "\n",
+    "@track\n",
+    "def answer_question(question, contexts):\n",
+    "    # Define the answer function, in this case we will hard code the answer\n",
+    "    return \"Paris\"\n",
+    "\n",
+    "@track(name=\"Compute Ragas metric score\", capture_input=False)\n",
+    "def compute_rag_score(answer_relevancy_metric, question, answer, contexts):\n",
+    "    # Define the score function\n",
+    "    row = {\"question\": question, \"answer\": answer, \"contexts\": contexts}\n",
+    "    score = compute_metric(answer_relevancy_metric, row)\n",
+    "    return score\n",
+    "\n",
+    "@track\n",
+    "def rag_pipeline(question):\n",
+    "    # Define the pipeline\n",
+    "    contexts = retrieve_contexts(question)\n",
+    "    answer = answer_question(question, contexts)\n",
+    "\n",
+    "    trace = get_current_trace()\n",
+    "    score = compute_rag_score(answer_relevancy_metric, question, answer, contexts)\n",
+    "    trace.log_feedback_score(\"answer_relevancy\", round(score, 4), category_name=\"ragas\")\n",
+    "    \n",
+    "    return answer\n",
+    "\n",
+    "rag_pipeline(\"What is the capital of France?\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Evaluating datasets\n",
+    "\n",
+    "If you looking at evaluating a dataset, you can use the Ragas `evaluate` function. When using this function, the Ragas library will compute the metrics on all the rows of the dataset and return a summary of the results.\n",
+    "\n",
+    "You can use the OpikTracer callback to log the results of the evaluation to the Opik platform. For this we will configure the OpikTracer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "985d2e27ce8a48daad673666e6e6e953",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Evaluating:   0%|          | 0/9 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'context_precision': 1.0000, 'faithfulness': 0.8250, 'answer_relevancy': 0.9755}\n"
+     ]
+    }
+   ],
+   "source": [
+    "from datasets import load_dataset\n",
+    "from ragas.metrics import context_precision, answer_relevancy, faithfulness\n",
+    "from ragas import evaluate\n",
+    "from ragas.integrations.opik import OpikTracer\n",
+    "\n",
+    "fiqa_eval = load_dataset(\"explodinggradients/fiqa\", \"ragas_eval\")\n",
+    "\n",
+    "opik_tracer_eval = OpikTracer(tags=[\"ragas_eval\"], metadata={\"evaluation_run\": True})\n",
+    "\n",
+    "result = evaluate(\n",
+    "    fiqa_eval[\"baseline\"].select(range(3)),\n",
+    "    metrics=[context_precision, faithfulness, answer_relevancy],\n",
+    "    callbacks=[opik_tracer_eval]\n",
+    ")\n",
+    "\n",
+    "print(result)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "py312_llm_eval",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/src/ragas/evaluation.py b/src/ragas/evaluation.py
@@ -41,6 +41,8 @@
 
     from ragas.cost import CostCallbackHandler, TokenUsageParser
 
+RAGAS_EVALUATION_CHAIN_NAME = "ragas evaluation"
+
 
 @track_was_completed
 def evaluate(
@@ -230,7 +232,7 @@ def evaluate(
     # new evaluation chain
     row_run_managers = []
     evaluation_rm, evaluation_group_cm = new_group(
-        name="ragas evaluation", inputs={}, callbacks=callbacks
+        name=RAGAS_EVALUATION_CHAIN_NAME, inputs={}, callbacks=callbacks
     )
     for i, sample in enumerate(dataset):
         row = t.cast(t.Dict[str, t.Any], sample.dict())