Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Opik Integration #1256

Merged
merged 10 commits into from
Sep 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/getstarted/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ In addition, you can use the RAG metrics with other LLM observability tools like
- [Phoenix (Arize)](../howtos/integrations/ragas-arize.ipynb)
- [Langfuse](../howtos/integrations/langfuse.ipynb)
- [OpenLayer](https://openlayer.com/)
- [Opik](../howtos/integrations/opik.ipynb)

These tools can provide model-based feedback about various aspects of your application, such as the ones mentioned below:

Expand Down
6 changes: 3 additions & 3 deletions docs/howtos/applications/tracing.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

# Explainability through Logging and tracing

Logging and tracing results from llm are important for any language model-based application. This is a tutorial on how to do tracing with Ragas. Ragas provides `callbacks` functionality which allows you to hook various tracers like Langmsith, wandb, etc easily. In this notebook, I will be using Langmith for tracing
Logging and tracing results from llm are important for any language model-based application. This is a tutorial on how to do tracing with Ragas. Ragas provides `callbacks` functionality which allows you to hook various tracers like Langmsith, wandb, Opik, etc easily. In this notebook, I will be using Langmith for tracing.

To set up Langsmith, we need to set some environment variables that it needs. For more information, you can refer to the [docs](https://docs.smith.langchain.com/)

Expand All @@ -12,7 +12,7 @@ export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
```

Now we have to import the required tracer from langchain, here we are using `LangChainTracer` but you can similarly use any tracer supported by langchain like [WandbTracer](https://python.langchain.com/docs/integrations/providers/wandb_tracing)
Now we have to import the required tracer from langchain, here we are using `LangChainTracer` but you can similarly use any tracer supported by langchain like [WandbTracer](https://python.langchain.com/docs/integrations/providers/wandb_tracing) or [OpikTracer](https://comet.com/docs/opik/tracing/integrations/ragas?utm_source=ragas&utm_medium=github&utm_campaign=opik&utm_content=tracing_how_to)

```{code-block} python
# langsmith
Expand All @@ -38,4 +38,4 @@ evaluate(dataset["train"],metrics=[context_precision],callbacks=[tracer])
![](./../../_static/imgs/trace-langsmith.png)


You can also write your own custom callbacks using langchain’s `BaseCallbackHandler`, refer [here](https://www.notion.so/Docs-logging-and-tracing-6f21cde9b3cb4d499526f48fd615585d?pvs=21) to read more about it.
You can also write your own custom callbacks using langchain’s `BaseCallbackHandler`, refer [here](https://www.notion.so/Docs-logging-and-tracing-6f21cde9b3cb4d499526f48fd615585d?pvs=21) to read more about it.
1 change: 1 addition & 0 deletions docs/howtos/integrations/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,5 @@ tonic-validate.ipynb
haystack.ipynb
openlayer.ipynb
helicone.ipynb
opik.ipynb
:::
328 changes: 328 additions & 0 deletions docs/howtos/integrations/opik.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Opik by Comet\n",
"\n",
"In this notebook, we will showcase how to use Opik with Ragas for monitoring and evaluation of RAG (Retrieval-Augmented Generation) pipelines.\n",
"\n",
"There are two main ways to use Opik with Ragas:\n",
"\n",
"1. Using Ragas metrics to score traces\n",
"2. Using the Ragas `evaluate` function to score a dataset\n",
"\n",
"## Setup\n",
"\n",
"[Comet](https://www.comet.com/site?utm_medium=github&utm_source=ragas&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&utm_medium=github&utm_source=ragas&utm_campaign=opik) and grab you API Key.\n",
"\n",
"> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/self_hosting_opik?utm_medium=github&utm_source=ragas&utm_campaign=opik/) for more information."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import getpass\n",
"\n",
"os.environ[\"OPIK_API_KEY\"] = getpass.getpass(\"Opik API Key: \")\n",
"os.environ[\"OPIK_WORKSPACE\"] = input(\"Comet workspace (often the same as your username): \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are running the Opik platform locally, simply set:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# import os\n",
"# os.environ[\"OPIK_URL_OVERRIDE\"] = \"http://localhost:5173/api\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparing our environment\n",
"\n",
"First, we will install the necessary libraries, configure the OpenAI API key and create a new Opik dataset."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%pip install opik --quiet\n",
"\n",
"import os\n",
"import getpass\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API key: \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Integrating Opik with Ragas\n",
"\n",
"### Using Ragas metrics to score traces\n",
"\n",
"Ragas provides a set of metrics that can be used to evaluate the quality of a RAG pipeline, including but not limited to: `answer_relevancy`, `answer_similarity`, `answer_correctness`, `context_precision`, `context_recall`, `context_entity_recall`, `summarization_score`. You can find a full list of metrics in the [Ragas documentation](https://docs.ragas.io/en/latest/references/metrics.html#).\n",
"\n",
"These metrics can be computed on the fly and logged to traces or spans in Opik. For this example, we will start by creating a simple RAG pipeline and then scoring it using the `answer_relevancy` metric.\n",
"\n",
"#### Create the Ragas metric\n",
"\n",
"In order to use the Ragas metric without using the `evaluate` function, you need to initialize the metric with a `RunConfig` object and an LLM provider. For this example, we will use LangChain as the LLM provider with the Opik tracer enabled.\n",
"\n",
"We will first start by initializing the Ragas metric:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Import the metric\n",
"from ragas.metrics import AnswerRelevancy\n",
"\n",
"# Import some additional dependencies\n",
"from langchain_openai.chat_models import ChatOpenAI\n",
"from langchain_openai.embeddings import OpenAIEmbeddings\n",
"from ragas.llms import LangchainLLMWrapper\n",
"from ragas.embeddings import LangchainEmbeddingsWrapper\n",
"\n",
"# Initialize the Ragas metric\n",
"llm = LangchainLLMWrapper(ChatOpenAI())\n",
"emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())\n",
"\n",
"answer_relevancy_metric = AnswerRelevancy(llm=llm, embeddings=emb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the metric is initialized, you can use it to score a sample question. Given that the metric scoring is done asynchronously, you need to use the `asyncio` library to run the scoring function."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Run this cell first if you are running this in a Jupyter notebook\n",
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Answer Relevancy score: 0.9616931041269692\n"
]
}
],
"source": [
"import asyncio\n",
"from ragas.integrations.opik import OpikTracer\n",
"\n",
"# Define the scoring function\n",
"def compute_metric(opik_tracer, metric, row):\n",
" async def get_score(opik_tracer, metric, row):\n",
" score = await metric.ascore(row, callbacks=[opik_tracer])\n",
" return score\n",
"\n",
" # Run the async function using the current event loop\n",
" loop = asyncio.get_event_loop()\n",
" result = loop.run_until_complete(get_score(opik_tracer, metric, row))\n",
" return result\n",
"\n",
"# Score a simple example\n",
"row = {\n",
" \"question\": \"What is the capital of France?\",\n",
" \"answer\": \"Paris\",\n",
" \"contexts\": [\"Paris is the capital of France.\", \"Paris is in France.\"]\n",
"}\n",
"\n",
"opik_tracer = OpikTracer()\n",
"score = compute_metric(opik_tracer, answer_relevancy_metric, row)\n",
"print(\"Answer Relevancy score:\", score)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you now navigate to Opik, you will be able to see that a new trace has been created in the `Default Project` project.\n",
"\n",
"#### Score traces\n",
"\n",
"You can score traces by using the `get_current_trace` function to get the current trace and then calling the `log_feedback_score` function.\n",
"\n",
"The advantage of this approach is that the scoring span is added to the trace allowing for a more fine-grained analysis of the RAG pipeline. It will however run the Ragas metric calculation synchronously and so might not be suitable for production use-cases."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Paris'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from opik import track\n",
"from opik.opik_context import get_current_trace\n",
"\n",
"@track\n",
"def retrieve_contexts(question):\n",
" # Define the retrieval function, in this case we will hard code the contexts\n",
" return [\"Paris is the capital of France.\", \"Paris is in France.\"]\n",
"\n",
"@track\n",
"def answer_question(question, contexts):\n",
" # Define the answer function, in this case we will hard code the answer\n",
" return \"Paris\"\n",
"\n",
"@track(name=\"Compute Ragas metric score\", capture_input=False)\n",
"def compute_rag_score(answer_relevancy_metric, question, answer, contexts):\n",
" # Define the score function\n",
" row = {\"question\": question, \"answer\": answer, \"contexts\": contexts}\n",
" score = compute_metric(answer_relevancy_metric, row)\n",
" return score\n",
"\n",
"@track\n",
"def rag_pipeline(question):\n",
" # Define the pipeline\n",
" contexts = retrieve_contexts(question)\n",
" answer = answer_question(question, contexts)\n",
"\n",
" trace = get_current_trace()\n",
" score = compute_rag_score(answer_relevancy_metric, question, answer, contexts)\n",
" trace.log_feedback_score(\"answer_relevancy\", round(score, 4), category_name=\"ragas\")\n",
" \n",
" return answer\n",
"\n",
"rag_pipeline(\"What is the capital of France?\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Evaluating datasets\n",
"\n",
"If you looking at evaluating a dataset, you can use the Ragas `evaluate` function. When using this function, the Ragas library will compute the metrics on all the rows of the dataset and return a summary of the results.\n",
"\n",
"You can use the OpikTracer callback to log the results of the evaluation to the Opik platform. For this we will configure the OpikTracer"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "985d2e27ce8a48daad673666e6e6e953",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Evaluating: 0%| | 0/9 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'context_precision': 1.0000, 'faithfulness': 0.8250, 'answer_relevancy': 0.9755}\n"
]
}
],
"source": [
"from datasets import load_dataset\n",
"from ragas.metrics import context_precision, answer_relevancy, faithfulness\n",
"from ragas import evaluate\n",
"from ragas.integrations.opik import OpikTracer\n",
"\n",
"fiqa_eval = load_dataset(\"explodinggradients/fiqa\", \"ragas_eval\")\n",
"\n",
"opik_tracer_eval = OpikTracer(tags=[\"ragas_eval\"], metadata={\"evaluation_run\": True})\n",
"\n",
"result = evaluate(\n",
" fiqa_eval[\"baseline\"].select(range(3)),\n",
" metrics=[context_precision, faithfulness, answer_relevancy],\n",
" callbacks=[opik_tracer_eval]\n",
")\n",
"\n",
"print(result)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "py312_llm_eval",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
4 changes: 3 additions & 1 deletion src/ragas/evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@

from ragas.cost import CostCallbackHandler, TokenUsageParser

RAGAS_EVALUATION_CHAIN_NAME = "ragas evaluation"


@track_was_completed
def evaluate(
Expand Down Expand Up @@ -230,7 +232,7 @@ def evaluate(
# new evaluation chain
row_run_managers = []
evaluation_rm, evaluation_group_cm = new_group(
name="ragas evaluation", inputs={}, callbacks=callbacks
name=RAGAS_EVALUATION_CHAIN_NAME, inputs={}, callbacks=callbacks
)
for i, sample in enumerate(dataset):
row = t.cast(t.Dict[str, t.Any], sample.dict())
Expand Down
Loading
Loading