-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting nan(not a number) when evaluating faithfulness #733
Comments
Funny because most of my Faithfulness is actually just NaNs. Even though I have a valid answer with claims. |
Hey @Hungreeee @aldrinjenson Which models are you guys using underneath? |
Also could you expand on? @aldrinjenson
|
I am primarily using gpt 3.5 turbo based models. both the normal one and the 16k one. Was able to get The pattern I noticed is that, it usually comes up when asking questions which are out of context - about data that is not present inside the chunks. |
I see @aldrinjenson if you have an example of this it would be helpful for me to reproduce the issue. In situations like this, we chose to give the score as NaN because it's not that the answer is unfaithful, so it's not good to give it a score. |
Hi, |
Hello, I am also getting the faithfulness score shown as NaN when I use the RAGAS evaluation metric with GPT-3.5 Turbo models, including both the standard and 16k versions. This happens even when the questions I ask are related to the content of the uploaded document and its chunks. Can you suggest a way to fix this? |
I have been experiencing the same issue with gpt-3.5-turbo-instruct. I have added a callback to capture request and response from LLM, and I have discovered that the LLM is returning multiple questions, answers and statements, in addition to the expected one. Here is an example. Following code: from langchain_core.callbacks import BaseCallbackHandler
class TestCallback(BaseCallbackHandler):
def on_llm_start(self, serialized, prompts, **kwargs):
print(f"Prompts: {prompts}\n\n")
def on_llm_end(self, response, **kwargs):
print(f"Response: {response}\n\n") from datasets import Dataset
data_samples = {
'question': [
'When was the first super bowl?',
'Who won the most super bowls?'
],
'answer': [
'The first superbowl was held on Jan 15, 1967',
'The most super bowls have been won by The New England Patriots'
],
'contexts' : [
[
'The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'
],
[
'The Green Bay Packers...Green Bay, Wisconsin.',
'The Packers compete...Football Conference'
]
],
}
dataset = Dataset.from_dict(data_samples)
df_dataset = dataset.to_pandas()
df_dataset.head() from langchain_openai import AzureOpenAI, AzureOpenAIEmbeddings
from ragas import evaluate, RunConfig
from ragas.metrics import faithfulness
score = evaluate(
dataset=dataset,
metrics=[faithfulness],
llm=AzureOpenAI(
deployment_name="gpt-35-turbo"
),
embeddings=AzureOpenAIEmbeddings(
azure_deployment="text-embedding-ada-002"
),
raise_exceptions=True,
callbacks=[TestCallback()],
is_async=True,
run_config=RunConfig(
timeout=10,
max_retries=10,
max_wait=60,
max_workers=1
)
)
score.to_pandas() is generating NaN for faithfulness, and the callback outputs the following logs:
and
Note that we are not directly using OpenAI, but an Azure deployment instead because our company restrictions. However gpt-3.5-turbo-instruct is invoked under the hood. Edit: Using AzureChatOpenAI instead of AzureOpenAI seems to work much better. |
Hello I met the same problem. Does anybody have update about the root cause? Please provide suggestion on how to resolve this. |
Hi @RinneHan, I've noticed that this issue is rare when using GPT-4. It occurs mostly with GPT-3.5 Turbo. |
@aldrinjenson is right, it does depend on which model you use and the JSON support for that model. which model are you using @RinneHan? |
I am experiencing the same issue using a local Llama 3 model even when the context does contain the answer. Any suggestions why faithfulness returns as Nan |
@parham-box it's mostly because of the lack of JSON following in those models. how large were the models you used? |
Any solution? |
@Senthselvi with #1232 there will be improvements but this also depends on which model you are using too |
Hi,
I'm trying to evulate faithfulness by giving some out of context question and I was getting the score as nan(Not a number).
I took a look at the code adn found that this line is causing the issue:
Why is it present? what does it mean?
Ragas version: 0.1.3
Python version: 3.11.4
Code to Reproduce
Ask questions out of context in the RAG pipeline. Occasionally gets faithfulness as nan
The text was updated successfully, but these errors were encountered: