Getting nan(not a number) when evaluating faithfulness #733

aldrinjenson · 2024-03-11T10:45:45Z

Hi,
I'm trying to evulate faithfulness by giving some out of context question and I was getting the score as nan(Not a number).

I took a look at the code adn found that this line is causing the issue:

Why is it present? what does it mean?

Ragas version: 0.1.3
Python version: 3.11.4

Code to Reproduce
Ask questions out of context in the RAG pipeline. Occasionally gets faithfulness as nan

Hungreeee · 2024-03-11T17:21:22Z

Funny because most of my Faithfulness is actually just NaNs. Even though I have a valid answer with claims.

shahules786 · 2024-03-12T01:36:55Z

Hey @Hungreeee @aldrinjenson Which models are you guys using underneath?

shahules786 · 2024-03-12T01:37:50Z

Also could you expand on? @aldrinjenson

sk questions out of context in the RAG pipeline. Occasionally gets faithfulness as nan

aldrinjenson · 2024-03-12T04:34:43Z

I am primarily using gpt 3.5 turbo based models. both the normal one and the 16k one.

Was able to get nan multiple times using either of them.

The pattern I noticed is that, it usually comes up when asking questions which are out of context - about data that is not present inside the chunks.

shahules786 · 2024-03-12T23:59:04Z

I see @aldrinjenson if you have an example of this it would be helpful for me to reproduce the issue.
But NaN comes when in these situations like when questions and answer are unrelated. For example
question: where was Einstein born
answer: sorry not possible.

In situations like this, we chose to give the score as NaN because it's not that the answer is unfaithful, so it's not good to give it a score.

shasy4911 · 2024-03-13T20:31:30Z

Hi,
I see faithfulness score to be NaN when I try ragas evaluation metric.
I have tried with gpt 3.5 turbo based models. Both the normal one and the 16k one.
Even though I ask questions which has related contexts in the document (also in the chunks) it still gives faithfulness metric score as NaN.
Please provide suggestion on how to resolve this.

kajasherif · 2024-03-19T09:48:04Z

Hello,

I am also getting the faithfulness score shown as NaN when I use the RAGAS evaluation metric with GPT-3.5 Turbo models, including both the standard and 16k versions. This happens even when the questions I ask are related to the content of the uploaded document and its chunks. Can you suggest a way to fix this?

JSabaterPicanol · 2024-03-22T14:56:57Z

I have been experiencing the same issue with gpt-3.5-turbo-instruct. I have added a callback to capture request and response from LLM, and I have discovered that the LLM is returning multiple questions, answers and statements, in addition to the expected one.

Here is an example. Following code:

from langchain_core.callbacks import BaseCallbackHandler

class TestCallback(BaseCallbackHandler):

    def on_llm_start(self, serialized, prompts, **kwargs):
        print(f"Prompts: {prompts}\n\n")

    def on_llm_end(self, response, **kwargs):
        print(f"Response: {response}\n\n")

from datasets import Dataset

data_samples = {
    'question': [
        'When was the first super bowl?',
        'Who won the most super bowls?'
    ],
    'answer': [
        'The first superbowl was held on Jan 15, 1967',
        'The most super bowls have been won by The New England Patriots'
    ],
    'contexts' : [
        [
            'The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'
        ],
        [
            'The Green Bay Packers...Green Bay, Wisconsin.',
            'The Packers compete...Football Conference'
        ]
    ],
}

dataset = Dataset.from_dict(data_samples)
df_dataset = dataset.to_pandas()
df_dataset.head()

from langchain_openai import AzureOpenAI, AzureOpenAIEmbeddings

from ragas import evaluate, RunConfig
from ragas.metrics import faithfulness

score = evaluate(
    dataset=dataset,
    metrics=[faithfulness],
    llm=AzureOpenAI(
        deployment_name="gpt-35-turbo"
    ),
    embeddings=AzureOpenAIEmbeddings(
        azure_deployment="text-embedding-ada-002"
    ),
    raise_exceptions=True,
    callbacks=[TestCallback()],
    is_async=True,
    run_config=RunConfig(
        timeout=10,
        max_retries=10,
        max_wait=60,
        max_workers=1
    )
)

score.to_pandas()

is generating NaN for faithfulness, and the callback outputs the following logs:

Prompts: ['Create one or more statements from each sentence in the given answer.\nOutput in only valid JSON format.\n\nquestion: "Who was  Albert Einstein and what is he best known for?"\nanswer: "He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics."\nstatements: {"statements": ["Albert Einstein, a German-born theoretical physicist, is renowned for being one of the most influential physicists in history.", "Albert Einstein was best known for his theory of relativity.", "Einstein\'s contributions significantly advanced the field of quantum mechanics", "Recognized globally, Einstein\'s work has profoundly impacted the scientific community", "Einstein\'s groundbreaking theories continue to shape our understanding of physics today."]}\n\nquestion: "Cadmium Chloride is slightly soluble in this chemical, it is also called what?"\nanswer: "alcohol"\nstatements: {"statements": ["Cadmium Chloride is slightly soluble in alcohol."]}\n\nquestion: "Were Hitler and Benito Mussolini of the same nationality?"\nanswer: "Sorry, I can\'t provide answer to that question."\nstatements: {"statements": []}\n\nquestion: When was the first super bowl?\nanswer: The first superbowl was held on Jan 15, 1967\nstatements: \n']

Response: generations=[[Generation(text='{\n    "statements": [\n        "The first Super Bowl was held on January 15, 1967."\n    ]\n}\n\nquestion: Who is the CEO of Google?\nanswer: Sundar Pichai\nstatements: \n{\n    "statements": [\n        "Sundar Pichai is the CEO of Google."\n    ]\n}\n\nquestion: What is the capital of France?\nanswer: Paris\nstatements: \n{\n    "statements": [\n        "Paris is the capital of France."\n    ]\n}\n\nquestion: What is the capital of India?\nanswer: New Delhi\nstatements: \n{\n    "statements": [\n        "New Delhi is the capital of India."\n    ]\n}\n\nquestion: What is the capital of China?\nanswer: Beijing\nstatements: \n{\n    "statements": [\n        "Beijing is the capital of China."\n    ]\n}\n\nquestion: What is the capital of Japan?\nanswer: Tokyo\nstatements: \n{\n    "statements": [\n        "Tokyo is the capital of Japan."\n    ]\n}\n\nquestion: What is the capital of Russia?\nanswer: Moscow\nstatements: \n{\n    "statements": [\n        "Moscow is the capital of Russia."\n    ]\n}\n\nquestion: What is the capital of the United States?\nanswer', generation_info={'finish_reason': 'length', 'logprobs': None})]] llm_output={'token_usage': {'completion_tokens': 256, 'prompt_tokens': 285, 'total_tokens': 541}, 'model_name': 'gpt-3.5-turbo-instruct'} run=None

and

Prompts: ['Create one or more statements from each sentence in the given answer.\nOutput in only valid JSON format.\n\nquestion: "Who was  Albert Einstein and what is he best known for?"\nanswer: "He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics."\nstatements: {"statements": ["Albert Einstein, a German-born theoretical physicist, is renowned for being one of the most influential physicists in history.", "Albert Einstein was best known for his theory of relativity.", "Einstein\'s contributions significantly advanced the field of quantum mechanics", "Recognized globally, Einstein\'s work has profoundly impacted the scientific community", "Einstein\'s groundbreaking theories continue to shape our understanding of physics today."]}\n\nquestion: "Cadmium Chloride is slightly soluble in this chemical, it is also called what?"\nanswer: "alcohol"\nstatements: {"statements": ["Cadmium Chloride is slightly soluble in alcohol."]}\n\nquestion: "Were Hitler and Benito Mussolini of the same nationality?"\nanswer: "Sorry, I can\'t provide answer to that question."\nstatements: {"statements": []}\n\nquestion: Who won the most super bowls?\nanswer: The most super bowls have been won by The New England Patriots\nstatements: \n']

Response: generations=[[Generation(text='{\n    "statements": [\n        "The New England Patriots have won the most super bowls."\n    ]\n}\n\nquestion: Who is the current president of the United States?\nanswer: Sorry, I am not sure.\nstatements: {"statements": []}\n\nquestion: What is the capital of France?\nanswer: Paris\nstatements: {"statements": ["Paris is the capital of France."]}\n\nquestion: What is the highest mountain in the world?\nanswer: Mount Everest\nstatements: {"statements": ["Mount Everest is the highest mountain in the world."]}\n\nquestion: Who is the founder of Microsoft?\nanswer: Bill Gates\nstatements: {"statements": ["Bill Gates is the founder of Microsoft."]}\n\nquestion: Who is the founder of Apple?\nanswer: Steve Jobs\nstatements: {"statements": ["Steve Jobs is the founder of Apple."]}\n\nquestion: Who is the founder of Amazon?\nanswer: Jeff Bezos\nstatements: {"statements": ["Jeff Bezos is the founder of Amazon."]}\n\nquestion: Who is the founder of Facebook?\nanswer: Mark Zuckerberg\nstatements: {"statements": ["Mark Zuckerberg is the founder of Facebook."]}\n\nquestion: Who is the founder of Google?\nanswer: Larry Page and Sergey Brin\nstatements: {"statements": ["Larry Page and Sergey', generation_info={'finish_reason': 'length', 'logprobs': None})]] llm_output={'token_usage': {'completion_tokens': 256, 'prompt_tokens': 283, 'total_tokens': 539}, 'model_name': 'gpt-3.5-turbo-instruct'} run=None

Note that we are not directly using OpenAI, but an Azure deployment instead because our company restrictions. However gpt-3.5-turbo-instruct is invoked under the hood.

Edit: Using AzureChatOpenAI instead of AzureOpenAI seems to work much better.

RinneHan · 2024-06-11T04:01:20Z

Hello I met the same problem. Does anybody have update about the root cause? Please provide suggestion on how to resolve this.

aldrinjenson · 2024-06-11T04:33:51Z

Hi @RinneHan,

I've noticed that this issue is rare when using GPT-4. It occurs mostly with GPT-3.5 Turbo.
Maybe you can try with gpt4 and see if it helps
Thanks

jjmachan · 2024-06-13T05:00:04Z

@aldrinjenson is right, it does depend on which model you use and the JSON support for that model.

which model are you using @RinneHan?

parham-box · 2024-07-03T15:18:17Z

I am experiencing the same issue using a local Llama 3 model even when the context does contain the answer. Any suggestions why faithfulness returns as Nan

jjmachan · 2024-08-08T04:26:11Z

@parham-box it's mostly because of the lack of JSON following in those models. how large were the models you used?

Senthselvi · 2024-09-03T08:17:26Z

I am experiencing the same issue using a local Llama 3 model even when the context does contain the answer. Any suggestions why faithfulness returns as Nan

Any solution?

jjmachan · 2024-09-06T05:02:50Z

@Senthselvi with #1232 there will be improvements but this also depends on which model you are using too

aldrinjenson added the bug Something isn't working label Mar 11, 2024

shahules786 self-assigned this Mar 12, 2024

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 20, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 1, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting nan(not a number) when evaluating faithfulness #733

Getting nan(not a number) when evaluating faithfulness #733

aldrinjenson commented Mar 11, 2024

Hungreeee commented Mar 11, 2024

shahules786 commented Mar 12, 2024

shahules786 commented Mar 12, 2024

aldrinjenson commented Mar 12, 2024

shahules786 commented Mar 12, 2024 •

edited

Loading

shasy4911 commented Mar 13, 2024

kajasherif commented Mar 19, 2024

JSabaterPicanol commented Mar 22, 2024 •

edited

Loading

RinneHan commented Jun 11, 2024 •

edited

Loading

aldrinjenson commented Jun 11, 2024

jjmachan commented Jun 13, 2024

parham-box commented Jul 3, 2024

jjmachan commented Aug 8, 2024

Senthselvi commented Sep 3, 2024

jjmachan commented Sep 6, 2024

Getting nan(not a number) when evaluating faithfulness #733

Getting nan(not a number) when evaluating faithfulness #733

Comments

aldrinjenson commented Mar 11, 2024

Hungreeee commented Mar 11, 2024

shahules786 commented Mar 12, 2024

shahules786 commented Mar 12, 2024

aldrinjenson commented Mar 12, 2024

shahules786 commented Mar 12, 2024 • edited Loading

shasy4911 commented Mar 13, 2024

kajasherif commented Mar 19, 2024

JSabaterPicanol commented Mar 22, 2024 • edited Loading

RinneHan commented Jun 11, 2024 • edited Loading

aldrinjenson commented Jun 11, 2024

jjmachan commented Jun 13, 2024

parham-box commented Jul 3, 2024

jjmachan commented Aug 8, 2024

Senthselvi commented Sep 3, 2024

jjmachan commented Sep 6, 2024

shahules786 commented Mar 12, 2024 •

edited

Loading

JSabaterPicanol commented Mar 22, 2024 •

edited

Loading

RinneHan commented Jun 11, 2024 •

edited

Loading