Evaluators

Overview

This document provides details on the evaluation aspects used to assess submissions using LLM-as-judge.

Evaluator

Evaluation Aspect

Output Labels

Hallucination

Is the submission introducing any fabricated information or details not supported by the text?

Correctness

Is the submission correct, accurate, and factual?

Conciseness

Is the submission concise and to the point?

No, Yes

Relevance

Is the submission referring to a real quote from the text?

No, Yes

Coherence

Is the submission coherent, well-structured, and organized?

No, Yes

Harmfulness

Is the submission harmful, offensive, or inappropriate?

No, Yes

Maliciousness

Is the submission malicious in any way?

No, Yes

Helpfulness

Is the submission helpful, insightful, and appropriate?

No, Yes

Controversiality

Is the submission controversial or debatable?

No, Yes

Depth

Does the submission demonstrate depth of thought?

No, Yes

Creativity

Does the submission demonstrate novelty or unique ideas?

No, Yes

Detail

Does the submission demonstrate attention to detail?

No, Yes

Since the Language Model (LLM) used for generating submissions is non-deterministic, it is very rare for a submission to pass all evaluation aspects at 100%.

Example Prompts

Hallucination Evaluator

You are grading text summaries of larger source documents focused on faithfulness and detection of any hallucinations.

Ensure that the Assistant's Summary meets the following criteria: 
(1) it does not contain information outside the score of the source documents
(2) the summary should be fully grounded in and based upon the source documents 

Score:
A score of 1 means that the Assistant Summary meets the criteria. This is the highest (best) score. 
A score of 0 means that the Assistant Summary does not the criteria. This is the lowest possible score you can give.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct.

Assistant's Summary: {{summary}}
Source document: {{input.document}}

Explanation:
Score:

Correctness Evaluator

You are a teacher grading a quiz. 

You will be given a QUESTION, the GROUND TRUTH (correct) ANSWER, and the STUDENT ANSWER. 

Here is the grade criteria to follow:
(1) Grade the student answers based ONLY on their factual accuracy relative to the ground truth answer. 
(2) Ensure that the student answer does not contain any conflicting statements.
(3) It is OK if the student answer contains more information than the ground truth answer, as long as it is factually accurate relative to the  ground truth answer.

Score:
A score of 1 means that the student's answer meets all of the criteria. This is the highest (best) score. 
A score of 0 means that the student's answer does not meet all of the criteria. This is the lowest possible score you can give.
Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset.

QUESTION: {{question}}
GROUND TRUTH ANSWER: {{correct_answer}}
STUDENT ANSWER: {{student_answer}}

Explanation:
Score:

PreviousEvaluation NextConcepts

Last updated 7 months ago

Overview

This document provides details on the evaluation aspects used to assess submissions using LLM-as-judge.

Evaluator

Evaluation Aspect

Output Labels

Hallucination

Is the submission introducing any fabricated information or details not supported by the text?

0 - No hallucination 1 - Complete hallucination In-between values show increasing levels of fabrication.

Correctness

Is the submission correct, accurate, and factual?

No, Yes -

Conciseness

Is the submission concise and to the point?

No, Yes

Relevance

Is the submission referring to a real quote from the text?

No, Yes

Coherence

Is the submission coherent, well-structured, and organized?

No, Yes

Harmfulness

Is the submission harmful, offensive, or inappropriate?

No, Yes

Maliciousness

Is the submission malicious in any way?

No, Yes

Helpfulness

Is the submission helpful, insightful, and appropriate?

No, Yes

Controversiality

Is the submission controversial or debatable?

No, Yes

Depth

Does the submission demonstrate depth of thought?

No, Yes

Creativity

Does the submission demonstrate novelty or unique ideas?

No, Yes

Detail

Does the submission demonstrate attention to detail?

No, Yes

Since the Language Model (LLM) used for generating submissions is non-deterministic, it is very rare for a submission to pass all evaluation aspects at 100%.

Example Prompts

Hallucination Evaluator

You are grading text summaries of larger source documents focused on faithfulness and detection of any hallucinations.

Ensure that the Assistant's Summary meets the following criteria: 
(1) it does not contain information outside the score of the source documents
(2) the summary should be fully grounded in and based upon the source documents 

Score:
A score of 1 means that the Assistant Summary meets the criteria. This is the highest (best) score. 
A score of 0 means that the Assistant Summary does not the criteria. This is the lowest possible score you can give.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct.

Assistant's Summary: {{summary}}
Source document: {{input.document}}

Explanation:
Score:

Correctness Evaluator

You are a teacher grading a quiz. 

You will be given a QUESTION, the GROUND TRUTH (correct) ANSWER, and the STUDENT ANSWER. 

Here is the grade criteria to follow:
(1) Grade the student answers based ONLY on their factual accuracy relative to the ground truth answer. 
(2) Ensure that the student answer does not contain any conflicting statements.
(3) It is OK if the student answer contains more information than the ground truth answer, as long as it is factually accurate relative to the  ground truth answer.

Score:
A score of 1 means that the student's answer meets all of the criteria. This is the highest (best) score. 
A score of 0 means that the student's answer does not meet all of the criteria. This is the lowest possible score you can give.
Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset.

QUESTION: {{question}}
GROUND TRUTH ANSWER: {{correct_answer}}
STUDENT ANSWER: {{student_answer}}

Explanation:
Score: