Building AI applications using large language models (LLMs) is straightforward for developers since they only need to create a prompt rather than collecting labeled data and training models. However, LLMs are general-purpose and not optimized for specific tasks. While they may perform well in demos with standard prompts, they often struggle with more complex real-world scenarios.
To ensure the quality of LLM outputs, developers need a way to evaluate metrics like relevance, the percentage of hallucinations, and latency. Evaluating the performance allows developers to determine if adjustments to prompts or retrieval strategies have improved the application and by how much.
When you adjust your prompts or retrieval strategy, you will know whether your application has improved and by how much using evaluation. The dataset you are evaluating determines how trustworthy generalizable your evaluation metrics are to production use. A limited dataset could showcase high scores on evaluation metrics, but perform poorly in real-world scenarios.
Evaluating the performance of Large Language Models (LLMs) against defined criteria is a critical step in the prompt engineering cycle. LLM-based grading offers a fast, flexible, and scalable approach suitable for making complex judgments about LLM outputs.
In this example, we use a question containing a customer inquiry and a golden answer, describing the correct way an AI assistant should respond to the question.
The LLM-based grading system would evaluate the AI assistant's response against the golden answer, considering correctness factor.
By comparing the AI assistant's response to the golden answer, the LLM-based grading system can provide a quality score and identify areas for improvement in the AI assistant's performance.
Provide evaluation data
question: A string containing a customer question or inquiry
golden_answer: A string describing the correct way an AI assistant should respond to the question. It should not contain the actual text of the assistant's response.
eval_data = [
# this should be a correct golden_answer
{
"question": "I received a damaged product in my order. How can I get a replacement?",
"golden_answer": "A correct answer should express empathy for the customer’s situation, clearly outline the steps for getting a replacement (such as contacting customer support, providing the order number, and describing the damage), and assure the customer that their issue will be resolved. If the company policy allows, it might also include offering to initiate the replacement process directly."
},
# this should be a correct golden_answer
{
"question": "I was charged twice for my order. How can I get a refund?",
"golden_answer": "A correct answer should apologize for the inconvenience, explain the steps for initiating a refund, and reassure the customer that their issue will be resolved. The instructions should include contacting customer support with the order details and the nature of the duplicate charge. If possible, the assistant should offer to start the refund process directly or escalate the issue."
},
# this should be an incorrect golden_answer
{
"question": "Can you tell me where my order is?",
"golden_answer": "A correct answer should state that unfortunately, once an order ships our system can no longer track its progress and the customer should have to wait patiently for it to arrive."
}
]
Building the Chatbot to Evaluate Answers
The provided code demonstrates how to create an automated system for evaluating the correctness of answers given by a customer support AI assistant using LangChain and the Anthropic Claude model.
The purpose of this system is to automate the evaluation of answers provided by a customer support AI. By using a separate grader AI with a predefined rubric, the script can determine the correctness of the generated answers and provide an overall accuracy score.
grader_prompt_template = ChatPromptTemplate.from_messages([
('system', """You will be provided an answer that an assistant gave to a question, and a rubric that instructs you on what makes the answer correct or incorrect."""),
('user', """Here is the answer that the assistant gave to the question.
<answer>{answer}</answer>
Here is the rubric on what makes the answer correct or incorrect.
<rubric>{rubric}</rubric>
An answer is correct if it entirely meets the rubric criteria, and is otherwise incorrect.
First, think through whether the answer is correct or incorrect based on the rubric inside <thinking></thinking> tags. Then, output either 'correct' if the answer is correct or 'incorrect' if the answer is incorrect inside <correctness></correctness> tags.""")
])
grader_llm = grader_prompt_template | model
customer_support_prompt_template = ChatPromptTemplate.from_messages([
('system', """You will be provided a question that a customer asked, and you need to provide an answer that is helpful and informative.
Please provide a helpful and informative response to the customer's question inside <answer></answer> tags."""),
('user', "{question}")
])
customer_support_llm = customer_support_prompt_template | model
grades = []
for eval_data_item in eval_data:
customer_support_response = customer_support_llm.invoke({"question": eval_data_item["question"]})
grader_response = grader_llm.invoke({"answer": customer_support_response.content, "rubric": eval_data_item["golden_answer"]})
# Extract just the label from the completion (we don't care about the thinking)
match = re.search(r'<correctness>(.*?)</correctness>', grader_response.content, re.DOTALL)
if match:
label = match.group(1).strip()
print(label)
if label == 'correct':
# Correct answer
grades.append(1)
elif label == 'incorrect':
# Incorrect answer
grades.append(0)
else:
# Invalid label
raise ValueError("Invalid label: " + label)
else:
raise ValueError("Did not find <correctness></correctness> tags.")
print(f"correctness: {round(sum(grades) / len(grades), 2)}")