LLM Eval Overview

2025-04-15 21:45:26   0  举报





AI智能生成

reading notes on two blogs: 1. Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices 2. LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide

LLM

作者其他创作

大纲/内容

LLM system evaluation

Definition

The process of evaluating an LLM-based system that integrates the LLM into a broader application or workflow.

some most widely recognized Evaluation frameworks

Azure AI Studio Evaluation (Microsoft)

Prompt Flow (Microsoft)

Weights & Biases(Weights & Biases)

LangSmith (LangChain)

Vertex AI Studio (Google)

DeepEval (Confident AI)

categories

Statistical Scorers

Charisteristics

Lack of reasoning (semantics)

Not accurate but reliable (consistent)

Methods

BLEU (BiLingual Evaluation Understudy)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Levenshtein distance

Model Based Scorers

Non-LLM Based

Charisteristics

less reliable but more accurate

still not good enough due to lack of reasoning

父主题

Natural Language Inference models

NLP classification model to classify whether an LLM output is logically consistent (entailment), contradictory, or unrelated (neutral) with respect to a given reference text.

struggle with accuracy when processing long texts

BERTScore scorer

Use pre-trained language models like BERT and computes the cosine similarity between the contextual embeddings of words in the reference and the generated texts

reliance on contextual embeddings from pre-trained models like BERT

Choosing Your Evaluation Metrics

1-2 custom metrics (G-Eval or DAG) that are use case specific
2-3 generic metrics (RAG, agentic, or conversational) that are system specific

RAG Metrics

Faithfulness

Answer Relevancy

Contextual Precision

Contextual Recall

Contextual Relevancy

Agentic Metrics

Tool Correctness

Task Completion