LLM Eval Overview
2025-04-15 21:45:26 0 举报
AI智能生成
reading notes on two blogs: 1. Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices 2. LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide
作者其他创作
大纲/内容
LLM system evaluation
Definition
The process of evaluating an LLM-based system that integrates the LLM into a broader application or workflow.
some most widely recognized Evaluation frameworks
Azure AI Studio Evaluation (Microsoft)
Prompt Flow (Microsoft)
Weights & Biases(Weights & Biases)
LangSmith (LangChain)
Vertex AI Studio (Google)
DeepEval (Confident AI)
categories
Statistical Scorers
Charisteristics
Lack of reasoning (semantics)
Not accurate but reliable (consistent)
Methods
BLEU (BiLingual Evaluation Understudy)
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Levenshtein distance
Model Based Scorers
Non-LLM Based
Charisteristics
less reliable but more accurate
still not good enough due to lack of reasoning
父主题
Natural Language Inference models
NLP classification model to classify whether an LLM output is logically consistent (entailment), contradictory, or unrelated (neutral) with respect to a given reference text.
struggle with accuracy when processing long texts
BERTScore scorer
Use pre-trained language models like BERT and computes the cosine similarity between the contextual embeddings of words in the reference and the generated texts
reliance on contextual embeddings from pre-trained models like BERT
Choosing Your Evaluation Metrics
1-2 custom metrics (G-Eval or DAG) that are use case specific
2-3 generic metrics (RAG, agentic, or conversational) that are system specific
2-3 generic metrics (RAG, agentic, or conversational) that are system specific
RAG Metrics
Faithfulness
Answer Relevancy
Contextual Precision
Contextual Recall
Contextual Relevancy
Agentic Metrics
Tool Correctness
Task Completion
Fine-Tuning Metrics
Hallucination
Toxicity
Bias
LLM evaluation
Definition
The process of assessing a standalone Large Language Model (LLM) independent of any application or system it is integrated into.
Metrics
General Ability metrics
GLUE Benchmark
General Language Understanding Evaluation
HellaSwag
Evaluates how well an LLM can complete a sentence
TruthfulQA
Measures truthfulness of model responses
MMLU
evaluates how well the LLM can multitask
Responsible AI metrics
Harmful content, Regulation, bias

收藏

收藏
0 条评论
下一页