Benchmark LLMs - LM Harness, FastEval, Flask

LM Harness Benchmarks

Evaluate LLMs 20x faster with TGI via litellm proxy's /completions endpoint.

This tutorial assumes you're using the big-refactor branch of lm-evaluation-harness

Step 1: Start the local proxy see supported models here

$ litellm --model huggingface/bigcode/starcoder

Using a custom api base

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model huggingface/tinyllama --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud

OpenAI Compatible Endpoint at http://0.0.0.0:8000

Step 2: Set OpenAI API Base & Key

$ export OPENAI_API_BASE=http://0.0.0.0:8000

LM Harness requires you to set an OpenAI API key OPENAI_API_SECRET_KEY for running benchmarks

export OPENAI_API_SECRET_KEY=anything

Step 3: Run LM-Eval-Harness

python3 -m lm_eval \
  --model openai-completions \
  --model_args engine=davinci \
  --task crows_pairs_english_age

FastEval

Step 1: Start the local proxy see supported models here

$ litellm --model huggingface/bigcode/starcoder

Step 2: Set OpenAI API Base & Key

$ export OPENAI_API_BASE=http://0.0.0.0:8000

Set this to anything since the proxy has the credentials

export OPENAI_API_KEY=anything

Step 3 Run with FastEval

Clone FastEval

# Clone this repository, make it the current working directory
git clone --depth 1 https://github.com/FastEval/FastEval.git
cd FastEval

Set API Base on FastEval

On FastEval make the following 2 line code change to set OPENAI_API_BASE

https://github.com/FastEval/FastEval/pull/90/files

try:
    api_base = os.environ["OPENAI_API_BASE"] #changed: read api base from .env
    if api_base == None:
        api_base = "https://api.openai.com/v1"
    response = await self.reply_two_attempts_with_different_max_new_tokens(
        conversation=conversation,
        api_base=api_base, # #changed: pass api_base
        api_key=os.environ["OPENAI_API_KEY"],
        temperature=temperature,
        max_new_tokens=max_new_tokens,

Run FastEval Set -b to the benchmark you want to run. Possible values are mt-bench, human-eval-plus, ds1000, cot, cot/gsm8k, cot/math, cot/bbh, cot/mmlu and custom-test-data

Since LiteLLM provides an OpenAI compatible proxy -t and -m don't need to change -t will remain openai -m will remain gpt-3.5

./fasteval -b human-eval-plus -t openai -m gpt-3.5-turbo

FLASK - Fine-grained Language Model Evaluation

Use litellm to evaluate any LLM on FLASK https://github.com/kaistAI/FLASK

Step 1: Start the local proxy

$ litellm --model huggingface/bigcode/starcoder

Step 2: Set OpenAI API Base & Key

$ export OPENAI_API_BASE=http://0.0.0.0:8000

Step 3 Run with FLASK

git clone https://github.com/kaistAI/FLASK

cd FLASK/gpt_review

Run the eval

python gpt4_eval.py -q '../evaluation_set/flask_evaluation.jsonl'

Debugging

Making a test request to your proxy

This command makes a test Completion, ChatCompletion request to your proxy server

litellm --test

Benchmark LLMs - LM Harness, FastEval, Flask

LM Harness Benchmarks​

FastEval​

FLASK - Fine-grained Language Model Evaluation​

Debugging​

Making a test request to your proxy​

LM Harness Benchmarks

FastEval

FLASK - Fine-grained Language Model Evaluation

Debugging

Making a test request to your proxy