Docs: https://phasellm.com/docs/phasellm/eval.html

This project provides a unified framework to test generative language models on a large number of different evaluation tasks.

Features:

  • 200+ tasks implemented. See the task-table for a complete list.
  • Support for models loaded via transformers (including quantization via AutoGPTQ), - GPT-NeoX, and Megatron-DeepSpeed, with a flexible tokenization-agnostic interface.
  • Support for commercial APIs including OpenAI, goose.ai, and TextSynth.
  • Support for evaluation on adapters (e.g. LoRa) supported in HuggingFace’s PEFT library.
  • Evaluating with publicly available prompts ensures reproducibility and comparability between papers.
  • Task versioning to ensure reproducibility when tasks are updated.