Introducing ArmBench-LLM

published on 17 March 2025

We are thrilled to announce the release of ArmBench-LLM v0.1, the first ever benchmark specifically designed to evaluate Large Language Models on Armenian language tasks. Developed by Metric AI Research lab, this open-source initiative aims to advance AI capabilities for the Armenian language and support the broader community of Armenian language researchers and developers.

What is ArmBench-LLM?

ArmBench-LLM is a comprehensive evaluation framework that tests language models on their ability to understand, reason with, and generate Armenian text across various domains. The benchmark includes challenging tasks derived from real-world applications. ArmBench-LLM v0.1 features two main evaluation categories:

  1. Armenian Unified Test Exams. The benchmark includes actual university entrance examinations in Armenian Literature and Language, Armenian History, and Mathematics. These exams represent authentic, challenging tasks that mostly require a deep understanding of the Armenian language and culture.
  2. MMLU-Pro-Hy.  A stratified sample of 1,000 questions translated from the Massive Multitask Language Understanding (MMLU) benchmark into Armenian, covering diverse domains of knowledge. The translation was done using various LLMs and data was extensively post-processed and cleaned afterwards.

Initial Results

Our initial evaluation reveals interesting insights about the current LLM capabilities in Armenian:

  • Claude 3.7 Sonnet leads the overall leaderboard for test exams, demonstrating superior performance across most tasks,
  • Claude 3.5 Sonnet is the only model that successfully passes all university entrance exams, scoring >=8 out of 20 possible.
  • Gemini 2.0 Flash is leading the MMLU-Pro-Hy leaderboard, while slightly lagging behind Sonnet on test exams.

Open-Source Resources

As part of our commitment to advancing Armenian NLP research, we're releasing:

  1. Interactive Hugging Face space featuring the complete leaderboard and metrics.
  2. GitHub repository with evaluation code for contributing new models.
  3. The datasets powering the benchmark for research and development purposes.

Get Involved

  • Evaluate your models on ArmBench-LLM,
  • Contribute improvements to the benchmark,
  • Help expand the dataset with new challenging tasks.

Experience the Leaderboard Below

Built on Unicorn Platform