Introducing Armenian Text Embeddings 1: Armenian embedding model with a strong multilingual knowledge

published on 25 November 2024
TL;DR - We are excited to announce armenian-text-embeddings-1, a new state-of-the-art for Armenian embeddings and a big leap towards the development of Armenian AI. armenian-text-embeddings-1 can be used for various applications ranging from RAG-based Generative AI (such as chatbots, Document Q&A), semantic search, text classification and more.
Access the Model on HuggingFace 🤗

Five months ago we tasked ourselves with the responsibility to contribute to the development of Armenian AI through training foundation models that can extend Generative AI applications for the Armenian market.

Today, we are excited to finally introduce our first model, armenian-text-embeddings-1 which is the first-of-its-kind embedding model achieving state-of-the-art performance for Armenian embedding tasks. The model is trained on a quarter billion token dataset and is fully open-source and available for your use on HuggingFace.

In order to achieve this result, we have carefully curated a benchmark dataset that allows us to evaluate the quality of different embedding models for Armenian. According to the benchmark, armenian-text-embeddings-1 outperforms all other models including OpenAI text embeddings models by a significant margin obtaining a whopping 89% retrieval accuracy.

About embedding models

Generative AI applications are everywhere. 2024 is the era of RAG (retrieval augmented generation) systems that provide GenAI products the ability to enhance their knowledge using proprietary/external data. While the interface to the user is supported by LLMs (such as GPT4o or Claude), the so-called Retrieval part is handled by language models known as text embedding models.

Text embedding models are the "swiss army knife" of AI. They can be used in various applications ranging from chatbots to e-commerce search engines. While they might not be as consumerized as LLMs are, their quality is the key driving force behind different AI applications.

Unfortunately, there was absolutely no embedding model specifically trained for Armenian langauge. Well, at least up until today.

We have trained a specialized model which has outstanding performance on Armenian text while preserving its strong multilingual abilities on languages such English or Russian.

Technical details

The model is forked from Microsoft's multilingual-e5-base and trained for 5 epochs using 4xA100 40Gb GPUs. We followed the standard practice of training embeddings models and used the InfoNCE loss with in-batch negatives. Each batch consisted of 4x256 (query, passage) prefixed pairs (aka asymmetric training).

The training data was generated synthetically by translating more than 2m Reddit (title, body) pair datasets from English to Armenian using Gemma 2 27b-it. The synthetic data was rigorously cleaned and post processed. The resulting training set had around quarter billon tokens which were used to train the embedding model.

The fine tuned model was merged with the base model using Stochastic Weight Averaging (SWA) to preserve the multilingual capabilities. The resulting model has 12 layers with 278M parameters and the embedding size is 768. It supports context up to 512 tokens.

It outperforms all open source and closed source models that we tested on the Armenian benchmarks indifferent of their size with a 89% accuracy on the handmade multidisciplinary retrieval benchmark.

Evaluation details

To evaluate the model, we carefully curated a strong and general benchmark for Armenian embedding models inspired by MTEB. The benchmark includes datasets for 4 tasks: Retrieval, Sentence similarity, Classification, and Paraphrasing. Part of the datasets were manually created, while some were generated using machine translation (again using Gemini Pro). The translated synthetic data used in the benchmarking is not related to the Reddit translations used in training process.

Results

The bar chart above shows the comparison of the performance of our model (Blue) versus OpenAI's model (Orange) on the 4 tasks in the benchmark dataset. armenian-text-embeddings-1 exhibits superior performance on 3 out of the 4 tasks. On the paraphrasing task, the OpenAI model performs slightly better, which we attribute to the asymmetric training setup used behind armenian-text-embeddings-1. The Green bar is the performance of Okapi BM25, the strong baseline used for similar tasks before the era of contextual dense embeddings as well as in combination with them.

Use cases

Here are different use cases, where you could apply armenian-text-embeddings-1

  • GenAI Chatbots - banks, insurance companies, and telcos can use embedding models to introduce customer-facing chatbots powered by their own website information and documentation.
  • Semantic Search - supermarkets, retailers and e-commerce shops can leverage embedding models to improve product search and recommendation on their platform.
  • Document similarity - companies can utilize vector embeddings to find similar documents, cluster them together, and increase productivity with document workers.
  • Retrieval - Armenian text embeddings can be used to build internal AI assistants based on internal/proprietary documents without your data leaving your servers.

Future work

We are working tirelessly to extend the Armenian AI space with additional models and use cases. Here are the current works in progress that you can expect from us soon:

Access the Model on HuggingFace 🤗
Built on Unicorn Platform