OpenSearchCon 2024 Session: RAGElo: An Elo Rating-based Evaluation Toolkit for RAG

Retrieval Augmented Generation (RAG) has become the workhorse of Large Language Models (LLMs) for Question Answering and Chat grounded in private data sets. On the R side, search engines provide many different retrieval strategies for finding relevant information; vector search, BM25, hybrid search, re-ranking, etc. On the G side, prompt engineering is more like an art than a science; small variations in the prompt can lead to wildly different results. When combined with agent-style generation, where the LLM is in charge of deciding the query, search filters, and retrieval strategy based on the user intent, the number of possible solution variations becomes astronomical. On top of all of this, standard evaluation techniques of comparing to “gold standard” answers are not always feasible, as the answer might not be known or might be too expensive to obtain. This is where RAGElo comes in. RAGElo creates an Elo ranking system for the different RAG solutions. Here, powerful LLMs employ reasoning techniques to evaluate pairs of answers alongside a set of questions, taking into account the information retrieved by the search engine.