RouteLLM: How I Route to The Best Model to Cut API Costs

RouteLLM is an open-source framework developed by LM.org that aims to reduce operating costs and maintain high-quality responses

Oct 08, 2024

large language models have shown amazing capabilities in a variety of tasks, but there is a big difference in their cost and capabilities.

Claude 3 Opus, GPT-4, and others are high in performance, but they are also high in cost. So that’s why we’re making deals now. The trade-off is: use the best, brightest, and most expensive, or go for something cheaper, faster, and less capable.

But what if there was a better way? This leads to the dilemma of deploying LLMs in the real world.

if you’re building something to run a business or help with web research, whatever you’re doing with these models, routing all your queries to the biggest, most capable model will give you the highest quality responses, but it can be costly.

some of these projects are blowing thousands of dollars because they’re all relying on GPT-4 or whatever

Of course, you can save money by routing queries to smaller models, but the quality of the responses can go down. GPT-3.5 is cheap, but the quality isn’t as good, and it fails on harder tasks

That’s where something like Route LLM comes in.

in this story, we will provide an easy-to-understand explanation of Route LLM, what it is, how it works, what its features and even build an actual application.

What is RouteLLM?

RouteLLM is an open-source framework developed by LM.org that aims to reduce operating costs and maintain high-quality responses by distributing queries between different language models through intelligent routing technology. Simply put, RouteLLM can flexibly select an appropriate model to process queries based on the complexity of the query, thereby saving resources.

solved problem

Using high-performance AI to process every query is like consulting a genius professor for simple questions like “What’s the weather like today?” — unnecessary and expensive.

In contrast, relying on basic AI to process complex queries could be more efficient. RouteLLM optimizes cost and response quality by intelligently matching queries with appropriate AI models.

How RouteLLM works

Query Analysis: RouteLLM first analyzes the complexity and intent of each query using natural language processing techniques.
Win rate prediction model: It uses predictive modeling to determine the likelihood that the advanced AI will provide a significantly better response.
Learning from preference data: RouteLLM is trained on historical data, learning from past queries and user feedback to improve its decisions.
Dynamic routing: Based on the predictions, the system routes the query to the most appropriate AI model.
Continuous Improvement: RouteLLM continuously updates its algorithms to enhance routing accuracy and efficiency as new data is added.

Core features of RouteLLM

Cost-Effectiveness: Leverage cheaper models for simple queries and use expensive, high-performance models only when necessary.
Efficient routing: By using preference data to train the router, it learns the strengths and weaknesses of different models in processing different queries.
Data augmentation: Data augmentation technology is used to improve the model's routing performance, including golden-label datasets and LLM-judge-labeled datasets.

Advantages of RouteLLM

RouteLLM performs well on multiple benchmarks. For example, using GPT-4 Turbo as the strong model and Mixtral 8x7B as the weak model, RouteLLM saves 75% in cost compared to random routing while maintaining high performance.

How do you set up and use RouteLLM?

1. Cloning the GitHub Repository:

git clone https://github.com/lm-sys/RouteLLM.git

git clone is a Git command used to create a copy of a specific repository from GitHub (or another Git-based repository hosting service).

2. Navigating to the Cloned Directory:

cd RouteLLM

To use RouteLLM, you first need to install it with the command:

pip install "routellm[serve,eval]"

Basic Configuration

import os
from routellm.controller import Controller

os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"
os.environ["ANYSCALE_API_KEY"] = "esecret_XXXXXX"

client = Controller(
  routers=["mf"],
  strong_model="gpt-4-1106-preview",
  weak_model="groq/llama3-8b-8192",
)

Here, ‘mf’ is the recommended router model. The strong_model specifies the advanced AI (in this case, GPT-4), and then weak_model specifies a less complex AI (in this case, use groq/llama3).

Router settings

RouteLLM provides a variety of router options, including matrix factorization-based router (mf), BERT-based classifier (bert), LLM-based Classifier and Weighted Elo Calculation. You can choose the most suitable router according to your needs:

# Setting Different Routers
routers = [
    'mf',         # Matrix Factorization
    'sw_ranking', # Weighted Elo Calculation
    'bert',       # BERT Classifier
    'causal_llm'  # LLM-based Classifier
]
# Selecting a Router
chosen_router = 'mf'

Setting the Threshold

python -m routellm.calibrate_threshold --routers mf --strong-model-pct 0.5 
--config config.example.yaml

This command determines at what level of questions the advanced AI will be asked. Here, it is set to ask the advanced AI for 50% of the total questions.

Usage:

response = client.chat.completions.create(
  model="router-mf-0.11593",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

By doing this, RouteLLM analyzes the question and directs it to the appropriate AI model. As you see in the prompt, we prompt hello and this question can be answered by any model we don’t need an expensive model to answer this question

import os
from routellm.controller import Controller

# Set environment variables for API keys
os.environ["OPENAI_API_KEY"] = "sk-proj-S9UotthZt3QLgrUTouvMT3BlbkFJ3ETijcqlmyL6F1wsX4LU"
os.environ["GROQ_API_KEY"] = "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"

# Check if the environment variable is set correctly
api_key = os.getenv("OPENAI_API_KEY")
print(f"OPENAI_API_KEY: {api_key}")

# Initialize the Controller
client = Controller(
    routers=["mf"],  # List of routers, e.g., "mf" for matrix factorization
    strong_model="gpt-4-1106-preview",  # Specify the strong model to use
    weak_model="groq/llama3-8b-8192"    # Specify the weak model to use
)

# Selecting a Router
chosen_router = 'mf'

response = client.chat.completions.create(
  model="router-mf-0.11593",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

output_content = response['choices'][0]['message']['content']
print(output_content)

Conclusion:

RouteLLM is an innovative tool that allows you to use AI technology more economically and efficiently. It seems particularly useful for companies operating large-scale AI services or startups that want to provide high-quality AI services on a limited budget. Whether you are working on daily applications or handling complex AI tasks, RouteLLM is an option worth considering.

Source of information

🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or Book a 1-on-1 Consulting Call With Me.

📚Feel free to check out my other articles:

Gao Dalie (高達烈)

Discussion about this post