Medium-to-Large AI's, Mixture-of-Experts and Even 'Thinking' Models for Free

Introduction

Artificial Intelligence (AI) has made tremendous progress in recent years, particularly in the field of Large Language Models (LLMs). These models are designed to process and understand human language, enabling applications such as chatbots, content generation, and language translation. At their core, LLMs work by predicting the next word in a sequence, given the context of the conversation or text. This prediction is based on complex algorithms and massive amounts of training data.

Mixture of Experts Architecture. A Dream Team of Nerds Inside Your AI Assistant

Before the introduction of Mixture of Experts (MoE) architecture models, LLMs were typically designed as a single, large model that handled all tasks. However, MoE models have revolutionized the field by combining multiple smaller models, each specializing in a specific task or domain. This approach has led to significant improvements in performance, efficiency, and flexibility.

Thinking Before Answering Decreases Error Rate

One of the most exciting developments in LLMs is the emergence of "Thinking" models, which can generate step-by-step reasoning and explanations for their responses. These models have the potential to greatly enhance the transparency and trustworthiness of AI systems. In this article, we'll explore the advantages of MoE and Thinking models, and highlight some of the best options available, with speeds over 100 tokens per second and completely free.

Here's some LLM's you can use for FREE at OpenRouter.

Index Release Date Company Model Size (B) Engine Context Size (K) Max Output (K) Throughput (tps) Latency (s) Context Cost (USD/1M tokens) Output Cost (USD/1M tokens)
These are typical values. Go to OpenRouter web to get the fresher statistics.

Selection

When it comes to selecting an LLM for a specific task, there are several factors to consider. One of the most important is speed, as it directly impacts the user experience. With speeds ranging from 100 to 300 tokens per second, users can engage in casual conversations, generate content yet a long output (i.e., +2000 words) can make you wait many seconds . A customer support chatbot can respond quickly to user queries, while a creative writer can generate text at a rapid pace but not instantaneously. If you need to generate large texts quickly check I have a few options for you in my article: Lightning-Fast AI Assistants at OpenRouter

Size Matters

The size of the model, measured in billions of parameters, also matters. Larger models tend to be more accurate and capable, but may require more computational resources. A model with 10 billion parameters may be sufficient for simple tasks, while a model with 400 billion parameters may be needed for more complex applications.

Methodology for Picking a Model

  1. Define the task: Identify the specific application or use case for the LLM.
  2. Determine the required speed: Consider the desired response time and throughput.
  3. Choose a model size: Select a model with a suitable number of parameters for the task complexity.
  4. Evaluate the model's capabilities: Consider the model's performance on relevant benchmarks and tasks.

Editor's Pick

After evaluating several LLMs, we recommend the following two models:

Distill Qwen Model from DeepSeek

  • 32B
  • 147tps
  • thinking
  • multi-step

Llama 4 Maverick Model from Meta

  • 400B
  • 161tps
  • accuracy
  • expertise

Both models are free to use and offer exceptional performance, making them attractive options for a wide range of applications. Whether you need a Thinking model for complex problem-solving or a highly accurate MoE model for demanding tasks, these two models are definitely worth considering.