Why is LLama 4 Much Better Than GPT-4o and Gemini 2.0 Flash — Here is The Result

May 16, 2025

Take note, tech enthusiasts and AI-curious minds! The world of artificial intelligence is evolving at the speed of light, and it feels like every few months, a new strong contender enters the ring. If you’ve dabbled in AI chat at all,

Recently, Meta announced the latest in its large-scale language model (LLM) series, Llama 4. Llama 4 is a next-generation AI model designed to be “native multi-modality,” not just a text processing model. This means it can understand and process both text and images.

Meta said Llama 4 is a multimodal large model that can process and integrate various types of data, including text, video, images and audio, and can convert content between these formats.

The Llama 4 series is the first to use a model with a mixture of experts (MoE) architecture, which is more efficient in training and answering user queries. It can divide the model into multiple “expert” sub-models that focus on specific tasks.

Mark Zuckerberg said in the announcement.

“Our goal is to build the world’s leading AI, open source it, and make it available to everyone.”

Llama 4 comes in three main variations:

Llama 4 Scout: A small but powerful multi-modality model
Llama 4 Maverick: Mid-range flagship multi-modality model
Llama 4 Behemoth: Super-large scale “teacher” model (still in training)

In addition, we plan to release a reasoning-specific model called “Llama 4 Reasoning” in the near future.

These new models represent the confluence of three key innovations: native multimodality with early fusion techniques, a sparse mixture of experts (MoE) architecture that dramatically improves parameter efficiency, and an expanded context window that scales to an unprecedented 10 million tokens.

It uses a mixture of experts (MoE) architecture to improve efficiency in training and answering user queries. This architecture improves overall computational efficiency by breaking down data processing tasks into subtasks and delegating them to smaller, specialized “expert” models.

In benchmark tests, Llama 4 Maverick surpassed similar models such as GPT-4o and 4.5 preview, DeepSeek v3.1 and R1 models, and left Google’s Gemini 2.0 Flash far behind.

In this step-by-step guide, we will cover exactly what Llama 4 is, its features, and why Llama 4 is so much better than GPT-4o and Gemini 2.0 Flash.

What exactly is Llama 4?

Llama 4 is not a single large language model (LLM), but a “family”. The series released by Meta includes multiple AI models of different sizes, each of which is a “generalist” and supports multi-language and multi-modal functions.

Key Takeaways

Llama 4 Scout

The Llama 4 Scout is the smallest model in the series, but it’s no surprise when it comes to performance.

Main Specifications

Active parameters: 17 billion
Number of experts: 16
Total parameters: 109 billion
Context Length: 10 million tokens
Hardware requirements : Single NVIDIA H100 GPU (with Int4 quantization)

Main Applications

Highly efficient inference on a single GPU
Processing and summarizing very long documents
Understanding and analyzing the entire code base
Basic Image Understanding Tasks

The Scout offers better performance than other models in its class (such as the Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1), and also outperforms all previous Llama models.

Llama 4 Maverick

The Llama 4 Maverick is positioned as the “flagship” model of the Llama 4 series.

Main Specifications

Active parameters: 17 billion
Number of experts: 128
Total parameters: 400 billion
Context Length: 1 million tokens
Hardware requirements: Single NVIDIA H100 DGX host

Main Applications

Advanced Multi-Modality Tasks
Precise Image Understanding
Creative Writing
Coding and inference tasks

Maverick outperforms GPT-4o and Gemini 2.0 Flash, and performs on par with DeepSeek v3 with less than half the active parameters. It achieves an ELO score of 1417 on LMArena.

Use cases: Corporate customer support can understand screenshots and error screens sent by users and suggest appropriate solutions, while e-commerce can analyze multiple product images to generate detailed comparison reports.

Llama 4 Behemoth

Behemoth is a “teacher” model still in training, but it is already showing incredible performance.

Main Specifications

Active parameters: 288 billion
Number of experts: 16
Total parameters: Approximately 2 trillion
Features: Outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro in many STEM benchmarks

The primary purpose of Behemoth is to act as a “teacher” for smaller models, improving the performance of Scout and Maverick through knowledge distillation.

What are the Llama 4 features?

To understand Llama 4’s innovative features, we’ll introduce some key technical concepts.

Mixed of Experts (Moe) Architecture

The Moe architecture is like an intelligent “dispatching centre”. Each token only needs to activate some experts, which greatly improves the operating efficiency.

For example, the Llama 4 Maverick model has approximately 400 billion parameters in total, but only about 17 billion are activated at any one time, allowing for much more efficient inference compared to traditional “dense” Transformer models.

Native multimodality

Llama 4 combines text and vision into one, creating a unified backbone architecture through joint pre-training. The visual encoder based on Metaclip is also fine-tuned for LLM and supports multiple image inputs very well, processing 48 images in pre-training and up to 8 images during testing.

This allows Llama 4 to perform tasks such as:

Answers to questions about elements in images 📷❓
Analyze relationships between multiple images 🖼️🔄
Interpreting and explaining graphs and charts 📊📈
Locating specific elements in an image (grounding) 🎯

Extra-long context window

Llama 4 Scout’s 10 million token context window is amazing. 10 million tokens are equivalent to millions of words, which is enough to cover an entire set of documents or a very large code base.

What does a long context window mean?

Processing documents with hundreds of pages
Analyzing large code bases
Maintaining a long conversation history
Multi-Document Summarization

Llama 4 Vs GPT-4o Vs Gemini 2.0 Flash

Let’s compare Gemini 2.0 Flash and GPT-4o and verify how much performance Llama 4 has.

This time, we will verify the following items.

Summarise the PDF

In my last video, I talked about GRAPHITI, If you didn’t watch, please feel free to watch, and in this story, I will summarise this PDF

Llama 4

The paper presents Graphiti, an automated reasoning technique for checking equivalence between graph database queries written in Cypher and relational queries in SQL.

The authors introduce the concept of database transformers, which transform database instances between graph and relational models. They propose a verification methodology that checks equivalence modulo a given transformer by reducing the original problem to verifying equivalence between a pair of SQL queries.

The approach involves embedding a subset of Cypher into SQL through syntax-directed translation, allowing the authors to leverage existing research on automated reasoning for SQL.

Graphiti has been implemented and used to check equivalence between graph and relational queries. The experiments demonstrate that Graphiti is useful for both verification and refutation and can uncover subtle bugs.

The proposed approach has two key advantages:


It can perform correct-by-construction transpilation when the user wants to translate a Cypher query to SQL but does not care about the underlying relational schema.

Given any arbitrary correctness specification, it can leverage existing automated reasoning tools for SQL to reason about equivalence between any pair of Cypher and SQL queries.

The authors have evaluated Graphiti on 410 benchmarks, revealing equivalence violations in 34 benchmarks.

About 80% of supported queries are verified as equivalent in a push-button manner.

Graphiti can generate SQL queries that are competitive with manually-written ones in terms of execution efficiency.

The key contributions of this paper include:


A technique for reasoning about equivalence between graph and relational queries based on a formal definition of equivalence modulo database transformer.

The concept of standard database transformer, which acts as the default correctness specification for equivalence between Cypher and SQL queries.

A sound and complete transpilation technique that translates a subset of Cypher queries into equivalent SQL queries.

A verification methodology that reduces the equivalence checking problem to checking equivalence between a pair of SQL queries.

GPT-4o

The paper titled "Graphiti: Bridging Graph and Relational Database Queries" introduces a novel method for verifying the equivalence between graph queries (Cypher) and relational queries (SQL). Here's a concise summary:

🔍 Purpose
To enable automated checking of whether a graph query in Cypher is equivalent to a relational query in SQL, which is crucial for database migrations and correctness guarantees.

🛠️ Key Contributions
Database Transformers: A formal framework for translating between graph and relational data models.

Cypher-to-SQL Translation: A syntax-directed method to embed a subset of Cypher into SQL, simplifying verification.

Verification via SQL Equivalence: Reduces the problem of comparing graph and SQL queries to checking equivalence between two SQL queries.

🧪 Implementation
Developed a tool called Graphiti.

Uses the proposed method to verify or refute the equivalence of Cypher and SQL queries.

Found subtle bugs in example queries from tutorials and research papers.

🧠 Significance
First approach to formally verify equivalence between graph and relational queries.

Useful for developers transitioning between database types or embedding graph capabilities in SQL-based systems.

Gemini 2.0 Flash

This paper presents GRAPHITI, a new automated reasoning technique for checking equivalence between graph database queries written in Cypher and relational queries in SQL. The approach centers around the concept of "database transformers" to map between graph and relational models. The core method involves embedding Cypher into SQL through syntax-directed translation, allowing the use of existing SQL verification tools. GRAPHITI is implemented as a tool and evaluated on a large benchmark of Cypher and SQL queries, demonstrating its ability to verify and refute equivalence and uncover subtle bugs, including those in existing Cypher tutorials. Key benefits include the ability to translate Cypher to SQL, find inconsistencies, and verify equivalencies in graph and relational queries. It automates a manual task that's prone to errors.

Llama 4 — Most complete and technically accurate.
Gemini 2.0 Flash — Balanced and readable, but slightly less detailed.
GPT-4o — Very concise and readable, but omits technical depth.

Data Science

I went to the LeetCode platform and selected a hard-level problem related to sorting algorithms. The problem, titled “Given two sorted arrays nums1 and nums2 of size m and n respectively, return the median of the two sorted arrays. The overall run time complexity should be O(log (m+n)).

Llama 4

Median of Two Sorted Arrays

Problem Statement
Given two sorted arrays nums1 and nums2 of size m and n respectively, return the median of the two sorted arrays. The overall run time complexity should be O(log (m+n)).

Solution
We can solve this problem using binary search. The idea is to partition both arrays into two parts such that the elements on the left side of the partition in both arrays are less than or equal to the elements on the right side.

Code
python

def findMedianSortedArrays(nums1, nums2):
    """
    Returns the median of two sorted arrays.

    Args:
    nums1 (list): The first sorted array.
    nums2 (list): The second sorted array.

    Returns:
    float: The median of the two sorted arrays.
    """
    # Make sure that nums1 is the smaller array
    if len(nums1) > len(nums2):
        nums1, nums2 = nums2, nums1

    x, y = len(nums1), len(nums2)
    start = 0
    end = x

    while start <= end:
        # Partition nums1
        partitionX = (start + end) // 2
        # Partition nums2
        partitionY = ((x + y + 1) // 2) - partitionX

        # Calculate the values at the partitions
        maxLeftX = float('-inf') if partitionX == 0 else nums1[partitionX - 1]
        minRightX = float('inf') if partitionX == x else nums1[partitionX]

        maxLeftY = float('-inf') if partitionY == 0 else nums2[partitionY - 1]
        minRightY = float('inf') if partitionY == y else nums2[partitionY]

        # Check if the partitions are correct
        if maxLeftX <= minRightY and maxLeftY <= minRightX:
            # Calculate the median
            if (x + y) % 2 == 0:
                return (max(maxLeftX, maxLeftY) + min(minRightX, minRightY)) / 2
            else:
                return max(maxLeftX, maxLeftY)
        # Move the partition to the left
        elif maxLeftX > minRightY:
            end = partitionX - 1
        # Move the partition to the right
        else:
            start = partitionX + 1

# Example usage
nums1 = [1, 3]
nums2 = [2]
print(findMedianSortedArrays(nums1, nums2))  # Output: 2.0

nums1 = [1, 2]
nums2 = [3, 4]
print(findMedianSortedArrays(nums1, nums2))  # Output: 2.5
Explanation

We first make sure that nums1 is the smaller array. This is done to simplify the logic and reduce the number of edge cases.

We then initialize the start and end pointers for the binary search. The start pointer is set to 0, and the end pointer is set to the length of nums1.

We loop until the start pointer is less than or equal to the end pointer.

Inside the loop, we calculate the partition points for both arrays. The partition point for nums1 is calculated as the average of the start and end pointers. The partition point for nums2 is calculated such that the total number of elements on the left side of the partitions is equal to half of the total number of elements.

We then calculate the values at the partitions. If the partition point is 0, we set the value to negative infinity. If the partition point is equal to the length of the array, we set the value to positive infinity.

We check if the partitions are correct by comparing the values at the partitions. If the maximum value on the left side of the partition in nums1 is less than or equal to the minimum value on the right side of the partition in nums2, and the maximum value on the left side of the partition in nums2 is less than or equal to the minimum value on the right side of the partition in nums1, then the partitions are correct.

If the partitions are correct, we calculate the median. If the total number of elements is even, the median is the average of the maximum value on the left side and the minimum value on the right side. If the total number of elements is odd, the median is the maximum value on the left side.

If the partitions are not correct, we move the partition to the left or right based on the comparison of the values at the partitions.

Time Complexity
The time complexity of this solution is O(log (min(m, n))), where m and n are the lengths of the two input arrays. This is because we are using binary search to find the correct partition.

Space Complexity
The space complexity of this solution is O(1), which means the space required does not change with the size of the input arrays.

GPT-4o

To solve this in O(log(min(m, n))) time, we can use a binary search approach on the smaller array. The key idea is to partition the two arrays such that the elements on the left of the partition are less than or equal to those on the right and the number of elements on each side is balanced.

Here’s the Python code implementing this:

python
Copy
Edit
def findMedianSortedArrays(nums1, nums2):
    # Ensure nums1 is the smaller array to optimize binary search
    if len(nums1) > len(nums2):
        nums1, nums2 = nums2, nums1

    m, n = len(nums1), len(nums2)
    total = m + n
    half = total // 2

    left, right = 0, m

    while True:
        i = (left + right) // 2  # partition nums1
        j = half - i             # partition nums2

        # Edge elements of the partition
        nums1Left = nums1[i - 1] if i > 0 else float("-infinity")
        nums1Right = nums1[i] if i < m else float("infinity")
        nums2Left = nums2[j - 1] if j > 0 else float("-infinity")
        nums2Right = nums2[j] if j < n else float("infinity")

        # Check if correct partition
        if nums1Left <= nums2Right and nums2Left <= nums1Right:
            # Odd total
            if total % 2:
                return min(nums1Right, nums2Right)
            # Even total
            return (max(nums1Left, nums2Left) + min(nums1Right, nums2Right)) / 2
        elif nums1Left > nums2Right:
            right = i - 1
        else:
            left = i + 1
How it works:
We binary search on the smaller array nums1.

At each step, we try a partition i in nums1 and j = half - i in nums2.

We check if the left elements of both arrays are less than or equal to the right elements.

Once we find the right partition, we compute the median based on even or odd total length.

Gemini 2.0 Flash

def findMedianSortedArrays(nums1, nums2):
    """
    Finds the median of two sorted arrays.

    Args:
        nums1: The first sorted array.
        nums2: The second sorted array.

    Returns:
        The median of the two sorted arrays.
    """

    if len(nums1) > len(nums2):
        nums1, nums2 = nums2, nums1  # Ensure nums1 is the shorter array

    m, n = len(nums1), len(nums2)
    low, high = 0, m

    while low <= high:
        partitionX = (low + high) // 2
        partitionY = (m + n + 1) // 2 - partitionX

        maxLeftX = nums1[partitionX - 1] if partitionX > 0 else float('-inf')
        minRightX = nums1[partitionX] if partitionX < m else float('inf')

        maxLeftY = nums2[partitionY - 1] if partitionY > 0 else float('-inf')
        minRightY = nums2[partitionY] if partitionY < n else float('inf')

        if maxLeftX <= minRightY and maxLeftY <= minRightX:
            if (m + n) % 2 == 0:
                return (max(maxLeftX, maxLeftY) + min(minRightX, minRightY)) / 2
            else:
                return max(maxLeftX, maxLeftY)
        elif maxLeftX > minRightY:
            high = partitionX - 1
        else:
            low = partitionX + 1

All three versions are correct, optimal, and well-structured. And there is no big difference

Conclusion :

Meta’s move is undoubtedly an attempt to catch up with or even surpass the leaders in the AI industry. By introducing the Mixture of Experts architecture, Llama 4 is no longer just a “bigger” model, but a “smarter” one.

Whether Llama 4 can truly challenge the top players in the AI field remains to be seen, but one thing is certain: Meta, the “sheep”, is no longer a docile character.

Reference :

https://ai.meta.com/blog/llama-4-multimodal-intelligence/

🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or Book a 1-on-1 Consulting Call With Me.

Gao Dalie (高達烈)

Discussion about this post