Google launches community benchmarks on Kaggle

Kaggle is shifting the focus of AI evaluation from static data sets to dynamic, user-generated script tests that run directly in the notebook environment. As a key lever for user acquisition, the Google subsidiary offers free API access to inference models from third-party providers such as Anthropic and DeepSeek.

Free inference via proxy: Kaggle covers the full API costs for premium models such as Claude 3.5 Sonnet, DeepSeek, and Gemini Pro, as long as the undocumented “fair use” limits are observed.
No GPT support: The framework has a critical gap, as OpenAI models (GPT-4o/o1) are not technically supported in the current community tier and are therefore missing from the benchmark.
Code-first instead of static data: The evaluation is not based on static CSV uploads, but on executable Python code (@kbench.task), making prompting strategies and validation logic transparently visible.
Strict submission limits: To keep the leaderboard clean, the architecture only allows exactly one “main task” per notebook, which must be defined using the magic command %choose.

Table of Contents

The Trust Gap: Why manufacturer metrics are no longer sufficient

When a new model release is celebrated at a keynote with “state-of-the-art performance” and bar charts, the developer community now responds with healthy skepticism. The reason is an open secret in the industry: data contamination.

Since modern LLMs are trained on virtually the entire internet, there is a high probability that the test questions (from well-known datasets such as MMLU or GSM8K) were already included in the training set. The model does not “understand” the task—it simply remembers the solution. The result is overfitting on test sets: the model shines in the benchmark but fails in your production environment on simple edge cases.

The central problem with current evaluations can be pinpointed to three weaknesses:

Static leaderboards are manipulable: As soon as a benchmark such as MMLU becomes the industry standard, manufacturers optimize their models specifically for it (“Goodhart’s Law”). The top rankings often reflect who has optimized best for the test, not who has the most intelligent model.
The “average” error: A high score on a general academic dataset says nothing about whether a model can handle specific niche tasks. As the research shows, experts – as in the “wastewater engineering” example – do not need general physics knowledge, but rather a deep understanding of the process to diagnose errors in pump systems. Standard metrics do not reflect this.
Black box evaluation: In many cases, you only see the end result (e.g., “88.5%”). You don’t see the code that led to it. Was “chain of thought” used? How often was the model re-queried (best of N)? Without the evaluation code, the number is worthless.

This is exactly where the chain of trust breaks down. Developers and companies don’t need more PDF tables from manufacturers, but rather reproducible tests “in the wild.” The demand is shifting from static Q&A lists to dynamic tasks in which not only the result but also the method (prompting, retry logic, parsing) is transparent and executable. With its latest launch, Kaggle is positioning itself precisely in this gap: moving away from the “what” (the number) to the “how” (the code).

Under the Hood: The Architecture of Community Benchmarks

The core of the new system is not another static data set library, but the kaggle-benchmarks Python SDK. Kaggle is thus shifting its focus from pure data entry to algorithmic validation directly in the browser. From a technical perspective, a Kaggle Notebook is transforming from a pure analysis environment into a standardized test harness in which inference, validation, and scoring are performed in one go.

The architecture is based on three technical pillars:

Code-first definition via decorators:
You define a benchmark not through JSON configurations, but as a Python function. Using the @kbench.task decorator, you mark the logic that is to be executed. This makes the testing process transparent: every user can see exactly in the notebook whether a tricky system prompt was used or whether the model had to respond “naked.”
Managed Inference API:
Kaggle acts as a proxy here. Within the notebook, you can access models from Google (Gemini Pro/Flash) and external partners such as Anthropic (Claude 3.5), DeepSeek, or Qwen via the SDK. Important for your architecture planning: Currently, there is no native integration for OpenAI models in the Community Tier, so you are primarily comparing the Open Weight and Google ecosystems here.
LLM-as-a-Judge & Assertions:
Evaluation is no longer based solely on simple string matches. With kbench.assertions, the SDK offers tools for regex validation, but also integrates LLM-as-a-Judge workflows. Here, a more powerful model (e.g., Gemini Pro) evaluates the output of a smaller model based on qualitative criteria.

To prevent cherry-picking from distorting the leaderboard, the architecture enforces a technical limitation: only one main task per notebook may be submitted for ranking. This is done using the explicit magic command %choose. Only the task defined here ends up in the global comparison, which ensures that the community does not accidentally interpret unfinished secondary tests as benchmark results.

This architecture solves the “black box” problem: since the evaluation code is mandatory as an executable notebook, each benchmark is fully reproducible. You not only see the result “92% accuracy,” but you can also fork the code and check whether the kbench.evaluate function was implemented correctly or whether the test set contained leaked data.

Practice: How to run your own LLM benchmark

Getting started in the Kaggle environment is radically simplified, as you don’t have to worry about GPU drivers or API keys. At its core is the new kaggle-benchmarks Python SDK, which gives you access to proprietary and open models.

1. Setup and model selection

Create a new notebook and install the SDK. What’s special: Kaggle acts as a proxy. You can directly access models from Google (Gemini Pro/Flash), Anthropic (Claude), or DeepSeek without burning your own credits. Note, however, that OpenAI models (GPT-4o) are not supported in the current community tier.

2. Define the benchmark (code-first)

Unlike static multiple-choice tests, you define your benchmark as a Python function. This is where the kbench decorator comes into play. A typical workflow for a custom task (e.g., for technical documentation) looks like this:

import kaggle_benchmarks as kbench

# Define the task and the evaluation logic
@kbench.task(name="tech_doc_validation")
def validate_documentation(llm, prompt_text: str, required_terms: list):
    # 1. Inference: The model generates a response (free via proxy)
    response = llm.prompt(f"Explain the following function to laypeople: {prompt_text}")

    # 2. Evaluation: LLM-as-a-Judge or Regex
    # Here, a stronger model (Judge) checks whether the answer is correct
    score = kbench.evaluate.llm_judge(
        response=response,
        criteria=f"Text must be understandable and contain terms: {required_terms}")


# Assertions for scoring (0.0 to 1.0)
return score

3. Execution and limits

To start the test, run the function against the desired models. The SDK allows parallel testing:

Command: validate_documentation.run(llm=kbench.llms.all_supported, ...)
Models: You test simultaneously against the available range (e.g., Claude 3.5 Sonnet vs. Qwen).
Infrastructure: The calculation does not run on your local notebook VM, but via Kaggle’s API endpoints. As long as you stay within the “fair use” quotas, there are no costs.

4. Submission to the leaderboard

There is an important technical restriction here: A notebook may contain multiple tests, but only one “main task” counts for the public leaderboard. In order for your benchmark to be listed, you must use the magic command to mark this task:

%choose tech_doc_validation

Without this command, your notebook remains a private analysis and does not appear in the community comparison. After “Save & Run All,” your custom score is calculated and, if the validation (assertions) passes, published.

Practice: How to run your own LLM benchmark

The process on Kaggle differs fundamentally from local setups. Instead of configuring GPU clusters, you use the kaggle-benchmarks Python SDK directly in the notebook environment. The biggest advantage here is that Kaggle covers the costs for inference.

1. Setup and model selection

Instead of starting with a blank notebook, you should ideally fork an existing template or install the SDK. High-end models are accessed via integrated API proxies rather than local weights.

The SDK gives you instant access to:

Google: Gemini series (Flash, Pro)
Third-party: Anthropic (Claude), DeepSeek, Qwen (Alibaba)

Important: Currently, no OpenAI models (GPT-4o/o1) are supported in the Community Tier. If you want to test against GPT, this framework is currently the wrong choice.

2. Execution: The “code-first” approach

Unlike static multiple-choice tests, here you define a Python function as a task. A typical workflow looks like this:

Task definition: You decorate your test function with @kbench.task.
Inference: You send prompts to the supported models.
Evaluation (LLM-as-a-Judge): You use a strong model (e.g., Gemini Pro) to evaluate the response of a weaker model or a competitor.
Assertion: You define hard criteria (e.g., regex matches or minimum scores).

The SDK enforces a clear structure. To make your result visible on the leaderboard, you must use the magic command %choose:

# Selects the task for the leaderboard (only ONE allowed per notebook)
%choose my_custom_task_name

3. Customizing: Your own data instead of MMLU

The real strength lies in uploading your own niche test data (“domain-specific evals”). A concrete example from the community is the “WWTP Engineering Benchmark” (Wastewater Treatment). Here, models are not tested on general knowledge, but on fault diagnosis in centrifugal pumps or safety protocols in sewage treatment plants.

Simply upload your CSV/JSON with scenarios as a Kaggle dataset and iterate over it. Since inference runs on Kaggle quotas (“fair use”), you can run hundreds of rows against expensive models like Claude 3.5 without charging your credit card at Anthropic – as long as you stay within the undocumented limits.

4. Output and reporting

At the end of the run, the SDK generates a report with scores (e.g., 1–5 stars or pass/fail rates). These metrics automatically end up in the “Community Benchmarks” tab of the model. Since all evaluation code is public, any third party can check whether your prompts have biased the model (“prompt engineering” vs. true intelligence) or whether your kbench.assertions are valid.

Practice: How to run your own LLM benchmark

Getting started with Kaggle Community Benchmarks is fundamentally different from classic Hugging Face pipelines. Here, you don’t evaluate locally on your hardware, but use Kaggle’s infrastructure as a proxy for API calls. The process is based entirely on the kaggle-benchmarks Python SDK and runs directly in the browser notebook.

1. Setup & Available Models

First, install the SDK in your Kaggle notebook. The key advantage: Kaggle covers the inference costs for the supported models within a “fair use” quota. Among others, the following are available:

Google: Gemini series (Flash, Pro)
Third-party: Anthropic (Claude models), DeepSeek, Qwen (Alibaba)
Important: OpenAI models (GPT-4o/o1) are currently not technically supported in the Community tier.

2. Definition of the benchmark (code-first)

Instead of uploading static data sets, you define the logic as a Python function. Kaggle relies heavily on LLM-as-a-Judge: a strong model evaluates the output of another model.

Here is the schema using the example of the “Wastewater Engineering Benchmark” (user: Mehmet Isik), which tests specific domain knowledge:

import kaggle_benchmarks as kbench

# Define task 1: The scenario
@kbench.task(name="pump_failure_diagnosis")
def diagnose_pump(llm, scenario: str, expected_root_cause: str):

    # Prompt to the model to be tested (free via proxy)
    response = llm.prompt(f"System Alarm: {scenario}. Diagnose the root cause.")

    # 2. Evaluation: A "Judge" (e.g., Gemini Pro) checks the response
    score = kbench.evaluate.llm_judge(
        response=response,
        criteria=f"Must identify {expected_root_cause} and suggest safety shut-off.")


# Assert Score > 4/5 for success
kbench.assertions.assert_greater(score, 4)

3. Execution and submission limits

To run the benchmark, call the .run() method. The SDK fires the prompt in parallel against all selected models (e.g., Claude 3.5 Sonnet vs. DeepSeek vs. Gemini Pro).

There is a hard technical limit for the official leaderboard: a notebook may only score a single “main task.” You must use the so-called magic command to specify which task counts:

# Mandatory for the leaderboard
%choose diagnose_pump

This prevents users from “spamming” hundreds of micro-tests in a single run. The focus is on curated, in-depth test scenarios that often cover niches that are missing in academic sets such as MMLU.

Here, you leave the world of static CSV uploads. Instead of rigid data sets, you define tasks in Kaggle using Python code. At the heart of it all is the kaggle-benchmarks SDK. To create a niche-specific test—for example, for German legal RAG systems or complex engineering problems—you write a Python function and decorate it with @kbench.task.

A powerful real-world example is Mehmet Isik’s “WWTP Engineering Benchmark. “ Instead of testing general knowledge, this benchmark tests models on scenarios such as material fatigue or pump failures in wastewater treatment plants. The model receives a specific prompt (e.g., an error message) and must diagnose a cause.

The workflow for your customization looks like this:

Define the scenario: You create the logic in the notebook.
Inference (free): You access models such as Gemini (Pro/Flash), Claude (Anthropic), or DeepSeek via the SDK. Kaggle covers the cost of the API calls as long as you stay within the (undocumented) “fair use” limits.
- Important: Currently, no OpenAI models are supported in the Community tier. So you primarily test Google and Open-Weights against the rest, but without GPT-4.
Evaluation (LLM-as-a-Judge): Since complex answers are difficult to check with Regex, you use a powerful model (e.g., Gemini Pro) as a judge. With kbench.assertions, you define the score at which a test is considered passed.

Here is a simplified scheme for implementation:

import kaggle_benchmarks as kbench

@kbench.task(name="your_niche_test")
def evaluate_rag(llm, scenario, expected_answer):
    # Inference (free API via Kaggle Proxy)
    response = llm.prompt(f"Scenario: {scenario}. Diagnose the problem.")

    # Evaluation by a Judge-LLM
    score = kbench.evaluate.llm_judge(
        response=response,
        criteria=f"Must contain {expected_answer}."
    )
# Validation
kbench.assertions.assert_greater(score, 4)

Once the benchmark is running, you will receive scores for all selected models. However, there is a strict technical restriction for publication on the leaderboard: A notebook may only nominate a single “main task” for ranking. This is done using the magic command:

%chooseyour_niche_test

Without this command, your benchmark remains a local exercise in the notebook. With it, your results end up directly in the comparable leaderboard, where the community can check whether DeepSeek or Gemini solves your special case better.

Google vs. Hugging Face: The battle for interpretive authority

The Hugging Face Open LLM Leaderboard has been considered the gold standard for evaluating open-source models for years. It relies on rigorous, academic metrics such as MMLU or GSM8K to make models comparable under laboratory conditions. Kaggle takes a diametrically different approach with its community benchmarks: Instead of standardized tests for the general public (“Is model A smarter than model B?”), Google focuses on application-specific scenarios (“Can model A query my specific SQL database?”).

The key difference lies in the execution: While Hugging Face is primarily based on configurable datasets (via LightEval), Kaggle benchmarks are executable Python code. This means that you don’t upload static results, but define inference logic that runs live in the Kaggle cloud.

Here are the key differences in a direct comparison:

Feature	Kaggle Community Benchmarks	Hugging Face (Open LLM Leaderboard)
Philosophy	Dynamic (“Wild West”):Focus on niche use cases (e.g., regex checks, wastewater engineering) and custom code.	Standardized (academic):Focus on comparability through defined metrics (MMLU, IFEval).
Costs & Compute	Hosted / Free Tier: Kaggle covers the API bill for models such as Gemini, Claude, and DeepSeek (within quotas).	Bring Your Own Compute:You pay for GPU inference (locally or via endpoints) to run `LightEval`.
Transparency	Code-First: The test code is openly available as a notebook. Anyone can follow the prompt strategy 1:1.	Config-Based:Often abstracted by evaluation frameworks; focus is on the end result (score).

The aggressive “free compute” lever

Kaggle’s strongest argument is not the technology, but the business model. By offering free API access to third-party models such as Anthropic (Claude) or DeepSeek within the benchmarks (“free access within quota limits”), Google is massively lowering the barrier for developers. With Hugging Face, you have to rent expensive GPU clusters for comprehensive benchmarks; with Kaggle, a browser tab is sufficient.

Complementary rather than competitive?

Currently, both platforms serve different target groups. Hugging Face remains indispensable for ML researchers and foundation model builders who need scientifically robust comparisons against the state of the art.

Kaggle, on the other hand, positions itself as a playground for app developers and domain experts. It is the place for the “long tail” of evaluation: tests that are irrelevant for research but crucial for practical application. However, one disadvantage remains the walled garden: the Kaggle Benchmarks SDK is strongly linked to Google’s infrastructure, while Hugging Face tools are more agnostic. In addition, Kaggle currently lacks the integration of OpenAI models, which makes direct comparison with GPT-4 difficult.

Limitations and challenges

Even though the “community first” approach sounds promising, a closer look at the technical architecture and governance reveals significant hurdles to productive use.

The “black box” quotarisk
Kaggle aggressively advertises free API access to premium models from Anthropic (Claude), Google (Gemini), and DeepSeek. While this solves the cost problem of evaluation, it creates a new dependency: the “fair use” limits are undocumented.

Lack of transparency: There is no clear price list or visible token limit. If your benchmark times out or gets blocked, you as a developer have little recourse. For serious research that requires reproducibility, a budget that can be stopped at any time without warning (“within quota limits”) is a risk.
The missing market leader: OpenAI (GPT-4o/o1) is currently not supported in the Community tier. A benchmark that excludes the de facto industry standard remains incomplete. Here, Google and its partners are compared against the rest, but without the most important benchmark competitor.

Vendor lock-in instead of open standard
Unlike Hugging Faces LightEval, which you can run locally on any hardware, the kaggle-benchmarks SDK ties you closely to the platform infrastructure.

Walled garden: The tests are Python functions tailored to Kaggle’s environment and its API proxies. You can’t simply export your elaborately written evaluation suites and run them in your own CI/CD pipeline.
Technical hurdle: This is not a drag-and-drop tool for product managers. If you want to create benchmarks, you need to be proficient in Python, understand the @kbench.task decorator pattern, and write validation logic (kbench.assertions). The entry barrier is significantly higher than for classic leaderboards.

Manipulation and technicallimits
Community-driven also means less standardization.

Gameability: Since the validation code (e.g., regex matching) is openly available in the notebook, there is a risk that models or prompts will be specifically “fitted” to these assertions instead of solving the underlying problem (overfitting to the test code).
Submission limit: Currently, the system only allows a single “main task” per notebook for the leaderboard (controlled via the %choose command). Complex evaluation suites that want to test and aggregate a model across different disciplines simultaneously reach their architectural limits here.

Here is the conclusion from the perspective of the “ai-rockstars.de” reviewer.

Conclusion

Kaggle Community Benchmarks are the long-overdue death knell for static manufacturer PDFs. The platform finally shifts the authority of interpretation to where it belongs: from the marketing department back to the code. The approach of viewing validation not as an abstract score (MMLU) but as an executable Python script is the only viable way to combat data contamination and Goodhart’s Law.

But let’s not kid ourselves: this is not neutral ground. It is a golden cage created by Google. You get free computing power (“Free Compute”), but you pay for it with massive vendor lock-in into the Kaggle ecosystem. The fact that the current industry standard GPT-4o (OpenAI) is completely missing currently degrades the platform to a “Google vs. The Rest” comparison instead of the ultimate showdown of all LLMs.

Nevertheless, anyone building real applications will find more truth here than in any Hugging Face leaderboard, because here it’s “edge cases” and process knowledge that count, not memorized Wikipedia knowledge.

Is it worth the effort?

Yes, but as a supplement, not a replacement. The ability to pit expensive models such as Claude 3.5 Sonnet or Gemini Pro against each other for free via API is an unbeatable argument for quick experiments. For scientific research or productive CI/CD pipelines, the system is (still) too proprietary and undocumented in its limits.

Who is it for?

USE IT IF YOU:
- Are an applied AI engineer: Want to know if model A writes better SQL queries for your schema than model B? You can test it here for free.
- A domain expert: You have specific knowledge (law, medicine, engineering) and want to build specialized tests (“unit tests for LLMs”) that go beyond general knowledge.
- Are budget-conscious: Want to compare DeepSeek V3 and Claude 3.5 without charging your credit card for every API call.
DON’T DO IT IF YOU:
- You are an OpenAI power user: Without GPT-4 integration, you are missing the most important benchmark opponent.
- You are a researcher: If you need clean, academic comparability (reproducibility on your own hardware), stick with Hugging Face and LightEval.
- Are looking for enterprise-ready solutions: The “fair use” quotas are undocumented. Don’t build business-critical eval pipelines on a “maybe available” free tier.

Action Plan

Ignore the leaderboards. The rankings are secondary.
Go into the code. Fork a notebook that resembles your use case (e.g., RAG or code gen).
Take full advantage of the free tier. Write your own assertions and run the expensive models (Claude/Gemini) against each other for free to get a feel for their reasoning quality in your niche.
Wait and see. As long as OpenAI is not integrated, Kaggle remains an extremely useful sandbox, but not a complete market overview.

The Trust Gap: Why manufacturer metrics are no longer sufficient

Under the Hood: The Architecture of Community Benchmarks

Practice: How to run your own LLM benchmark

1. Setup and model selection

2. Define the benchmark (code-first)

3. Execution and limits

4. Submission to the leaderboard

Practice: How to run your own LLM benchmark

1. Setup and model selection

2. Execution: The “code-first” approach

3. Customizing: Your own data instead of MMLU

4. Output and reporting

Practice: How to run your own LLM benchmark

1. Setup & Available Models

2. Definition of the benchmark (code-first)

3. Execution and submission limits

Google vs. Hugging Face: The battle for interpretive authority

The aggressive “free compute” lever

Complementary rather than competitive?

Limitations and challenges

Conclusion

Is it worth the effort?

Who is it for?

Action Plan

Related Posts: