Skip to content

v4.0.0 - Nejumi LLM Leaderboard4

Latest

Choose a tag to compare

@nejumi nejumi released this 27 Aug 05:06
· 4 commits to main since this release
5ffd16e

Nejumi LLM Leaderboard 4 Release Notes

Tokyo, Japan — August 27, 2025

Weights & Biases Japan has released Nejumi LLM Leaderboard 4, the third major update to Japan’s largest open benchmarking platform for Japanese-language LLMs. This update significantly expands evaluation coverage to keep pace with cutting-edge models, focusing on advanced reasoning, deep knowledge, application development capabilities, and safety assessments. The goal is to provide enterprises with more practical benchmarks for selecting and deploying LLMs.


Key Updates

  1. Advanced Reasoning Benchmarks

    • Added ARC-AGI and ARC-AGI-2 to evaluate mathematical and abstract reasoning.
  2. Deeper Knowledge Evaluation

    • Introduced JMMLU-Pro and Humanity’s Last Exam to assess PhD-level knowledge and reasoning.
  3. Application Development Capabilities

    • New evaluation category for LLM-based application development.
    • Includes SWE-Bench Verified, JHumanEval, MT-Bench Coding, and BFCL for function/tool use.
  4. Enhanced Safety Assessments

    • Strengthened reproducible safety evaluations with widely available datasets:

      • M-IFEVAL (multilingual instruction following)
      • HalluLens (truthfulness / hallucination detection)
  5. Enterprise & Developer Support

    • Fully open source, with faster evaluation pipelines and unified interfaces.
    • Private leaderboard deployments available under W&B Enterprise licenses.
    • Interactive analytics on W&B platform for task-level comparisons and decision support.

Insights from Nejumi 4

  • Performance gaps are clearer again: New reasoning and coding tasks differentiate frontier models.
  • GPT-5 vs Claude Opus 4.1: Results are nearly tied, with Opus 4.1 stronger in application tasks, and GPT-5 still leading in knowledge and QA. However, Opus 4.1 comes with significantly higher cost per token.
  • Category-level variance: Translation tasks are nearly saturated, while abstract reasoning, domain knowledge, coding, and function calling remain frontier areas with large headroom for improvement.

Links