Large Language Model Benchamrk

Italian Benchmark Evaluates Large Language Models, Includes AI Translation

A new community-driven initiative evaluates large language models using Italian-native tasks, with AI translation among the ...

Morningstar

Logical Intelligence Achieves 76 Percent on Putnam Benchmark, Highlighting Shift Beyond Large Language Models to Language-free, Mathematically Grounded Models

Over the last decade, artificial intelligence (AI) has been largely built around large language models (LLMs). These systems are based on a language and guess words in a chain in the form of tokens.

SiliconANGLE

MLCommons releases new AILuminate benchmark for measuring AI model safety

MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.

Analytics Insight

Why Large Language Models Can't Always Solve Math Problems

Overview: Large Language Models predict text; they do not truly calculate or verify math.High scores on known Datasets do not ...

Business Insider

Sword Health Launches MindEval, the First Multi-Turn Mental Health Benchmark for Evaluating Large Language Models in Realistic Therapeutic Dialogue

New York, NY, Dec. 09, 2025 (GLOBE NEWSWIRE) -- Sword Health, the world’s leading AI Health company, today unveiled MindEval, the industry’s first benchmark designed to evaluate how large language ...

4don MSN

How the UAE built the world’s leading Arabic AI model: Falcon-H1 Arabic explained

Abu Dhabi's Technology Innovation Institute unveiled Falcon-H1 Arabic, a powerful new AI model excelling in Arabic language ...

ascopubs.org

RadOncRAG: A Novel Retrieval-Augmented Generation Framework Improves Large Language Model Benchmark Performance in Radiation Oncology

Large language models (LLMs) show promise in assisting knowledge-intensive fields such as oncology, where up-to-date information and multidisciplinary expertise are critical. Traditional LLMs risk ...

EurekAlert!

MathEval: a comprehensive benchmark for evaluating large language models on mathematical reasoning capabilities

This study introduces MathEval, a comprehensive benchmarking framework designed to systematically evaluate the mathematical reasoning capabilities of large language models (LLMs). Addressing key ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results