Establishing robust benchmarks for evaluating contextual reasoning in large language models

A. K. Dhami; Er. Siddharth

doi:10.36676/jrps.v16.i1.43

ScienceGate Book Chapters

JOURNAL ARTICLE

Establishing robust benchmarks for evaluating contextual reasoning in large language models

A. K. Dhami Er. Siddharth

Year: 2025 Journal: International Journal for Research Publication and Seminars Vol: 16 (1)Pages: 215-228

DOI: 10.36676/jrps.v16.i1.43

Get Full-Text PDF Get Analytical Report

Abstract

The growing prevalence of large language models in real-world applications necessitates a deeper understanding of their contextual reasoning capabilities. Despite impressive performance on a variety of tasks, these models often struggle to consistently interpret and integrate complex contextual information, highlighting a critical gap in current evaluation practices. This paper introduces a novel suite of robust benchmarks specifically designed to assess contextual reasoning in large language models. By incorporating diverse and challenging test cases that mirror real-world ambiguity and multi-layered context, our benchmarks aim to uncover both the strengths and limitations of these systems. Extensive experimental evaluations reveal significant variability in performance across different models, emphasizing the need for standardized, context-aware assessment tools. The insights gained from this study not only advance our understanding of contextual reasoning in AI but also provide a solid foundation for the development of next-generation models with improved interpretative and reasoning capabilities.

Keywords:

Computer science Natural language processing Artificial intelligence Language model Linguistics

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.03

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Speech and dialogue systems

Physical Sciences → Computer Science → Artificial Intelligence

Establishing robust benchmarks for evaluating contextual reasoning in large language models

Abstract

Metrics

Topics

Related Documents

Evaluating Large Language Models with Enterprise Benchmarks

EconNLI: Evaluating Large Language Models on Economics Reasoning

Evaluating Large Language Models for Tax Law Reasoning

Evaluating Multimodal Large Language Models: Benchmarks, Methods, and Analytical Approaches

Contextual Reasoning for Robust Composed Image Retrieval with Vision-Language Models