
Leading Artificial Intelligence (AI) models may be performing impressively in basic scientific tasks, but they still fall short when it comes to deeper reasoning, a new study by researchers from IIT Delhi and Friedrich Schiller University (FSU) Jena, Germany, has found. Published in Nature Computational Science, the study reveals that although current AI systems can handle simple perception-based tasks with near-perfect accuracy, they fail to demonstrate genuine scientific understanding.
The research team, led by IIT Delhi Associate Professor N. M. Anoop Krishnan and FSU Jena Professor Kevin Maik Jablonka, developed MaCBench, the first comprehensive benchmark designed to assess how vision–language models perform on real-world chemistry and materials science tasks. Their results showed a striking paradox: while AI models performed nearly flawlessly in identifying laboratory equipment, they struggled with spatial reasoning, integrating cross-modal information, and multistep logical inference—skills essential for authentic scientific discovery.
“Our findings represent a crucial reality check for the scientific community. While these AI systems show remarkable capabilities in routine data processing tasks, they are not yet ready for autonomous scientific reasoning,” Krishnan said, as reported by PTI.
He added, “The strong correlation we observed between model performance and internet data availability suggests these systems may be relying more on pattern matching than genuine scientific understanding.”
Safety Assessments Reveal Gaps
Highlighting one of the most worrying outcomes, Jablonka said, “While models excelled at identifying laboratory equipment with 77% accuracy, they performed poorly when evaluating safety hazards in similar laboratory setups, achieving only 46% accuracy. This disparity between equipment recognition and safety reasoning is particularly alarming.” He added that such shortcomings reveal that AI cannot yet replace the tacit knowledge scientists rely on for safe laboratory operations. “Scientists must understand these limitations before integrating AI into safety-critical research environments,” Jablonka noted.
Future of AI in Science Requires Human Guidance
The research team also conducted extensive ablation studies to identify specific failure modes. They found that AI systems performed better when identical information was presented as text instead of images, indicating incomplete multimodal integration, a key requirement for scientific work.
The study’s implications stretch beyond chemistry and materials science, highlighting similar challenges for AI adoption across other research disciplines. According to IIT Delhi PhD scholar Indrajeet Mandal, “Our work provides a roadmap for both the capabilities and limitations of current AI systems in science. While these models show promise as assistive tools for routine tasks, human oversight remains essential for complex reasoning and safety-critical decisions.”
The researchers concluded that the next step in developing reliable AI scientific assistants lies in creating training models that prioritise understanding over pattern matching, along with stronger frameworks for human–AI collaboration.