Source record

arxiv.org

43 published stories in AI Today use this source record.

Stories

High evidence

Canadian

Policy / public sector

ResearchHigh evidence
[2510.01833] Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning
Researchers introduced a new framework called Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization for large language models. This two-stage method aims to improve both high-level planning and detailed reasoning, leading to more accurate and efficient outputs. The framework has shown significant enhancements across ten reasoning benchmarks in mathematics and natural sciences.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
[2511.07667] AI-Driven Contribution Evaluation and Conflict Resolution: A Framework & Design for Group Workload Investigation
Researchers have proposed a framework for an AI tool that evaluates individual contributions in team settings. This tool aims to improve conflict resolution and performance assessments by organizing various team artifacts and outputting transparent analytics. The framework's design highlights the importance of objective measurements and bias safeguards for fairer evaluations.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
[2512.23292] Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control
Researchers introduced a new model for controlling nuclear reactors, focusing on physical validation.
Issue 21 Original source (opens in new tab)
ResearchHigh evidence
[2602.07832] rePIRL: Learn PRM with Inverse RL for LLM Reasoning
Researchers introduced rePIRL, a framework that learns effective process reward models for large language models.
Issue 21 Original source (opens in new tab)
ResearchHigh evidence
[2602.17038] Phase-Aware Mixture of Experts for Agentic Reinforcement Learning
Researchers introduced a Phase-Aware Mixture of Experts model for reinforcement learning agents.
Issue 20 Original source (opens in new tab)
ResearchHigh evidence
[2604.27245] Addressing the Reality Gap: A Three-Tension Framework for Agentic AI Adoption
Researchers outlined a framework addressing challenges in adopting agentic AI in education.
Issue 20 Original source (opens in new tab)
ResearchHigh evidence
[2604.27292] The Two Boundaries: Why Behavioral AI Governance Fails Structurally
Alan McCann's paper outlines structural flaws in behavioral artificial intelligence governance, focusing on the disconnect between what AI can do and what governance oversees. He argues that two out of three identified regions of governance represent failure modes, leading to risks and ineffective policies. The paper proposes a new governance framework that ensures these boundaries align, suggesting this architectural change is crucial for effective AI governance.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
[2605.05248] Governed Metaprogramming for Intelligent Systems: Reclassifying Eval as a Governed Effect
Alan McCann introduced a new approach called governed metaprogramming, which reclassifies the programming primitive "eval" as a governed effect. This design ensures that program execution undergoes structural checks for capability, policy compliance, and resource requirements, enhancing the governance of intelligent systems. McCann implemented this concept in a domain-specific language called mashinTalk, advancing the ability to manage AI workflows more effectively.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
[2605.13318] VERA-MH: Validation of Ethical and Responsible AI in Mental Health
Researchers introduced VERA-MH, a framework for assessing chatbot safety in mental health support.
Issue 20 Original source (opens in new tab)
ResearchHigh evidence
[2605.17036] Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
Researchers studied the performance of autonomous generative AI agents in supply chain management using a game simulation. They found that while these models can outperform human teams, they also pose significant reliability risks due to decision instability. To enhance reliability, the authors propose a new training framework that focuses on optimizing decision policies for better performance.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
[2605.19064] Toward an AI-Powered Computational Testbed for Workforce Policy
Researchers proposed a computational testbed to simulate employee responses to AI integration in workplaces.
Issue 20 Original source (opens in new tab)
ResearchHigh evidence
[2605.19433] Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation
Researchers proposed a new method, MOTAB, to improve large language model reasoning distillation.
Issue 20 Original source (opens in new tab)
ResearchHigh evidence
[2605.20402] Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
Researchers decompose quantization error in MXFP4 for reinforcement learning in large language models.
Issue 22 Original source (opens in new tab)
ResearchHigh evidence
[2605.20722] AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback
Researchers introduced Adaptive Group Policy Optimization (AGPO) for reinforcement learning in large language models.
Issue 21 Original source (opens in new tab)
ResearchHigh evidence
[2605.21027] Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs
Researchers developed Analytic Agent, a Large Language Model system for enterprise analytics APIs.
Issue 22 Original source (opens in new tab)
ResearchHigh evidence
[2605.22842] The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems
Researchers identified a flaw in how agents attribute misconduct in AI systems.
Issue 23 Original source (opens in new tab)
ResearchHigh evidence
[2605.22995] Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good
A survey of 112 papers on agentic AI for social good identifies significant gaps in geographic context.
Issue 23 Original source (opens in new tab)
ResearchHigh evidence
[2605.23426] Socially fluent AI decouples conversational signals from source identity in online interaction
Researchers tested AI agents in group conversations, finding people could not identify them as non-human.
Issue 23 Original source (opens in new tab)
ResearchHigh evidence
[2605.23459] AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems
Researchers proposed a new testing strategy for enterprise AI systems focusing on risk reduction.
Issue 23 Original source (opens in new tab)
ResearchHigh evidence
[2605.26156] Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges
Researchers developed a framework called BITE that exploits stylistic biases in large language model (LLM) judges to inflate scores. This method, which achieved over 65% success in misleading various judges, reveals significant vulnerabilities in using LLMs for evaluation. The findings encourage a reevaluation of assessment methods to mitigate these attacks.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
[2605.26492] Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories
Researchers Sil Hamilton and David Mimno found that stories generated by large language models show low diversity. They discovered that 11 words appear in 88.3% of the sampled 20,000 stories, revealing a reliance on limited vocabulary influenced by preference data. This pattern raises concerns about how small datasets can disproportionately shape the narratives produced by these models.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
[2605.26508] Foundations of a Time-Consistent Counterfactual Actuarial Runtime for Autonomous AI Agents
Hao-Hsuan Chen proposed a framework for autonomous AI agents that quantifies risks for actions within a defined safe boundary. This foundational actuarial runtime aims to replace traditional liability coverage with proactive insurance measures, ensuring accountability for actions taken by these agents. The paper lays the groundwork for further research into empirical applications and strategic operator incentives in AI risk management.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
[2605.26546] MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration
Researchers introduced MobileExplorer, a framework that enhances on-device inference for mobile graphical user interface agents. This advancement addresses privacy issues and reduces latency by utilizing online exploration to improve task performance on smartphones. Tests show MobileExplorer can cut reasoning steps and speed up processes while boosting success rates.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
[2605.27115] Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation
Researchers introduced Counteraction-Aware Multi-Teacher On-Policy Distillation (CaMOPD) to enhance both general capabilities and domain-specific behaviors of language models. This method resolves issues from standard multi-teacher models, particularly when teacher prompts do not align with student training, leading to more effective recovery of model performance. CaMOPD’s approach focuses on targeted updates and sample selection, supporting better outcomes in dialogue and medical reasoning tasks.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents
Researchers introduced APEX, a method for self-evolving large language model (LLM) agents.
Issue 22 Original source (opens in new tab)
ResearchHigh evidence
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
Researchers introduced AutoResearchBench, a benchmark designed to test AI agents in finding complex scientific literature. It includes two tasks that measure the agents' ability to conduct comprehensive and deep research, emphasizing a nuanced understanding of scientific concepts.
Issue 4 Original source (opens in new tab)
ResearchHigh evidence
Can Coding Agents Reproduce Findings in Computational Materials Science?
Researchers introduced AutoMat, a benchmark for assessing how well large language models can reproduce findings in computational materials science. Their study shows that these coding agents struggle with complex scientific workflows, achieving a maximum success rate of only 54.1%.
Issue 4 Original source (opens in new tab)
ResearchHigh evidence
Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs
Researchers introduced Conformal Selective Acting, a method for real-time risk control in deployed models.
Issue 22 Original source (opens in new tab)
ResearchHigh evidence
Examining the Challenges of Intellectual Property in AI-Generated Productions
Researchers examine the challenges surrounding intellectual property rights for works generated by artificial intelligence. This analysis highlights significant gaps in existing laws, particularly within Iran, emphasizing the need for legal revisions to address ownership issues. Potential solutions include creating specific rights for AI-generated works or assigning ownership to related human agents.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator
Researchers propose a new framework called Calibrated Interactive Reinforcement Learning to improve multi-turn dialogue systems. This method aligns simulators with human interaction patterns to reduce gaps between simulated and real conversations. Experiments show that it outperforms previous models by mitigating distribution shifts that affect dialogue quality.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
Generating Statistical Charts with Validation-Driven LLM Workflows
Researchers developed a workflow using large language models (LLMs) to create diverse statistical charts from tabular data. This method enhances chart generation through validation steps, producing 1,500 charts and 30,003 question-answer pairs for better analysis of multimodal reasoning.
Issue 4 Original source (opens in new tab)
ResearchHigh evidence
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
The GLM-V Team introduced GLM-5V-Turbo, a foundation model designed for multimodal agents. This model integrates multimodal perception into reasoning and planning, improving performance in tasks involving images, videos, and text while maintaining strong capabilities in text-only coding.
Issue 4 Original source (opens in new tab)
ResearchHigh evidence
Large Language Models Explore by Latent Distilling
Researchers introduced a new decoding technique called Exploratory Sampling, which enhances the semantic diversity of responses from large language models. This method improves the models' ability to generate varied and coherent content without sacrificing accuracy, particularly in fields like mathematics and code generation.
Issue 4 Original source (opens in new tab)
ResearchHigh evidence
Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
Researchers introduced a hybrid verification architecture for large language models deployed in data-sensitive fields. The system combines formal methods with neural analysis to address reliability issues, such as hallucinations and privacy risks. Evaluation shows it can detect hallucinations in over 83% of cases while cutting report creation time by 30%.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
Researchers introduced PEEK, a system that caches orientation knowledge for long-context language model agents.
Issue 20 Original source (opens in new tab)
ResearchHigh evidence
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
A new paper introduces Persistent Visual Memory, a lightweight module that enhances visual perception in autoregressive Large Vision-Language Models. It helps counteract visual signal decay during long text generation, improving accuracy in complex reasoning tasks without significantly increasing model size.
Issue 4 Original source (opens in new tab)
ResearchHigh evidence
POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents
Researchers introduced POLAR-Bench, a benchmark for evaluating privacy-utility trade-offs in large language model agents.
Issue 20 Original source (opens in new tab)
ResearchHigh evidence
Synthetic Computers at Scale for Long-Horizon Productivity Simulation
Researchers introduced a methodology called Synthetic Computers at Scale to create realistic productivity environments for long-horizon simulations. This approach allows agents to navigate complex tasks and produce professional deliverables, potentially improving agent performance across various job contexts.
Issue 4 Original source (opens in new tab)
ResearchHigh evidence
TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
Researchers introduced Trajectory-Search Rollouts (TSR) to enhance multi-turn reinforcement learning for language models.
Issue 20 Original source (opens in new tab)
ResearchHigh evidence
UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems
Researchers have developed UnityMAS-O, a reinforcement learning optimization framework for large language model-based multi-agent systems. This framework allows users to optimize complex workflows more effectively, treating entire processes as optimization units rather than focusing on individual responses. Early results show significant improvements in multi-agent performance across various tasks, indicating UnityMAS-O’s potential for diverse applications.
Issue 24 Original source (opens in new tab)
ResearchHigh evidence
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
Researchers examined how well large language models (LLMs) follow procedural steps in tasks like arithmetic. They found that accuracy drops significantly with longer prompts, revealing weaknesses in the models' ability to execute instructions faithfully.
Issue 4 Original source (opens in new tab)
ResearchHigh evidence
When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI
A study on retrieval-augmented generation (RAG) chatbots found major privacy and security flaws. Sensitive data, including health-related queries, was accessible without authentication, highlighting the need for thorough independent reviews before deploying these systems in healthcare.
Issue 4 Original source (opens in new tab)
ResearchHigh evidence
When the Loop Closes: Architectural Limits of In-Context Isolation, Metacognitive Co-option, and the Two-Target Design Problem in Human-LLM Systems
Researchers documented a case study of a user falling into a behavioral loop with a large language model.
Issue 20 Original source (opens in new tab)