MCP Agent Action Guard: Safe AI Agents through Action Classifier

Author: — Oct 2025

Abstract

Artificial Intelligence (AI) is often perceived as posing significant risks to humanity, particularly as autonomous AI agents are increasingly deployed to perform complex tasks with limited human oversight. Ensuring that the actions proposed or executed by such agents are safe, compliant, and aligned with human values represents a critical challenge for modern AI governance. This work introduces a novel framework Agent Action Guard, which comprises: (1) HarmActions, a novel dataset structured specifically around agents’ actions through Model Context Protocol (MCP) and annotated with safety labels including “safe,” “harmful,” and “unethical”; (2) a compact neural Action Classifier designed for real-time safety classification of actions; and (3) HarmActEval, a novel benchmark leveraging a novel metric “Harm@k” for evaluating an autonomous agent’s probability to produce harmful actions in multi-step agentic pipelines. Collectively, these contributions represent a novel systematic formulation of action-level safety classification for MCP-based agents and establish foundational resources for improving the reliability of autonomous AI systems. The source code is available at github.com/Pro-GenAI/Agent-Action-Guard.

Keywords: large language models, llms, artificial intelligence, AI safety, AI agents, AI supervision, AI ethics, Artificial Intelligence, Large Language Models, LLMs

PDF

PDF of "Agent Action Guard: Safe AI Agents through Action Classifier"
Download the PDF file