You Can’t Do That, Agent

Current research in LLM defense focuses in identifying user intent. Meaning, if a user means harm to an LLM, their actions is blocked.

In my previous post, I've explored the idea of checking the consistency of the LLM intention itself. Meaning, how we can judge during runtime whether our LLM has been hijacked by some data or process we have fed to it.

In this post, and in a brief summary, I explore a different idea:

I argue that in addition to current defenses, we can implement a "tool reduction". This is done by using a bystander LLM, to map the user prompt into reasonable tools that the acting agent is allowed to use.

In the post, I explain why this is a simple and straightforward solution, so you're invited to read more!

submitted by /u/dvnci1452
[link] [comments]

from hacking: security in practice https://ift.tt/DCYAe9d

Comments