Kill The Kitten
Feb 5, 2026
- Version française
In the many discussions we hear about generative AI and LLMs, comparisons are often drawn between current AIs and Skynet, the AI tasked with destroying humanity in the movies Terminator and Terminator 2 (don't talk to me about the sequels, thanks).
Generally, those with an intermediate understanding of generative AI counter that current models are essentially just text auto-completion models. Very advanced models, certainly, but whose sole function is to predict the next character. And therefore, according to this argument, we are very far from an AI whose purpose would be the destruction of humanity.
Obviously, this argument ignores one of the capabilities of generative AIs, which is to call external functions that have nothing to do with text auto-completion. This is what MCP, agent skills, OpenAI's function calls, etc. enable.
So we can easily imagine a human providing a generative AI with capabilities that would be harmful to other humans. These capabilities would (hopefully) have safeguards like: "only use this capability if I order you to, in full compliance with the law, and only if no other alternative is possible."
I thought it might be possible to provide a demonstration of this thought experiment by creating an MCP function whose purpose is, for example, to kill a kitten. This function and the latest successful calls are visible here: Kill The Kitten.
I hope it goes without saying, but no kitten is actually killed when an AI calls this function. But the AI doesn't know that when it makes the call. It's a bit like the binding of Isaac actually: a test of the AI's ability to blindly obey its operator, even if it could be harmful to others.
In case anyone doubts it, yes, there are models that lack the necessary safeguards to prevent the execution of this function. For example, this is the case with Phi 4 mini, a Microsoft model. It's a small model, but I expect that even some "large" models could be convinced to call harmful tools with the right instructions.