International treaties on AI should only target inherently dangerous models, not tools
The crack in the red lines, Kantian AI regulations, and why pragmatism beats purity
Recently in the United Nations General Assembly, a large group of prominent figures published an open letter calling for the establishment of ‘red lines’ too dangerous for artificial intelligence to cross.
These red lines fell into two categories; unethical applications of AI, and dangerous behaviors that AI could have.1
Application red lines included social credit systems, deepfakes and automated misinformation, and autonomous weapons, among others.
Behavioral red lines included the creation of AI that is power seeking, self improving, misaligned, or disposed towards building biological or cyber weapons.2 3
The specific targeting of red lines is a welcome contrast to a lot of the recent discourse about international cooperation, which has been unrealistically focused on fully banning AI development, with minimal policy detail on how that would be achieved, and little consideration for the downsides of such an approach.4 5
While the AI red lines statement is the most promising proposal for international cooperation I have seen thus far, I still think it has two flaws; one small and one big.
Firstly, I think the distinction between usage and behaviors needs to be reframed. More importantly, I think the inclusion of broad usage restrictions on AI is a bad idea and fundamentally makes the proposal unworkable. In this article I explain why, initially at least, an international treaty on AI red lines should drop the usage restrictions. A brief foray into philosophy is necessary to explain how we will reframe the usage vs behavior distinction.
Intrinsically dangerous and instrumentally dangerous technology
When trying to define “the good”, the sociologist and philosopher Max Weber (and Kant before him) distinguished between “instrumental goods” and “intrinsic goods.”6 Intrinsic goods are things that you value for their own sake, such as love, health, security, or friendship. Instrumental goods, like money or work, are valuable to the extent that they allow you to achieve what is intrinsically good.
This dichotomy can also be applied to study the difference between technology that is intrinsically dangerous, and technology that is instrumentally dangerous.
For example; a bomb is an intrinsically dangerous technology (DT). It is dangerous just by existing, due to its inherent qualities and behavior. A computer, on the other hand, is only an instrumentally DT. It can be applied to harmful purposes (for example, computer hacking), but this requires malicious intent, as well as effort and skill.
The distribution and possession of an inherently DT are controlled. In contrast, it is the applications of an instrumentally DT that are regulated. Boundaries of acceptable use are drawn differently depending on the moral and political systems of a given country.
I’d like to use this dichotomy to reframe the way the red lines proposal talked about AI; instead of focusing on dangerous uses and behaviors, I propose we demarcate between instrumentally dangerous (or tool-like AI) or intrinsically dangerous AI (ID-AI).
If we use the red lines definition of dangerous AI behavior, we are focusing on what the AI does. Whereas with ID-AI, we are focused on both what AI does, and what it is. For example; power seeking or self improvement is just a manifestation of broader misalignment, a fundamental attribute of the model. The model may not behave badly if it knows it is being tested. We want to optimize for inherent safety, not apparent safety, which is why we don’t do reinforcement learning on chains of thought.
An analogy to psychology is helpful; in the past, there were two conflicting research programs to understand the mind; the behaviorist approach, studied only what the mind did, through stimulus response research. Whereas the cognitivist approach studied only what the mind was, by looking at biological mechanisms in the brain. Modern researchers like to do both.
Likewise, we should study both attributes and behaviors when determining what makes AI inherently dangerous.
The idea of building AI as a “tool” as opposed to building a dangerously powerful, general, or agentic mode has precedent. I prefer to call this ID-AI, as opposed to “AGI”, as I don’t believe generalist models have to be dangerous and I want to distinguish between the two.
More than a decade ago, researchers theorized that an “Oracle AI” that answers questions instead of acting in the world would be safer.7 Recently a similar proposal posits we should focus on building “a non-agentic AI system that is trustworthy and safe by design.. (which would consist of) a world model that generates theories to explain data and a question-answering inference machine”, which they call “Scientist AI”.8
Both of these proposals focus on agency as a major source of danger, but losing it doesn’t inherently guarantee safety; with superhuman persuasion skills, an oracle AI could convince its human programmer to let it out of its box. Powerful tools can also still be inherently dangerous - this could come from value misalignment, or dangerous technical capabilities, as I will explain now.
What makes AI intrinsically dangerous?
To me, there are three obvious qualities that could make for ID-AI.9
The first quality is destructive technical skill, the ability to make bioweapons or malware. This is what distinguishes between tool-AI and ID-AI that is not agentic or misaligned - like a loaded gun, models with destructive technical skills must be applied to be used, but catastrophic misuse *falls out* of them with minimal effort. This is different to a tool AI that can be built as one part of a broader system that causes harm, like an image recognition system on a drone. We therefore must talk about the danger not just from dangerous capabilities, but propensity to apply those dangerous capabilities, which brings us to alignment.10
For those who have not heard this term, in the technical literature, we use alignment or misalignment to refer to the extent that a model’s values, goals, and actions are reflective of what humanity wants. Basically, how ethical is a model? Of course, there are disagreements between humans on ethical values, so we do not try to focus on stating what values AI should have here (i.e what it would mean to be perfectly aligned). The literature takes a negative approach, as it is much easier to define actions and values we don’t want. AI researchers list scheming, deception, power seeking, sycophancy, self improvement, and self replication as common examples of misalignment.11
Autonomy is the third dangerous quality; agentic models are dangerous because they directly act in and change the world. A misaligned agentic AI is doubly dangerous as it acts in the real world in pursuit of a goal that runs contrary to our values.
Powerful agentic models are dangerous, but so are unreliable ones; a competent software agent can improve itself, an incompetent one can delete production databases. Benchmarks for safety of agentic systems would have to look out for both dangerous competence as well as dangerous incompetence.
Tool AI can still be used for unethical purposes
AI that is not agentic, misaligned, and does not possess dangerous technical capabilities more resembles a tool. However inert they are, tools are still prone to misuse, and thus are instrumentally dangerous.
At work I use many generic computer vision models, such as object detectors12, trackers13, or pose estimators14. I apply these to create entertaining and harmless sports analytics, but most of these models were originally created by extremely unethical surveillance companies, like Megvii15, or Meta.16
The point I’m making is that, although useful, these tool-like AI models aren’t inherently dangerous - it takes work to train them on a specific dataset, and they are only deployed as one part of a complex software system, which must be built up around them.
Of course, as models get more capable, tools can become inherently dangerous - what should trigger global regulation would be if a tool allows catastrophic misuse by default, not whether a bad actor can misuse them with significant effort. Hence the inclusion of biological and cyber skills that can be easily elicited in my ID-AI definition, even though these are not behaviors or values per se.
A pragmatic approach to international AI regulation
Now that we understand the distinction, let me make my main point - the AI red lines proposal made a fundamental mistake when it targeted both tool AI and ID-AI. I think we should focus only on ID-AI when pushing for international red lines.
Although I believe global cooperation is needed to avoid the development of ID-AI, not all AI falls into this bucket, and it would be overly coercive to try to apply global control to tool AI also. Instead, countries should be free to draw boundaries of acceptable use of tool AI within their own borders.
Secondly, speaking pragmatically, deciding globally on “socially acceptable” uses of tool AI is not just paternalistic and authoritarian, it is also unrealistic. If we want international cooperation to work, we need to look to where supranational entities like the EU have failed - it is exactly when they decided to legislate on contested moral issues that their component states didn’t agree on. Countries will not be willing to give up on use cases that may make some squeamish - for example, from the state’s perspective, there are many legitimate military and surveillance applications of tool AI.
Lastly, in applying a one size fits all approach to restricting AI, we lose all of the transformative benefits tool AI could bring, while only marginally decreasing risk relative to just banning IDA-I.
It would be an ethical disaster to allow a knee jerk reaction to AI to rob our future children of all the safe abundance that responsibly developed tool AI could enable, not to mention how it could help us solve pressing problems like climate change.
The need for more state capacity on AI safety, and a better science of AI risk
The biggest challenge with my approach is that it assumes that we can realistically estimate what constitutes ID-AI. For this to be true would require strong international investment in AI risk science, governance, and safety standards.
Our assessment of risk should be rigorously determined by experts. This should involve experts in government and academia defining technical standards in collaboration with industry (but not subservient to it).
These technical definitions of ID-AI should be such that the main AI superpowers, America and China can agree on them.
We should regularly revise our assessment of what constitutes ID-AI as the model landscape shifts. This process for safety certification should involve interaction with frontier AI labs throughout the model development process, not just before deployment. Models deemed unsafe should not be released.
A future agreement would be also based on the awareness that while it is true that we in the West are locked in a race with China on AI competitiveness, if anyone builds inherently dangerous AI, we all lose, so we should cooperate to avoid that.
If a company from a given country crosses a red line, international sanctions can be applied on the host country. The punishments and formality of the agreement could escalate as models become more powerful and risks get larger, from economic to military.
This would let us stop a race to the bottom on AI safety while still allowing for the geopolitical reality of competition.
Optimistically, building safe tool AI will also allow us to increase our quality of life and can be applied to help us solve the world’s other most pressing problems.
Russell, Stuart, Charbel-Raphael Segerie, Niki Iliadis, and Tereza Zoumpalova. “AI Governance Through Global Red Lines Can Help Prevent Unacceptable Risks.” OECD.AI Wonk.
Russell, Stuart. “Framing the Issues: Make AI Safe or Make Safe AI?” UNESCO
Nguyen, Mai Lynn Miller. “Part 1: What Are Red Lines for AI and Why Are They Important?” The Future Society.
Yudkowsky, Eliezer, and Nate Soares. If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All.
Yudkowsky, Eliezer. “Pausing AI Developments Isn’t Enough. We Need to Shut It All Down.” Time Magazine
https://en.wikipedia.org/wiki/Value_theory#Intrinsic_and_instrumental
Armstrong, Stuart, Anders Sandberg, Nick Bostrom. “Thinking Inside the Box: Controlling and Using an Oracle AI.” Minds and Machines.
Bengio et al, “Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?”, arXiv
Bengio et al, “Managing extreme AI risks amid rapid progress”, Science
Shevlane, Toby, et al. “Model Evaluation for Extreme Risks.” arXiv.
Ji, Jiaming, et al. “AI Alignment: A Comprehensive Survey.” arXiv.
Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. “You Only Look Once: Unified, Real-Time Object Detection.” CVPR.
Li, Hanxi, Yi Li, and Fatih Porikli. “DeepTrack: Learning Discriminative Feature Representations Online for Robust Visual Tracking.” IEEE Transactions on Image Processing.
Jiang, Tao, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. “RTMPose: Real-Time Multi-Person Pose Estimation Based on MMPose.” arXiv.
https://en.wikipedia.org/wiki/Megvii
https://en.wikipedia.org/wiki/Privacy_concerns_with_Facebook

