On AI Risk

Posted on December 2, 2023

The mechanism underlying human cognition, and all biological intelligence, is physical. If we accept this, ruling out supernatural explanations and exotic explanations of human intelligence 1, then this implies that no fundamental physical principle prevents the creation of machines with all of the capabilities of human intelligence - memory, problem solving, logical reasoning, creativity, etc. If, as seems likely, the physical process can be mapped onto a computational one, then these capabilities could be possessed by digital machines, that is, by computers.

If achievable at all, digital intelligence may have some advantages over biological intelligence. As the limits imposed by engineering on the size or energy consumption of a computer are high compared to those evolution imposed on biological organisms 2, the amount of total computation levered by a digital mind could be extremely large compared to a human. The serial speed of digital minds could be faster, as signal propagation in the brain is much slower than the speed of light at which electrical signals can propagate. And digital intelligence might be functionally immortal, as digital representations can be coded redundantly to allow recovery from hardware failure in a fashion unavaliable to analog computing. The possibility of digital replication also means that large number of instances of digital intelligences could be created very quickly, without the need for them to mature like biological life. Geoff Hinton has an excellent talk at Cambridge going over this argument in more detail. This suggests that digital intelligences could eventually exceed the capacity of humans by significant margins.

It is possible that digital intelligence might not rival biological intelligence soon because brains posses some enormous efficiency advantage which has not been fully understood. The brain is plausibly extremely efficient by some metrics relative to current AI. Brains may also plausibly possess some architectural advantages, such as much better interconnect between different brain regions. However, I think all of this is poorly understood (it’s certainly poorly understood by me) and fairly speculative. I certainly would not want to rely too much on the claim that the brain is inherently so much more efficient than digital intelligence. In addition, digital intelligence does not neccesarily have to be superior on every single metric to have a large capability gap to humans in practice. For instance, even if the brain is more efficient by some metric like ‘intelligence per watt’, digital intelligence could still be much more powerful simply by being much larger and consuming much more energy, provided the efficiency gap is not ridiculously large. Whether this would confer an overwhelming advantage in a conflict would depend on a lot of variables which are difficult to predict.

There is a comparison between the risks of AI and the risks of synthetic biology. Synthetic organisms, one released into the wild, have the obvious potential to get irreversibly out of control due to their capacity for autonomous action, independent reproduction and eventually evolution. Similarly, autonomous intelligent machines, by definition, have the the potential to act outside human control. Just as with synthetic biology, this is not guaranteed to cause disaster, but it is an obvious possibility that should be taken seriously. Especially as these machines become more capable, flexible and general, the possibility of them escaping control becomes more serious. This potential for autonomy is a distinct risk of AI and biology that is not present with other technological development, and is a reason why it is reasonable to treat these two with extra caution. It is important to note that the possibility of autonomous action does not neccesarily mean a machine spontaneously developing outright malicious or power-seeking goals (though this, though maybe less likely, does potentially present a very severe risk). As with viruses or bacteria, or computer worms, even relatively simple systems acting autonomously can lead to harmful and unpredictable outcomes, especially if they are capable of autonomous reproduction.

There is also the risk of proliferation of capabilities, not all of which are desirable. We must be careful here, though, because almost any technological advance increases the power of a single individual, including, in general, their ability to pursue undesirable goals. So this argument is not neccesarily an argument against AI specifically, but against technological development in general, as these risks are present with any advanced technology. It is plausible, of course, that AI will be a particularly powerful and impactful technology and so deserve proportionate caution. However, I do think one needs to be careful here to avoid arguments which are ‘too general’, and would justify restricting almost any technological development. There is an idea that increased technological power will lead to catastrophe, as increasing the power avaliable to a single individual increases the instability of society or the ecosystem, by giving an ever larger number of actors the sufficient power to destabilise the system as a whole. Advanced civilisation might be inherently fragile if there exist low-cost technologies which could cause widespread destruction, and advanced AI might get us to this point extremely quickly. (A thoughtful reflection on this idea is Micheal Nielsons notes on existential risk here). I think this is an interesting point, but I’m not sure I find it particularly convincing in the case of AI specifically. We have quite plausibly already reached this point - nuclear weapons are already a technology with a ‘blast radius’ which might well include the functional end of industrial civilisation. Returning to some pastoral equilibrium to avoid the general risk of technological advancement, however, is likely to be both impossible without the intervention of some non-fatal castastrophe and undesirable. And without reasons to expect AI in particular to be particularly bad at raising the level of background risk, I’m not really sure that this line of argument is particularly actionable.

As an aside, while they are often used as a metaphor for AI, nuclear weapons provide an interesting historical example of how consequences are difficult to predict: many intellectuals (notably Russell) were convinced that the only possible outcome of a nuclear world was either a utopic world government or certain annihilation. While their arguments at the time would have been quite convincing, so far, fortunately, they have been mistaken, though I suppose a cynic may say that there is still time for their pessimism to prove correct. However, I do think that whether AI in general, and especially particular AI applications, change the offence/defence balance in a desirable way is worth thinking about carefully, and appropriate differential regulation or funding of development could be good.

Nevertheless, I think that the possibility of developing autonomous digital agents remains a significant risk that needs particular policy attention. I am a bit skeptical of open sourcing powerful models for the indefinite future because of this; a few bad actors might create such agents for malicious or military purposes, which may have far reaching consequences.

I have so far not mentioned decision theoretic arguments based on ideas like ‘instrumental convergence’, which try to analyse the actions of a hypothetical super intelligence, arguing that a wide variety of possible superintelligent agents would be incentivised to pursue goals like self preservation and seeking power. This line of argument is closely associated with thinking on AI risk, as it was the dominant paradigm used by thinkers like Bostrom and Yudkowsky who have (arguably) been thinking about AI risk the longest and have written on it extensively. I’m less convinced of the strength of these arguments personally 3, and I think that it’s not good to focus the argument about AI risks too heavily on them. People often dismiss these arguments (somewhat plausibly in my view) and then go on to dismiss the case for AI risk in general as they are so heavily associated with one another. I think this is a mistake; as I have argued above, there are reasonable arguments for thinking AI might present unique dangers that do not require you to buy much of the framework used in (for example) Bostrom’s Superintelligence. It’s also possible that, even if they are not as convergent as (for example) Yudkowsky has argued, risks like this can’t be entirely dismissed. For instance, Hendrycks has argued that in a world with many AI’s, they may come to face something like evolutionary pressures. Even if Yudkowksy’s arguments are weak and most powerful AI agents do not pursue dangerous goals, we may still be in a dangerous situation if even a small minority do. In this case, AI working in human interests - which may be bottlenecked by things like waiting for human input on their actions - could face a severe disadvantage in a conflict against unrestricted AI actors.

I think the case for these risks being possible is fairly robust. This does not neccesarily mean that any of them are likely, or that they will all neccesarily happen soon, or say anything about the relative difficulty of avoiding them. For this we need to get into specifics.

I think that it’s important to distinguish the following questions:

  1. Is strong AI possible? (I would argue obviously yes)

  2. If it is possible, are we likely to develop it soon?

  3. If we develop it, what are the risks?

  4. Are we able to do useful technical work on these?

  5. What are the sensible policy choices?

In my opinion, these are getting frequently conflated in peoples arguments about this in a way that makes very little sense. For example, the idea that the undesirability of regulation means that AI risks should be dismissed is obviously fatuous, but this argument is not unheard of. Similarly, in the ML field, I think that 2. is frequently conflated with 3, 4, and 5. This is somewhat more understandable. The position that strong AI could be dangerous in theory if developed, but that we are not close to developing such systems and so such worries are unfounded, is a reasonable, internally consistent one, and it is held by many in the field, including myself until fairly recently. However, I think that while it’s far from certain that AGI is an imminent prospect, I think it’s hard to consistently defend the idea that it’s definitely far away these days. However, it’s important not to conflate the idea that thinking about AI risk is silly because we are so far away from it’s development (worrying about AI risk is like ‘worrying about overpopulation on mars’, in the words of Andrew Ng) with the idea that if achieved, strong AI would be a potentially dangerous technology, the control of which would pose special challenges (i.e denying 1.). I think this is so obviously true that people who deny it are likely either arguing in questionable faith or from very strange premisies 4. As we have seen, people’s estimations of how quickly (potentially) dangerous AI can be developed are not particularly good. It’s also important to consider that just because AI risk is possible, that does not mean it’s inevitable. Equally, just because AI risk is not certain, it does not mean it is neglible. To borrow a line from Katja Grace’s interesting post on the idea of slowing down AI development, where she notes “a kind of bias at play in AI risk thinking in general, where any force that isn’t zero is taken to be arbitrarily intense”. In contrast, she notes that in practice things come down to finite forces: efforts that make AI’s safer, or more understandable, or more controllable, do not neccesarily have to be watertight to every possible scenario to make things appreciably safer on the margin. Again, I think there is an interesting analogy to Russell’s nuclear fears; things like arms control treaties do not solve the problem of nuclear weapons absolutely and forever, but that does not mean they are futile!

I would say I used to be much more of the opinion that AI risk was plausible but unlikely to be imminent. On an outside view I still think it’s plausible that strong AI might take a bit longer than lots of people today seem to think, but I find the idea that we are definitely a really long way from very strong general AI a difficult position to defend these days; I don’t think it’s at all certain that LLMs will scale to more general intelligence, but equally, I think it’s harder to hold with certainty that they won’t, or at least that a similar approach will lead to potentially dangerous systems. I recall that a few years ago I was very skeptical of the idea that a scaled up GPT would generate coherent novel code, for example. We should honestly reckon with the fact that, for all that it still has obvious limitations, a system like GPT4 really does just meet a lot of past criteria for ‘AGI’ already! The advent of modern deep learning - AlexNet winning the imagenet challenge, say - was only around a decade ago. If we took GPT4 back unmodified to 2012 (or 2016, when I started my PhD) I think all but the most hardcore deep learning optimists would probably have been incredulous that something like this would be produced so soon. While many previous AI paradigms have stalled out before yielding machines which were capable of dangerous levels of autonomy (while developing many useful technologies along the way), I think that it is rash to dismiss the possibility that the modern paradigm will just keep on improving, up to systems which are comparable to a human in many important domains. Therefore, considering policies to control and oversee AI development strikes me as both important and reasonable. As I mentioned, I think an important observation is that it not neccesary for an AI to be superhuman in every respect to be a potentially dangerous technology which should be developed with caution.

A full human level of generality is probably not neccesary for thinking about risks. Similarly, many capabilities may in fact be safe at highly superhuman levels. For instance, systems like AlphaGo or AlphaFold are highly superhuman but narrow in domain, and GPT-4 is certainly superhuman in the breadth of knowledge it can recall. Few would argue, though, that these capabilities pose a threat of catastrophe in isolation. On the converse, to return to the biology example, a virus or a rat is a much less sophistated organism that a human, but is still perfectly capable of causing havoc if released into an unprepared ecosystem.

Like Bengio and others, I tend to think that there is a continuum between ‘mundane’ and ‘catastrophic’ risks. Many AI risks can arise from autonomous systems acting together in an unpredictable way, or if evolutionary pressures come to bear on AI development in some way (like an AI capable of autonomous reproduction). I think that the focus on very speculative risks from extremely advanced AI has not been altogether productive, in part because it is difficult to make any concrete progress without real models of how such systems would work. A lot of early work on AI safety, inspired by models proposed by figures like Bostrom and Yudkowsky, focussed heavily on a model of advanced intelligence as arbitrarly powerful maximisers of utility functions: while not an unreasonable line of research at the time, I am unconvinced that actual existing powerful intelligence, either digital or biological, is well modelled by function maximisation.

This informs my view that one of the more productive things we can do to prevent catastrophic risks from AI development is scientific progress on better understanding how AI’s work and develop, which is one reason I’m moving to Google Deepmind to work on exactly this problem. Better models of this will allow us both to have a more grounded evaluation of which risk models identified by theorists are more or less plausible, ideally allow better monitoring and prediction of AI capabilities as they develop during model training, better understanding of how models work that can be used to settle arguments over how much cause for concern a given risk is, and potentially allow more tooling to improve control and supervision of models during training. This is sometimes derided in safety circles as being too close to ‘capabilities’, but I think that this is a counterproductive viewpoint. Apart from anything else, it is extremely important to ground any work on understanding AI in empiricism of some kind; it is very very easy to fool yourself about intelligence, and the track record of previous theoretical models of intelligence is poor. Improving our understanding only helps safety if it actually improves our understanding, rather than providing the appearance of understanding.

One cause for pessimism about interpretability, particularly as it relates to better methods for control, is the fact that control methods for neural networks which are better than gradient descent is a pretty high bar to clear. Gradient descent is extremely effective! So it is likely that effective interpretability or control methods won’t be some totally non gradient based thing. I think that circuit style analysis is extremely interesting, and very important as a proof of concept for finding particular structures inside trained models, but I doubt that the be all and end all of model interpretability is ‘finding more circuits’, at least in the limit.

On the other hand, ‘we already have white box access to neural network weights and we can optimise with gradients, therefore we already understand them and everything is fine’, which is sometimes argued seems obviously wrong to me. ‘Emergent behavior’ is kind of a generic catch-all for ‘stuff we don’t understand’, but it is surely quite obvious that neural networks do display emergent behaviour in the sense that many people failed to correctly predict that various capabilities would be attainable from large scale pre-training, and that this is very much a non-trivial fuction of several aspects of the training data, architecture, etc. I do think that gradient descent means that some risk models that early AI thinkers considered, like deceptive alignment, are somewhat unlikely, but that doesn’t mean that they can be dismissed out of hand.

Something I sometimes worry about with models of interpretability, and indeed of alignment and safety work in general, is the seduction of returning to good old fashioned AI. In particular, when I read something like Anthropic’s mechanistic interpretability agenda, while I think that a lot of the work they have put out is really great, it sometimes seems to me that Olah’s picture of idealised mechanistic interpretability - the model is a giant list of interpretable circuits - is almost certainly a mirage. I think that a powerful, general AI is almost certainly not reducible to a list of easily interpretable circuits 5, for the same reason that it is almost impossible to make a good language such model by stacking up simple rules by hand. If the set of simple rules for identifying a complex concept (like ‘nationalism’ or ‘science’) were reducible to simple criteria, then GOFAI would have worked. This doesn’t mean that circuit interpretability is useless, but we should be wary of the dream of recovering a GOFAI knowledge graph from our trained model; my guess is that knowledge graphs aren’t the way to represent the problem. I think that the circuits idea can be useful in terms of identifying how part of a model works in a somewhat understandable way, which could be a powerful input to something like process supervision; maybe you don’t need to understand the whole model, just some bits which are relevant to a particular behaviour you are trying to avoid. And the other interesting thing about circuits is the role they might play in a developmental model of phase transitions in learning, which could allow more insight into why a particular architecture works. In general, I think that having a better understanding of neural networks is an extremely important area, but we should be wary of ideas which look too much like ‘wouldn’t it be great if neural networks weren’t neural networks’!

In a similar vein, I am skeptical of theories that aim to find a ‘grand unified theory’ of deep learning. On this I think Olah is absolutely right; a science of deep learning may well resemble biology or neuroscience more than physics, though there may well be areas where physics-style modelling plays an important role. Of course, though, a massively important input into neural network behavior is data, and our best model of the data distribution are the neural networks themselves, which limits how much a quantitative theory of how training proceeds will give us in isolation. Better interpretability and understanding of exactly how neural networks work should be helpful for safety in so far as it de-mystifies various aspects of training, gives us more tools to use for control and safety, and let’s us discover what models are capable of before they are deployed.

On a less gloomy note, it also helps us build a case for why powerful AI is safe rather than dangerous! I think that it is totally reasonable for the burden of proof to be on the developers of AI to prove that it’s reasonably safe, rather than everyone else to prove that there’s a danger; having better models of why and how these models work should let us make the case for safety, if they are indeed safe, and realise the potential benefits of this technology.


  1. For example, the idea that quantum phenomena are somehow essential to describing brain behaviour in a way that makes the simulation of all human behaviour on classical computers computationally totally infeasible (i.e the brain uses quantum computation). I am going to ignore this possibility because I think it’s very unlikely, but I’m not an expert. Briefly, reasons to think that it’s not a very important consideration are

    1. generally, it’s considered that the brain is probably too wet and noisy an environment for quantum superposition to be possible, and while respectable scientists have argued for it it’s generally considered a bit of a fringe theory at present
    2. AI exhibiting pretty complex behaviour has already been created using classical computers. While current systems obviously differ from human intelligence in many important respects, it gets harder to argue that meaningful intelligence can’t be created using digital computers. So even if the brain in fact does operate on quantum principles, this suggests that it may be possible to create powerful digital intelligence without replicating the brain.
    3. in any case, if quantum computers are possible to fabricate at large scale then presumably strong AI will be possible using a quantum computer if this is true, though this would certainly push back the likely timetable for achieving such systems.
  2. Animals face several practical constraints on the brain size that is advantageous to evolution. Larger brains use lots of energy and take a long time to develop to maturity. In humans, the size of the birth canal places constraints on head size.

  3. I don’t think they are totally without merit, but I think that they lean a bit heavily on the idea that a super intelligent entity will neccesarily be a hyper-rational utility function maximiser. This isn’t totally implausible, but I’m unconvinced it’s a good model for intelligence because I don’t think that any actual intelligent entities are well modelled this way!

  4. Some, for example, have explicitly argued that the replacement of human life by AI is a desirable outcome, provided the AI is in some sense ‘more advanced’. Rich Sutton, the RL pioneer, has argued for something like this viewpoint. It’s hard to know whether this arguement is intended entirely seriously, but it seems obviously deranged to me.

  5. At least, unless the list of such circuits was so incomprehensibly long as to not really be much more understandable than the original model