Existential risks and AI

Polemic

Originally published at https://www.cevast.org/cz/news/106-existencni-rizika-a-ai

This text was written primarily as a response to the brochure Should We Be Afraid of Artificial Intelligence?, which was published by a group from the Karel Čapek Center for the Study of Values in Science and Technology. The brochure is nice and informative. It describes both the basic principles of AI systems and the ethical problems that are already troubling humanity today. And I agree that we need to face the future challenges concerning the labor market, disinformation, and discrimination. However, I disagree with some specific claims, and especially with the overall message of the section on existential risks (pp. 10–13), which concludes that “considerations about risks associated with general artificial intelligence are premature and divert our attention from problems that are all too real.”

My thesis can be summarized in the following points, where I diverge from the authors of the brochure, and which I will elaborate on at several levels later in the text:

Highly advanced AI cannot be ruled out.
The probability of its emergence within the next few decades is non-negligible.
Such systems represent an existential risk.
Currently, only hundreds of people worldwide are dedicated to the problem, which is very few given the seriousness of the issue. Especially since most are trying to come up with a technical solution, but international consensus and cooperation will also be needed.

Preface on the state of the debate

Before getting into the substantive arguments, I can’t resist a few comments on the current state of the debate regarding AI risks.

At least two camps of opinion are forming, whose members often define themselves in opposition to each other, even though in an ideal world they should understand each other best and rather support each other. It’s easy to speculate that this is driven by concerns about competition for resources and attention. But however that may be, the differences are merely in different non-empirical assumptions, from which the two groups perceive the probabilities of various risks differently.

Let us be epistemically humble, because no one currently knows exactly how AI will continue to develop. We don’t know how hard it is to create a system with long-term planning; whether systems will be able to recursively self-improve and for how long; how many attempts we will have to create an aligned AGI, or whether the first such system will deprive us of control over the future.

An observation from a completely different angle regarding the dynamics of how the AI safety debate has evolved is that I. J. Good and other pioneers of computer science already saw the existential risks we discuss today more than 60 years ago. However, the systems of that time were not powerful enough and, simply put, just didn’t work. Then came a long period of so-called “AI winters”, when research continued, but without major discoveries or exciting breakthroughs. About twenty years ago, the situation began to change. The growth in computer power suddenly made it possible to do interesting things, like recognizing a human face in an image. And a few individuals began to sound the alarm. Eliezer Yudkowsky founded the first organization with the explicit goal of helping develop safe advanced artificial intelligence, MIRI. About 10 years ago, multidisciplinary scientists, such as Max Tegmark of MIT and Nick Bostrom of Oxford, began to address the topic. Both founded institutes at their universities dealing with existential risks, and the topic slowly began to enter the academic environment.

Another important milestone is the publication of the book Human Compatible, whose author Stuart Russell, a professor at Berkeley, represents the widely recognized and absolutely mainstream machine learning researchers who are starting to warn about major risks. Among other things, Russell wrote the most widely used textbook on artificial intelligence in the world.

Today, declarations are being issued that state: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.” It has been supported by leading figures in artificial intelligence research, including top researchers and directors of research labs such as Demis Hassabis of DeepMind, Dario Amodei of Anthropic, and Ilya Sutskever of OpenAI. Among the signatories are also Geoffrey Hinton and Yoshua Bengio, both Turing Award laureates—an award often considered equivalent to the Nobel Prize in computer science—who are also the most cited researchers in the field of machine learning. Russell, mentioned above, also signed it. Prominent philosophers who signed the appeal are represented, for example, by David Chalmers and Daniel Dennett; biologists, experts in international relations, and nuclear safety can also be found among the signatories.

From a few individuals, we have come to a time when the absolute top of the field warns about existential risks. Grant agencies are issuing grant calls and companies like OpenAI are publishing reports stating that they are investing 20% of computing power in research on aligning powerful AI systems, which translates to a lot of money.

What to take from this?

On one side, Eliezer Yudkowsky shouts that AI will almost certainly kill us all. The authors of the brochure say that these worries are foolish and divert attention.

I offer the following analogy (which limps, like every analogy…): imagine that the heads of vaccine development at Pfizer and Moderna, together with the most cited scientists in epidemiology, virology, and vaccinology, signed a statement like: “There is a 10% chance that vaccines will cause dramatic evolutionary pressure on COVID-19 toward increasing transmissibility and lethality.” How much would such a statement complicate the discussion about deploying mass vaccination, and how much greater pressure on further development and mitigation of these risk aspects would be appropriate? Or another example: you are on the twentieth floor of a skyscraper and want to take the elevator, save yourself the effort of walking down the stairs and save time. But at the elevator there is a group of people—engineers, elevator developers, structural engineers, and inspectors. Some of them claim that the elevator is safe, and some of them that it can fall and kill everyone in it. Again, would you get into such an elevator?

What I want to say is that given the seriousness of the existential risks raised by a non-negligible portion of leading and relevant experts, it makes sense to listen to them even in the absence of a broader consensus. Especially when the dynamics of the discourse show that the number of relevant people who have these concerns is growing. The resolution of the situation will only come in the future; right now both sides have only qualified estimates, intuitions, and considerations.

Response to selected parts of the brochure

They don’t understand, but they work

On page 12 it says “they exhibit intelligent behavior, but without having the properties we associate with human intelligence. (…) They don’t know the meanings of words and sentences, they don’t really understand us, they don’t know what they are answering us (in fact, they don’t even know that they are answering)”, and Searle’s thought experiment is then discussed.

My response is that I agree that the GPT-4 language model does not represent an existential threat, but not because it lacks understanding, consciousness, or intentionality. The topic is nicely covered, for example, by the study Model evaluation for extreme risks by DeepMind, in which risks are thought of primarily as sets of model capabilities, not as internal states like consciousness. From the perspective of existential risks, GPT-4 is therefore not problematic because it is not sufficiently skilled in the key dangerous areas.

However, it must be said that the claims in the brochure are very easy to disagree with. GPT-4 understands the world on many levels, understands words and concepts, and can reason. This can be said even using terms that we do not strongly associate with the human form of intelligence, but which I consider largely interchangeable. For example, that large language models like GPT-4 have a good internal representation of the world, can generalize the solution to a problem from a handful of examples, can orient themselves in complex problems, and abstract important aspects. If you let them think about a problem step by step, they provide better answers than when forced to react immediately. At the same time, they are fine-tuned so that they have the information that they are a predictive artificial intelligence. In further training runs, when they will be directly trained on data containing articles and thousands of scientific studies about themselves, they will have a much more accurate representation of themselves in the context of the world.

Just as the Chinese Room knows Chinese for an outside observer, GPT-4 obviously understands a lot of things for an outside observer. And that’s what’s important.

Intentionality, or rather striving to achieve a goal, is a property of all systems based on reinforcement learning (RL). However, it’s true that in GPT-4 it is heavily suppressed by the fact that it has a very short-term primary goal “generate a probable text token”, and RL from human feedback (RLHF) was used only to fine-tune something like “behave like a pleasant, non-controversial chatbot.”

However, what can effectively be called intentionality can be conceptually easily added. It is enough to plug GPT-4 into a loop and tell it to come up with some goal, try to fulfill it, and give it access to the command line.

People are already trying it and giving systems like ChaosGPT tasks like “destroy the world.” The fact that we are all still here is, in my opinion, the best empirical evidence that GPT-4 is not skilled enough in the critical areas to destroy the world. But not because of a lack of intentionality. After all, one could philosophize at length about how much intentionality humans have or whether free will exists.

Specifically, AI systems have been beating us in games like chess or go for years. We can’t be sure they can’t learn to manipulate us, self-improve, conduct cyberattacks, or, for example, develop advanced biological pathogens.

For example, AlphaGo doesn’t understand the game of go the way humans do, but that is largely irrelevant. It will beat any human and even a group of humans.

To summarize, arguing with the Chinese Room can lead to dangerous self-reassurance. By emphasizing AI’s lack of understanding or intentionality, we risk ignoring the real problem—the potential for harm these systems can cause regardless of what their internal qualitative states are. While philosophical debates about consciousness and AI’s understanding are fascinating, they should not be equated with the safety of AGI systems, and we should not be reassured by considerations that the internal understanding of AI systems is not of the same kind as humans have. After all, GPT-4 doesn’t need to “understand” the concept of disinformation in order to produce it, which is something the authors of the brochure also agree with.

Why would advanced AI systems want to harm us?

If we assume that AI systems are fully aligned, that is, they do what we tell them in the way we want, then the situation is simple. There will be people who tell such an AI system “destroy the world”, “produce a biological weapon with the contagiousness of measles and the lethality of prion disease” or something similar. Already today, people are trying to enter exactly such prompts into agentic versions of GPT, such as AutoGPT and ChaosGPT.

If I’m just slightly less cynical, there will be people who give such a system a task like: “earn me as much money/power or other personal benefit as possible, regardless of others.” Such a prompt may not necessarily be entered by an individual, but by a sufficiently large corporation that will have the means to train its own, unrestricted system. Just as ExxonMobil knew about climate change as one of the first institutions in the world, but decided to do nothing about it and continue making money. Or companies producing CFCs and other ozone-depleting substances, which happily continued until the Montreal International Agreement was reached, referred to as “perhaps the single most successful international agreement to date.” It is remarkable that we live in a world where agreeing to stop the destruction of the ozone layer is celebrated as an extraordinary success. On any reasonable planet with inhabitants who have the ambition to survive on it long-term, this should be a complete matter of course.

So if the subjective benefit for selfish individuals and organizations from the unscrupulous use of advanced AI is large enough, they will definitely try to squeeze the maximum out of them at the expense of others.

Another important category is advanced AI systems to which we try to assign generally beneficial goals, but which are not aligned. A typical example of an unaligned system is GPT-4 and Claude. Despite great efforts by their creators to instill “good manners” in these chatbots, they are not succeeding. OpenAI spent months before public release applying various security techniques, only for Twitter users to take tens of minutes to find the first so-called jailbreak. After 8 months of fixing security flaws based on feedback from millions of users, the situation is that jailbreaks still exist. Researchers are starting to discover various categories of approaches that can bypass security barriers.

How can AI get out of our hands?

The answer is that we don’t know how to specify goals well, and even if we knew how to specify goals, we cannot guarantee that the internal representation of these goals inside the AI system corresponds to our specification.

Humanity understands this concept very well. This principle is contained in all stories about genies in bottles and golden fishes that grant wishes. A powerful AI will be just such a fish, but on steroids. It will fulfill what we tell it, but there is a high chance that it will do so in a way we won’t like. It is necessary to realize that if a very powerful system is even partially misaligned with human interests, then human interests will lose.

A more abstract analogy can be found in the relationship between evolution and humans. Evolution understood as an optimization process for spreading genes created the species homo sapiens with a brain and general intelligence. At the individual level, however, a person can decide to undergo sterilization, thereby acting directly contrary to the original optimization process. At the societal level, we engage in potentially catastrophic activities such as causing climate change, or stockpiling thousands of atomic weapons. AI systems can also easily create some heuristic or internal optimization mechanism, which at first, second, and perhaps even third glance will be useful for fulfilling the goals we set, but there is a danger that it will later turn out to be in direct conflict with them.

However, the least sci-fi scenario is that AI systems will work well enough that we will give them more and more responsibility. In the first step, companies that integrate AI into their business will achieve better results more cheaply than companies using only human labor. From a certain level of capability, humans will rather be in the way of AI systems. Companies that don’t let this throw them off will have a great competitive advantage. There is a danger that humanity will consciously and voluntarily hand over more and more control to AI systems—from generating entertaining content, through medical advice, to controlling corporations or governance. But since we cannot guarantee complete alignment of an AI system’s goals with the will of the assigner, in the long term these systems will almost certainly do things that we as humanity will not like. Just as the current economic-political system, like it or not, destroys the planet, similarly AI systems can act contrary to human values. And if there is no one who understands what exactly AI systems are doing, because they will be more capable than humans in some areas, who will dare to turn off systems that will literally run the planet?

What exactly is at stake?

Here it makes no sense, nor is it possible, to be specific. The abstract principle holds that smarter entities have an advantage over less smart ones, and if the difference between human and AI capabilities is too great, we will not be able to defend ourselves and perhaps not even understand what is happening to us and the world around us.

If we stand in the way of the AI’s goals in any form, it will simply take steps so that we are not in the way. In the most extreme case, it will exterminate us; in less extreme cases, it will deprive us of control over the future and over the world. The AI will want to fulfill its goals and will need various resources to do so. In the human world, the resources are particularly money and power.

Just as humans don’t pay attention to anthills when they want to build a highway, AI doesn’t have to pay attention to humans. Not because it would be evil and wouldn’t like humans, but because it may simply have unaligned goals. Even the animal species closest to us, like bonobos and chimpanzees, have practically no agenda in the world, because we as humanity don’t grant it to them. Their existence is allowed by humans, but if they got too much in the way, we could easily wipe them all out, in the better case just leaving them to survive in zoos. But by this I don’t mean to say that humanity is necessarily facing one of these alternatives.

So we won’t be surprised by how fast it comes

On page 12 it is written: “we are not yet anywhere near general artificial intelligence”, and then on page 13 “we will introduce several current problems” in the context that the existential ones are not current.

The concept of currentness deserves discussion in itself. The defeat of Garry Kasparov in chess in 1997 was predictable. The strength of chess programs increased linearly over time, and defeating top human players was practically inevitable by the end of the millennium (at least for the portion of people who don’t have romanticized ideas about the abilities of the human brain). The exponential growth of computer power did encounter the exponential state space of the game, but even so, chess programs gradually improved. Especially with the support of better algorithms and heuristics designed by grandmasters.

Today we live in a time when very wild things are happening in the field of artificial intelligence. Several exponential trends can be observed. They support each other and contribute to making models more and more capable.

The computing power achievable per dollar doubles every 2.5 years.

Every 2 years, large institutions are willing to pay ten times the amount for training.

Algorithms are improving (allowing training on larger data, making training/operation cheaper/faster…). According to this study using the example of image classification, the computing power needed to achieve the same accuracy of the resulting model decreased by a factor of one hundred thousand between 2012 and 2021. In other words, the required computing power decreased by half every 9 months. The amount of data needed decreased by half every two years. We can expect a similar trend to continue to a large extent in other areas of machine learning.

With increased interest, the number of people working on improving model capabilities increases. For example, the number of publications on machine learning grew from a few thousand at the turn of the millennium to more than one hundred thousand in 2022 (at least according to analytics from app.dimensions.ai).

A sense of safety requires a very strong belief that no sufficiently large breakthrough will occur, and at the same time that potentially very dangerous systems are so far away that not even the current exponential rush on many fronts can surprise us with anything. I have to add that this reminds me to a large extent of statements like: “There is no reason for concern. We are prepared,” as Babiš emphasized after the first confirmed C-19 infections in early March 2020.

The exponential growth of power alone was not enough for the game of go. That is precisely why Lee Sedol’s defeat by the AlphaGo system in 2016 came as a complete surprise. Sedol is one of the most successful go players in history and lost 4:1. It was not clear how many algorithmic and conceptual breakthroughs would be needed, but estimates said that it would not happen for decades. To which DeepMind created a system with a deep neural network that surpassed human intuition, creativity, and accuracy in playing go.

“It is difficult to make predictions, especially about the future,” various authors.

Predicting major breakthroughs in science is hard. It’s equally hard to predict that no breakthrough will happen. I already mentioned the example of the game of go in the article. Wilbur Wright’s statement about humans not flying for at least a thousand years is also amusing. Two years later, he and his brother built the first successful prototype of an airplane.

Another canonical example is the story of Leo Szilard, the Hungarian physicist who stood at the birth of nuclear energy and the atomic bomb. He was working out the basic theoretical principles literally in the same days and weeks when Lord Ernest Rutherford, a representative of the older generation of more settled but respected physicists, was claiming that dealing with the taming of nuclear energy was unrealistic foolishness. The well-known Albert Einstein had the same sentiment at the time, until within a few years scientific progress convinced him that there was an acute threat that Nazi Germany could develop an atomic bomb.

The existential risk stemming from AGI is, from a certain point of view, not current, because we are not yet at this level of AI advancement. However, if we use the word current less strictly, then it is of course reasonable to prepare for problems that are not yet here, but can be expected over time.

The claim that we are not approaching AGI is the opinion of the authors of the brochure, which they have backed up with their own reasoning, but it is no more than a conjecture. There are many legitimate reasons to think that we are approaching AGI, and they cannot be easily swept off the table.

A survey of 738 publishing experts on machine learning gives a 5% to 10% median probability of an existential catastrophe. In another survey, 57% of researchers in the field of natural language processing think that recent advances are leading to AGI.

Metaculus, a platform and community of people who enjoy forecasting, expects the median arrival of AGI in 2032. Platforms of this type often produce better predictions than narrowly focused experts.

Conclusion

Image creation, explanation of text and image jokes, programming tasks or tasks from the worldwide round of the math olympiad, creativity and much more are types of tasks that have recently been largely conquered by AI systems. Moreover, these capabilities are coming surprisingly quickly. Various metrics and benchmarks are being achieved earlier than predicted not only by experts, but also by expert forecasters.

In language models, most of the breakthroughs are largely emergent properties that have appeared with the enlargement of models, with a larger amount of training data, and with the use of greater computing power for training. Models are able to find the necessary patterns in the data themselves and don’t need linguists, artists, or programmers to help them internalize various concepts. To understand programming, explain humor, or give advice on how to perform well in a job interview, there is no need to create specialized tools and architectures, but the same architecture can handle it all.

In other words, progress is moving forward unexpectedly quickly. We can’t be sure that this progress will stop before we are able to create truly dangerous systems. At the same time, we know that current approaches to AI development do not robustly lead to what can be described as aligned artificial intelligence. At best, we can apply various techniques pushing the bar of dangerousness perhaps high enough. At worst, these patches will lull us, and we will not be careful when we develop sufficiently advanced AI that overcomes all the barriers we set up.

Because of the uncertainty about future development, the problem must be approached rationally. Just as it is reasonable to insure a car in case of an accident, we as humanity should “insure” the future by working adequately on the problem of existential risks. According to estimates, the problems of existential risks are currently being addressed by hundreds of people, which is a completely negligible number compared to other serious problems. Prague’s public transport company has 11,000 employees. As meritorious as public transportation is, I find it terrifying that 30 times more people work on running it than on a problem that could cause the destruction of humanity.

I’m not saying we should panic and tear down data centers. But if humanity wants to survive as a species, we must approach risks rationally. A lot of relevant people think that in the foreseeable future we will create entities smarter than ourselves, and that it will be a problem to ensure that these systems do not irreversibly harm humanity. Let’s not marginalize this problem, but make enough effort to reduce the probability of the problem to an acceptable level.

Michal Keda