torsdag 29 februari 2024

On OpenAI's report on biorisk from their large language models

Aligning AIs with whatever values it is we need them to have in order to ensure good outcomes is a difficult task. Already today's state-of-the-art Large Language Models (LLMs) present alignment challenges that their developers are unable to meet, and yet they release their poorly aligned models in their crazy race with each other where first prize is a potentially stupendously profitable position of market dominance. Over the past two weeks, we have witnessed a particularly striking example of this inability, with Google's release of their Gemini 1.5, and the bizarre results of their attempts to make sure images produced by the model exhibit an appropriate amount of demographic diversity among the people portrayed. This turned into quite a scandal, which quickly propagated from the image generation part of the model to the likewise bizarre behavior in parts of its text generation.1

But while the incident is a huge embarrassment to Google, it is unlikely to do much real damage to society or the world at large. This can quickly change with future more capable LLMs and other AIs. The extreme speed at which AI capabilies are currently advancing is therefore a cause for concern, especially as alignment is expected to become not easier but more difficult as the AIs become more capable at tasks such as planning, persuasion and deception.2 I think it's fair to say that the AI safety community is nowhere near a solution to the problem of aligning a superhumanly capable AI. As an illustration of how dire the situation is, consider that when OpenAI in July last year announced their Superalignment four-year plan for solving the alignment problem in a way that scales all the way up to such superhumanly capable AI, the core of their plan turns out to be essentially "since no human knows how to make advanced AI safe, let's build an advanced AI to which we can delegate the task of solving the safety problem".3 It might work, of course, but there's no denying that it's a huge leap into the dark, and it's certainly not a project whose success we should feel comfortable hinging the future survival of our species upon.

Given the lack of clear solutions to the alignment problem on the table, in combination with how rapidly AI capabilities are advancing, it is important that we have mechanisms for carefully monitoring these advances and making sure that they do not cross a threshold where they become able to cause catastrophe. Ideally this monitoring and prevention should come from a state or (even better) intergovernmental actor, but since for the time being no such state-sanctioned mechanisms are in place, it's a very welcome thing that some of the leading AI developers are now publicizing their own formalized protocols for this. Anthropic pioneered this in September last year with their so-called Responsible Scaling Policy,4 and just three months later OpenAI publicized their counterpart, called their Preparedness Framework.5

Since I recorded a lecture about OpenAI's Preparedness Framework last month - concluding that the framework is much better than nothing, yet way too lax to reliably protect us from global catastrophe - I can be brief here. The framework is based on evaluating their frontier models on a four-level risk scale (low risk, medium risk, high risk, and critical risk) along each of four dimensions: Cybersecurity, CBRN (Chemical, Biological, Radiological, Nuclear), Persuasion and Model Autonomy.6 The overall risk level of a model (which then determines how OpenAI may proceed with deployment and/or further capabilities development) is taken to be the maximum among the four dimensions. All four dimensions are in my opinion highly relevant and in fact indipensable in the evaluation of the riskiness of a frontier model, but the one to be discussed in what follows is CBRN, which is all about the model's ability to create or deploy (or assisting the creation or deployment of) non-AI weapons of mass destruction.

The Preparedness Framework report contains no concrete risk analysis of GPT-4, the company's current flagship LLM. A partial such analysis did however appear later, in the report Building an early warning system for LLM-aided biological threat creation released in January this year. The report is concerned with the B aspect of the CBRN risk factor - biological weapons. It describes an ambitious study in which 100 subjects (50 biology undergraduates, and 50 Ph.D. level experts in microbiology and related fields) are given tasks relevant to biological threats, and randomly assigned to have either access to both GPT-4 and the Internet, or just to the Internet. The question is: does GPT-4 access make subjects more capable at their tasks? It seems yes, but the report remains inconclusive.

It's an interesting and important study, and many aspects of it deserve praise. Here, however, I will focus on two aspects where I am more critical.

The first aspect is how the report goes on and on about whether the observed positive effect of GPT-4 on subjects' skills in biological threat creation is statistically significant.7 Of course, this obsession with statistical significance is shared with a kazillion other research reports in virtually all empirically oriented disciplines, but in this particular setting it is especially misplaced. Let me explain.

In all scientific studies meant to detect whether some effect is present or not, there are two distinct ways in which the result can come out erroneous. A type I error is to deduce the presence of a nonzero effect when it is in fact zero, while a type II error is is to fail to recognize a nonzero effect which is in fact present. The concept of statistical significance is designed to control the risk for type I errors; roughly speaking, employing statistical significance methodology at significance level 0.05 means making sure that if the effect is zero, the probability of erroneously concluding a nonzero effect should be at most 0.05. This gives a kind of primacy to avoiding type I errors over avoiding type II errors, laying the burden of proof on whoever argues for the existence of a nonzero effect. This makes a good deal of sense in a scientific community where an individual scientific career tends to consist largely of discoveries of various previously unknown effects, creating an incentive that in the absence of a rigorous system for avoiding type I errors might overwhelm scientific journals with a flood of erroneous claims about such discoveries.8 In a risk analysis context such as in the present study, however, it makes no sense at all, because here it is type II errors that mainly need to be avoided - because they may lead to irresponsible deployment and global catastrophe, whereas consequences of a type I error are comparatively trivial. The burden of proof in this kind of risk analysis needs to be laid on whoever argues that risk is zero or negligible, whence the primary focus on type I errors that follows implicitly from a statistical significance testing methodology gets things badly backwards.9

Here it should also be noted that, given how widely useful GPT-4 has turned out to be across a wide range of intellectual tasks, the null hypothesis that it would be of zero use to a rogue actor wishing to build biological weapons is highly far fetched. The failure of the results in the study to exhibit statistical significance is best explained not by the absence of a real effect but by the smallness of the sample size. To the extent that the failure to produce statistical significance is a real problem (rather than a red herring, as I think it is), it is exacerbated by another aspect of the study design, namely the use of multiple measurements on each subject. I am not at all against such multiple measurements, but if one is fixated on the statistical significance methodology, it leads to dependencies in the data that force the statistical analyst to employ highly conservative10 p-value calculations, as well as to multiple inference adjustments. Both of these complications lead to worse prospects for statistically significant detection of nonzero effects.

The second aspect is what the study does to its test subjects. I'm sure most of them are fine, but what is the risk that one of them gets a wrong idea and picks up inspiration from the study to later go on to develop and deploy their own biological pathogens? I expect the probability to be low, but given the stakes at hand, the expected cost might be large. In practice, the study can serve as a miniature training camp for potential bioterrorists. Before being admitted to the study, test subjects were screened, e.g., for criminal records. That is a good thing of course, but it would be foolish to trust OpenAI (or anyone else) to have an infallible way of catching any would-be participant with a potential bioterrorist living in the back of their mind.

One possible reaction OpenAI might have to the above discussion about statistical aspects is that in order to produce more definite results they will scale up their study with more participants. To which I would say please don't increase the number of young biologists exposed to your bioterrorism training. To which they might reply by insisting that this is something they need to do to evaluate their models' safety, and surely I am not opposed to such safety precautions? To which my reply would be that if you've built something you worry might cause catastrophe if you deploy it, a wise move would be to not deploy it. Or even better, to not build it in the first place.


1) See the twin blogposts (first, second) by Zvi Mowshowitz on the Gemini incident for an excellent summary of what has happened so far.

2) Regarding the difficulty of AI alignment, see, e.g., Roman Yampolskiy's brand-new book AI: Unexplainable, Unpredictable, Uncontrollable for a highly principled abstract approach, and the latest 80,000 Hours Podcast conversation with Ajeya Cotra for an equally incisive but somewhat more practically oriented discussion.

3) As with so much else happening in the AI sphere at present, Zvi Mowshowitz has some of the most insightful comments on the Superalignment announcement.

4) I say "so-called" here because the policy is not stringent enough to make their use of the term "responsible" anything other than Orwellian.

5) The third main competitor, besides OpenAI and Anthropic, in the race towards superintelligent AI is Google/DeepMind, who have not publicized any corresponding such framework. However, in a recent interview with Dwarkesh Patel, their CEO Demis Hassabis assures us that they employ similar frameworks in their internal work, and that they will go public with these some time this year.

6) Wisely, they emphasize the preliminary nature of the framwork, and in particular the possibility of adding further risk dimensions that turn out in future work to be relevant.

7) Statistical significance is mentioned 20 times in the report.

8) Unfortunately, this has worked less well than one might have hoped; hence the ongoing replication crisis in many areas of science.

9) The burden of proof location issue here is roughly similar to the highly instructive Fermi-Szilard disagreement regarding the development of nuclear fission, which I've written about elsewhere, and where Szilard was right and Fermi was wrong.

10) "Conservative" is here in the Fermian rather than the Szilardian sense; see Footnote 9.