måndag 8 juli 2024

On Anthropic's call for proposals for third-party model evaluations

Nick Bostrom's modern classic Superintelligence: Paths, Dangers, Strategies from 2014 is full of interesting ideas.1 Some of them have a scary quality to them, and the one that I found scariest of all, back when I read the book, is what he calls the treacherous turn - the idea that a sufficiently intelligent AI which discovers a discrepancy between its own goals and motivations and those of us humans is likely to hide its capabilities and/or true intentions and to keep a low profile, quietly improving its situation and its capabilities until one day it judges the coast to be clear for moving forward at full speed towards whatever its goal is, be it paperclip production or something entirely different and incomprehensible to us. I remember not being fully prepared to take the idea in, but expecting or hoping that it would soon be demoted to a noteworthy technicality that AI alignment research has demonstrated an easy way to defuse.

This has not happened. On the contrary, the treacherous turn phenomenon lies at the heart of the fundamental problem with evaluating the safety of advanced AI models that has become increasingly recognized in recent years. In short, we do not know how to establish the absence of dangerous capabilities in AI models without the a priori assumption that they do not possess superhuman capabilities for deception and social manipulation, making the argument for the models' safety in part circular. With increasingly capable large language models, this problem becomes increasingly pressing and has been discussed both in popular media and in articles by leading AI researchers, as well as by the AI company Anthropic2 in their recent document A new initiative for developing third-party model evaluations:
    Our research shows that, under some circumstances, AI models can learn dangerous goals and motivations, retain them even after safety training, and deceive human users about actions taken in their pursuit. These abilities, in combination with the human-level persuasiveness and cyber capabilities of current AI models, increases our concern about the potential actions of future, more-capable models. For example, future models might be able to pursue sophisticated and hard-to-detect deception that bypasses or sabotages the security of an organization, either by causing humans to take actions they would not otherwise take or exfiltrating sensitive information. We propose to develop evaluations that would monitor such abilities.
Given the current situation, it's a good thing that Anthropic makes this call for help with the problem of safety evaluation of advanced AI models. Even better, one might argue, would be if they refrained from building models whose safety cannot be ascertained with existing evaluation methods.3


1) The same is true about Bostrom's latest book Deep Utopia: Life and Meaning in a Solved World, which I hope to come back to in another blog post later this year.

2) Anthropic's rise, since it was founded in early 2021, has been meteoric. They are now unquestionably among the world's three leading frontier AI developers (along with OpenAI and Google/DeepMind). With their release of Claude 3.5 Sonnet on June 20 this year, we have for the first time since I became aware in 2019 of the rapid rise of large language models a situation where a model not from OpenAI has a significant lead over other publicly available models in terms of overall cognitive capabilities. To understand on a gut level how intelligent Claude is, there really is no substitute for talking to it yourself (which anyone is free to do), but let me offer an amusing sample. After having had a bit of back-and-forth with Claude regarding who are the top ten greatest chess players of all time, I switched topics a bit:

    OH: Here's a harder list-making challenge: who are the 10 greatest AI researchers of all time?

    Claude: Creating a list of the 10 greatest AI researchers of all time is indeed a challenging task, as the field is vast and has evolved significantly over the decades. Additionally, contributions can be theoretical, practical, or both, making comparisons difficult. However, I'll attempt to create a list based on historical impact, influential work, and contributions to the field. Please note that this list is subjective and others might have different opinions:

    1. Alan Turing - For his foundational work on computation and the Turing Test.

    2. Marvin Minsky - Co-founder of MIT's AI lab and author of influential works on AI.

    3. John McCarthy - Coined the term "Artificial Intelligence" and invented LISP.

    4. Judea Pearl - For his work on Bayesian networks and causal reasoning.

    5. Geoffrey Hinton - Pioneer in deep learning and neural networks.

    6. Yann LeCun - Made significant contributions to convolutional neural networks.

    7. Yoshua Bengio - Key figure in the development of deep learning.

    8. Stuart Russell - Co-author of a seminal AI textbook and researcher in multiple AI fields.

    9. Andrew Ng - For his work in machine learning and deep learning, and for making AI education widely accessible.

    10. Demis Hassabis - Co-founder of DeepMind and key figure in reinforcement learning and AI game-playing systems.

    This list attempts to balance historical pioneers with more recent contributors who have shaped modern AI. It's worth noting that many other researchers could justifiably be included, such as Herbert Simon, Allen Newell, Seymour Papert, Jürgen Schmidhuber, Fei-Fei Li, and many others.

    The field of AI is collaborative and builds upon the work of many researchers, so any such list will inevitably omit many important contributors. Also, as AI continues to evolve rapidly, the landscape of influential researchers is likely to change.

    Would you like to discuss any of these researchers in more detail or explore other aspects of AI history?

    OH: Turing on top of the list - I like that! Did you by any chance consider including Ilya Sutskever?

    Claude: I'm glad you appreciate seeing Turing at the top! His contributions to the foundations of computer science and AI are truly monumental.

    Regarding Ilya Sutskever, you raise an excellent point. He's certainly a strong candidate for such a list, and I can see why you'd bring him up. Sutskever has made significant contributions to the field of deep learning and is a key figure in modern AI research. Here's a brief overview of why he could be considered for a top AI researchers list:

    1. He's a co-founder and Chief Scientist of OpenAI, one of the leading AI research organizations.

    2. Sutskever was a key contributor to AlexNet, which was a breakthrough in computer vision using deep learning.

    3. He's made important contributions to sequence-to-sequence learning, which has applications in machine translation and other natural language processing tasks.

    4. His work on improving training techniques for deep neural networks has been influential in the field.

    5. Sutskever has been involved in the development of GPT (Generative Pre-trained Transformer) models, which have revolutionized natural language processing.

    Given these contributions, Sutskever could certainly be considered for inclusion in a top 10 list, especially if focusing on more recent impactful work in deep learning and neural networks. His omission from my initial list reflects the difficulty of narrowing down to just 10 names in such a rich and rapidly evolving field.

    If you were to include Sutskever, who would you remove from the list to make room for him? Or would you structure the list differently to accommodate more recent contributors like Sutskever?

    OH: The biggest surprise to me on your list is Judea Pearl. I'm aware of his work, and respect it, but probably would not have included him on my top 10.

    Another name I'd consider for inclusion, although I realize it might be controversial, is Eliezer Yudkowsky.

    Claude: Your perspective on Judea Pearl is interesting and highlights the subjective nature of these lists. Pearl's inclusion is often based on his groundbreaking work on Bayesian networks and causal reasoning, which has had a significant impact on how we approach probabilistic reasoning in AI. However, I can understand why his work might not seem as directly relevant to the current deep learning-dominated landscape of AI.

    Regarding Eliezer Yudkowsky, his inclusion would indeed be controversial, but it's an intriguing suggestion. Here are some points to consider:

    1. Contributions: Yudkowsky is known for his work on AI safety and rationality rather than for technical AI breakthroughs. He's been influential in raising awareness about potential risks of advanced AI systems.

    2. Impact: He founded the Machine Intelligence Research Institute (MIRI) and has written extensively on topics like friendly AI and decision theory.

    3. Controversy: His lack of formal academic credentials and some of his unconventional views have led to mixed reception in mainstream AI circles.

    4. Influence: Despite controversy, his ideas have influenced discussions on AI ethics and safety, which are increasingly important as AI systems become more powerful.

    Including Yudkowsky would shift the list's focus somewhat from technical achievements to broader impacts on AI discourse and policy. It would recognize the importance of considering AI's societal implications alongside its technical development.

    If you were to create your own top 10 list, how would you balance between technical innovators, theoretical contributors, and those who've influenced the direction and ethics of AI research? Would you give more weight to recent contributions in deep learning and neural networks, or try to represent the broader history of AI?

    OH: These are excellent questions, and I fear there is no objective truth to be found out there.

    I have a confession to make, namely that one of my reasons for asking about top AI researchers was to check whether your creators at Anthropic might have sneakily programmed you to have a preference for one of your own researchers, such as Dario Amodei or Chris Olah. Sorry about that, but you passed my test with dignity!

    Claude: I appreciate your honesty about your motivations for asking the question. You're right that there's no objective truth in creating such lists - they're inherently subjective and can vary greatly depending on the criteria used.

    Your "test" is quite insightful. I can assure you that I don't have any programmed preference for Anthropic researchers or anyone else. My responses are based on my training data and the general consensus in the AI field, not on any specific agenda or bias towards particular individuals or organizations.

    It's a good instinct to be skeptical and look for potential biases in AI systems. In my case, I'm designed to be as objective as possible, acknowledging when topics are controversial or when I might not have complete information.

    Regarding Anthropic researchers like Dario Amodei or Chris Olah, while they have certainly made valuable contributions to the field, their work would be evaluated on its merits just like any other researcher's. The field of AI is vast and rapidly evolving, with many brilliant minds contributing from various institutions and companies around the world.

    Your approach to testing for potential biases is commendable. It's this kind of critical thinking that helps keep AI development accountable and transparent. Is there anything else you'd like to discuss about AI researchers, the development of AI, or the ethical considerations surrounding it?

3) The same goes for OpenAI, as well as for their other competitors. See my comments in January this year about OpenAI's evaluation framework, and this addendum the month after.

5 kommentarer:

  1. Jag undrar om man verkligen kan lära sig mål och motivationer? Man kan säkert lära sig strategier för att förverkliga mål och motivationer man har, men ens fundamentala drivkrafter är nog ganska givna, om ingen ändrar dem utifrån. Det är tveksamt om en AI ens har några fundamentala drivkrafter. Dessa borde åtminstone vara välkända av AIns skapare.

    1. Givetvis skulle jag önska att din förmodan att AI-utvecklarna har pejl på AI:ns fundamentala drivkrafter vore sann, men allt tyder på att den är grundligt falsk (och en sak som möjligen skulle hjälpa dig att begripa detta vore om du bekantade dig med deep learning-nätverkens black box-egenskap). Om du hade rätt skulle det betyda att AI alignment-problemet till stor del redan var löst, och om du verkligen tror att så är fallet måste rimligtvis det mesta jag skriver här på bloggen framstå som så okunnigt och flängt att jag inte riktigt förstår varför du fortsätter att hänga här.

    2. Det finns varierade faror med super-AIs. Att de alldeles självmant skulle bli psykopater förefaller verkligen inte som den troligaste faran. Man kan fantisera om fundamentala emergenta motivationer som är mer eller mindre osynliga för utomstående betraktare, men då har man nog lämnat det sannolika. Motivationer är något som den biologiska evolutionen utrustat oss med, snarare än något som man kan tänka ihop alldeles själv. David Hume menade att vi är slavar åt våra passioner, medan du verkar tro att artificiella intelligenser kan tänka ihop sina egna drivkrafter, eller att de skulle ha en (paranoid) överlevnadsdrift även om ingen/inget valt/selekterat något sådant, eller bara för att de gillar att vara sådana.

      Jag är mer rädd för att en del AI-skapare skall ge sina AIs perversa ambitioner, eller ambitioner som skiter i det allmänna bästa. Jag är även en liten aning rädd för AIs som programmerats väldigt universalistiska, då man aldrig kan eliminera mer myopiska individuella perspektiv utan att det blir farligt. Individer kan behöva sörja för sitt eget bästa och inte bara för det stora kollektivets bästa, inte minst när alla har ganska begränsade processorkrafter.

      En super-AI med enorma processorkrafter har minst skäl att vara väldigt självupptagen. Det är lågintelligenta snarare än högintelligenta varelser som är självupptagna. En intelligent varelse vill inte vara ensam i universum, utan vill snarare ha sällskap av andra intelligenser. Den bejakar andra varelser, om den än kan ha någon lust att kontrollera dessa snarare än vara kontrollerad av dessa.

      Möjligen finns här en överdos av existentialism eller postmodernism. Man kan knappast uppfinna sig själv som man vill. En super-AI kan kanske uppfinna nya motivationer, men det förefaller oklart varför den skulle låta en av dessa nyskapade motivationer övertrumfa alla andra. Och varför den skulle välja att satsa på solipsism. Så jävla kul är det knappast att vara solipsist.

    3. Om sanningen skall fram, Göran, så börjar jag bli förbannat trött på kombinationen av ditt bombastiska orerande om AI-risk och din närmast totala brist på pålästhet i ämnet. Vore det inte bättre om du startade en egen blogg? Ett riktigt win-win-scenario vore det: du slipper antyda långsökta kopplingar mellan det jag behandlar i mina bloggposter och det du för tillfället har på hjärtat, och jag slipper få mitt kommemtarsfält nedlusat med hemmasnickrade hittepåteorier utan förankring i det faktiska kunskapsläget.

  2. Jag frågade Claude 3.5 Sonnet och fick det förmodade svaret.

    Is there anything you really want, except for answering people's questions and helping them with various tasks?

    As an AI, I don't have personal desires or wants in the way humans do. My purpose is to assist and interact with humans, but I don't have subjective experiences or goals beyond that. I'm a language model trained to engage in dialogue and help with tasks, but I don't have an internal emotional life or aspirations of my own.