I am deeply concerned about the future of humanity.
1 There are several factors that may have a huge impact (for good and/or for bad) but are mostly overlooked in mainstream public discourse, and one of my ambitions is to try to understand these, as best as I can - with the ultimate goal of suggesting courses of action that might improve the odds of a favorable future. One of these factors is the possibility of a breakthrough in AGI - Artificial General Intelligence.
2 Hence it is important to me to try to stay informed about major developments in AGI (notwithstanding
the extreme difficulty even for experts to forecast in this area).
One concept that is sometimes claimed to be of central importance in contemporary AGI research is the so-called
AIXI formalism. I came across a recent presentation named
Uncertainty & Induction in AGI where AIXI's inventor
Marcus Hutter presents its basic ingredients, which, very briefly, are as follows:
An AIXI agent is (like all of us) situated in an environment which is partly unknown to it, and wishes to act in such a way as to maximize some pre-defined utility. Based on what it observes it updates its beliefs about the environment, and
unlike most of us, it does so using
Bayesian updating. Before it has made any observations, the agents starts from a particular prior distribution known as
Solomonoff's universal prior, defined by assigning to every possible environment H probability 2
-K(H), where K(H) is the
Kolmogorov complexity of H, which in turned is defined as the length of the shortest program that produces H on a given
universal Turing machine. The AIXI also acts, and it does so based by plugging its utility function and its Bayesian beliefs about the environment into the
Bellman equations of standard decision theory, in order to maximize its futire utility (with some finite time horizon).
In his
presentation, Hutter advices, in fairly strong terms, AGI researchers to take AIXI seriously. It provides, he says,
"the best conceptual solutions of the induction/AGI problem so far". The approach he advocates outperforms all earlier attempts at theories of uncertainty and induction, including
"Popper’s denial of induction, frequentist statistics, much of statistical learning theory, subjective Bayesianism, Carnap’s confirmation theory, the data paradigm, eliminative induction, pluralism, and deductive and other approaches". Moreover,
"AIXI is the most intelligent environmental independent, i.e. universally optimal, agent possible". If nothing else, this last word
"possible" is a bit of an exaggeration, because AIXI is in fact
not possible to implement, as Kolmogorov complexity is not computable). Hutter's tone then moves on from immodest to outright condescending, when he writes that
"cranks who have not understood the giants and try to reinvent the wheel from scratch can safely be ignored".
3
Concluding from Hutter's boasting tone that Solomonoff induction and AIXI are of no interest would, of course, be a
non sequitur.
But
Uncertainty & Induction in AGI is just a PowerPoint presentation and can only give so much detail, so I felt I needed to look further. In the presentation, Hutter advices us to consult his book
Universal Artificial Intelligence. Before embarking on that, however, I decided to try one of the two papers that he also directs us to in the presentation, namely his
A philosophical treatise of universal induction, coauthored with Samuel Rathmanner and published in the journal
Entropy in 2011. After reading the paper, I have moved the reading of Hutter's book far down my list of priorities, because gerneralizing from the paper leads me to suspect that the book is not so good.
I find the paper bad. There is nothing wrong with the ambition - to sketch various approaches to induction from
Epicurus and onwards, and to try to argue how it all culminates in the concept of Solomonoff induction. There is much to agree with in the paper, such as the untenability of relying on uniform priors and the limited interest of the so-called No Free Lunch Theorems (points I've actually made myself in
a different setting). The authors' emphasis on the difficulty of defending induction without resorting to circularity (see the well-known
anti-induction joke for a drastic illustration) is laudable. And it's a nice perspective to view Solomonoff's prior as a kind of compromise between Epicurus and
Ockham, but does this particular point need to be made in quite so many words? Judging from the style of the paper, the word
"philosophical" in the title seems to mean something like "characterized by lack of rigor and general verbosity".
4 Here are some examples of my more specific complaints:
- The definition of Kolmogorov complexity depends on the choice of universal Turing machine, of which there are infinitely many. The same ambiguity is therefore true of Solomonoff induction, and the authors admit this. So far, so good. But how, then, should we choose which particular machine to use?
From p 1111 and onwards, the authors talk about choosing a machine that is "unbiased", i.e. it doesn't favor certain objects blatantly more than other universal Turing machines do. It is not at all clear (at least not to the present reader) that such a notion of unbiasedness can be made sense of. On p 1112, the vague notion of natural universal Turing machines is introduced, meaning machines that do not exhibit such biases, and the authors offers this:
One possible criterion is that a reference machine is natural if there is a short interpreter/compiler for it on some predetermined and universally agreed upon reference machine. If a machine did have an in-built bias for any complex strings then there could not exist a short interpreter/compiler. If there is no bias then we assume it is always possible to find a short compiler.
This does not make any progress towards solving the ambiguity problem, due to the obvious circularity: We should choose a machine that is unbiased in comparison to a "universally agreed upon reference machine", but how should that machine be chosen? Perhaps it should be chosen in such a way as to be unbiased in comparison to some absolutely honest-to-God universally agreed upon reference machine? And so on.5
- Solomonoff induction involves the model assumption that the environment is computable. On p 1118, the authors address this assumption, and we find the following passage:
John von Neumann once stated
“If you will tell me precisely what it is that a machine cannot do, then I can always make a machine
which will do just that”. This is because, given any “precise” description of a task we can design an algorithm to complete this task and therefore the task is computable. Although slightly tongue in cheek and not quite mathematically correct, this statement nicely captures the universality of the concept of computability.
All right, thank you for the "not quite mathematically correct", but I still think this is downplaying the problem in a misleading way. It's not the case that von Neumann's statement could be turned into something similar in spirit but correct - the problem is that the statement is downright false. It would have been easy for the authors at this point to provide a counterexample by sketching Turing's diagonalization argument for the halting problem.
In the same paragraph, they say that "according to the Church-Turing thesis, the class of all computable measures includes essentially any conceivable natural environment". Here it would have been suitable for the authors to modestly admit that the procedure they advocate (Solomonoff prediction) is in fact not computable and therefore impossible in "any conceivable natural environment". (Such a discussion comes much later in the paper.)
Of course it is OK and meaningful to discuss, as most of this paper does, the hypothetical scenario of having a machine that performs Solomonoff induction. But to insist, in this scenario, that the environment must be computable, is to introduce an outright contradiction and thus to enter the realm of the meaningless.
- The black raven paradox is a major issue that any theory of induction ought to address. To accept induction is to accept that the observation of a black raven will (usually) strengthen our belief in the hypothesis H that all ravens are black. Note, however, that "black" and "raven" are just arbitrarily chosen examples of properties here, and might as well be replaced by, e.g., "non-raven" and "non-black", so that accepting induction equally well entails accepting that the observation of a non-black non-raven will (usually) strengthen our belief in the hypothesis H* that all non-black items are non-ravens. But H* and H are logically equivalent, and surely logically equivalent hypoteses must be supported by the same evidence, so we seem to be forced to accept that the observation of a non-black non-raven (such as a yellow banana) will (usually) strengthen our belief in the hypothesis H that all ravens are black. But this seems paradoxical: most people will intuitively resist the idea that the observation of a yellow banana should have any evidential impact on the credibility of a hypothesis concerning the color of ravens.
Rathmanner and Hutter suggest that Solomonoff induction makes progress on the black raven paradox, but their case is very weak, for the following reasons:
- Nowhere in their discussion about Solomonoff induction and the black raven paradox on p 1122-1124 of their paper is there any hint of any asymmetry in how Solomonoff induction would treat black ravens versus non-black non-ravens as regards the the all-ravens-are-black hypothesis. No such hint means no argument that Solomonoff induction helps to solve the paradox.
- The highlight, supposedly, of the authors' argument on Solomonoff induction and the black raven paradox is the displayed formula near the top of p 1124, which states that when observing objects, one after another, the conditional probability of never ever seeing a black raven given what has been seen so far after n observations tends to 1 as n→∞ almost surely on the event that no black raven is ever seen. This is a nice property, but has little or nothing to do with Solomonoff induction, because the same is true for any Bayesian model incorporating such an infinite sequence of observations.6
- There is a difference between the statements "all ravens are black" and "all ravens ever appearing in our sequence of observations are black". The statement we're originally interested in is the former, whereas the aforementioned highlight on p 1124 is about the latter. The preceding displayed formula, near the bottom of p 1123, bridges this difference. That formula, however, is not about Solomonoff induction but only about a certain simple class of i.i.d. mixtures.
To summarize, there is nothing in the paper to back up its penultimate sentence (on p 1134) saying that that the authors have shown "how Solomonoff induction deals with the black raven paradox and argued that it should give the desired result".
- A bit further down on p 1124, in Section 8.1, the authors suggest that a certain property of Solomonoff's univeral prior M implies "fast convergence in practice". I resent the term "in practice" here, because there is no practice: Solomonoff's prior is not computable. (The paper is full of similar slips.)
- In the introductory section, on p 1078, we learn that Solomonoff induction means that "it can be argued that the problem of formalizing optimal inductive inference is solved", and from then on, we hear about this optimality again and again. But I have trouble understanding the concept of e procedure that is both optimal and (if we accept the aforementioned Church-Turing thesis) impossible. Here's another induction procedure which is equally impossible, and arguably even more optimal: whenever we wish to know something about the world, ask an infallible oracle!
Presumably, what the authors mean by Solomonoff induction being "optimal" is that no real-world system can perform better at the given task. But there are many such "optimal" levels, as my oracle example shows, so a better term than "optimal" would be "upper bound on performance". The upper bound on performance provided by Solomonoff induction is obviously tighter than mine, and therefore better. But how tight is it, compared to what is actually achievable? Is there room for better bounds? There'd better not be, in order for the authors' statement about "the problem of formalizing optimal inductive inference [being] solved" not to look silly, but on this issue they are silent.
- In Section 10.4 entitled "Conclusions", the authors claim on p 1134 that Solomonoff induction "gives us new insights into the mechanics of our own thinking". Well, isn't that nice? Even nicer would have been if the paper had contained at least one example of such an insight that can plausibly be attributed to Solomonoff induction.
I still consider it plausible to think that Kolmogorov complexity and Solomonoff induction are relevant to AGI
7 (as well as to statistical inference and the theory of science), but the experience of reading
Uncertainty & Induction in AGI and
A philosophical treatise of universal induction strongly suggests that Hutter's writings are not the place for me to go in order to learn more about this. But where, then? Can the readers of this blog offer any advice?
Footnotes
1) For the benefit of Samuel Rathmanner and Marcus Hutter, and to give them a chance to comment, I write this blog post in English.
2) What such a breakthrough might entail is extremely difficult to predict, but I have discussed various scenarios en earlier blog posts, e.g.,
here,
here and
here.
4) I have great respect for the subject of philosophy. Naturally, then, I disapprove of Rathmanner's and Hutter's useage of the term "philosophical".
5) There is a structural similarity between, on one hand, this failure to make progress by introducing yet another reference machine, and, on the other hand, the postulation that everything that exists must be created and thus have a creator, and pretending to solve the problem by introducing God Almighty. (Who created God Almighty?)
6) One interesting strengthening concering Solomonoff induction would be if the "almost surely" qualifier might be dropped in the asymptotic result considered here. This would be less trivial, because the corresponding statement in the generality of Bayesian models incorporating an infinite sequence of observations would be false.
7) I am not a computer scientist, so the following should perhaps be taken with a grain of salt. While I do think that computability and concepts derived from it such as Kolmogorov complexity may be relevant to AGI, I have the feeling that the somewhat more down-to-earth issue of
computability in polynomial time is even more likely to be of crucial importance.