GPT-3 is scary because it’s a tiny model compared to what’s possible, with a simple uniform architecture trained in the dumbest way possible (prediction of next text token) on a single impoverished modality (random Internet text dumps) on tiny data (fits on a laptop), and yet, the first version already manifests crazy runtime meta-learning - and the scaling curves still are not bending! [...] In 2010, who would have predicted these enormous models would just develop all these capabilities spontaneously, aside from a few diehard connectionists written off as willfully-deluded old-school fanatics by the rest of the AI community? [...] GPT-3 is hamstrung by its training & data, but just simply training a big model on a lot of data induces meta-learning without even the slightest bit of meta-learning architecture being built in; and in general, training on more and harder tasks creates ever more human-like performance, generalization, and robustness.
NN: I still think GPT-2 is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.
SA: Yeah, well, your mom is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.