Damien Benveniste provides us with two articles about the history and taxonomy of large language models. In a post on LinkedIn, he leads us from 2017’s Attention is all you need, to 2018’s ELMo and BERT. He references Facebooks XLM (2019), which demonstrated the use of transformers for cross-linguistic language representation, and then of course GPT and ChatGPT. He provides us with great diagrams on transformers, including this one:
In The ChatGPT Models Family, he provides us with some fascinating charts showing the taxonomy of various large language models. He notes that GPT models (GPT-1, GPT-2, GPT-3) differ mostly in terms of the data size and number of transformer blocks used for training. GPT-1 has 12 transformer blocks and 117 million parameters, GPT-2 has 48 blocks and 1.5 billion parameters, and GPT-3 has 96 blocks and 175 billion parameters. He also tells us about some ChatGPT alternatives, including Google’s LAMDA and Meta’s PEER. He finishes by pointing out that since the original 2017 paper, not much has changed in large language models as far as the underlying transformer architecture goes.
This reminds me of this graphic by Max Roser showing us just how fast technology in general accelerates.
Throughout most of human history, the pace of technological change has been glacial. Now technologies are emerging that would have been unimaginable just a few decades ago. Roser explains his chart:
Given all of this, it’s fun to consider where this technology may take us in the next five or ten years. If history is any guide, we may well be at the start of a very fast rising hockey stick.