At some point, you won’t want to keep building one huge model that’s good at everything. You may want to over-emphasize a context, which is present but rare in the overall training dataset. This is discussed often in terms of societal bias and impolite speech — rightfully so. The internet has many other biases. Very little of the content in the giant OpenAI “Web Text” dataset, for example is about science. Even less is about something niche like chess, math, or machine learning, ironically enough. Almost none of the dataset consists of financial reports, even though some of those are publicly available. Much of the content isn’t dated.
Back to Twitter, and to some extent Reddit. Imagine training a single large model to learn “general language” then customizing it to focus on your chosen community. Or even customizing the model to a single person. You would do so gradually, changing the mix of content as you go along, from random text, to content more and more alike your target.
This may sound like a small change, but the result may not be so tiny. It’s hard to grok 100–1 ratios in volumes of content, much less 10,000–1 ratios, which are pretty common on the internet. It is great to know that the multi-billion parameter LMs are big enough to keep so much global context, and still have room for your niche, without forgetting anything, given the proper training procedure.
Reddit is the “easy” large dataset to organized by niche, and already used to that effect — especially for chatbots.
Twitter would be better, if well organized. It has all the highest quality temporal content, albeit in short form.
The problem with Twitter data is you’d need to index it to make it useful — and no such index ships with the data. The vast majority of tweets are useless, include no information and getting no engagement. This isn’t obvious as a user, since the tweets you see are over-sampled from the best content, based on social proof and previous interactions.
Twitter will sell you the firehose and historical data, but it’s on you to sort this data — which you don’t own and can’t redistribute without Twitter’s prior approval. Moreover, a lot of the categorizations you’d want are by topic, by clout, etc. These are imprecise, and perhaps could not be offered by a central provider, much less Twitter itself. We are perhaps egalitarian at heart — not all tweets are created equal. As it stands, I can’t train a model only on “full threads” about biology and tech, with content that is roughly as quality as Balaji’s — measured by reputation or by user impressions. There are public Twitter datasets, but they are nothing like what you’d want. You get all tweets, over a relatively short time, that match a specific key word or hashtag, perhaps with a minimum number of retweets. It’s hard to describe how much more useful “the best of Twitter” would be as a dataset, than this glorified random sample.
One of my few regrets is not trying harder to publish a good Twitter NLP dataset when I was on the Twitter Cortex team. We also trained graph-NN models (very simple — based on semi-random walks) that produced author embedding similarities, accurate down to micro communities — like specific programming languages, and bloggers for specific professional sports teams. It helps that Twitter authors mostly stay in their lane 😜