1/41
@karpathy
It's a bit sad and confusing that LLMs ("Large Language Models") have little to do with language; It's just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something.
They don't care if the tokens happen to represent little text chunks. It could just as well be little image patches, audio chunks, action choices, molecules, or whatever. If you can reduce your problem to that of modeling token streams (for any arbitrary vocabulary of some set of discrete tokens), you can "throw an LLM at it".
Actually, as the LLM stack becomes more and more mature, we may see a convergence of a large number of problems into this modeling paradigm. That is, the problem is fixed at that of "next token prediction" with an LLM, it's just the usage/meaning of the tokens that changes per domain.
If that is the case, it's also possible that deep learning frameworks (e.g. PyTorch and friends) are way too general for what most problems want to look like over time. What's up with thousands of ops and layers that you can reconfigure arbitrarily if 80% of problems just want to use an LLM?
I don't think this is true but I think it's half true.
2/41
@itsclivetime
on the other hand maybe everything that you can express autoregressively is a language
and everything can be stretched out into a stream of tokens, so everything is language!
3/41
@karpathy
Certainly you could think about "speaking textures", or "speaking molecules", or etc. What I've seen though is that the word "language" is misleading people to think LLMs are restrained to text applications.
4/41
@elonmusk
Definitely needs a new name. “Multimodal LLM” is extra silly, as the first word contradicts the third word!
5/41
@yacineMTB
even the "large" is suspect because what is large today will seem small in the future
6/41
@rasbt
> A better name would be Autoregressive Transformers or something
Mamba, Jamba, and Samba would like to have a word
.
But yes, I agree!
7/41
@BenjaminDEKR
Andrej, what are the most interesting "token streams" which haven't had an LLM properly thrown at them yet?
8/41
@marktenenholtz
Time series forecasting reincarnated
9/41
@nearcyan
having been here prior to the boom i find it really odd that that vocab like LLMs, GPT, and RLHF are 'mainstream'
this is usually not how a field presents itself to the broader world (and imo it's also a huge branding failure of a few orgs)
10/41
@cHHillee
> It's also possible that deep learning frameworks are way too general ...
In some sense, this is true. But even for just a LLM, the actual operators run vary a lot. New attention ops, MoE, different variants of activation checkpointing, different positional embeddings, etc.
11/41
@headinthebox
> deep learning frameworks (e.g. PyTorch and friends) are way too general for what most problems want to look like over time.
Could not agree more; warned the Pytorch team about that a couple of years ago. They should move up the stack.
12/41
@FMiskawi
Has anyone trained a model with DNA sequences, associated proteins and discovered functions as tokens yet to highlight relationships and predict DNA sequences?
13/41
@axpny
@readwise save
14/41
@Niklas_Sikorra
Main question to me is, is what we do in our brain next token prediction or something else?
15/41
@0bgabo
My (naive) intuition is that diffusion more closely resembles the task of creation compared to next token prediction. You start with a rough idea, a high level structure, then you get into details, you rewrite the beginning of an email, you revisit an earlier part of your song…
16/41
@shoecatladder
seems like the real value is all the tools we developed for working with high dimensional space
LLMs may end up being one application of that
17/41
@dkardonsky_
Domain specific models
18/41
@FlynnVIN10
X
TOKEN STREAM
STREAM TEN OK
19/41
@nadirabid
Have you looked into the research by Numenta and Jeff Hawkins?
I think there research into modeling the neocortex are very compelling.
It's not the impractical kind of neuroscience about let's recreate the brain.
20/41
@DrKnowItAll16
Excellent points. ATs would be a better term, and data is data so next token prediction should work quite well for many problems. The one issue that plagues them is system 2 thinking which requires multiple runs over the tokens rather than one. Do you think tricks with current models will work to allow this type of thought or will we need another breakthrough architecture? I incline toward the latter but curious what you think.
21/41
@geoffreyirving
"Let's think pixel by pixel."
22/41
@cplimon
LLM is a so bad that it’s really really good.
IMO it added to the public phenomena
At the peak of attention, the only comps i’ve lived through are Bitcoin (‘21), covid, and the Macarena (‘96)
23/41
@brainstormvince
The use of the word Language at least allows some degree of common understanding - you could use Autoregressive Transformers but it would leave 85% of the world baffled and so seems to slightly come from the same place as the Vatican arguing against scripture being translated from Latin as the translations failed to preserve the real meaning of the words.
the fact the LLM doesn't care what is used is all the more reason for keeping the use of language as something the mass of people can comprehend even if it loses some of its precision.
AI in general is already invoking a large degree of exponential angst the last thing we need is for it to become even more obscurant or arcane
We need Feynman's dictum more than ever ' if we can't explain it to a freshman we don't understand what we are talking about'
24/41
@danieltvela
It's funny how languages are characterized by the ability to distinguish between words, while tokenizers seem to remove that ability.
Perhaps we're removing something important that's impeding Transformers' innovation.
I'd love to be able to train LLMs whose tokens are words, and the tokenization is based on whitespace and other punctuation.
25/41
@jakubvlasek
Large Token Models
26/41
@localghost
yeah, similar to how we still call our pocket computers "phones" (and probably will continue to). seems like "large world model" is appearing as a contender for multimodal ais that do everything though
27/41
@srikmisra
deep learning for generative ai is perhaps way to general and reliant on brute force computing - language is more complex & nuanced than finding sequential relationships - so yes, a more appropriate name is needed
28/41
@43_waves
LLM 4astronomy LSM
29/41
@sauerlo
"If you can reduce your problem to that of modeling token streams"
Wouldn't that be literally everything? With a nod to Goedels theorem and Turing completenes.
Basically anything outside it, is 'phantasy' which itself has to have language and run on a turing computer.
30/41
@s_batzoglou
And yet the token streams that are amenable to LLM modeling can be thought of as languages with language-like structures like word-like concepts, higher level phrase-like concepts, and contexts
31/41
@swyx
do you have a guess as to the next most promising objective after “next token prediction”?
v influenced by @yitayml that objectives are everything. feels like a local maxima (fruitful! but not global)
32/41
@Joe_Anandarajah
I think "Large Language Model" or "Large Multimodal Model" work best for common and enterprise users. Even if inputs and outputs are multimodal; customization, reasoning and integration remain language oriented for consumer and enterprise apps.
33/41
@miklelalak
Tokens are partially defined by their contextual relationship to other tokens, though, right? Through the collection of weights? I'm not even sure if I'm asking this question right.
34/41
@DanielMiessler
I vote for “Generalized Answer Predictors”
Because the acronym is GAP—as in—
“gaps in our knowledge.”
35/41
@ivanku
+1 I've been struggling with the name for quite some time. Doesn't make much sense if the model is operating with images, time-series, or any other data types
36/41
@jerrythom11
Somewhere in this book Eco says "Semiotics is physical anthropology." Programmer who read it, met him in 2000, said, 'ah, thats just programming.'
37/41
@kkrun
the way i try to understand it is that it can interpolate n-dimensionally within the entire web. who knew the difference between a joke and a tragic story is about statistical correlation among words?
hopefully someday it can do product concept sketches - that's my wish
38/41
@simoarcher
LIM (Large Information Model), which funnily enough means 'glue' in Norwegian, might be a better term for LLMs. It reflects how these models are more abstract in terms of language, understanding and creating any type of content — whether it's text, images, audio, or molecules.
39/41
@poecidebrain
Long ago, a friend who had gone back to school was complaining about how hard it was to learn the math. I said, "Once you learn the language, it's not that hard." Math IS a language. Think about that.
40/41
@JonTeets005
Call it something that highlights its back-grounded nature, like, idunno, "Jost"
41/41
@jabowery
Dynamical, not statistical modeling. This is no mere pedantry, Andrej. It is the difference between the Algorithmic Information Criterion for causal discovery at the heart of science and pseudoscience based on statistics, such as sociology.
GitHub - jabowery/HumesGuillotine: Hume's Guillotine: Beheading the social pseudo-sciences with the Algorithmic Information Criterion for CAUSAL model selection.
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196