Quote:
Originally Posted by oska
Garry, I said fancy pattern matching. Obviously that's not simple grep or even google's search thingy, that's just silly.
In its barest engineering essence without using any hype, buzzwords, overly technical terms or overstating anything, in as few words as possible what is it technically then?
What are "language models" and "deep learning algorithms" if not a bunch of weighted links on how phrases relate and fancy pattern matchers?
|
Thanks Steve
Look I know where you are coming from and I'm following up here not
to point score, but rather to have a stab at trying to explain why I would
not refer to ChatGPT as even the fanciest of pattern matchers.
My hope is that for any interested reader who takes the time to read this
overly long response that they may come away with some
appreciation that the way ChatGPT operates is significantly different in
many key ways to the convolutional neural networks that are presented
in the video.
And for the record, all the following text is mine as tempting as it may
be to just get ChatGPT to do it. Thus any factual errors, terrible grammar
are purely mine.
I note that video was made in 2016 which at the current pace is very old.
It predates a seminal paper that came out of Google in 2017 that was
a game changer that made large language models like GPT-2/GPT-3 and
ChatGPT practically possible.
One feature of something that is purely a pattern matcher is that if you
gave it certain specific inputs, it would provide an identical result
each time. The convolutional neural network when processing images is
precisely that type of system.
If you present identical text to ChatGPT in new chat sessions, there is no
guarantee that the output will be identical.
In fact if you were to build an exact clone of the hardware of ChatGPT and
gave both systems the same input conversation, there is no guarantee that
the outputs would be the same.
I will touch upon why that is a little later.
You mention the word "phrases" here and in your first post where you offered an
explanation that, quoting "these phrases are associated with these other
phrases. The filtered phrases are passed on to another pattern matcher that makes
reasonable sentences. Done.".
That is incorrect because during real-time operation, remarkably ChatGPT
doesn't deal with phrases at all.
To understand why that is, it is useful to understand the size of the problem
if it could directly deal with phrases.
If you go back to the 2016 video link you posted, the presenter gives
a hypothetical example of a neural network that might be used to
estimate the price of a house given a set of input parameters it derives
from an image - number of windows, width and height of building and so
on. (2m33s in the video). In actual fact, he goes on to say that's not how
he's image processing system works he but does propose to the
viewer that might be some "smart way of doing it".
Now consider the problem of natural language processing. If one were
to use the phrase approach there are essentially some impractically
enormous number of inputs to deal with because there are essentially an
infinite number of input phrases.
So what does ChatGPT do? Instead of dealing with phrases, it firstly deals
with a word at a time and uses probabilities as to what word is most likely
to come next.
So far, so good. But if you build a system that just relies on what word
is most statistically likely to come next, it tends to quickly drift off topic
on some confusing tangent.
So you then think, "Okay, for it to stay on topic, each time I ask a new
question in a dialog on a subject, I will simply re-parse the entire
conversation up to and including the new question and so it will more
likely stay on topic".
The problem with that approach is that with each new input dialog from
the user and each new output dialog from ChatGPT, if it were to process
the entire conversation again through the neural network, it rapidly exceeds
its computational ability. There is just too much data.
So what do you do? You need to keep focus on the conversation by
distilling the whole conversation up to a given point into a smaller form
of data. Something small enough that you store in memory
which can be thought of as an abstract vector summarizing the
information from the conversation so far.
The hidden state is updated at each time step based on the input (i.e. the
message from the user) and the previous hidden state, allowing the model
to keep track of the context of the conversation and make informed
predictions about the next message.
These types of neural networks are known as Recurrent Neural Networks
(RNNs).
Unlike the convolutional network in the video, the RNN
ChatGPT employs is 'stateful'. Things don't simply pass in and
"filter through" a series of filters or pattern matchers. The hidden memory
component feeding back as one of the inputs makes the system a state machine.
Now RNN's have been around a long time, since the 1980's.
But a key feature is, how do you decide what to retain in the hidden
state and how do yo do it? You only have limited memory and limited
processing capability.
So what to keep in this hidden state vector and what to throw away is
the tough part. It is the part that at any instant is trying to remember
the things that are important to keep the conversation on track.
Does this sound familiar? Probably, because it feels eerily similar to what
we do as humans when having a conversation with someone or reading
a novel.
As the conversation with another person progresses, we don't record
the entire conversation in our heads but instead pay attention to the
details we are talking about. In a similar way, if we are half way through
a thick novel and open it up again, we don't run through every word in
the book in our minds up to that point. We have some distillation of the
plot and what the characters were up to so when we start reading again
where we left off last time, they form inputs into the neural network in
our brains combined with the new words in the book so it makes some
sense to us.
Sure, we do pattern matching when we see the printed words on the
page, for example, it might say, "She turned on me with real fury
as though I were a child who had carelessly broken some vase she
had cherished over the the years for its beauty and the memories it
contained". We do the pattern matching to pick out the space
delimted words "She", "child", "vase" and so on, but for an abstract
phrase like this which we are unlikely to have seen before, we don't
do any fancy phrase pattern matching. Nor for that matter does
ChatGPT.
What we do do is retain some distillation of the novel
up to that point so we know who "she" is and who the person telling
the story is and as we parse the sentence we are processing it serially,
a word at a time and maintaining "attention" on what is important. In
this case, "She" (who we know from previous) is furious with the person
telling the story (who we also know from previous).
ChatGPT, like us, does the same during its conversation.
Now the really, really tricky part is in this small distillation we keep in
our heads of a conversation or up to that point we left off in
a novel is deciding what is important to retain and what is not.
When processing language, we look for what is important. "She", "me",
"fury". That's what we want to keep in the short term memory at least
for the next paragraph to make sense. Who "she" is and why she is furious
will have already been distilled in our compacted memory of the book.
That this is probably the way we process language as humans is often
demonstrated when we are having a conversation, become distracted
and then say, "What were we talking about again?" Our state vector
of the conversation to date requires a refresh. Sometimes both parties
can't recollect what they were talking about.
Now the tough part is knowing what to retain in that distillation
of the conversation. What to keep and what to throw away.
Over the years with RNN's there were several approaches.
Algorithms with names such as Long Short-Term Memory (LSTM) cells
and Gated Recurrent Units (GRUs). These are known as "attention"
mechanisms used to mimic cognitive attention. How we stay focused
when processing language.
Then in 2017, a year after that YouTube video, a breakthrough.
A paper entitled "Attention Is All You Need" by Vaswani et. al.
It proposed a brand new attention mechanism called the "Transformer".
This algorithm not only does a much better job of a RNN staying
on track during a conversation, it is computationally efficient and
parallelizable making it run faster.
The key innovation of the transformer architecture is the use of
"self-attention mechanisms", which allow the network to weigh the
importance of different parts of the input sequence when making
predictions. The self-attention mechanism allows the network to focus on
the most relevant parts of the input when making predictions, rather than
simply processing the entire sequence in a fixed order as in traditional
RNNs or convolutional neural networks.
So GPT-2/GPT-3/ChatGPT are examples of language models using
the "Transformer" algorithm to refresh their hidden state vector.
It gets more complex than that. ChatGPT also uses random number
generators to mix things up. Hence even if you built an identical clone,
it is likely to have a different worded conversation.
So ChatGPT does not pattern match on phrases but instead
uses a RNN that has an internal feedback loop that is stateful, which
in turn is refreshed by a "Transformer" algorithm to maintain
attention.
Mechanisms such as "attention" are key to language models and
hence the term "language model" is not a buzzword substitute
for a "fancy pattern matcher". They are two entirely different
concepts.
In fact it is not even a really fancy pattern matcher any more than
a computer is a fancy typewriter.
With a typewriter, you press the Q key and an Q is printed and so on.
Deterministic. Place finger here, recognizes what key is pressed and
predicatively the same letter is printed.
By comparison, the old Enigma machine you would press a Q and
depending on the settings, some other letter would come out,
say K. It's gears would turn and you press Q again and some other
letter might come out, say 'B'.
The statefullness of CHatGPT, its ability to effectively change its
internal state on the fly, the addition of a random number generator
make it like an Enigma machine on steroids. Billions of times bigger.
No fancy matching of phrases at all.
Let's not beat about the bush. ChatGPT is certainly the most impressive
demonstration of software of any type I have ever seen.
Attached, over two files is a conversation I had with GPT a short while
ago which demonstrates its "attention" mechanism by way of the
"Transformer" algorithm.
I only mention the word "penguin" once at the start of the conversation.
Despite the fact I do not use the world "penguin" again but instead ask
questions such as "What do they eat?" and "How do they withstand the
cold?", notice how ChatGPT understands we are talking about penguins
and say, for example, not the "someone" who suggested a trip to the zoo.