LLM, softmax and temperature
Marcello Vitali-Rosati
Montréal - DHSI - 02-06-2025
Arguments from Various Disabilities. These arguments take the form, “I grant you that you can make machines do all the things you have mentioned but you will never be able to make one to do X”. Numerous features X are suggested in this connexion. I offer a selection:
Be kind, resourceful, beautiful, friendly, have initiative, have a sense of humour, tell right from wrong, make mistakes , fall in love, enjoy strawberries and cream, make some one fall in love with it, learn from experience, use words properly, be the subject of its own thought, have as much diversity of behaviour as a man, do something really new. […]
No support is usually offered for these statements. I believe they are mostly founded on the principle of scientific induction. A man has seen thousands of machines in his lifetime. From what he sees of them he draws a number of general conclusions.Turing, Computer Machinery and Intelligence, 1950
You insist that there is something that a machine can’t do. If you will tell me precisely what it is that a machine cannot do, then I can always make a machine which will do just that. Von Neuman
How do we define intelligence?
How do we define the different forms of intelligent behavior?
Can a machine be creative?
Is chatGPT creative?
What is creativity?
How can we give a formal definition of creativity?
\[\sigma(z_i) = \frac{e^{\beta z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K\]
Guessing the most probable following token
nights = 3
laughter = 2
duck = 1
friends = 4
An LLM is a model that learns the probability of a token \(e_t\) given the previous token \(e_{t−1}\). That is:
\[P(e_t | e_{t-1})\]
\[\sigma(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K\]
\[z_1 = 3\] \[z_2 = 2\] \[z_3 = 1\] \[z_4 = 4\]
The softmax is the function \[\sigma(z_i)\]
We want:
\[\sigma(3) = ?\]
\[\sigma(2) = ?\]
\[\sigma(1) = ?\]
\[\sigma(4) = ?\]
The sum of these 4 values must give 1 (so we’ll have the percentages, by multiplying by 100)
\[\frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}}\]
The numerator of the fraction is the exponential function which has base \(e\), Euler’s number and exponent \(z_i\), the digit to which we are applying the equation.
The denominator is the sum of all the results of the exponential function applied to the digits we want to process.
In our case, the denominator will be:
\(e^3 + e^2 + e^1 + e^4 = 84.791024884\)
(\(e = 2.71828...\))
And so:
\(\sigma(3) = \frac{e^3}{84.791024884} = 0.23\)
\(\sigma(2) = \frac{e^2}{84.791024884} = 0.10\)
\(\sigma(1) = \frac{e^1}{84.791024884} = 0.03\)
\(\sigma(4) = \frac{e^4}{84.791024884} = 0.64\)
We therefore have the following probabilities:
\[\sigma(z_i) = \frac{e^{\beta z_{i}}}{\sum_{j=1}^K e^{\beta z_{j}}} \ \ \ for\ i=1,2,\dots,K\]
\[T = \frac{1}{\beta}\]
the higher the temperature, the more disorganized the system is
If we increase the temperature (and thus decrease \(\beta\)), the difference between the percentages will be reduced
temperature to 5 and thus \(\beta = \frac{1}{5} = 0.2\)
Our 4 digits will be transformed as follows:
Our denominator will therefore be:
\(e^{0.6} + e^{0.4} + e^{0.2} + e^{0.8} = 6.760887185\)
And if we do the calculations with these new numbers (it is intuitive that the exponent being smaller, the result will be lower, thus the gap lower):
\(\sigma(3) = \frac{e^{0.6}}{6.760887185} = 0.27\)
\(\sigma(2) = \frac{e^{0.4}}{6.760887185} = 0.22\)
\(\sigma(1) = \frac{e^{0.2}}{6.760887185} = 0.18\)
\(\sigma(4) = \frac{e^{0.8}}{6.760887185} = 0.33\)
The new probabilities:
smoothing…
And new words? Also possible!
A behavior that deviates from a normal probability distribution
Wrong question
GIVE THE F***(ORMAL) DEFINITION!