A formal definition of creativity

AI vs HI ?

The “human privilege”: enjoying strawberries and cream

Arguments from Various Disabilities. These arguments take the form, “I grant you that you can make machines do all the things you have mentioned but you will never be able to make one to do X”. Numerous features X are suggested in this connexion. I offer a selection:
Be kind, resourceful, beautiful, friendly, have initiative, have a sense of humour, tell right from wrong, make mistakes , fall in love, enjoy strawberries and cream, make some one fall in love with it, learn from experience, use words properly, be the subject of its own thought, have as much diversity of behaviour as a man, do something really new. […]
No support is usually offered for these statements. I believe they are mostly founded on the principle of scientific induction. A man has seen thousands of machines in his lifetime. From what he sees of them he draws a number of general conclusions.Turing, Computer Machinery and Intelligence, 1950

Everything that can be defined unambiguously can be modeled

You insist that there is something that a machine can’t do. If you will tell me precisely what it is that a machine cannot do, then I can always make a machine which will do just that. Von Neuman

Questioning the definition of intelligence

How do we define intelligence?

How do we define the different forms of intelligent behavior?

Intelligence definition models

define natural language
define specific concepts
define feelings
define consciousness
…
define creativity

What does a particular algorithm tell us about intelligent behavior?

Let’s avoid essentializing the “human”, the “machine”, the “intelligence”.

Creativity: the wrong question

Can a machine be creative?

Is chatGPT creative?

Creativity: the good question

What is creativity?

How can we give a formal definition of creativity?

Temperature as creativity

\[\sigma(z_i) = \frac{e^{\beta z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K\]

General (very simplified) principle of an LLM

Guessing the most probable following token

Very simplified example

I remember the summer nights.
I remember the warm summer nights.
I remember the cool summer nights.
I remember the laughter.
I remember the joyful laughter.
I remember the duck.
I remember the friends from high school.
I remember the friends from college.
I remember the friends from work.
I remember the friends from the park.

nights = 3
laughter = 2
duck = 1
friends = 4

Learning probability

An LLM is a model that learns the probability of a token \(e_t\) given the previous token \(e_{t−1}\). That is:

\[P(e_t | e_{t-1})\]

But the question is: how to transform these values into probabilities?

In this very simple example, we could simply make a proportion on 100%, but in the real case of LLMs, the final values will be very complex - it is not directly the number of occurrences. We may have negative numbers and very high numbers, real numbers, etc.”

The key points are:

The raw output values from an LLM are not directly interpretable as probabilities.
In a simple example, you could normalize the values to sum to 100% to get probabilities.
However, in real-world LLMs, the output values are much more complex and cannot be directly converted to probabilities in this straightforward way.
The output values may include negative numbers, very high numbers, and real numbers, making the transformation to valid probabilities (which must be non-negative and sum to 1) more challenging.

Softmax!

\[\sigma(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K\]

\[z_1 = 3\] \[z_2 = 2\] \[z_3 = 1\] \[z_4 = 4\]

The softmax is the function \[\sigma(z_i)\]

We want:

\[\sigma(3) = ?\]

\[\sigma(2) = ?\]

\[\sigma(1) = ?\]

\[\sigma(4) = ?\]

The sum of these 4 values must give 1 (so we’ll have the percentages, by multiplying by 100)

\[\frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}}\]

The numerator of the fraction is the exponential function which has base \(e\), Euler’s number and exponent \(z_i\), the digit to which we are applying the equation.

The denominator is the sum of all the results of the exponential function applied to the digits we want to process.

In our case, the denominator will be:

\(e^3 + e^2 + e^1 + e^4 = 84.791024884\)

(\(e = 2.71828...\))

And so:

\(\sigma(3) = \frac{e^3}{84.791024884} = 0.23\)

\(\sigma(2) = \frac{e^2}{84.791024884} = 0.10\)

\(\sigma(1) = \frac{e^1}{84.791024884} = 0.03\)

\(\sigma(4) = \frac{e^4}{84.791024884} = 0.64\)

We therefore have the following probabilities:

nights = 23%
laughter = 10%
duck = 3%
friends = 64%

These probabilities are then used by the model to propose the word that will follow “I like…”. Based on these probabilities, the model will suggest “roses” 23% of the time, “functions” 10%, “dogs” 3%, and “girls” 64%. This is what is meant by “stochastic” behavior. The result proposed by the model is random, in the sense that the model chooses randomly among the possibilities it has, but its choice is weighted by the distribution of probabilities. In other words: imagine we have the sequence of words “I like…” and we place in front of the model a basket with 100 small tickets. On each ticket, there is a word. On 23 tickets it says “roses”, on 10 “functions”, on 3 “dogs”, and on the remaining 64 “girls”. The model draws and gives the word it found. This is the meaning of the word “stochastic”: random but relative to a distribution of probabilities. It is by chance that the word is taken, but the word “girls” will be more likely than “dogs” because it is more frequent.

Yeah… but what about creativity?

A mathematical twist

\[\sigma(z_i) = \frac{e^{\beta z_{i}}}{\sum_{j=1}^K e^{\beta z_{j}}} \ \ \ for\ i=1,2,\dots,K\]

Temperature?

\[T = \frac{1}{\beta}\]

the higher the temperature, the more disorganized the system is

The effect

If we increase the temperature (and thus decrease \(\beta\)), the difference between the percentages will be reduced

Exemple

temperature to 5 and thus \(\beta = \frac{1}{5} = 0.2\)

Our 4 digits will be transformed as follows:

nights: 3 x 0.2 = 0.6
laughter: 2 x 0.2 = 0.4
duck: 1 x 0.2 = 0.2
friends: 4 x 0.2 = 0.8

Our denominator will therefore be:

\(e^{0.6} + e^{0.4} + e^{0.2} + e^{0.8} = 6.760887185\)

And if we do the calculations with these new numbers (it is intuitive that the exponent being smaller, the result will be lower, thus the gap lower):

\(\sigma(3) = \frac{e^{0.6}}{6.760887185} = 0.27\)

\(\sigma(2) = \frac{e^{0.4}}{6.760887185} = 0.22\)

\(\sigma(1) = \frac{e^{0.2}}{6.760887185} = 0.18\)

\(\sigma(4) = \frac{e^{0.8}}{6.760887185} = 0.33\)

The new probabilities:

nights = 27%
laughter = 22%
duck = 18%
friends = 33%

Cool demo

Let’s do the calculations by setting the temperature to 5 and thus \(\beta = \frac{1}{5} = 0.2\)

Our 4 digits will be transformed as follows:

roses: 3 x 0.2 = 0.6
functions: 2 x 0.2 = 0.4
dogs: 1 x 0.2 = 0.2
girls: 4 x 0.2 = 0.8

Our denominator will therefore be:

\(e^{0.6} + e^{0.4} + e^{0.2} + e^{0.8} = 6.760887185\)

And if we do the calculations with these new numbers (it is intuitive that the exponent being smaller, the result will be lower, thus the gap lower):

\(\sigma(3) = \frac{e^{0.6}}{6.760887185} = 0.27\)

\(\sigma(2) = \frac{e^{0.4}}{6.760887185} = 0.22\)

\(\sigma(1) = \frac{e^{0.2}}{6.760887185} = 0.18\)

\(\sigma(4) = \frac{e^{0.8}}{6.760887185} = 0.33\)

The new probabilities:

roses = 27%
functions = 22%
dogs = 18%
girls = 33%

In this way, an unlikely occurrence in the corpus (“dogs,” which appeared only once) becomes much more probable for the model. When it “draws” from the basket, it will find things that are less probable in absolute terms – and which therefore will not correspond to reproducing the average observed in the corpus. The model will thus have unexpected behavior.

(I am giving this example with a language model, but the same treatment can be applied, for example, to a text-to-image model that will generate more unexpected images if we increase the temperature in the softmax that is used to predict the different pixels.)

Completely new?

smoothing…

And new words? Also possible!

One might object that in this way the algorithm can never give something “completely new.” The choices will be limited to what the model has already found in the corpus. The combinations may be rarer, but always already existing somewhere in the corpus. The temperature can bring out rarer occurrences, but never absolutely new occurrences – that is, ones that have never occurred in the corpus. But this limit is overcome thanks to smoothing. To put it simply: by default, we do not give any word a probability equal to 0. We always give a minimal score. (A probability of 0, with a softmax, would correspond to a starting score of \(-\infty\)). Furthermore, since the tokens of LLMs are not “words,” but groups of letters (induced from byte-pair encoding… another interesting topic, I will talk about it another time), the LLM can also construct new words.

What is “creativity” according to this formal definition?

A behavior that deviates from a normal probability distribution

Is it “the good definition”?

Wrong question

But for me, creativity is more like…

GIVE THE F***(ORMAL) DEFINITION!