Word embeddings

Le word embeddings - vectorisation de mots - est une méthode pour transformer un ensemble de mots en un ensemble de vecteurs.

Un vecteur est une série de chiffres:

\[\begin{split} \begin{bmatrix} 5\\ 1\\ 8\\ 2\\ 0\\ \vdots \\ \end{bmatrix} \end{split}\]

Nous pouvons imaginer un vecteur représenté en un espace cartésien à deux ou plus dimensions. Par exemple le vecteur:

\[\begin{split} \begin{bmatrix} 2\\ 3\\ \end{bmatrix} \end{split}\]

sera représenté:

vecteur

Mais comment peut-on transformer des mots en vecteurs?

Imaginons deux phrases:

La littérature et le cinéma sont des arts.

et

Les tomates et les courgettes sont des légumes

Nous avons un dictionnaire qui sera composé des mots suivants (après avoir enlévé des stopwords et avoir lemmatisé ):

mon_dictionnaire = ['littérature', 'cinéma', 'art', 'tomate', 'courgette', 'légume']

Nous pouvons le représenter avec un “one-hot encoding” comme suit:

littérature

cinéma

art

tomate

courgettes

légume

1

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

0

0

0

1

Mais cela n’est pas très utile cf. un petit article d’introduction ici.

Ce qu’on voudrait avoir est quelque chose de plus significatif… par exemple:

littérature

cinéma

art

tomate

courgette

légume

culture

0,7

0,6

0,8

0,3

0,2

0,1

nourriture

0,2

0,3

0,1

0,8

0,8

0,9

art

0,8

0,8

1

o,2

0,1

0,1

légume

0,1

0,1

0,1

0,8

0,8

1

plaisir

0,6

0,8

0,6

0,4

0,4

0,4

Mais comment faire?

Voyons d’abord ce que cela pourrait donner…

import gensim

# Je charge le modèle de Google en le limitant au premiers 100000 mots pour des raisons de mémoire
model = gensim.models.KeyedVectors.load_word2vec_format('/home/marcello/Downloads/GoogleNews-vectors-negative300.bin', binary=True, limit=100000)  
model.most_similar('literature')
[('Literature', 0.5977999567985535),
 ('poetry', 0.5864308476448059),
 ('literary', 0.5674947500228882),
 ('scholarly', 0.5518804788589478),
 ('books', 0.5446845293045044),
 ('writings', 0.5330445170402527),
 ('manuscripts', 0.5211322903633118),
 ('humanities', 0.5207687020301819),
 ('journals', 0.5085226893424988),
 ('pamphlets', 0.5083926916122437)]
model.most_similar('cinema')
[('cinemas', 0.6968410015106201),
 ('Cinema', 0.6812142729759216),
 ('film', 0.6696189641952515),
 ('films', 0.6530312299728394),
 ('theater', 0.6515336036682129),
 ('movie', 0.631748378276825),
 ('filmmaking', 0.6314999461174011),
 ('multiplex', 0.6233758926391602),
 ('movies', 0.6107324957847595),
 ('multiplexes', 0.5891655683517456)]
model.most_similar('art')
[('visual_arts', 0.6560915112495422),
 ('artworks', 0.6340675354003906),
 ('artwork', 0.6317088603973389),
 ('printmaking', 0.6287328004837036),
 ('Art', 0.6284820437431335),
 ('art_gallery', 0.6192224025726318),
 ('arts', 0.6003197431564331),
 ('paintings', 0.578402578830719),
 ('artist', 0.570186197757721),
 ('portraiture', 0.5575430989265442)]
model.most_similar('vegetable')
[('vegetables', 0.7447986602783203),
 ('Vegetable', 0.6995902061462402),
 ('veggie', 0.6504706740379333),
 ('tomato', 0.6411334276199341),
 ('potato', 0.6291499137878418),
 ('edible', 0.6108502745628357),
 ('broccoli', 0.6093160510063171),
 ('tomatoes', 0.6031390428543091),
 ('sweet_potato', 0.5973166823387146),
 ('sweet_potatoes', 0.5968406796455383)]
model.similar_by_vector(model['Paris'] - model['France'] + model['Canada'])
[('Canada', 0.6761579513549805),
 ('Toronto', 0.6313107013702393),
 ('Montreal', 0.6235082149505615),
 ('Ottawa', 0.6182886362075806),
 ('Calgary', 0.5934532880783081),
 ('Winnipeg', 0.5831626653671265),
 ('Saskatoon', 0.5748216509819031),
 ('Vancouver', 0.5672988891601562),
 ('Edmonton', 0.5661673545837402),
 ('Canadian', 0.5638517141342163)]
model.similar_by_vector(model['Sarkozy'] - model['France'] + model['Canada'])
[('Sarkozy', 0.7502914667129517),
 ('Ignatieff', 0.7185558676719666),
 ('Prime_Minister_Stephen_Harper', 0.7165366411209106),
 ('Stephen_Harper', 0.6793030500411987),
 ('Duceppe', 0.6656790375709534),
 ('McGuinty', 0.6300479173660278),
 ('Doer', 0.6249755024909973),
 ('Michael_Ignatieff', 0.6148149967193604),
 ('Charest', 0.6127935647964478),
 ('Stephane_Dion', 0.6024216413497925)]
model.similar_by_vector(model['pizza'] - model['Italy'] + model['Canada'])
[('pizza', 0.6648130416870117),
 ('pizzas', 0.5082672238349915),
 ('donuts', 0.5068162679672241),
 ('Tim_Hortons', 0.5026943683624268),
 ('donut', 0.4815843999385834),
 ('burrito', 0.47208717465400696),
 ('sandwich', 0.4705365002155304),
 ('hamburger', 0.4596938192844391),
 ('burger', 0.4569506347179413),
 ('hotdogs', 0.4563262462615967)]
model.similar_by_vector(model['king'] - model['man'] + model['woman'])
[('king', 0.8449392914772034),
 ('queen', 0.7300516963005066),
 ('monarch', 0.6454660892486572),
 ('princess', 0.6156250834465027),
 ('crown_prince', 0.5818676948547363),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663196563721),
 ('sultan', 0.5376776456832886),
 ('queens', 0.5289887189865112),
 ('ruler', 0.5247419476509094)]
model.similar_by_vector(model['zucchini'] - model['vegetable'] + model['literature'])
[('literature', 0.6343990564346313),
 ('zucchini', 0.5891058444976807),
 ('poetry', 0.42693835496902466),
 ('dictionaries', 0.4173279106616974),
 ('dictionary', 0.4157540500164032),
 ('texts', 0.40229904651641846),
 ('humanities', 0.39698630571365356),
 ('journals', 0.3911541700363159),
 ('anthologies', 0.38955801725387573),
 ('poems', 0.3868855834007263)]

Mais comment est-ce possible? Les mots sont des vecteurs… et ces vecteurs expriment des relations. Mais si nous allons regarder ces vecteurs…

model['passion']
array([ 1.15722656e-01, -1.65039062e-01,  1.43554688e-01,  2.23388672e-02,
        1.12304688e-02, -1.00097656e-01,  2.12890625e-01, -4.64843750e-01,
       -1.45507812e-01, -6.49414062e-02, -1.67968750e-01, -1.19628906e-01,
        2.08007812e-01, -2.17773438e-01, -1.31835938e-01,  1.10839844e-01,
       -2.41699219e-02,  3.61328125e-01,  7.08007812e-02, -1.99218750e-01,
        3.24707031e-02, -6.25000000e-02,  4.37011719e-02,  1.35742188e-01,
       -1.66992188e-01,  1.00097656e-01,  3.80859375e-02, -3.95507812e-02,
       -7.51953125e-02, -1.04980469e-01, -6.39648438e-02,  1.77734375e-01,
        1.12792969e-01,  1.38671875e-01,  2.57812500e-01,  1.33789062e-01,
        4.51660156e-03,  4.78515625e-02,  7.42187500e-02,  1.10351562e-01,
        3.84765625e-01, -6.39648438e-02, -3.16406250e-01, -4.36401367e-03,
       -1.95312500e-01, -1.16699219e-01, -3.02734375e-02,  9.03320312e-02,
        1.17187500e-01,  8.59375000e-02, -1.15234375e-01,  2.23632812e-01,
       -1.48437500e-01, -8.10546875e-02,  1.26953125e-01,  4.17480469e-02,
       -4.78515625e-02, -2.03125000e-01, -2.33398438e-01, -5.24902344e-02,
        2.18750000e-01, -8.54492188e-02, -1.37695312e-01, -9.39941406e-03,
       -2.18750000e-01,  1.52343750e-01, -1.91406250e-01, -1.13281250e-01,
        7.56835938e-02, -3.83300781e-02, -3.22265625e-02, -1.50390625e-01,
        4.68750000e-02,  1.83593750e-01, -1.47460938e-01, -1.31835938e-01,
        9.96093750e-02, -9.96093750e-02,  7.56835938e-02,  1.98242188e-01,
        5.10253906e-02,  1.01074219e-01,  5.93261719e-02, -4.95605469e-02,
       -9.81445312e-02, -1.78222656e-02, -3.02734375e-01,  2.96875000e-01,
        1.81640625e-01, -7.32421875e-02,  9.27734375e-03,  4.34570312e-02,
       -3.47656250e-01,  5.43212891e-03, -2.44140625e-02, -9.71679688e-02,
       -8.78906250e-02,  1.61132812e-01, -2.92968750e-02, -6.98242188e-02,
       -1.29882812e-01,  3.28125000e-01,  6.07910156e-02,  1.09375000e-01,
        6.88476562e-02, -3.12500000e-02,  1.36718750e-01,  7.81250000e-02,
        2.98828125e-01, -1.74804688e-01,  3.95507812e-02,  3.73535156e-02,
       -1.02539062e-01,  2.75390625e-01,  3.88671875e-01,  3.10058594e-02,
       -1.00585938e-01, -2.48046875e-01, -1.13281250e-01,  4.19921875e-02,
       -8.00781250e-02, -2.42187500e-01,  4.05273438e-02,  9.08203125e-02,
       -2.40234375e-01, -8.39843750e-02,  3.29589844e-03,  9.52148438e-02,
       -2.36328125e-01,  3.02734375e-01, -1.92382812e-01, -2.87109375e-01,
       -2.63671875e-01,  5.63964844e-02,  7.37304688e-02,  3.26171875e-01,
       -5.07812500e-02, -3.06396484e-02, -1.09863281e-02,  2.13867188e-01,
       -6.78710938e-02, -4.29687500e-02, -4.29687500e-02, -1.28906250e-01,
       -1.73828125e-01,  2.80761719e-02, -2.70996094e-02,  2.50000000e-01,
        1.28746033e-04, -3.88671875e-01,  2.61718750e-01,  4.66796875e-01,
        1.69921875e-01,  3.30078125e-01, -3.88671875e-01,  1.25976562e-01,
        5.10253906e-02, -3.39843750e-01, -1.97265625e-01, -3.86718750e-01,
        1.25000000e-01,  5.17578125e-02, -2.41210938e-01,  4.27734375e-01,
        1.12792969e-01, -1.45874023e-02,  1.24511719e-01, -1.14746094e-01,
        1.03515625e-01,  3.29589844e-02,  1.12792969e-01,  1.52343750e-01,
       -2.34375000e-01, -1.87500000e-01, -1.12304688e-01,  1.16699219e-01,
       -1.26953125e-01, -1.25976562e-01, -7.03125000e-02, -4.31640625e-01,
       -3.67187500e-01,  3.75976562e-02,  2.30468750e-01, -1.14257812e-01,
       -1.97265625e-01, -6.88476562e-02, -1.34765625e-01, -3.06640625e-01,
        5.98144531e-03, -2.73437500e-02, -9.91210938e-02,  2.18750000e-01,
        1.80664062e-01,  3.43750000e-01, -7.08007812e-02,  7.71484375e-02,
        1.76757812e-01,  2.14843750e-01, -4.39453125e-02,  7.76367188e-02,
       -1.97265625e-01,  2.02636719e-02,  4.24804688e-02, -1.59179688e-01,
        1.17187500e-01, -1.32812500e-01, -2.17773438e-01,  6.88476562e-02,
        8.93554688e-02, -1.05468750e-01, -1.83593750e-01, -3.97949219e-02,
       -1.28173828e-02,  1.25000000e-01, -1.40625000e-01, -2.33154297e-02,
        1.29394531e-02,  2.79296875e-01, -1.97265625e-01, -6.44531250e-02,
       -2.63671875e-01, -3.12500000e-02, -9.22851562e-02,  1.41601562e-01,
       -2.65625000e-01, -8.44726562e-02,  9.22851562e-02, -1.85546875e-02,
        1.48437500e-01, -5.61523438e-03, -2.91015625e-01,  1.25976562e-01,
        8.25195312e-02,  1.72851562e-01,  1.90429688e-02, -1.29882812e-01,
       -1.77001953e-02, -1.83105469e-03, -6.93359375e-02, -1.75781250e-01,
        3.18908691e-03, -1.18652344e-01, -4.41406250e-01,  3.32031250e-01,
        8.25195312e-02,  3.24218750e-01,  1.43554688e-01, -1.28906250e-01,
        1.65039062e-01, -2.98828125e-01,  2.42187500e-01,  3.08593750e-01,
        4.10156250e-02,  1.70898438e-01,  2.23632812e-01, -1.32812500e-01,
       -5.93261719e-02, -1.36718750e-01, -2.69531250e-01, -3.32031250e-02,
       -2.49023438e-02, -7.32421875e-02, -1.23535156e-01,  4.06250000e-01,
       -2.79296875e-01,  8.25195312e-02,  1.03027344e-01, -6.34765625e-02,
        2.71484375e-01, -5.39550781e-02, -3.73535156e-02, -5.34667969e-02,
        4.29687500e-02,  2.48046875e-01, -8.44726562e-02, -6.44531250e-02,
       -3.18359375e-01,  1.02539062e-01, -1.78710938e-01, -6.44531250e-02,
        2.75390625e-01, -1.38671875e-01,  2.41210938e-01,  2.56347656e-02,
        8.23974609e-03,  1.05957031e-01, -8.44726562e-02,  4.51171875e-01,
        6.77490234e-03,  2.75390625e-01, -7.71484375e-02, -5.34667969e-02,
       -3.55468750e-01, -4.41406250e-01, -5.93261719e-02, -2.91015625e-01,
        3.51562500e-01, -9.61914062e-02, -1.05468750e-01,  1.47705078e-02],
      dtype=float32)