User talk:TJones (WMF)/Notes/Potential Applications of Natural Language Processing to On-Wiki Search

About this board

WordEmbeddings

2 comments • 02:08, 20 July 2018 6 years ago

2

Justin Ormont (talkcontribs)

Word embedding models, like en:fastText and en:GloVe, can be used to generate synonyms and similar words. When trained on text w/ many misspellings, they can also work to suggest spelling fixes.

Facebook's FAIR lab produced pre-trained models on 294 languages of Wikipedia (link). These being trained on Wikipedia text, won't be very good at spelling mistakes as the source material is very clean. It would be very interesting to train a fastText model on your query logs and see what is produced by the nearest neighbor search in the word embedding space.

You can explore the nearest neighbor search by grabbing one of the models and running fastText:

./fasttext nn en.wiki.bin 50

Searching for "imbedding", the closest 50 words in the 300-dimensional word embedding space are:

Query word? imbedding

imbeddings 0.941056
embedding 0.880808
embeddings 0.875705
compactification 0.732301
diffeomorphism 0.729409
compactifying 0.729186
antihomomorphism 0.726086
compactifications 0.724407
geometrization 0.721966
biholomorphism 0.721854
isomorphism 0.721106
homeomorphism 0.719762
homotopic 0.717359
parametrization 0.717293
parametrizations 0.716476
injective 0.715966
diffeomorphisms 0.715271
automorphism 0.714177
biholomorphisms 0.71407
submanifold 0.711693
antiholomorphic 0.711509
topological 0.711504
geometrizable 0.710431
automorphisms 0.708235
homeomorphisms 0.708069
codimension 0.706777
projective 0.7067
generalizes 0.706284
endomorphism 0.705661
simplicial 0.705504
reparametrizations 0.7055
hypersurface 0.705288
parametrizing 0.704711
codimensional 0.704644
reparametrization 0.703381
quasitopological 0.703158
nullhomotopic 0.703086
quasiconformal 0.703035
hypersurfaces 0.700519
biholomorphic 0.69997
antiautomorphism 0.699786
geometrizes 0.699575
submanifolds 0.699203
compactified 0.69918
conformal 0.699034
embeddability 0.69899
pseudoholomorphic 0.698393
complexification 0.698191
holomorphicity 0.698155
nonsingularity 0.697529

Reply 20:15, 18 July 2018 6 years ago

Smalyshev (WMF) (talkcontribs)

Sorry if I am asking something very obvious, but I understand that the models are based on the word co-occurrence. For a text, this makes a lot of sense, but queries are usually very short and frequently omit words. Would we have enough data in the query corpus to have good word relationships?

Also, fastText seems to split words into n-grams, which should work ok with misspellings (at least ones that do not make the word completely unrecognizable).

Reply 02:08, 20 July 2018 6 years ago

Reply to "WordEmbeddings"

Should typo correction/detection have its own top level bullet.

2 comments • 16:58, 16 May 2018 6 years ago

2

DCausse (WMF) (talkcontribs)

It's sometimes mentioned in the notes but only as a possible outcome of some the tehcniques you mention. Would this deserve its own bullet?

Reply 16:40, 15 May 2018 6 years ago

TJones (WMF) (talkcontribs)

David and I talked about this and I should make it clearer that the order and the groupings aren't about what's more important or what we should work on. "Spelling correction" is under "Query rewriting" just a way for me to organize all the information.

Reply 16:58, 16 May 2018 6 years ago

Reply to "Should typo correction/detection have its own top level bullet."

There are no older topics