Yoruba Tech October 24, 2019

Yorùbá Word Vector Representation with fastText

I'm a native speaker of Yorùbá language. One of the interesting things about the language is that words may have similar spelling but mean different things—these words' spellings and intonation are differentiated using diacritics marks.

For example: ogún: twenty, ògùn: medicine, ògún: deity of iron

I'd like to know what interesting things I'll find when Yorùbá words' vectors are learned using a word embedding method like fastText. Luckily fastText has a repository of pretrained vectors in different languages.

What is fastText?

fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages. fastText uses a neural network for word embedding.

Data and Setup

The sample data for this article was downloaded from Niger-Volta-LTI GitHub repository.

!pip install fasttext
!wget https://raw.githubusercontent.com/Niger-Volta-LTI/yoruba-text/master/Iroyin/yoglobalvoices.txt
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.yo.300.bin.gz
!gunzip cc.yo.300.bin.gz

Comparing Two Models

For this article, we'll compare the performance of two models:

trained_model: trained on the sample data (about 27,400 tokens)
pretrained_model: trained on Common Crawl and Wikipedia using fastText

It is expected that the pretrained model will perform much better with higher confidence scores because it was trained on a very large data corpus compared to the data for the trained model.

Finding the Odd Word Out

Let's attempt to use the models to pick the odd word out.

Test 1: Ìjọba (Government), Ìbílẹ (Local), Abùjá (Nigeria's Capital), ọkọ (Husband)

trained_model.wv.doesnt_match(["Ìjọba", "Ìbílẹ", "Abùjá", "ọkọ"])
  → 'Ìbílẹ'

pretrained_model.wv.doesnt_match(["Ìjọba", "Ìbílẹ", "Abùjá", "ọkọ"])
  → 'ọkọ'

Test 2: Ìjọba (Government), ìyàwó (Wife), Ọmọ (Child), ọkọ (Husband)

trained_model.wv.doesnt_match(["Ìjọba", "ìyàwó", "Ọmọ", "ọkọ"])
  → 'ìyàwó'

pretrained_model.wv.doesnt_match(["Ìjọba", "ìyàwó", "Ọmọ", "ọkọ"])
  → 'Ìjọba'

For both word sets, the pretrained_model made accurate predictions and the trained_model didn't do well.

Word Vector Arithmetic

I was trying to see if any of the models would predict ìyá (Mother) when given:

positive=['ọkùnrin': man, 'bàbá': father], negative=["obìrin": woman]

The trained model didn't do a fantastic job at this, but the pretrained model made a good attempt by predicting words like:

Omobìnrin: Young woman
arákùnrin: Young man
aránbìnrin: Young woman

I later found out a spelling mistake—'obìrin' was written instead of 'obìnrin'.

Finding Most Similar Words

For the word ọkọ (husband), it's interesting how the trained model is so confident about all the wrong predictions. The characters in the word have many variations in Yorùbá. I understand why the pretrained model would assume it's related to ojú-omi because canoe/boat is 'ọkọ̀ ojú-omi'.

Comparing Similarity

To compare similarity and demonstrate my command of Yorùbá language, I tested rare words.

aṣo is the contemporary word for cloth, but it can also be called ẹ̀wù.

trained_model.wv.similarity('aṣo','ẹ̀wù')
  → KeyError: 'all ngrams for word ẹ̀wù absent from model'

pretrained_model.wv.similarity('aṣo','ẹ̀wù')
  → 0.0286972

The pretrained model was able to assign a similarity score but very low when comparing two words that are synonymous. This is understandable because 'ẹ̀wù' is not a common word in texts or everyday use, so it's difficult to correlate its relationship to 'aṣo'.

bùbá (Top) and ṣòkòtò (Trousers) are words that appear together often:

trained_model.wv.similarity('bùbá','ṣòkòtò')
  → 0.0588443

pretrained_model.wv.similarity('bùbá','ṣòkòtò')
  → 0.499416

The trained model calculated a low similarity score because the data it trained on is small and not diverse. The pretrained model did a better job and calculated a reasonable similarity score.

Conclusion

I agree a word is known by the company it keeps, and I know vector representation is mathematics, but I'm relieved it works well for my language too.

I hope native speakers write more in Yorùbá and add the proper intonations as much as they can.