Simple Natural Language Processing Projects for Health Sciences
Part 2: How to train fastText Word Embedding on a Standard Treatment Guideline
This tutorial is the second in a series of tutorials for people in health sciences doing Natural Language Processing. The principles are general principles and anyone can follow the train of thoughts, the datasets are just health sciences domain-specific.
In this tutorial we'll:
- Train fastText word embedding on Nigeria's 2008 standard treatment guidelines
- See the effect of training fastText embedding models at 3 different epochs (10, 20, 30) on this particular text, the default is 5
- Find semantically similar words generated based on the trained models
- Use the trained models to pick out words that do not match in a series
- Calculate the similarity between two words
- Create visualization after dimensionality reduction using PCA on some semantically similar words
What are Standard Treatment Guidelines?
Standard treatment guidelines (STGs) list the preferred pharmaceutical and nonpharmaceutical treatments for common health problems experienced by people in a specific health system. As such, they represent one approach to promoting therapeutic effective and economically efficient prescribing.
— Management Sciences for Health and World Health Organization, 2007. Drug and Therapeutics Committee Training Course
What is fastText?
fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages. fastText uses a neural network for word embedding.
Data Preparation
The text was downloaded online, the indexes and appendices were removed, and extraction from PDF to text was done using Python Tika. The text was converted from lists to strings, lowercased and punctuations were removed.
Stop words were not removed and words were not lemmatized (I have not seen a good and easy to use lemmatizer for medical words yet, I'm not ready for words like gastritis to come out as gastiti yet).
Training the Models
The 3 models were trained using the original fastText library, the only varying factor were the epochs (10, 20, 30), the time taken for each model training is noted in the code snippet. The models were saved as .bin files.
For ease of use, the models were loaded using gensim's FastText Implementation.
Finding Semantically Similar Words
Check the top 5 similar words to "dizziness", it's a popular medical symptom and side effect. So we'll expect it to be predicted as being similar to other symptoms and words like drowsiness, fainting, maybe headache.
The 3 models did a good job predicting "drowsiness", "syncope". The 10 epochs model had the highest confidence in its predictions but the 20 epochs model predicted much more related words like "syncope", "lethargy". The 30 epochs model had good predictions too but there was no improvement in performance by training up to 30 epochs.
For 10 epochs model: [('drowsiness', 0.958), ('headache', 0.936), ('dryness', 0.920), ('headaches', 0.908), ('nausea', 0.898)]
For 20 epochs model: [('drowsiness', 0.812), ('shortness', 0.726), ('syncope', 0.725), ('dryness', 0.687), ('lethargy', 0.661)]
For 30 epochs model: [('drowsiness', 0.657), ('shortness', 0.640), ('dryness', 0.567), ('syncope', 0.550), ('weakness', 0.550)]
Results for Different Medical Terms
For the word "pain", the most similar words were "pains", "painless" appeared as expected. If the words were lemmatized, they will probably have just the root word "pain". The 20 epochs model captured words like "painful", "complaints". Cool. The 30 epochs model didn't give an extraordinary outcome.
For "Ciprofloxacin" the models were smart enough to give other antibiotics as similar words, the 20 epochs had details like associating "ciprofloxacin" to one of its dosing regimen of "every12". While preprocessing this text, numbers were retained because of dosing, medication strength which are very important in clinical texts.
In this example, I misspelt "gonorrhoea" as "gonorhhoea". All the models suggested the right spelling as an option, 30 epochs model even suggested the species name "Neisseria".
Finding Words That Don't Match
For words that do not match, I tried ["paracetamol", "headache", "diarrhoea", "dizziness"], paracetamol is the only drug, all others are symptoms. The 3 models predicted paracetamol is the odd one out.
All the drugs on this list are antihypertensives: ["Hydrochlorothiazide", "Furosemide", "Amlodipine"] but the first two are diuretics and the third a calcium channel blocker. The three models predicted the third as the odd one out.
Similarity Calculations
Similarity distance between "drowsiness" and "dizziness" was predicted to be 0.95 by 10 epochs model. I'll agree, they are close in meaning and use.
Similarity distance between drowsiness and amlodipine is about 0.53 by the 10 epochs model. Drowsiness is a popular side effect of amlodipine.
Visualization After Dimensionality Reduction
For visualization, I generated top similar words for 10 words, then did a dimensionality reduction of vectors from the 300 used in model training to two principal components using Principal Component Analysis from sklearn implementation.
Conclusion
I hope you had a great time working through this. As with all machine learning tasks, more quality data means better performance. The text data used for this is high quality because it's a standard medical guide put together by professionals. To improve model performance more data can be used, maybe other medical references like pharmacopoeia, drug formularies, etc. Other parameters like learning rate, minimum word count can be tweaked.
For this task you will agree with me that the model with 10 epochs did just fine, so save your compute.
Happy Holidays!