Exploring Topic Modelling with Gensim on the Essential Science Indicators Journals List
The Dataset
As stated on the Thomas Reuters website, "Essential Science Indicators is a unique compilation of performance statistics and trends extrapolated from counts of articles published in scholarly journals and the citations to those articles." For this article, only the names of the journals listed will be used. There are more than 11,000 instances in the data.
The dataset can be found here.
Tools and Models
Gensim is a Python library that is optimized for topic modelling. I will be using the Latent Dirichlet Allocation (LDA), Latent Semantic Indexing (LSI), and Hierarchical Dirichlet Process (HDP) models.
Coherence will be used as the metric of comparison between the topic models.
Import and Inspect Data
Data Pre-Processing
Since this dataset is a list of journals, the word "journals" is very common, so I wrote a function to filter it out, then used pandas .apply lambda to filter it from the DataFrame. The process returned a list which caused errors since the next step requires strings not lists, so I converted the returned list back to strings.
Gensim has tidy ready-made modules for most text data pre-processing needs (I will use the library more!). When I applied the 'simple_preprocess' from gensim.utils, it was simple indeed but it didn't do all the job. When I applied 'preprocess_string' from gensim.parsing.preprocessing, it overdid the job. Then I found out how to filter out processes I didn't want from the documentation, and that was what I did here.
Feature Engineering
Then some feature engineering was done. Created a dictionary of integers, then bag of words were created. This is readily available in gensim too.
I defined a simple custom function to pre-process test documents.
The LDA Model
From the dataset there were 22 unique categories but after trying different numbers of topics, I figured out the models did better on choosing just 20 topics.
Testing the LDA Model
For the first test document, I came up with a random name to see if the model can predict the probability of the name belonging to a group of topics.
For the first test document, the model predicted the name could belong to topic 4 (index 3) by a probability of 0.35 which I agree with. Something similar for the second document. So the model actually learnt, which is a good thing.
Interactive Visualization with pyLDAvis
Interactive visualization of the LDA Model using pyLDAvis. You should try this in a notebook. It is insightful and fun to play around in the interactive image.
Generate a coherence model and value for LDA:
Latent Semantic Indexing (LSI) Model
Testing LSI Model
The LSI model's predictive ability isn't as 'sharp' as the one for LDA for test document 1. It performed better with test_doc2. Generating the coherence value will be a better metric to compare.
Coherence value for LSI Model. The lower coherence value compared to LDA is valid.
Hierarchical Dirichlet Process (HDP) Model
Testing the HDP Model
HDP Coherence has the highest coherence value. I will agree with this because the keywords in the topics from HDP are much more varied than for the other models. I think the model learnt much better than LDA and LSI.
Coherence Comparison
Visualizing the coherence comparison of the three models:
Conclusion
I had a great time exploring gensim. I enjoyed how easy pre-processing text data was with the modules. Pandas .apply(lambda) didn't fail. I wish there was pyLSIvis and pyHDPvis like pyLDAvis. There are many other things that can be done with this dataset. I hope to do more.
Check out the complete code on GitHub.
Thanks for reading.