Opinion Mining using the UCI Drug Review Dataset (Part 1)
Data Loading and Pre-processing using Python's Pandas, VADER, NLTK and other random stuff
Earlier in the week, I was randomly searching for datasets for Opinion Mining that is not IMDB or Yelp review, and Voila! I found the Drug Review Dataset on the UCI Machine Learning Repository.
Yes, I'm a Pharmacist. It's nice to work with a dataset that I can relate with and actively participate in.
Tutorial Goals
When I started learning about Data Science, I always hoped for a tutorial that shows many parts of data pre-processing, working with Notebooks (Google Colaboratory Notebook was used for this article), and many random stuff that a newbie could need.
This dataset is not the 'dirtiest'. It's tidy in '.tsv' files, divided into train and test sets, has no null values, but to build a working model, further processing and wrangling has to be done, which I hope we would be doing together.
Loading Data
The first step is to load data. There are many ways to load data into a Notebook:
It could be directly through the File upload icon,
Using '!wget url',
Mounting the Google Drive (there is a pre-saved code snippet in Colaboratory notebook for this).
For this exercise, the '!wget' command method was used, '!unzip' to decompress the zipped file, then the files were read into Pandas DataFrames.
I combined the two files (Test and Train) that were unzipped with pd.concat. This is to have a larger dataset (we will split them later, maybe in another proportion) and to preprocess the files together.
Data Inspection and Preparation
After inspecting the DataFrames, the columns were renamed.
A new DataFrame was created containing just the Id, review, and rating columns.
Sentiment Analysis with VADER
vaderSentiment library was pip installed.
Stopwords were downloaded from NLTK. You can create your own list of stopwords or use another precompiled list.
vaderSentiment's SentimentIntensityAnalyzer was used to generate the compound sentiment polarity scores based on the review.
Sentiment Labeling
The sentiment polarity scores were grouped into 3 polarity labels. Many analyses are based on binary (positive or negative), but I chose the (positive, neutral and negative) labels because in practice, people can be neutral about their medications and I would like that to be learned by the model.
Readers can choose the binary polarity label—the vaderReviewScore can be regrouped to just two labels and mapped as positive or negative.
The punctuation in the review was not removed because VADER was designed to be able to analyze social media data, and factors like exclamation marks, emoticons and special characters are considered in calculating the polarity scores.
Sentiment scores of 2 for positive, 1 for negative, 0 for neutral were mapped on the vaderReviewScore column as vaderSentiment.
The vaderSentiment values were labelled positive, negative, or neutral as vaderSentimentLabel.
Analyzing Ratings vs. Reviews
A similar analysis was carried out on the rating. Sentiment generated from review is considered implicit while the rating is considered explicit. It is not unusual for their polarity labels to be different.
For example, someone may review a particular item in a positive tone but still give it an average rating, while some can review an item in a seemingly negative tone and rate the product highly.
This is why it is important to infer sentiment from review (using libraries like VADER) even though it's easier and straightforward to analyze the star rating values.
I hope to analyze the data generated from both values and see how it goes.
Saving Processed Data
I showed ways to save the pre-processed data both as CSV and gzip files.
Conclusion
I hope I did a fantastic job taking you through this process!
Codes on GitHub: https://github.com/WuraolaOyewusi/Opinion-Mining-using-the-UCI-Drug-Review-Dataset