SENTIMENT ANALYSIS: COMPARING TWO MODELS
Text Mining Classification With Movie Reviews
This is the results of a sentiment analysis regarding a data set of over 150,000 movie reviews. The analysis is built around comparing two models: a support vector machine model, and a multinomial naive Bayes model. Each model was adjusted with different feature sets in order to enhance accuracy, which included the usage of unigrams, bigrams, and cross-validation.
​
The objective was to compare and select the best fitting model to the data set, as well as determine which specific words lead to a positive or negative experience when watching movies.
RESULTS
Multinomial Naïve Bayes
A unigram and bigram feature set were compared in this section. There is a confusion matrix image capturing the accuracy of the model when making predictions on how a movie was rated. The movies were rated 0 - 4, with 0 being a very negative review, and 4 being a very positive review.

UNIGRAM
Confusion Matrix
The first method involved using a unigram feature set on the multinomial naive Bayes (MNB) model. A unigram indicates that each word is independent in the feature set.
​
This model scored an accuracy of 60.4%.

BIGRAM
Confusion Matrix
This method added bigrams to the feature set, which are two words put together, along with keeping the unigrams.
​
​
​
This model scored an accuracy of 59.4%.

TOP WORDS: UNIGRAM
These are the top words related to very positive/negative reviews by their log probabilities
Very Negative:
rbb, long, action, time, dull, worst, characters, minutes, comedy, bad
Very Positive:
movies, work, performance, great, performances, comedy, well, good, funny, best
​
Words pertaining to duration seem to link to bad reviews, and funny movies with good performances seem to link to good reviews. The word comedy showed up in both lists. Perhaps many of the movie reviews were for comedy movies. This list may be insightful.

TOP WORDS: BIGRAM
These are the top words related to very positive/negative reviews by their log probabilities
Very Negative:
rbb, long, action, time, dull, worst, characters, minutes, comedy, bad
Very Positive:
movies, work, performance, great, performances, comedy, well, good, funny, best
​
The words are the same as the unigram features. This makes sense since the unigram model performed slightly better.
​
​
A confusion matrix is a visualization that compares predicted values vs actual values. The Y axis is the predicted values and the X axis is the actual values. The goal is to get the predicted values to match the actual values, which would place most numbers in a diagonal line from the top left corner, to the bottom right corner.
​
We want to match 0 with 0, 1 with 1, etc. Any value outside of these boxes are incorrect predictions.
​
RESULTS
Support Vector Machine
A unigram and bigram feature set were compared in this section. There is a confusion matrix image capturing the accuracy of the model when making predictions on how a movie was rated. The movies were rated 0 - 4, with 0 being a very negative review, and 4 being a very positive review.

UNIGRAM
Confusion Matrix
This method involved using a unigram feature set on the support vector machine (SVM) model.
​
This model scored an accuracy of 62.2%.

BIGRAM
Confusion Matrix
This method involved adding a bigram feature set on the SVM model.
​
This model scored an accuracy of 63.4%.

TOP WORDS: UNIGRAM
These are the top words related to very positive/negative reviews by their log probabilities
Very Negative:
loathsome, ungainly, awfulness, disappointment, grotesquely, atrocious, worthless, unappealing, sucked, unwatchable
Very Positive:
scars, glorious, enriched, excellent, masterfully, masterful, flawless, magnificent, awesome, perfection
​
These words appear to be more expressive than the words from the MNB model.

TOP WORDS: BIGRAM
These are the top words related to very positive/negative reviews by their log probabilities
Very Negative:
utterly incompetent, grotesquely, unappealing, unbearable, disappointment, thumbs, unwatchable, appalling, sucked, worthless
Very Positive:
superb, masterful, flawless, screenplay die, masterpiece, awesome, excellent, magnificent
​
Unlike the NMB, this model did contain a few bigrams in the top ten word lists. Again, these Ngrams appear to be very expressive compared to the MNB.
​