top of page
Watching a Movie

MOVIE REVIEWS SENTIMENT ANALYSIS

September, 2019

Project that performs a sentiment analysis comparing a support vector machine model with a multinomial naïve Bayes model on textual data containing movie reviews.

Movie Reviews Sentiment Analysis: Research

SENTIMENT ANALYSIS: COMPARING TWO MODELS

Text Mining Classification With Movie Reviews

This is the results of a sentiment analysis regarding a data set of over 150,000 movie reviews. The analysis is built around comparing two models: a support vector machine model, and a multinomial naive Bayes model. Each model was adjusted with different feature sets in order to enhance accuracy, which included the usage of unigrams, bigrams, and cross-validation.

​

The objective was to compare and select the best fitting model to the data set, as well as determine which specific words lead to a positive or negative experience when watching movies. 

Movie Reviews Sentiment Analysis: Text

RESULTS

Multinomial Naïve Bayes

A unigram and bigram feature set were compared in this section. There is a confusion matrix image capturing the accuracy of the model when making predictions on how a movie was rated. The movies were rated 0 - 4, with 0 being a very negative review, and 4 being a very positive review.

mnbuni1.png

UNIGRAM

Confusion Matrix

The first method involved using a unigram feature set on the multinomial naive Bayes (MNB) model. A unigram indicates that each word is independent in the feature set.

​

This model scored an accuracy of 60.4%.

mnbbi1.png

BIGRAM

Confusion Matrix

This method added bigrams to the feature set, which are two words put together, along with keeping the unigrams.

​

​

​

This model scored an accuracy of 59.4%.

Client 3

TOP WORDS: UNIGRAM

These are the top words related to very positive/negative reviews by their log probabilities

Very Negative:

rbb, long, action, time, dull, worst, characters, minutes, comedy, bad

Very Positive:

movies, work, performance, great, performances, comedy, well, good, funny, best

​


Words pertaining to duration seem to link to bad reviews, and funny movies with good performances seem to link to good reviews. The word comedy showed up in both lists. Perhaps many of the movie reviews were for comedy movies. This list may be insightful.

Client 5

TOP WORDS: BIGRAM

These are the top words related to very positive/negative reviews by their log probabilities

Very Negative:

rbb, long, action, time, dull, worst, characters, minutes, comedy, bad

Very Positive:

movies, work, performance, great, performances, comedy, well, good, funny, best

​



The words are the same as the unigram features. This makes sense since the unigram model performed slightly better.

​

​

A confusion matrix is a visualization that compares predicted values vs actual values. The Y axis is the predicted values and the X axis is the actual values. The goal is to get the predicted values to match the actual values, which would place most numbers in a diagonal line from the top left corner, to the bottom right corner.

​

We want to match 0 with 0, 1 with 1, etc. Any value outside of these boxes are incorrect predictions.

​

Movie Reviews Sentiment Analysis: Clients

RESULTS

Support Vector Machine

A unigram and bigram feature set were compared in this section. There is a confusion matrix image capturing the accuracy of the model when making predictions on how a movie was rated. The movies were rated 0 - 4, with 0 being a very negative review, and 4 being a very positive review.

svmuni.png

UNIGRAM

Confusion Matrix

This method involved using a unigram feature set on the support vector machine (SVM) model. 

​

This model scored an accuracy of 62.2%.

svmbi.png

BIGRAM

Confusion Matrix

This method involved adding a bigram feature set on the SVM model. 

​

This model scored an accuracy of 63.4%.

Client 3

TOP WORDS: UNIGRAM

These are the top words related to very positive/negative reviews by their log probabilities

Very Negative:

loathsome, ungainly, awfulness, disappointment, grotesquely, atrocious, worthless, unappealing, sucked, unwatchable

Very Positive:

scars, glorious, enriched, excellent, masterfully, masterful, flawless, magnificent, awesome, perfection

​

These words appear to be more expressive than the words from the MNB model. 

Client 5

TOP WORDS: BIGRAM

These are the top words related to very positive/negative reviews by their log probabilities

Very Negative:

utterly incompetent, grotesquely, unappealing, unbearable, disappointment, thumbs, unwatchable, appalling, sucked, worthless

Very Positive:

superb, masterful, flawless, screenplay die, masterpiece, awesome, excellent, magnificent

​

Unlike the NMB, this model did contain a few bigrams in the top ten word lists. Again, these Ngrams appear to be very expressive compared to the MNB.

​

Movie Reviews Sentiment Analysis: Clients

RESULTS

Cross-Validation

Rather than splitting the data into training and testing sets, here we used a ten-fold cross-validation method with each model.

By looking at the chart we can see that at most folds the support vector machine model performed better.

​

crossval1_edited.png
Movie Reviews Sentiment Analysis: Accessibility Policy

©2020 by Data Science Projects. Proudly created with Wix.com

bottom of page