Using KaggleAPI to obtain the Sentiment Analysis on Movie Review Datasets
!pip install -q kaggle
# upload the Kaggle API Json file to access the datasets
from google.colab import files
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets list
ref title size lastUpdated downloadCount voteCount usabilityRating
-------------------------------------------------- ------------------------------------ ----- ------------------- ------------- --------- ---------------
ruchi798/data-science-job-salaries Data Science Job Salaries 7KB 2022-06-15 08:59:12 17615 549 1.0
aravindas01/monkeypox-cases-countrywise-data MonkeyPox Cases_Countrywise Data 6KB 2022-08-10 17:12:36 524 27 0.9117647
faryarmemon/usa-housing-market-factors U.S. Housing Market Factors 32KB 2022-08-03 02:19:31 425 30 1.0
zzettrkalpakbal/full-filled-brain-stroke-dataset Brain stroke prediction dataset 52KB 2022-07-16 09:57:08 1968 62 0.9705882
himanshunakrani/student-study-hours Student Study Hours 276B 2022-07-20 13:17:29 1497 57 1.0
jillanisofttech/brain-stroke-dataset Brain Stroke Dataset 47KB 2022-08-04 18:02:56 666 29 0.9705882
nancyalaswad90/diamonds-prices Diamonds Prices 711KB 2022-07-09 14:59:21 1942 88 1.0
erqizhou/students-data-analysis Students Data Analysis 2KB 2022-07-20 03:54:13 776 28 1.0
dansbecker/melbourne-housing-snapshot Melbourne Housing Snapshot 451KB 2018-06-05 12:52:24 92793 1118 0.7058824
gabrielabilleira/football-manager-2022-player-data Football Manager 2022 Player Data 94KB 2022-07-26 09:49:50 546 28 1.0
mukuldeshantri/ecommerce-fashion-dataset E-commerce Dataset with 30K Products 546KB 2022-07-08 12:28:18 2259 68 1.0
datasnaek/youtube-new Trending YouTube Video Statistics 201MB 2019-06-03 00:56:47 181067 4628 0.7941176
zynicide/wine-reviews Wine Reviews 51MB 2017-11-27 17:08:04 164231 3348 0.7941176
residentmario/ramen-ratings Ramen Ratings 40KB 2018-01-11 16:04:39 35106 802 0.7058824
rtatman/188-million-us-wildfires 1.88 Million US Wildfires 168MB 2020-05-12 21:03:49 20978 1026 0.8235294
datasnaek/chess Chess Game Dataset (Lichess) 3MB 2017-09-04 03:09:09 30770 1023 0.8235294
jpmiller/publicassistance US Public Food Assistance 703KB 2020-08-21 16:51:18 16824 401 0.9117647
dansbecker/powerlifting-database powerlifting-database 9MB 2019-04-30 21:07:41 5207 64 0.5882353
nasa/kepler-exoplanet-search-results Kepler Exoplanet Search Results 1MB 2017-10-10 18:26:59 10720 667 0.8235294
residentmario/things-on-reddit Things on Reddit 16MB 2017-10-26 14:10:15 8769 219 0.5882353
!kaggle competitions download -c 'sentiment-analysis-on-movie-reviews'
#unzip the files on Colab
inflating: sampleSubmission.csv
inflating: train.tsv
inflating: test.tsv
Load the data into Pandas Dataframes
import pandas as pd
df = pd.read_csv("train.tsv", sep="\t")
PhraseId | SentenceId | Phrase | Sentiment | |
0 | 1 | 1 | A series of escapades demonstrating the adage ... | 1 |
1 | 2 | 1 | A series of escapades demonstrating the adage ... | 2 |
2 | 3 | 1 | A series | 2 |
3 | 4 | 1 | A | 2 |
4 | 5 | 1 | series | 2 |
Check the distribution of the sentiment classes to see if there are any class imbalancces
<matplotlib.axes._subplots.AxesSubplot at 0x7f09c12b4cd0>
For BERT, we need to tokenize the text to create two input tensors, the input IDs, and attention mask
The dimensions will be len(df) * 512 because 512 is the sequence length of the tokenized sequences for BERT, and len(df) means the count of the samples from the training
import numpy as np
seq_len = 512
num_samples = len(df)
Tokenize with BertTokenizer
!pip install transformers
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
The token will return three numpy arrays - input_ids, token_type_ids, and attention_mask
tokens = tokenizer(df['Phrase'].tolist(), max_length=seq_len, truncation=True,
padding='max_length', add_special_tokens=True,
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
Save the file as Numpy binary files, this is for persisteing a single arbitrary NumPy array on disk. The format stores all of the shape and dtype information necessary to reconstruct the array correctly even on another machine with a different architecture. The format is designed to be as simple as possible while achieving its limited goals.reference
with open('movie-xids.npy', 'wb') as f:, tokens['input_ids'])
with open('movie-xmask.npy', 'wb') as f:, tokens['attention_mask'])
We also need to extrac the values and use one-hot encode the labels to another numpy array so that it would be len(df) * 5 label classes.
arr = df["Sentiment"].values
#initialize the zero array
labels = np.zeros((num_samples, arr.max()+1))
Use the curent values in the arr [0, 1, 2, 3, 4] to place 1 values in the right positions
labels[np.arange(num_samples), arr] = 1
array([[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.]])
with open("movie-labels.npy", "wb") as f:, labels)
Use TensorFlow to make use of the object TensorFlow Dataset
with open('movie-xids.npy', 'rb') as f:
Xids = np.load(f, allow_pickle=True)
with open('movie-xmask.npy', 'rb') as f:
Xmask = np.load(f, allow_pickle=True)
with open('movie-labels.npy', 'rb') as f:
labels = np.load(f, allow_pickle=True)
import tensorflow as tf
dataset =, Xmask, labels))
<TakeDataset element_spec=(TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(5,), dtype=tf.float64, name=None))>
Then, each sample of the dataset would containing Xids, Xmask, and labels tensor.
def map_func(input_ids, masks, labels):
# we convert our three-item tuple into a two-item tuple where the input item is a dictionary
return {'input_ids': input_ids, 'attention_mask': masks}, labels
#use the dataset map method to apply this transformation
dataset =
<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(512,), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(512,), dtype=tf.int64, name=None)}, TensorSpec(shape=(5,), dtype=tf.float64, name=None))>
Use batch size of 16 and drrop any samples that don’t fit into chunks of 16
batch_size = 16
dataset = dataset.shuffle(10000).batch(batch_size, drop_remainder=True)
<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))>
Split the data into training and validation by 90-10 split
split = 0.9
# calculate how many be taken to the training set
size = int((Xids.shape[0] / batch_size) * split)
train_ds = dataset.take(size)
val_ds = dataset.skip(size)
Save the files by using When loading the files, element_spec which describes the tensor shape should be specified, 'train'), 'val')
({'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None),
'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)},
TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))
from transformers import TFAutoModel
bert = TFAutoModel.from_pretrained('bert-base-cased')
Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
Model: "tf_bert_model"
Layer (type) Output Shape Param #
bert (TFBertMainLayer) multiple 108310272
Total params: 108,310,272
Trainable params: 108,310,272
Non-trainable params: 0
To define the frame around BERT, we need two input layers(one for input IDs and one for attention mask, a post-bert dropout layer to reduce the likelihood of overfitting and improve generalizaton, max pooling layer to convert the 3D tensors output by Bert to 2D and finally softmax as output activation for categorical probabilities
input_ids = tf.keras.layers.Input(shape=(512,), name='input_ids', dtype='int32')
mask = tf.keras.layers.Input(shape=(512,), name='attention_mask', dtype='int32')
embeddings = bert.bert(input_ids, attention_mask=mask)[1]
x = tf.keras.layers.Dense(1024, activation='relu')(embeddings)
y = tf.keras.layers.Dense(5, activation='softmax', name='outputs')(x)
model = tf.keras.Model(inputs=[input_ids, mask], outputs=y)
BERT is already a pre-trained model that is highly trained and has lots of parameters that take a long time to train further. We could just use the parameters from Bert for the task.
#freeze bert layer, optional
model.layers[2].trainable = False
Model: "model"
Layer (type) Output Shape Param # Connected to
input_ids (InputLayer) [(None, 512)] 0 []
attention_mask (InputLayer) [(None, 512)] 0 []
bert (TFBertMainLayer) TFBaseModelOutputWi 108310272 ['input_ids[0][0]',
thPoolingAndCrossAt 'attention_mask[0][0]']
n_state=(None, 512,
e, 768),
ne, hidden_states=N
one, attentions=Non
e, cross_attentions
dense (Dense) (None, 1024) 787456 ['bert[0][1]']
outputs (Dense) (None, 5) 5125 ['dense[0][0]']
Total params: 109,102,853
Trainable params: 792,581
Non-trainable params: 108,310,272
optimizer = tf.keras.optimizers.Adam(lr=1e-5, decay=1e-6)
loss = tf.keras.losses.CategoricalCrossentropy()
acc = tf.keras.metrics.CategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[acc])
Load in the training and validation datasets and specify the element specs
element_spec = ({'input_ids': tf.TensorSpec(shape=(16, 512), dtype=tf.int64, name=None),
'attention_mask': tf.TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)},
tf.TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))
# load the training and validation sets
train_ds ='train', element_spec=element_spec)
val_ds ='val', element_spec=element_spec)
# view the input format
<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))>
history =
Epoch 1/3
8778/8778 [==============================] - 7199s 819ms/step - loss: 1.1577 - accuracy: 0.5383 - val_loss: 1.1420 - val_accuracy: 0.5208
Epoch 2/3
8778/8778 [==============================] - 7155s 815ms/step - loss: 1.0935 - accuracy: 0.5596 - val_loss: 1.0990 - val_accuracy: 0.5362
Epoch 3/3
8778/8778 [==============================] - 7184s 818ms/step - loss: 1.0659 - accuracy: 0.5693 - val_loss: 1.0809 - val_accuracy: 0.5412
#save the model for future usage"sentiment_model")
model = tf.keras.models.load_model('sentiment_model')
Model: "model"
Layer (type) Output Shape Param # Connected to
input_ids (InputLayer) [(None, 512)] 0 []
attention_mask (InputLayer) [(None, 512)] 0 []
bert (TFBertMainLayer) TFBaseModelOutputWi 108310272 ['input_ids[0][0]',
thPoolingAndCrossAt 'attention_mask[0][0]']
n_state=(None, 512,
e, 768),
ne, hidden_states=N
one, attentions=Non
e, cross_attentions
dense (Dense) (None, 1024) 787456 ['bert[0][1]']
outputs (Dense) (None, 5) 5125 ['dense[0][0]']
Total params: 109,102,853
Trainable params: 792,581
Non-trainable params: 108,310,272
For the predictions, we need to format the data: tokenizing by using bert-base-cased and transform the data into dictionary that has “input_ids” and “attention_mask” tensors.
def prep_data(text):
tokens = tokenizer.encode_plus(text, max_length=512,
truncation=True, padding='max_length',
add_special_tokens=True, return_token_type_ids=False,
# tokenizer returns int32 tensors, we need to return int64, so we use tf.cast
return {'input_ids': tf.cast(tokens['input_ids'], tf.int64),
'attention_mask': tf.cast(tokens['attention_mask'], tf.int64)}
df = pd.read_csv("test.tsv", sep="\t")
PhraseId | SentenceId | Phrase | |
0 | 156061 | 8545 | An intermittently pleasing but mostly routine ... |
1 | 156062 | 8545 | An intermittently pleasing but mostly routine ... |
2 | 156063 | 8545 | An |
3 | 156064 | 8545 | intermittently pleasing but mostly routine effort |
4 | 156065 | 8545 | intermittently pleasing but mostly routine |
df['Sentiment'] = None
for i, row in df.iterrows():
# get token tensors
tokens = prep_data(row['Phrase'])
# get probabilities
probs = model.predict(tokens)
# find argmax for winning class
pred = np.argmax(probs)
# add to dataframe[i, 'Sentiment'] = pred
PhraseId | SentenceId | Phrase | Sentiment | |
0 | 156061 | 8545 | An intermittently pleasing but mostly routine ... | 2 |
1 | 156062 | 8545 | An intermittently pleasing but mostly routine ... | 2 |
2 | 156063 | 8545 | An | 2 |
3 | 156064 | 8545 | intermittently pleasing but mostly routine effort | 2 |
4 | 156065 | 8545 | intermittently pleasing but mostly routine | 2 |
<bound method NDFrame.tail of PhraseId SentenceId \
0 156061 8545
1 156062 8545
2 156063 8545
3 156064 8545
4 156065 8545
... ... ...
66287 222348 11855
66288 222349 11855
66289 222350 11855
66290 222351 11855
66291 222352 11855
Phrase Sentiment
0 An intermittently pleasing but mostly routine ... 2
1 An intermittently pleasing but mostly routine ... 2
2 An 2
3 intermittently pleasing but mostly routine effort 2
4 intermittently pleasing but mostly routine 2
... ... ...
66287 A long-winded , predictable scenario . 2
66288 A long-winded , predictable scenario 2
66289 A long-winded , 2
66290 A long-winded 2
66291 predictable scenario 2
[66292 rows x 4 columns]>