The IMDB Dataset

The IMDB dataset is one of the most popular sentiment analysis datasets. It contains multiple movie reviews, each annotated with either a positive or negative label.

In this example, a classifier model was built to predict positive and negative reviews.

Text Preprocessing

To preprocess the text, first lowercase all words and remove any HTML residues and punctuations. Tokenize the words and feed them into an Embedding layer to embed all the tokens.

Densely Connected Model

A straightforward approach to solving this problem is by using a densely connected model. The proposed model is composed from a Dense layer followed by a GlobalAveragePooling1D.

Below is the model’s block representation within our platform:

Training Metrics

During the training our platform collects various performance metrics and metadata that are presented interactively within a customizable Dashboard. This enables further performance and data analysis.

After training the model for 20 epochs, the accuracy is 0.9595 and the error loss is 0.149.
Visualization of the error loss / accuracy vs batch:

Error Analysis

Population Exploration

The plot below is a population exploration plot. It represents a samples’ similarity map based on the model’s latent space, built using the extracted features from the trained model.

The samples shown below were taken from the training set and are colored based on their ground truth class. The dot’s size represents the model’s error loss for that specific sample. We can see that there is a clear separation of the samples based on their classes.There are also high loss sample (large dots).

Unsupervised Clustering

Tensorleap’s platform provides an unsupervised clustering of samples based on their similarities.

In our use-case we can see that this clustering technique was able to group together samples with common features. For example, in this Population Exploration plot, each cluster is colored, while the size of the dot represents the number accuracy each sample:

It can be seen that the left-side groups consist of positively labelled samples.

Sample Analysis

The Tensorleap platform provides us with a way to further explore the model’s response to specific data samples.

For example, performing Sample Analysis of one of the failing samples shows a false-positive prediction made by the model, with a heat-map overlay that scores the significance of each word to the positive prediction:

The heat-map highlighted words such as “appreciated“, “favourite“, “appreciate” which indeed translates positively. But when we review their context, we get “other comments… appreciated“, “favourite… but“, and “appreciate.. but do not“, which resonates as negative.

Bert model

We have tried other model- a pretrained Bert (Bidirectional Encoder Representations from Transformers) model.

Running Sample Analysis on the same sample, now with the Bert model, got different results. The bert model predicted the sample as “negative” as it should be.

It is evident that terms such as “but“, “not“, “waste of“, “the worst” and so on, contribute to a lower loss.

Here’s another instance of loss analysis depicted in the image below. However, this example pertains to mislabeled data. In this case, the test data is labeled as ‘positive’, but the model incorrectly predicted it as ‘negative.’ In reality, the text is actually a negative review of the movie, so the model’s prediction is, in fact, correct.

Data Exploration

The Tensorleap’s Dashboard enables you to see how your data is distributed across various features. Below is a dashboard showing 5 histogram vs loss of informative features:

Different correlation insights from this visualization:

out-of-vocabulary – the more out-of-vocabulary words a review has, the higher its loss.
polarity – an external (TextBlob) polarity analysis shows that sentences with neutral polarity have higher loss.

Summary

The Tensorleap platform provides powerful tools for analyzing and understanding deep learning models. In this example, we presented only a few examples of the types of insights that can be gained using the platform.

Getting Started with Tensorleap Project

This quick start guide will walk you through the steps to get started with this example repository project.

Prerequisites

Before you begin, ensure that you have the following prerequisites installed:

Python (version 3.7 or higher).
Poetry.
Tensorleap platform access. To request a free trial click here.
Tensorleap CLI.

Tensorleap CLI Installation

with curl:

curl -s https://raw.githubusercontent.com/tensorleap/leap-cli/master/install.sh | bash

Tensorleap CLI Usage

Tensorleap Login

To login to Tensorleap:

tensorleap auth login [api key] [api url].

API Key is your Tensorleap token (see how to generate a CLI token in the section below).
API URL is your Tensorleap environment URL: https://api.CLIENT_NAME.tensorleap.ai/api/v2

How To Generate CLI Token from the UI

Login to the platform in ‘CLIENT_NAME.tensorleap.ai’
Scroll down to the bottom of the Resources Management page, then click GENERATE CLI TOKEN in the bottom-left corner.
Once a CLI token is generated, just copy the whole text and paste it into your shell.

Tensorleap Project Deployment

To deploy your local changes:

leap project push

Tensorleap files

Tensorleap files in the repository include leap_binder.py and leap.yaml. The files consist of the required configurations to make the code integrate with the Tensorleap engine:

leap.yaml

leap.yaml file is configured to a dataset in your Tensorleap environment and is synced to the dataset saved in the environment.

For any additional file being used, we add its path under include parameter:

include:
    - leap_binder.py
    - IMDb/configs.py
    - [...]

leap_binder.py file

leap_binder.py configures all binding functions used to bind to Tensorleap engine. These are the functions used to evaluate and train the model, visualize the variables, and enrich the analysis with external metadata variables

Testing

To test the system we can run leap_test.py file using poetry:

poetry run test

This file will execute several tests on leap_binder.py script to assert that the implemented binding functions: preprocess, encoders, metadata, etc., run smoothly.

For further explanation please refer to the docs

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Sentiment Analysis

The IMDB Dataset

Text Preprocessing

Densely Connected Model

Training Metrics

Error Analysis

Population Exploration

Unsupervised Clustering

Sample Analysis

Bert model

Data Exploration

Summary

Getting Started with Tensorleap Project

Prerequisites

Tensorleap CLI Installation

Tensorleap CLI Usage

Tensorleap Login

Tensorleap Project Deployment

Testing

Inspected models

Dataset

Task

Data Type

Storage

Vertical

Sentiment Analysis

<img decoding="async" src="https://github.com/Tensorleap-hub/IMDb/raw/main/images/img_2.png" alt="Population Exploration Analysis of Training Samples" />

The IMDB Dataset

Text Preprocessing

Densely Connected Model

Training Metrics

Error Analysis

Population Exploration

Unsupervised Clustering

Sample Analysis

Bert model

Data Exploration

Summary

Getting Started with Tensorleap Project

Prerequisites

Tensorleap CLI Installation

Tensorleap CLI Usage

Tensorleap Login

Tensorleap Project Deployment

Testing

Inspected models

Dataset

Task

Data Type

Storage

Vertical