
Introduction
Video classification is a common and complex area in machine learning. Unlike other vision tasks that contain only spatial information (Segmentation, Object Detection, Mono-Depth), the addition of the temporal dimension makes error analysis and pinpointing failures root-cause especially challenging.
In this blog, we explore that challenge through the lens of Tensorleap, a powerful platform for visualizing and interpreting deep learning models. We will use Tensorleap to peer inside the mind of a video classifier.
Along the way, we’ll use Tensorleap’s explainability engine to explore concepts in the data, balance the dataset, and characterize failures. By the end, you’ll understand how Tensorleap can be used not just to interpret model outputs, but to actively debug and improve your video classification pipelines.
Dataset
For this exploration, we turned to Kinetics-400 – a widely-used dataset from DeepMind featuring around 240,000 short video clips, each labeled with one of 400 human action classes. These videos, originally sourced from YouTube, serve as a rich and diverse benchmark for human activity recognition.
To keep our analysis focused, we limited our scope to five classes: abseiling, disc golfing, playing basketball, rock climbing, snowboarding.
Model
To tackle video classification, we used X3D from FAIR’s PyTorchVideo library. X3D is a lightweight 3D convolutional network designed specifically for efficient video classification. It is designed to strike a balance: capture the rich spatiotemporal patterns in video clips while keeping computational costs low.
For this experiment, we used the medium-sized variant of X3D, which packs around 40 million parameters and achieves 74.6% top-1 accuracy on Kinetics-400.
Each input to the model consists of 16 frames at 256×256 resolution, representing the first 2 seconds of a video. The output is a prediction across all 400 action classes. Note: while we limited our analysis to a subset of five classes for interpretability and focus, the model itself still performs classification across the full set of 400 Kinetics labels.
Latent Space Exploration
One of Tensorleap’s most powerful features is its ability to extract and visualize a model’s latent space – not just the output of the last layer, but a smart, high-dimensional representation of how the model internally understands each video.
Using dimensionality reduction (PCA or t-SNE), Tensorleap projects this latent space into an interactive 2D panel that can be used to explore the dataset. Figure 1 shows the five selected action classes formed distinct clusters, suggesting that the model has learned strong semantic separations between them.
Latent Space Semantics
Once we had the latent space mapped out, we used Tensorleap to dig deeper – not just into class clusters, but into semantic patterns within those clusters.
One of the metadata fields we explored was the foreground–background ratio, which Tensorleap automatically extracts from each video. This metadata estimates how much of each frame is occupied by the moving subject (foreground) versus the surrounding environment (background), then averages that across the video.
When we colored the latent space based on this ratio, a new pattern emerged: samples with low foreground–background ratios appeared in one region, while those with high ratios clustered elsewhere (Figure 2). In other words, the latent space wasn’t just separating classes – it was also organizing samples based on internal visual characteristics.
To illustrate, Figure 2 presents two video samples (16 frames presented as a 4×4 grid) from opposite ends of the latent space:
- Low foreground–background ratio (upper region) – Left side: The subject occupies a small portion of the frame.
- High foreground–background ratio (lower region) – Right side: The moving people take a large portion of the frame
Similar Class Confusion
Some action classes are just visually similar, even to humans. For example, in our case: Rock Climbing and Abseiling. Both involve vertical surfaces, ropes, harnesses, and often very similar environments. Not surprisingly, these two classes ended up close together in latent space – and even partially overlapping in some regions (Figure 3).
To illustrate what the latent space represents, Figure 4 presents three samples (showing the key frame from each sample). Two samples are from the distinct class areas and one is from the class-mixed area.
Figure 4 (right part) presents Tensorleap’s Heatmaps along with the corresponding images, to show where the model focused when making predictions. The heatmaps reveal that the model’s focus areas are consistent with logical visual cues. In the distinct class samples (left and middle columns), the model attends to the correct features – for the rock climbing example, it concentrates on wall textures and body positioning; for the abseiling example, it focuses on ropes and harnesses. However, in the sample from the class-mixed region (right column), the model primarily fixates on the surrounding rocks, explaining why it confuses abseiling with rock climbing.
This analysis suggests that proximity in the latent space often mirrors visual and contextual similarities, and that Tensorleap’s tooling allows us to zoom into both cluster-level structure and sample-level model reasoning.
Tensorleap’s Insights
Tensorleap doesn’t just help visualize what a model is doing – it actively highlights where things might be going wrong. By analyzing the structure of the latent space, Tensorleap automatically flags clusters that might signal issues like overfitting, low accuracy, or repetitive data. Even better, it correlates these problem areas with automatically extracted metadata, offering a quick way to get an hypothesis for the root cause of the issue.
Insight: Over-Represented Cluster
One of the insights Tensorleap highlighted is the existence of a cluster with redundant information – a group of samples that were unusually dense in latent space. Insight number 7 flagged this cluster which contains 122 samples.
The model achieves notably higher accuracy on this particular cluster – 92.6% compared to the dataset-wide average of 78.5%. At first glance, this suggests strong performance. However, a closer look at the cluster’s content tells a different story.
Tensorleap insight indicate that this cluster primarily consists of samples from the “snowboarding” class. Visual inspection of the cluster reveals that many of these clips feature snowboarders filming themselves with selfie sticks (Figure 6 – top row), resulting in a set of videos that are visually and semantically repetitive: same camera angles, same movement, same framing.
This redundancy in the data raises questions about what the model is actually learning. Tensorleap’s Sample Analysis tool provides key insights here. The generated heatmaps, that are generated with respect to the snowboarding class, consistently highlight the selfie stick as a focal point in the model’s decision-making process (Figure 6 – bottom row). In other words, the model may be performing well – but for the wrong reasons.
Such overfitting to a specific visual cue can hurt generalization. If the model has learned to associate snowboarding with the presence of a selfie stick, it may struggle to correctly classify snowboarders in more varied, realistic contexts – particularly those without selfie sticks.
This insight showcases how Tensorleap’s layered analysis – from latent structure to visual inspection and attention heatmaps – can reveal subtle forms of overfitting that might otherwise go unnoticed, helping ensure models will learn features that supports generalization.
Insight: High Loss Cluster
Looking only at misclassified training samples, Tensorleap highlighted a cluster of samples with high loss values. This “low performance” cluster made up about 8.4% of the filtered set.
The cause of low performance wasn’t immediately obvious – but Tensorleap pointed us in the right direction. The platform automatically correlated this cluster with several metadata fields. Two in particular stood out (5th and 7th row in the red box in Figure 7):
- SSIM mean – a measure of visual similarity between consecutive frames.
- Frame diff mean – the average pixel-level difference between frames.
In this cluster, SSIM was high and frame difference was low, meaning that the videos had little variation over time. In other words, these samples might be visually static – showing minimal motion or scene change between frames.
That led us to come up with a hypothesis: these videos might not have enough visual signal for the model to work with. And indeed, visual inspection of the cluster confirmed it – many samples consisted of an almost static background with some text (Figure 8 top row).
Tensorleap’s heatmaps backed this up. The model focused on the only features the clip had, but those areas were static and uninformative for the task (Figure 8 bottom row). These findings reinforce the value of automated insights: the model struggles not due to semantic ambiguity, but due to lack of sufficient visual signal in the data.
Here we can see a classical use of Tensorleap’s error analysis flow: The insight flagged the issue, metadata correlations pointed us toward a lack of visual signal, and heatmaps confirmed the model’s attention on static, uninformative areas – Tensorleap provided us with all the information we need to identify, charactarize, and then solve the issue that was flagged.
Wrapping Up
Building robust video classifiers goes far beyond reaching high accuracy. Tensorleap takes you beyond metrics, revealing why your video classifier fails – and how to fix it. Whether it’s overfitting to a selfie stick, mixing up similar actions, or struggling with empty frames, Tensorleap surfaces the root causes behind your model’s behavior and provides actionable items.
In this post, we explored how seemingly strong performance can mask critical weaknesses, such as overfitting to redundant visual cues like selfie sticks. With Tensorleap, these issues surface early in the development cycle. By leveraging tools like Sample Analysis, latent space visualization, and heatmap overlays, teams can pinpoint failure modes, balance datasets, and avoid blind spots that would otherwise emerge in production.
Tensorleap shortens model development time by surfacing root causes of error and suggesting actionable improvements. Instead of spending weeks manually debugging performance drops, teams can quickly iterate on both models and data. The result: fewer surprises in production, better generalization, and more trustworthy AI systems.
Next Steps
Curious how you can benefit from Tensorleap? Reach out for a demo.
Want to explore this Tensorleap use-case yourself? Check out our implementation.