Youtube Video Content Cluster Analysis

Introduction

As the most popular video hosting and sharing website, Youtube hosts a huge number of videos of diverse content types. For a long time, it is difficult to generate or collect data about the content type of a video and analyse the major content groups on Youtube and their characteristics. This year, Google launched a Youtube video understanding competition on Kaggle, challenging machine learning practiioners around the world to tackle the video content tagging problem. Provided in the competition is a large sample (~4 million) of Youtube videos together with their corresponding tags. While the task of generating tasks for videos is beyond the scope of this project, we are still able to take this labelled sample of Youtube videos and gain some insights into the various types of content that exist on Youtube. We are especially interested in the following questions:

What are the most popular types of videos on Youtube?
Of the most common video content tags, which tags often appear on a same video? What are the content clusters and what are the dominant tags in each cluster?
Are there any peculiarities (length, views, likes, comments, likes vs dislikes, etc.) in videos of a particular popular tag or cluster?
Are there any correlations between video title words and popularity?

Information about the video sampling can be found at https://research.google.com/youtube8m/. Here is the original description:

The videos are sampled uniformly to preserve the diverse distribution of popular content on YouTube, subject to a few constraints selected to ensure dataset quality and stability:

Each video must be public and have at least 1000 views
Each video must be between 120 and 500 seconds long
Each video must be associated with at least one entity from our target vocabulary
Adult & sensitive content is removed (as determined by automated classifiers)

The label data is taken from https://www.kaggle.com/c/youtube8m/data. We only use the labelled section of the data, in train_labels.csv.

We retrieve additional metadata about videos (such as title, channel, duration, views, likes, dislikes and comments) using the Youtube Data API found at https://developers.google.com/youtube/v3/.

Controls

Description

This bundled hierarchical plot shows the relatedness of various YouTube video content tags. A connection indicates that the connected two tags share above given percentage of videos in the video sample. Closely related content tags are clustered into groups.

Hover over tag names to highlight related tags.

Adjust overlap percentage to change the threshold of
#( overlaping videos ) / #( total videos )
at which we consider two tags related. Lower the percentage to discover more related tags. Increase the percentage to find more defined clusters.

Change the number of tags to focus on the most common tags or investigate broader topic variety.

Controls

Group (Cluster)

Order
Words of Fame
Words of Shame

Description

This bubble chart demonstrates the result of a latent semantic analysis (LSA). We calculate the TF-IDF of top 10000 bi-grams in all English video titles and construct a bag-of-words model for each title, then perform an SVD (similar to PCA) to combine similar bi-grams and reduce the dimensionality to 100. This also reduces the correlation between the features in the BOW model. We then perform a linear regression analysis of log(views) versus title representation in the 100-dimension space.

How to read:

The vectors most correlated with more views (or less views, depending on whether "words of fame" or "words of shame" is chosen) are shown in the bubble chart with the most significant word in the vector displayed as caption. We may use the caption words to quickly identify the most popular or least popular (but still common) topics on YouTube.

Click on bubbles to zoom in to detailed view. In the detailed view, the top 10 most significant words are displayed as child bubbles. Bubble size is proportional to the contribution of that word in the vector. Colour indicates whether the word contributes positively or negatively to the vector. Word bubbles in green implies either the word appears commonly together with the other positive (green) word overall, or when they appear together in a title, they are more likely to contribute to higher (or lower) views. Conversely, orange words bubble indicates that the words normally do not appear together with the positive (green) words, or that the positive words tend to contribute more to views without them.

Click again in detailed view to return to default view.

Change the group selection option to focus on individual clusters.

Note: due to the use of a lemmatiser and removal of stop words, some words may not appear in their correct form within a phrase and some words might be missing (e.g. "league legend" or "star war"). Also because the language detector makes mistakes, some non-English words also end up on the bubble chart.