Music Industry Analysis With Unsupervised and Supervised Machine Learning — -Text Classification

6 min readMay 8, 2021

In order to confirm our findings from the two parts (previous clustering and classification models) whether a song is explicit or not impact the chance to win a Grammy. We ran a text classification model to confirm our findings.

TEXT CLASSIFICATION

Data Preprocessing

We combined the dataset billboardHot100_1999–2019.csv from Kaggle with the dataset we have been using for the previous models. But we only kept four variables( Name, Lyrics, Explicit, Won_grammy), and 7689 data points in total.

The link for the billboardHot100_1999–2019.csv can be found below.

Data on Songs from Billboard 1999-2019

Data on artists and songs that appeared on Billboard Hot 100 from 1999 to 2019

www.kaggle.com

Exploratory Data Analysis (EDA)

Before modeling, from the EDA, we can see songs without explicit content have a higher chance to win a Grammy award.

Word Clouds

Text cleaning steps for the word clouds.

lower the text
tokenize the text (split the text into words) and remove the punctuation
remove useless words that contain numbers
remove useless stop words like ‘the’, ‘a’ ,’this’ etc.
Part-Of-Speech (POS) tagging: assign a tag to every word to define if it corresponds to a noun, a verb etc. using the WordNet lexical database
lemmatize the text: transform every word into their root form (e.g. rooms -> room, slept -> sleep)

Wordcloud for explicit songs

Wordcloud for non-explicit songs

We can see that some words are shown both in the explicit songs and non-explicit songs. It is because most of the songs of this dataset have two versions, explicit and non-explicit. Singers usually only cover those dirty words when they produce their clean version of songs. Therefore, most of the words in the song are the same. But we still can see some bad words such as nigga are captured in the first word cloud. Also, we can see some errors in these word clouds, but it doesn’t hurt our modeling.

Feature Engineering

1.Adding two more columns to the dataset, but I don’t think it will affects our model performance. It is just a routine of text classification.

number of characters in the text
number of words in the text

2.Word embeddings

The next step consists in extracting vector representations for every lyrics. The module Gensim creates a numerical vector representation of every word in the corpus by using the contexts in which they appear (Word2Vec). This is performed using shallow neural networks.Similar words will have similar representation vectors.

Each text can also be transformed into numerical vectors using the word vectors (Doc2Vec). Same texts will also have similar representations and that is why we can use those vectors as training features.

We first had to train a Doc2Vec model by feeding in our text data. By applying this model on our texts, we can get those representation vectors.

3.TF-IDF (Term Frequency — Inverse Document Frequency)

Finally we added the TF-IDF values for every word and every document.TF computes the classic number of times the word appears in the text. IDF computes the relative importance of this word which depends on how many texts the word can be found. Using this method might give us some rare words that have a lot more meanings than common words. We could have simply counted how many times each word appears in every document.The problem with this method is that it doesn’t take into account the relative importance of words in the texts. A word that appears in almost every text would not likely bring useful information for analysis.

We added TF-IDF columns for every word that appear in at least 10 different texts to filter some of them and reduce the size of the final output.

Modeling

Random Forest

We used regular random forest model without tuning hyperparameters. The result is not ideal because the prediction on the true class is poor. The f1 score of the true class is only 5%.

Feature Importance

From the result below, we can see that the vector representations of the texts have a lot of importance in our training. However, this model only captures some very common words that can’t capture the special words such as explicit words, so we are not confident to say explicit content impact the chance to win a Grammy if we only use this model.

2. Random Forest with Undersampling

We used the RandomUnderSampler function to balance the two classes into 50/50 first. The next was to grid search max_depth and min_samples_split. This model performs much better than the previous model. The f1 score of the true class has been improved 25% but the f1 score of the false class has been decreased by 17%.

Feature Importance

From the result below, we can see that the vector representations of the texts have some importance in our training. And the model captures special words such as shit, fuck etc. that tells us explicit content does impact the chance to win a Grammy.

3. XGboost

We random searched 9 hyperparameters such as ‘max_depth’, ‘learning_rate’, ‘n_estimators’, ‘min_child_weight’, ‘subsample’, ‘colsample_bytree’, ‘colsample_bylevel’, ‘reg_alpha’, ‘reg_lambda’ and used scale_pos_weight=3.5 for the XGboost model. The benefits of using XGboost had been explained to the previous part. Surprisingly, this model doesn’t give us a higher f1 score of the true class, but instead, the false class. Therefore, the overall accuracy of the model has been increased by 6%.

Feature Importance

From the result below, the vector representations of the texts in this model don’t show importance in our training. And the model doesn’t capture special words like the previous model does. A lot of common words are captured.

Conclusion

We can conclude that the explicit content impact the chance to win the Grammy awards, although the XGboost model doesn’t give us this result. The random forest with undersampling model does confirm our thoughts. We believe if we improve the XGboost model with other techniques, the model will also give us the same finding. In the future, we can grid search the hyperparameters. We only used random searching this time and it took us roughly 15 hours to finish up tuning.

To be continued — -The changes of streaming habit

https://gracecamc168.medium.com/music-industry-analysis-with-unsupervised-and-supervised-machine-learning-exploratory-data-d7da3b673d41

Music Industry Analysis With Unsupervised and Supervised Machine Learning — -Text Classification

Data on Songs from Billboard 1999-2019

Data on artists and songs that appeared on Billboard Hot 100 from 1999 to 2019

Written by Gracecamc

No responses yet