Music Industry Analysis With Unsupervised and Supervised Machine Learning — -Text Classification
In order to confirm our findings from the two parts (previous clustering and classification models) whether a song is explicit or not impact the chance to win a Grammy. We ran a text classification model to confirm our findings.
TEXT CLASSIFICATION
Data Preprocessing
We combined the dataset billboardHot100_1999–2019.csv from Kaggle with the dataset we have been using for the previous models. But we only kept four variables( Name, Lyrics, Explicit, Won_grammy), and 7689 data points in total.
The link for the billboardHot100_1999–2019.csv can be found below.
Exploratory Data Analysis (EDA)
Before modeling, from the EDA, we can see songs without explicit content have a higher chance to win a Grammy award.
Word Clouds
Text cleaning steps for the word clouds.
- lower the text
- tokenize the text (split the text into words) and remove the punctuation
- remove useless words that contain numbers
- remove useless stop words like ‘the’, ‘a’ ,’this’ etc.
- Part-Of-Speech (POS) tagging: assign a tag to every word to define if it corresponds to a noun, a verb etc. using the WordNet lexical database
- lemmatize the text: transform every word into their root form (e.g. rooms -> room, slept -> sleep)
Wordcloud for explicit songs
Wordcloud for non-explicit songs
We can see that some words are shown both in the explicit songs and non-explicit songs. It is because most of the songs of this dataset have two versions, explicit and non-explicit. Singers usually only cover those dirty words when they produce their clean version of songs. Therefore, most of the words in the song are the same. But we still can see some bad words such as nigga are captured in the first word cloud. Also, we can see some errors in these word clouds, but it doesn’t hurt our modeling.
Feature Engineering
1.Adding two more columns to the dataset, but I don’t think it will affects our model performance. It is just a routine of text classification.
- number of characters in the text
- number of words in the text
2.Word embeddings
The next step consists in extracting vector representations for every lyrics. The module Gensim creates a numerical vector representation of every word in the corpus by using the contexts in which they appear (Word2Vec). This is performed using shallow neural networks.Similar words will have similar representation vectors.
Each text can also be transformed into numerical vectors using the word vectors (Doc2Vec). Same texts will also have similar representations and that is why we can use those vectors as training features.
We first had to train a Doc2Vec model by feeding in our text data. By applying this model on our texts, we can get those representation vectors.
3.TF-IDF (Term Frequency — Inverse Document Frequency)
Finally we added the TF-IDF values for every word and every document.TF computes the classic number of times the word appears in the text. IDF computes the relative importance of this word which depends on how many texts the word can be found. Using this method might give us some rare words that have a lot more meanings than common words. We could have simply counted how many times each word appears in every document.The problem with this method is that it doesn’t take into account the relative importance of words in the texts. A word that appears in almost every text would not likely bring useful information for analysis.
We added TF-IDF columns for every word that appear in at least 10 different texts to filter some of them and reduce the size of the final output.
Modeling
- Random Forest
We used regular random forest model without tuning hyperparameters. The result is not ideal because the prediction on the true class is poor. The f1 score of the true class is only 5%.
Feature Importance
From the result below, we can see that the vector representations of the texts have a lot of importance in our training. However, this model only captures some very common words that can’t capture the special words such as explicit words, so we are not confident to say explicit content impact the chance to win a Grammy if we only use this model.
2. Random Forest with Undersampling
We used the RandomUnderSampler function to balance the two classes into 50/50 first. The next was to grid search max_depth and min_samples_split. This model performs much better than the previous model. The f1 score of the true class has been improved 25% but the f1 score of the false class has been decreased by 17%.
Feature Importance
From the result below, we can see that the vector representations of the texts have some importance in our training. And the model captures special words such as shit, fuck etc. that tells us explicit content does impact the chance to win a Grammy.
3. XGboost
We random searched 9 hyperparameters such as ‘max_depth’, ‘learning_rate’, ‘n_estimators’, ‘min_child_weight’, ‘subsample’, ‘colsample_bytree’, ‘colsample_bylevel’, ‘reg_alpha’, ‘reg_lambda’ and used scale_pos_weight=3.5 for the XGboost model. The benefits of using XGboost had been explained to the previous part. Surprisingly, this model doesn’t give us a higher f1 score of the true class, but instead, the false class. Therefore, the overall accuracy of the model has been increased by 6%.
Feature Importance
From the result below, the vector representations of the texts in this model don’t show importance in our training. And the model doesn’t capture special words like the previous model does. A lot of common words are captured.
Conclusion
We can conclude that the explicit content impact the chance to win the Grammy awards, although the XGboost model doesn’t give us this result. The random forest with undersampling model does confirm our thoughts. We believe if we improve the XGboost model with other techniques, the model will also give us the same finding. In the future, we can grid search the hyperparameters. We only used random searching this time and it took us roughly 15 hours to finish up tuning.
To be continued — -The changes of streaming habit