Music Industry Analysis With Unsupervised and Supervised Machine Learning — -Binary Classification
The below is the part two of our project.
SUPERVISED MACHINE LEARNING
Classification models
Our goal is to predict which songs are going to win Grammy Awards based on the
Liveness, Loudness, Mode, Popularity, Peachiness, Tempo, Timesignature, Valence. This is a binary classification problem. The label is 1 for True class and 0 for False class, respectively. The best model should be the one that gives us the highest f1 score and not overfitting.
Exploratory Data Analysis (EDA)
The most important part of the EDA we found for modeling is that this dataset has moderately imbalance classes issue, which is common in the machine learning world. The distribution of classes in this dataset is 87% of False class, 13% of True class, indicating 87% ( 136464) songs did not win Grammy, 13% ( 18467) songs won Grammy. The False class is larger than the True class by roughly 7 times.
Data Preprocessing
This dataset has 18 variables including categorical and numerical. There are no data problems such as missing values. For the purpose of this problem, we need to drop the unnecessary variables(‘Album’,’Artist’,’Name’), remaining 14 variables listed under Goal for modeling. Next, we need to dummize the dataset because there are two categorical variables in the modeling dataset, then we split the dataset into predictors(x) and predicted variable(y), and then split the x,y into training set(75%) and testing set(25%).
Feature Selection
We use a heatmap to discover the correlation between variables. From the heatmap, we can see that only two variables (‘Loudness’ and ‘Energy’) are highly correlated to each other, the correlation rate is 0.78. but let’s keep them for prediction, because sometimes, removing one of the correlated variables will not impact the performance. Therefore, we use all the 14 variables to train the model.
Methodologies
The methods we use for modeling are Cross Validation, Grid Search hyperparameters, Random Searching hyperparameters, Under sampling, Oversampling, Feature Selection by using models, Feature Importance evaluation.
Modeling
We use Random Forest, Balanced Random Forest, Random Forest + Undersampling+Grid Search hyperparameters, Random Forest + Oversampling, Extreme Gradient Boosting (XGBoost) + Undersampling, XGBoost+ Random Searching hyperparameters as our models. But we chose XGBoost+ Random Searching hyperparameters as the final model because this model gave us the highest f1 score for the minority class.
1.Random Forest
The first model we tried is Random Forest with class_weight={0:1,1:7} without tuning hyperparameters. The overall accuracy rate of the testing is good, but the recall rate is extremely low, making the f1 score of the True class to be low as well. In addition, this model is overfitting because the training result of the f1-score of the True class is extremely well (98%), but the testing result is bad (13%). Since we intend to have more values on the True class, we don’t consider this model.
2.Balanced Random Forest
The Balanced Random Forest model with sampling_strategy=’auto’, performs better than the previous random forest model on the True class. The f1 score for the True class gets improved by 18%. And this model is not overfitting by checking the f1 scores of the training result and testing result. But the f1 score for the False class gets decreased by 20%. Although this model gives us higher predictions for which songs won Grammy awards, we still want to improve the precision rate of the True class because there are many false negative misclassifications of this model.
3. Random Forest & Undersampling
For the purpose of improving the precision rate of the True class, we tune the hyperparameters of the random forest model with the Undersampling method to see if the precision rate is going to be improved. We first use the RandomUnderSampler function to balance the classes from the distribution of 70/30 to 50/50. Then use GridSearchCV function with five folds cross validation to tune the hyperparameters of max_depth, min_samples_split, n_estimators, bootstrap. This model fits five folds for each of 56 candidates, totaling 280 fits. This model performs better than the Balanced Random Forest model. the overall f1 score of both True class and False class gets improved by 3% and 2 %, respectively. But this model is still overfitting on the True class.
4.Extreme Gradient Boosting (XGBoost)
With the consideration of getting better performance of the f1 score, the concern of overfitting, we use XGBoost algorithm to boost the True class and constrain the model to go overfitting. This algorithm uses advanced regularization (L1 & L2), which improves model generalization capabilities and avoids overfitting. XGBoost is basically developed by incorporating Bagging, Random Forest, Boosting, Gradient Boosting. One algorithm performs all the tasks. The chart below explains each tree-based algorithm and how XGBoost is the appropriate one for us to solve our concerns.
The XGBoost of the default logistic regression model gives us 88% of the f1 score of the False class and 39% of the f1 score of the True class. These results are the best of all the models, though the predicted accuracy rate of the True class is not the highest. This model gives us the most balance between precision and recall. Although the previous models received higher recall rate, meaning the predicted accuracy rate of the True class is higher, those models also gave us lower precision rate, meaning more false negative misclassifications.
We use the default tree booster to this model. We then tune 9 hyperparameters, ‘max_depth’, ‘learning_rate’, ‘n_estimators’, ‘min_child_weight’, ‘subsample’, ‘colsample_bytree’, ‘colsample_bylevel’, ‘reg_alpha’, ‘reg_lambda’. These hyperparameters are the most common parameters to be tuned. The tuning range of each parameter is based on the recommendation of Amazon Developer guide. The tuning of these parameters enables us to avoid overfitting and enhance performance. Parameters such asmax_depth, min_child_weight, regularization parameters (lambda, alpha)are used to control model complexity. Whilesubsample , colsample_bytree, the reduced-size of learning rate are used to add randomness to make training robust to noise. We used RandomizedSearchCV instead of GridSearchCV due to the large data points. This model fit 5 folds for each of 10 candidates, totaling 50 fits. After we received the best parameters, since this dataset has imbalance classes issue, we ran a for loop of selecting different weights to compare the f1 score and plot the Precision/Recall (PC) curve to compare the results visually. From this, setting up the weight to 7.5 is the optimal value that gave us the highest f1 score.
Feature Importance
We use SelectFromModel function to calculate the feature importance of each input variable, essentially allowing us to test each subset of features by importance, starting with all features and ending with a subset with the most important feature. The below chart tells us that the most important feature with 71% is Explicit_True when we use all 14 variables to do the modeling. Besides, this model only uses Explicit_True for the prediction. In order to see if the model is going to get improved, we run another model that only keeps the top 7 variables. We did Random Search for the new hyperparameters. However, the f1 score of the True class is 32%, which gets decreased by 7% when we use all variables. Explicit_True is still the most important feature.
Other Thoughts:
1. Oversampling vs Undersampling
We also tried using the Oversampling method by using the SMOTE function to balance our classes before doing modeling. However, the result is less ideal than using Random Undersampling, and it took more time to implement the method. We also tried using the combination of Oversampling and Undersampling by using Pipeline to receive the best oversampling and undersampling rate. However, the model performance is not good either. Many researchers recommend using the Undersampling method if you don’t have sufficient time, and most of the time, this method returns better results than the Oversampling method. But there is no guarantee for which method always outperforms the other for imbalanced data. If you have sufficient time, you should try many methods such as Oversampling, Undersampling and the combination of Oversampling and Undersampling.
2.Handle Imbalanced Data
According to the XGBoost documentation guide, we tried using parameters of both scale_pos_weight and max_delta_step to balance the classes. However, the performance of using max_delta_step is much poorer than using scale_pos_weight.
3.PC Curve vs ROC Curve
AUC score and ROC curve are two of the common methods to evaluate classification performance in the Machine Learning world. However, this is not the case when you have highly imbalanced classes. Our dataset is not highly imbalanced, just moderate, but we still don’t see the ROC curve is the right fit, but the PC curve. The charts below can confirm our assumption.
For a PR curve, a good classifier aims for the upper right corner of the chart but upper left for the ROC curve.
While PR and ROC curves use the same data, you can see that the two charts tell different stories, with some weights seem to perform better in ROC than in the PR curve. For example, when weight is set to a bigger value, the PR curve tends to get worse, except the weight is equal to 1. However, it is not clear to identify which ROC curve is the best when the weight is set to 7.5,15 and 30, respectively. ROC curve is not a good visual illustration for moderately and highly imbalanced data, because the False Positive Rate (False Positives / Total Real Negatives) does not drop drastically when the Total Real Negatives is huge. Whereas Precision ( True Positives / (True Positives + False Positives) ) is highly sensitive to False Positives and is not impacted by a large total real negative denominator.
Recommendations
From our classification model, whether the song is explicit or not impacts the most for winning a Grammy award. This result is actually matched with the above clustering output. An explicit track is one that has curse words or language or art that is sexual, violent, or offensive in nature. Therefore, we can recommend song producers avoid making songs explicit. Winning a Grammy means an accomplishment. Songwriters, producers, and other behind-the-scenes roles often enjoy the best benefits from Grammy wins. Those awards are more of a big deal for them. And it doesn’t matter what they won for — winning a Grammy opens all kinds of doors in just about any industry. Grammy winners don’t take cash back home from the moment they receive the Awards. But money will come later. Before winning a Grammy, producers on average charge $30,000 to $50,000 per track, after winning a Grammy, that would be $75,000 per track( Marketplace,2021). Any performer or producer who wins on the music industry’s biggest night can expect to see at least 55% more in concert ticket sales as compared to before they won(Harding,2021). Apart from the concerts, more businesses will invite these winners for commercial events. During social distancing, the music industry is struggling to survive. Our classification model can help music producers to understand the chances based on some song attributes to win a Grammy and what impacts the most to win a Grammy. But there are still so many factors to impact winning a Grammy award. This simply just opens a direction for music producers to make songs in the future.
Further Improvement
We can try Grid Searching for the hyperparameters with a wider range of each parameter for the XGBoost model. We did not use this method this time because we have a large dataset, it would take at least 25 hours to finish the tuning. We actually tried to tune nine hyperparameters by using this method initially, but we waited for 11 hours, and it hadn’t finished up tuning. In order to save time, we used Random Searching instead, which took 2 hours to train.
Second, we consider if other features such as genre of the songs have been added to this dataset, the performance might have improved further. In the future, we can try to add more features to the model.
Third, we can also try to tune hyperparameters of the Balanced Random Forest model or tune hyperparameters with the Random Forest model with Undersampling method . We did not do it this time. First because it will also take a long time. Second it is because many researchers commented that even with the tuning method, the problem of low precision rate of the minority class might still exist.
Fourth, we can also try to use deep learning Long Short Term Memory (LSTM) model.
To be continued — — Text Classification