Music Industry Analysis With Unsupervised and Supervised Machine Learning — -Clustering
INTRODUCTION
The music industry has undergone several changes in the past decade due to digitization of music and evolution of peer-to-peer sharing. While the effects of digitization of the profitability of the music and purchase intention of customers have been ambiguous for the longest time, there has been a positive shift with streaming platforms driving majority revenue in the current market. The issues that the industry faced due to piracy have been eliminated to a large extent by increasing audience engagement and accessibility to music. The role of data in the music industry has seen significant innovations as well. The shift in trend has enabled music producers and distributors with the ability to make more data-driven decisions to align the content closely to the kind of music that is appealing to their target market. While a number of different dimensions and metrics are available, this project aims to summarize the current music industry by understanding the probability of a song winning critical acclaim, popularity or higher visibility. The Covid-19 pandemic has changed the way musicians interact with their audience, due to tours coming to a complete halt and digital platforms being the most widespread channel. This had created a need to fill the strategic gap between content produced and user demand as well as listening behavior.
Additionally, this project tries to answer the following questions to understand managerial implications:
1. Classification problem, what attributes of a song contribute to the likelihood of it winning a Grammy?
2. Clustering, what attributes of a song contribute to it being popular?
3. Recommendation system, how to increase the outreach of the music by using the recommendation system?
With the help of the results and findings, the aim of the project is to understand the key determinants of a song’s success and utilize them to make songs with a higher inclusion of these features and thus gain competitive advantage in the growing industry landscape.
There are five parts for this project. Part one is Clustering; Part two is Binary Classification; Part three is Text Classification; Part four is Exploratory Data Analysis; Part five is Recommendation System.
DATA WRANGLING
For the purpose of this project, we have used the following datasets from Kaggle:
1. spotifyWeeklyTop200Streams.csv
2. songAttributes_1999–2019.csv
3. grammyAlbums_199–2019.csv
4. Spotify Dataset 1922–2021, ~600k Tracks
Datasets songAttributes_1999–2019.csv and grammyAlbums_199–2019.csv have been combined together using the columns “Album” and “Artist”, since the Grammy dataset was missing the audio characteristics data. This merged dataset helped us to analyze Grammy winning songs have which audio attributes and the popularity of the songs.
We used this dataset to perform our three parts of analysis, Clustering, Binary Classification and Recommendation system.
Softwares and Tools
All of the parts are done by Python.
Exploratory Data Analysis (skipped)
UNSUPERVISED MACHINE LEARNING
K-Means Clustering
K-means clustering is a type of unsupervised learning, which is used to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. For this project, we have tried to check how the variables are grouped together and what features each group has.
Standard scaling of data
Standardization is a scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation. The distance between the data points is important, and hence we have scaled our data for clustering.
Elbow to find optimal no. of clusters for K-Means
As per the output, we decided on forming three clusters for optimal results.
Plotting histogram of each column for each cluster for better visual understanding.
Interpretation
Cluster 0:
· Highest number of Grammy winning songs.
· Less explicit content.
· Least number of popular songs.
· Highest Acousticness.
· Low-medium Energy.
Cluster 1:
· Least number of Grammy winning songs.
· Most explicit content.
· Medium popularity- more than cluster 0
· Low Acousticness.
· High Energy.
· Highest Speechiness.
Cluster 2:
· Less Grammy winning songs.
· Least explicit content.
· Most popular.
· Low Acousticness.
· High Energy.
We can see the key differences among the three clusters. Cluster 0 and Cluster 1 are the most interesting ones if we want to analyze the likelihood of a song winning Grammy. Grammy winning songs have least explicit content (sexual, violent, or offensive in nature) and high acousticness (use of classical instruments over the electronic ones such as electronic guitar or synthesizer). These songs are more likely to have a classy and sophisticated vibe.
Interestingly, Cluster 1, which consists of the least no. of Grammy winning songs is the complete opposite of Cluster 0. Also, we can see that a song’s popularity doesn’t guarantee a Grammy win.
To be continued — Binary Classification.