ANALYZING SPOTIFY API SONG DATA PT.1

COLLECTING, STORING, AND EXPLORING THE DATA
5

PYTHON, POSTGRESQL

MEDIUM

last hacked on Oct 14, 2017

Spotify has quantified their songs in terms of danceability, liveness, acousticness, valence, etc. The quantities associated with these song features are available for each song in Spotify through the Spotify API. Therefore, we can create data sets consisting of songs and their respective song features. This report will cover the logistics of collecting and storing song data, and some exploratory analysis of curated playlists from Spotify. After going through this report, you will be ready to utilize Spotify song data and different analysis techniques to achieve the insights you desire.
# The Spotify API You can find documentation on the Spotify API <a href="https://developer.spotify.com/web-api/" target="_blank" >here</a>. It is useful to go through the <a href="https://developer.spotify.com/web-api/tutorial/" target="_blank" >tutorial</a> to better understand access headers if you choose to use API calls that go beyond what is already in the getData module. There are a lot of options in the API, but what we are mostly concerned with is getting the <a href="https://developer.spotify.com/web-api/get-audio-features/" target="_blank" >audio features</a>. Each request to the API takes approximately one second to return a response, so if we're working with a large data set it can take a long time to load the data directly from the API. To work around this, I created a python module called <a href="https://raw.githubusercontent.com/UCSB-dataScience-ProjectGroup/spotify-dashboard/master/spotilyzer/getData.py" target="_blank">getData.py</a>. # Attaining Data To make attaining data easier, we created the getData module. The idea behind the getData module is that it will maintain a local database running on the computer you execute an analysis script on so that it can quickly get data from the database, otherwise it will make an API call and store the data it receives in its database. For example, if you have 3000 songs that you want to work with, you will only need to wait ~3000 seconds the first time you run your analysis script; the next time you run your script, all of your data will be in your local database. ## getData Module Description Note that before you use the getData module, you must have PostgreSQL running on your computer and create a user named "spotilyzer" with password "spotipass". To install PostgreSQL you can follow <a target="_blank" href="https://www.digitalocean.com/community/tutorials/how-to-install-and-use-postgresql-on-ubuntu-16-04">this tutorial</a>. Once you have PostgreSQL installed, use the following commands to create the spotilyzer user: ``` psql postgres postgres=> CREATE ROLE spotilyzer WITH LOGIN PASSWORD ‘spotipass’; CREATE ROLE postgres=> ALTER ROLE spotilyzer CREATEDB; ALTER ROLE postgres=# \du List of roles Role name | Attributes | Member of ------------+------------------------------------------------------------+----------- spotilyzer | Create DB | {} timmy | Superuser, Create role, Create DB, Replication, Bypass RLS | {} postgres=# \q ``` The getData module will use those credentials to manage your local database of song feature data. |Functions |Input |Output |Usage | |---|---|---|---|---| |getSongs(songList)|A list of song ID's|A list of dictionaries where each dictionary contains the data for a song ID in the input list.|This should be used whenever you want to get data for a list of song ID's| |getSongsInCategory(category)|The string associated with a Spotify category. For example, "Jazz" or "Pop"|Outputs a list of dictionaries that contain the song feature data for each song under the respective category during the time of execution. This will return data for around 500 songs most times.|Useful when building data sets associated with a specific category.| |getAllSongsInDB()|N/A|A list of dictionaries where each dictionary contains the data for one song in your local database.|Useful when you have a well organized local database.| |getAccessHeader()|N/A|Access header used for HTTP request API calls.|Useful for writing your own functions in getData.py that require API calls.| The getData module can be expanded as necessary and there are several helper functions not described here. These are simply the functions of most interest. # Loading Data We'll use two examples to demonstrate how to use the getData module to load data from the Spotify API. ## Example 1 First, imagine the simple case where have a list of song ID's and we want the data associated with them. ```python import getData as gd songIDs = ["3n41HT8DnPHOBb1zcliJOD", "2wGLmEiWTc6q4EZhWy1Ltd"] data = gd.getSongs(songIDs) ``` Here is how we would access the data: ``` (Pdb) data[0]['songid'] '3n41HT8DnPHOBb1zcliJOD' (Pdb) data[0]['song_title'] '26 Inches' (Pdb) data[0]['danceability'] 0.545 (Pdb) data[0]['acousticness'] 0.67 (Pdb) data[1]['songid'] '2wGLmEiWTc6q4EZhWy1Ltd' (Pdb) data[1]['song_title'] 'Folk-Metaphysics' (Pdb) data[1]['danceability'] 0.555 ``` ## Example 2 Now imagine we want to retrieve the data for songs in a specific category. ```python import getData as gd popSongs = gd.getSongsInCategory("Pop") ``` Here is how it looks: ``` (Pdb) len(popSongs) 789 (Pdb) popSongs[700]['song_title'] 'Make Me (Cry) - Acoustic' (Pdb) popSongs[700]['acousticness'] 0.886 ``` As you can see, using the getData module makes loading your data easy. The table below lists all of the keys for an arbitrary song data dictionary: |Keys| |---| |time_signature, danceability, acousticness, duration, mode, artistids, popularity, songid, loudness, instrumentalness, energy, liveness, available_markets, key, speechiness, tempo, song_title, albumid, valence| # Pre-Processing The getData module handles most of the pre-processing with respect to getting the data from Spotify and presenting it to us in the form of a list of dictionaries. From here we'll need to put our data into a dataframe and normalize it. You can translate the data that getData returns to you into a dataframe in several different ways, but here is an example to get you started. ## Example This example takes advantage of the dataframe constructor that turns a dictionary into a dataframe. Consider the following function that takes a list of categories and a list of features, then constructs a dataframe consisting of songs in the specified categories with the specified features. ```python import getData as gd import pandas as pd from sklearn import preprocessing def createCategoriesDataFrame(categories, features): preFrameDict = {} for i in features: preFrameDict[i] = [] preFrameDict["songid"] = [] preFrameDict["category"] = [] for i in categories: songs = gd.getSongsInCategory(i) for j in songs: preFrameDict["category"].append(i) preFrameDict["songid"].append(j["songid"]) for q in features: preFrameDict[q].append(j[q]) df = pd.DataFrame(preFrameDict) #normalize data for feature in features: std_scale = preprocessing.StandardScaler().fit(df[feature]) df[feature] = pd.DataFrame(std_scale.transform(df[feature])) return df.set_index("songid") ``` # Exploratory Analysis Now that the logistics are taken care of, we can start having some fun with our data. ## Exploration of Genre Clusters Please use <a href="https://raw.githubusercontent.com/UCSB-dataScience-ProjectGroup/spotify-dashboard/master/spotilyzer/DemoKNN.py" target="_blank">DemoKNN.py</a> as a reference for the functions used in the following example. Let's first look at two categories, Jazz and Rock. ``` categories = ["Jazz", "Rock"] allFeatures = ["popularity", "danceability", "energy", "key", "loudness", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "time_signature"] cdf = createCategoriesDataFrame(categories, allFeatures) ``` We can get a crude idea of how the data will cluster by selecting a couple of features and creating scatter plots. Here we choose danceability and acousticness. ``` graph2DPlotlyCategoriesDifferentColors(cdf, ['danceability','acousticness'], categories) ``` <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~kaizentowfiq9/10.embed"></iframe> We can add in the valence feature and view the data in 3 dimensions. ``` graph3DPlotlyCategoriesDifferentColors(cdf, ['danceability','acousticness', 'valence'], categories) ``` <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~kaizentowfiq9/4.embed"></iframe> So we can definitely see that these genres are forming different clusters, but maybe we can create tighter clusters using Principal Component Analysis (PCA). If you don't know about PCA I highly recommend learning more about it. Long story short, PCA can project our high dimensional data onto a lower dimension while retaining most of the variance. For example, each feature of our songs represents one dimension. We're restricted to choosing 3 of our 12 features in order to view our data in 3 dimensions, but this is bad because maybe there is something in the other 9 dimensions that the 3 we have chosen doesn't tell us. So, PCA allows us to project the significance of all 12 dimensions (features) onto 3 or 2 dimensions, which should result in tighter clusters. ``` pcadf = PCAOnDataFrame(cdf, allFeatures, 2) graph2DPlotlyCategoriesDifferentColors(pcadf, ['1','2'], categories) ``` Projecting onto 2 dimensions <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~kaizentowfiq9/12.embed"></iframe> ``` pcadf = PCAOnDataFrame(cdf, allFeatures, 3) graph3DPlotlyCategoriesDifferentColors(pcadf, ['1','2', '3'], categories) ``` Projecting onto 3 dimensions <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~kaizentowfiq9/6.embed"></iframe> Although not perfectly distinct, it's clear that PCA has helped us tighten our clusters. This means that even a simple classification method such as KNN should give us decent accuracy when classifying between Jazz and Rock. If we add in a third genre (Hip Hop in red), we can see that there's still obvious clustering, but overlapping clusters can become a big problem very quickly. ``` categories = ["Jazz", "Rock", "Hip-Hop"] cdf = createCategoriesDataFrame(categories, allFeatures) pcadf = PCAOnDataFrame(cdf, allFeatures, 2) graph2DPlotlyCategoriesDifferentColors(pcadf, ['1','2'], categories) ``` <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~kaizentowfiq9/14.embed"></iframe> ``` pcadf = PCAOnDataFrame(cdf, allFeatures, 3) graph3DPlotlyCategoriesDifferentColors(pcadf, ['1','2', '3'], categories) ``` <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~kaizentowfiq9/8.embed"></iframe>

COMMENTS


I am trying to use the spotify API also and thank you for yoru ideas above. One question I do have is about demographic information. I was unable to find much on that. Is there a way to harness that information better? Luay
The 3D graphs look good.





keep exploring!

back to all projects