Going through my Spotify history

I take a look at my streaming history, big data style.


Premise

I am a giant music nerd and recently I am going through a reminiscing nostalgic trip. I want to know what I used to listen. I found out that Spotify allows you to request a copy of your extended streaming history.

(It is impossible to know my listening history before Spotify. 8 year old me relied on family for music and 12 year old me used to play music on YouTube and Google Music (RIP). I requested a copy of my YouTube data using takeout but none of them are useful (they only preserve 1 year of history it seems.))

The data

The data came in a zip file. Inside is a README PDF that contains descriptions of the columns. The actual dataset spans through multiple JSON files, each around 12.8MB big.

An example of a data point may look like this:

[
  {
    "ts": "2020-04-11T00:27:53Z",
    "platform": "iOS 13.4 (iPhone11,8)",
    "ms_played": 174674,
    "conn_country": "SG",
    "ip_addr": "203.0.113.x",
    "master_metadata_track_name": "Jack To The Future (feat. Russoul)",
    "master_metadata_album_artist_name": "Alex Metric",
    "master_metadata_album_album_name": "Jack To The Future (feat. Russoul)",
    "spotify_track_uri": "spotify:track:74cksX4vgcm0cHKaV7XE2L",
    "episode_name": null,
    "episode_show_name": null,
    "spotify_episode_uri": null,
    "audiobook_title": null,
    "audiobook_uri": null,
    "audiobook_chapter_uri": null,
    "audiobook_chapter_title": null,
    "reason_start": "trackdone",
    "reason_end": "trackdone",
    "shuffle": false,
    "skipped": false,
    "offline": false,
    "offline_timestamp": null,
    "incognito_mode": false
  },
  // ...
]

A few interesting things:

  • The IP address can be IPv4 and IPv6. A potential normalisation might be using IPv6's IPv4-Mapped address prefix like ::ffff:1.2.3.4. The database may already have types to deal with IP addresses though.
  • The IP address may get anonymised at the Spotify side. The last 8-bit of the subnet is replaced with a x.
    • You can still tell which ISP you were on by using bgp.he.net.
  • Spotify is being smart and gave us the album artist's name. Makes sense because there could be a featuring artist on the track, or it could be a compilation album.

"Quack"

Even though it is being given to me in JSON, this looks like tabular data to me. The only reasonable move is to put it into a database and query it instead of using grep.

For the data size (around 640k rows), I could have also just used SQLite with a CLI but it really isn't suitable for analytic workloads, especially if I am enriching the data with more metadata from Spotify later, the joins alone in SQLite would slow down the queries a lot.

At $work we use ClickHouse for large scale data analysis. I am very much used to ClickHouse SQL syntax. But if this is a personal project, why not use this opportunity to learn a new technology?

I turned to DuckDB as I want a database that's as easy to use as SQLite but can handle the data I gave it.

I ingested the data into DuckDB with a tiny bit of normalisation around the IP addresses and I skipped all the audiobook stuff because I never used it, as well as the podcast stuff.

File size

FormatSize (MB)
JSON513
DuckDB38
Arrow53.2
Parquet34.1

Even with compression, my dataset still weighs 34.1MB. I did some optimisation such as decoding the spotify track URI's base62 ID component into blobs and store the blobs, and normalisations like storing tracks in a different table and create a view to join the history table with the track table back together. This took the file size down to 21.8MB. I guess I can further normalise by having the user agent in a different table / enum, storing reason_start and reason_end as enums, but that's an exercise for later.

Excerpt on what I found

select 
  distinct artist_name,
  count() as count 
FROM spotify_history_view 
WHERE ts between '2020-01-01' and '2026-04-01' 
GROUP BY artist_name 
ORDER BY count desc 
limit 10;

My top 10 artists between 2020 and now are

  • deadmau5,
  • Flume,
  • Skrillex,
  • Bring Me The Horizon,
  • Lana Del Rey,
  • London Grammar,
  • Kendrick Lamar,
  • Radiohead,
  • ODESZA, and
  • The Prodigy.

Not surprising. But when did I actually start listening to them?

with top_artists AS (
select
  distinct artist_name,
  count() as count,
FROM spotify_history_view
WHERE timestamp between '2020-01-01' and '2026-04-01'
GROUP BY artist_name
ORDER BY count desc
limit 10
)   
SELECT 
artist_name, min(timestamp) 
from spotify_history_view 
JOIN top_artists 
using (artist_name) 
group by artist_name;
artist_namemin("timestamp")
Lana Del Rey2014-03-17 06:44:11
The Prodigy2014-12-20 14:24:23
Bring Me The Horizon2015-02-14 14:45:11
London Grammar2014-03-17 20:19:13
deadmau52014-04-14 21:42:46
Flume2015-01-11 02:35:49
ODESZA2014-12-30 01:52:43
Radiohead2014-05-13 21:03:06
Kendrick Lamar2015-01-17 23:39:01
Skrillex2014-04-12 12:04:00

Exciting. I have been listening to these artists for more than 10 years.

How about top albums? This is harder to answer with this dataset because there is no play count for each album so we can only estimate based on play counts for each tracks in each album. You can skew this data if you just play one song multiple times. (There might be a way but the SQL is too annoying to conjure.)

select date(min(timestamp)) as ts_date, count() as album_count, album_name, artist_name from spotify_history_view group by artist_name, album_name order by album_count desc limit 20;
ts_datealbum_countalbum_nameartist_name
2019-03-211225Hi This Is FlumeFlume
2016-12-021167W:/2016ALBUM/deadmau5
2017-06-041098hopeless fountain kingdomHalsey
2014-03-17885If You WaitLondon Grammar
2015-09-13885Oh WonderOh Wonder
2014-05-01794No Mythologies to Follow (Deluxe)
2015-08-03790Down to EarthFlight Facilities
2015-09-21787BADLANDSHalsey
2016-05-19772The Life Of PabloKanye West
2017-06-16771MelodramaLorde
2015-01-20757My Beautiful Dark Twisted FantasyKanye West
2015-01-30741while(1<2)deadmau5
2014-03-08731TrueAvicii
2018-07-13710where's the drop?deadmau5
2014-03-17699SettleDisclosure
2014-03-31686Days Are GoneHAIM
2019-03-29679WHEN WE ALL FALL ASLEEP, WHERE DO WE GO?Billie Eilish
2017-06-07677I See YouThe xx
2014-04-15670> Album Title Goes Here <deadmau5
2016-09-16661New Age | Dark Age (Deluxe Version)Karma Fields

After words

This dataset is basically Spotify Wrapped but better.

Spotify recently greatly reduced the amount of data they serve over the Web API. They no longer serve metadata such as genres, BPM, etc. My original plan was to write a function (either ClickHouse UDF or some JS) to fetch these metadata on-demand but that's no longer viable.

I might jot down the process of getting more metadata out of Spotify, MusicBrainz and Last.fm in a future post. It is still in progress and involves setting up a MusicBrainz database locally, attempting to pglite and a bit of LLM wrangling. I also want to look into fuzzy text search.

I also have friends asking me how to make this. I might make a little client-side web app for people to play with. DuckDB has a WASM version so this can all be done without a server.

In the mean time, I'd like to leave a toy app that generates wallpapers from album arts here. (source)

N.B. I want to share this but nevermind

I had a toy site ready to go, that lets people go and query an anonymised version of the data, but even then LLMs reminded me that AI giants and scrapers could just download the parquet file and use it for commercial purposes. :/