How I Made an Anthony Fantano Neural Network

I’ve been watching Anthony Fantano of theneedledrop tear apart my favorite albums for years. From House of Balloons by The Weeknd getting a three to 2014 Forest Hills Drive getting a six, I could write a whole list of anime betrayals that Melon has put me through. I’ve finished many reviews salty until I read those famous words: “Y’all know this is just my opinion right”? Then all my angers melthony awaytano.

Introduction

I study machine learning and had the idea to make a bot that rates songs like Fantano. I thought this could be a fun and interactive way to hone my skills doing data processing, developing ML models, and deploying them for public use. This article will focus on the research and development of the neural network. The next article in the series covers how I build a flask web application using my keras machine learning model. The final article covers the challenges I encountered deploying my web app to Heroku.

Try The Needle Bot yourself!

Disclaimer: This was just a fun project for me to experiment with. I put it out into the world because I think it’s fun to play with but the neural network’s accuracy is kind of dookie. More on that later.

Article Overview

Finding the Dataset

The dataset that I selected for this project come from Jared Arcilla on Kaggle. He compiled the dataset directly from Fantano’s Youtube descriptions. It includes 1733 unique album reviews up till 2018.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# First we read in the dataset of all of Melons reviews

df = pd.read_csv('fantano_reviews.csv', encoding='latin-1')
df.head()

I started by reading in the dataset and calling .head() to get an idea of the columns and their datatypes.

The columns include title, artist, review date, review type, score, word score, best tracks, worst tracks and link. I then removed the columns that would not be useful for the analysis. I really only need the title, artist, score, and best tracks from the available options.

The worst track and word score columns would be useful but unfortunately they are empty for every row. So, I removed those as well.

# Now lets get rid of the columns that we don't care about

df = df.drop(['Unnamed: 0', 'review_date', 'review_type', 'word_score', 'worst_track', 'link'], axis=1)

So, the next task to solve is now answering the question “How do I download all the songs from all the albums in this list?”. I broke this question down into two parts.

How do I find all the songs in an album using python?

To accomplish this, I use the Genius lyrics API and this python package, lyricsgenius, which wraps the API.

import lyricsgenius

genius = lyricsgenius.Genius("<API_KEY>")
album = genius.search_album("Metronomy", "The English Riviera")

print(album)

Here I am testing getting a given album from the Genius API. The response is:

Album(id, artist, …)

This is a successful request which returns an Album object. Now I can iterate through all the songs in a given album as follows:

for track in album.tracks:
    print(track.song.title)

From here, I moved on to the answering the next question.

How to search YouTube for a song in python?

I have the name of the song and the artist who wrote it. I want to find the corresponding youtube video so that I can download the audio file.

import urllib.request
from urllib.parse import urlencode
import re


def search_youtube_for_song(query):
    try:
        html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + query)
        video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())

        if len(video_ids) > 0:
            return video_ids[0]
        else:
            return None
    except:
        print(query, " Failed to search")
        return None

This function returns the unique video identifier of the top search result for a given query. I can then navigate to the url ‘http://www.youtube.com/watch?v=‘ + hashcode to access the video and download it.

Downloading songs using youtube-dl

In order to download the audio from a youtube video, I have to use the library youtube-dl. It offers both a command line utility and an object to call directly from a script.

from __future__ import unicode_literals
import youtube_dl
import random

def download_song_from_youtube_to_bucket(hashcodebucket):
    random_number = random.randint(00000, 99999)
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'wav',
            'preferredquality': '96'
        }],
        'postprocessor_args': [
            '-ar', '16000'
        ],
        'prefer_ffmpeg': True,
        'keepvideo': False,

        'outtmpl': str(bucket) + '/%(title)s.%(ext)s'
    }

    try:
        with youtube_dl.YoutubeDL(ydl_opts) as ydl:
            ydl.download(['http://www.youtube.com/watch?v=' + hashcode])
    except:
        print("error occured")

This function takes in the unique hashcode from the previous section, as well as a bucket. The bucket is simply the rating of the album the song comes from. This lets me organize the song’s into folders by their rating. So, I end up with a 1.0 folder, a 2.0 folder, and so on.

With all the puzzle pieces needed to go from a row in the csv to all the corresponding audio files, I am ready to start thinking about feature extraction. Neural networks have a hard time dealing with audio data directly, because of the notion of time. In order to analyze the signal directly, I would need to use some kind of Recurrent Neural Network, but that would be too computationally expensive. The question I am asking is “how can I represent these audio files in another way that is more friendly for a neural network?”.

Method #1: Image Analysis

My first idea was to convert the audio files into image representations. This would allow me to write a Convolutional Neural Network which generally has good performance on classification tasks. In order to go from audio file to image representation, I used the python library librosa to create a spectrogram.

Converting audio files to an image representation

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. To learn more about spectrogram’s, check out this resource. I’ve included a sample spectrogram that I generated in python below.

The x-axis represents time. The y-axis represents the frequency. The color represents the amplitude (“loudness”) of a particular frequency at a particular time.

While this spectrogram representation extracts lots of information from the audio file, the resultant image is very large. The output image is 806 × 359 pixels. If I were to use this for a neural network, my input layer would need to be 806 x 359 x 3 to accommodate the color channels of the image. This is far too large and would result in millions of tunable parameters in our CNN.

So, I had to get creative. I wrote some code to convert only the middle of a given audio file to a spectrogram, using a window of fixed size. This resulted in an image that was 64 x 64 pixels. However, in this process a lot of valuable data about each song was lost.

import os
import librosa
import librosa.display

def spectrogram_image(ysrouthop_lengthn_mels):
    # use log-melspectrogram
    mels = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels,
                                            n_fft=hop_length*2, hop_length=hop_length)
    mels = np.log(mels + 1e-9) # add small number to avoid log(0)

    # min-max scale to fit inside 8-bit range
    img = scale_minmax(mels, 0, 255).astype(np.uint8)
    img = np.flip(img, axis=0) # put low frequencies at the bottom in image
    img = 255-img # invert. make black==more energy

    # save as PNG
    plt.imsave(out, img)

x, sr = librosa.load('twenty-one-pilots-Stressed-Out-OFFICIAL-VIDEO.wav', sr=44100)

hop_length = 8196 # number of samples per time-step in spectrogram
n_mels = 64 # number of bins in spectrogram. Height of image
time_steps = 64 # number of time-steps. Width of image

start_sample = int((len(x) / 2) - ((time_steps*hop_length) / 2))  # starting at beginning

length_samples = time_steps*hop_length
 
window = x[start_sample:start_sample+length_samples]

spectrogram_image(window, sr=sr, out='out.png', hop_length=hop_length, n_mels=n_mels)

The code above generates a much more limited representation of a song. A sample of the 64 x 64 pixel spectrogram is shown below.

I generated 3000+ of these 64 x 64 images and organized them into buckets corresponding to the rating of the song. Now, with all the data I was ready to write the CNN.

Creating a CNN for image analysis

In order to make the convolutional neural network, I am using Keras. I tried many different configurations of the CNN layers before I gave up on this approach. One such architecture is shown below.

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense

from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

image_generator = ImageDataGenerator(rescale=1/255, validation_split=0.2)    

train_dataset = image_generator.flow_from_directory(batch_size=32,
                                                 directory='train',
                                                 shuffle=True,
                                                 target_size=(64, 64), 
                                                 subset="training",
                                                 class_mode='categorical')

validation_dataset = image_generator.flow_from_directory(batch_size=32,
                                                 directory='train',
                                                 shuffle=True,
                                                 target_size=(64, 64), 
                                                 subset="validation",
                                                 class_mode='categorical')



model = Sequential([
    Conv2D(16, (2, 2), activation='relu', input_shape=(64, 64, 3)),
    MaxPooling2D(2, 2),
    
    Conv2D(32, (2, 2), activation='relu'),
    MaxPooling2D(2, 2),
    
    Conv2D(64, (2, 2), activation='relu'),
    Conv2D(64, (2, 2), activation='relu'),
    MaxPooling2D(2, 2),
    
    Flatten(),
    Dense(256, activation='relu'),
    Dense(256, activation='relu'),
    Dense(8, activation='softmax')

model.summary()

model.compile(optimizer='SGD', loss='categorical_crossentropy', metrics =['accuracy'])

An important note here is that my softmax layer is only 8 classes large. This means that the model only outputs 1 of 8 classification options for the final score. The available classes are the whole numbers 2, 3, 4, 5, 6, 7, 8, and 9 only. There were no examples for a 10 nor a 1 in the original dataset.

Initially, I had been adding and additional 0.5 score to any song that was listed as a best song. This meant that there were 17 classes over which the CNN had to predict. I opted to remove the intermediary ratings such as 6.5 in order to make the classification problem less sparse.

Results and Limitations

If I had to rate this machine learning model I would give it a Not Good out of 10. The main issue, as it usually goes, is that the data I was passing the model simply did not hold enough information. A 64 x 64 representation of a given audio file is good from a computational complexity perspective, but we were essentially throwing out 95% of the data about each song. The spectrogram only detailed information about the center of each song.

As a result, the model would overfit on the data every time I trained it. Since there was not an equal number of samples for each output class, the model learned to output the most frequent class every time. So, regardless of what song I pass in, the model would always say it was an 8.

Method #2: Audio Feature Analysis

Going back to the drawing board, I decided I needed to extract numeric features from each song so that the representation would cover the entire song without being too complex. This is where audio feature extraction came into play. I determined that I could extract the following features from wav files:

  • Chroma Vector – Captures harmonic and melodic characteristics of music using pitch.
  • RMS – Root-Mean-Square (RMS) energy for each frame
  • Spectral Centroid – The frequency that the energy of a spectrum is centered upon
  • Spectral Bandwidth – Width of the band of light at one-half the peak maximum
  • Zero Crossing Rate – Measures the smoothness of a signal
  • Spectral Rolloff – The frequency at which high frequencies decline to 0
  • Mel-Frequency Cepstral Coefficients – A set of features which describe the overall shape of a spectral envelope

For an in-depth look at each of these audio features check out this resource.

Extracting audio features from a song in Python

First I setup my CSV with the appropriate headers

header = 'filename chroma_stft rmse spectral_centroid spectral_bandwidth rolloff zero_crossing_rate'
for i in range(1, 21):
    header += f' mfcc{i}'
header += ' rating'
header = header.split()
print(header)

file = open('dataset.csv', 'w', newline='')
with file:
    writer = csv.writer(file)
    writer.writerow(header)

Now, I was able to create a pipeline that iterated through all the rows in the original dataset, extract and download all the songs from the album, and then write the audio features to my new dataset.

x, sr = librosa.load(curdir + '/' + filename, offset=1.0, sr=22050)

rmse = librosa.feature.rms(y=x)
chroma_stft = librosa.feature.chroma_stft(y=x, sr=sr)
spec_cent = librosa.feature.spectral_centroid(y=x, sr=sr)
spec_bw = librosa.feature.spectral_bandwidth(y=x, sr=sr)
rolloff = librosa.feature.spectral_rolloff(y=x, sr=sr)
zcr = librosa.feature.zero_crossing_rate(x)
mfcc = librosa.feature.mfcc(y=x, sr=sr)
                    
to_append = f'{filename.replace(" ", "")} {np.mean(chroma_stft)} {np.mean(rmse)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)}'    
for e in mfcc:
   to_append += f' {np.mean(e)}'
to_append += f' {curdir[2:]}'

file = open('dataset.csv', 'a', newline='')
with file:
   writer = csv.writer(file)
   writer.writerow(to_append.split())

If you want to use the dataset that I created with the features described above, as well as the corresponding score that Fantano gave each song, you can download it below.

Training a Neural Network for Audio Analysis

The first thing I did after collecting the new dataset was to read it with pandas and remove the intermediary classes, i.e. song ratings such as 5.5.

df_for_analysis = pd.read_csv('dataset.csv')

df_for_analysis = df_for_analysis[df_for_analysis.rating != 2.5]
df_for_analysis = df_for_analysis[df_for_analysis.rating != 3.5]
df_for_analysis = df_for_analysis[df_for_analysis.rating != 4.5]
# ... repeat for all the rest

With that done, the next thing I did was to take a look at what a sample datapoint looked like.

print(df_for_analysis.iloc[0])

filename 94450.wav
chroma_stft 0.424663
rmse 0.300071
spectral_centroid 1932.29
spectral_bandwidth 1997.59
rolloff 4223.32
zero_crossing_rate 0.0903412
mfcc1 -78.1964
mfcc2 122.64

mfcc20 0.217367
rating 8

Name: 0, dtype: object

Next, I dropped all the rows that have NaN as a rating. I popped the rating column, saved it as y, and transformed it with the LabelEncoder. I removed the filename column and turned the datatype of every column into float32 so they would work with Tensors.

from sklearn import preprocessing

df_for_analysis.dropna(subset = ['rating'], inplace=True)

y = df_for_analysis.pop("rating")
le = preprocessing.LabelEncoder()
le.fit(y)

y_en = le.transform(y)

df_for_analysis.pop('filename')
df_for_analysis = df_for_analysis.astype({'chroma_stft': 'float32','rmse': 'float32','spectral_centroid': 'float32','spectral_bandwidth': 'float32','rolloff': 'float32','zero_crossing_rate': 'float32','mfcc1': 'float32','mfcc2': 'float32','mfcc3': 'float32','mfcc4': 'float32','mfcc5': 'float32','mfcc6': 'float32','mfcc7': 'float32','mfcc8': 'float32','mfcc9': 'float32','mfcc10': 'float32','mfcc11': 'float32','mfcc12': 'float32','mfcc13': 'float32','mfcc14': 'float32','mfcc15': 'float32','mfcc16': 'float32','mfcc17': 'float32','mfcc18': 'float32','mfcc19': 'float32','mfcc20': 'float32'})

After, I split my dataset into training and testing sets. Then, I defined the architecture of the neural network and fit it on the training set.

X_train, X_test, y_train, y_test = train_test_split(X, y_en, test_size=0.33, random_state=42)

# build a model
model = Sequential()
model.add(Dense(16, input_shape=(X_train.shape[1],), activation='relu')) # input shape is (features,)
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='softmax'))
model.summary()

# compile the model
model.compile(optimizer='rmsprop', 
              loss='sparse_categorical_crossentropy', 
# this is different instead of binary_crossentropy (for regular classification)
              metrics=['accuracy'])

history = model.fit(X_train,
                    y_train,
                    epochs=15, 
                    batch_size=100,
                    shuffle=True,
                    validation_split=0.2,
                    verbose=1)

Results and Limitations

After fitting the neural network on over 3000 samples, the validation accuracy was 13.5%. Yes, you are reading that correctly and yes, it is quite bad. In fact, there are 8 classes from which the model chooses its output and 13.5% accuracy is just a little bit better than guessing at random. Realistically, the samples are not evenly distributed across the classes so the 13.5% is quite easily achievable and does not indicate that any learning has taken place.

The accuracy of the model doesn’t tell the whole story here. We have a sparse classification problem and the classes are not only related but also able to be ordered. So, it would be better if the accuracy took into account how close a guess was to the true value instead of a binary correct / incorrect. For example, if the neural network said a song was a 6 when it was really a 5, that is better than saying the same song is a 10.

Acknowledging the shortcomings of my model, I settled on the final model that I released for one reason only. Simply put, the model I selected had the most “fantano-like” distribution of ratings. This means that it didn’t output the same rating everytime like the overfitted models in Method #1. It generally rated songs more towards median values with a skew towards the lower tail classes like 2 and 3. The distribution of class ratings for 1000+ samples is shown below.

As you can see, getting a good score from The Needle Bot is pretty difficult. This model satisfies on the characteristic of selectiveness.

Acceptance

I spent quite a while working on this project and was initially disappointed by the results I achieved. Initially, I really wanted to release a model that rated songs like Anthony Fantano with a high accuracy. However, machine learning is primarily about the data you can gather and I simply couldn’t extract enough information from the long form audio content for the model to make any real high level associations.

In the process, I created a neural network that takes in your song and spits out a number. Now, does that number really mean anything about the quality of your song? The answer is no. But, it’s still kind of cool to get a rating from The Needle Bot. If he gives you a bad score, don’t get angry. Remember, it’s just ‘his’ opinion.

Check out part two of the series where I detail how I took my keras model out of my jupyter notebook and packaged it into a web app using flask.

Avi Arora
Avi Arora

Avi is a Computer Science student at the Georgia Institute of Technology pursuing a Masters in Machine Learning. He is a software engineer working at Capital One, and the Co Founder of the company Octtone. His company creates software products in the Health & Wellness space.