Vector Databases

Jeremy Bennett

Exploring the role of Vector Databases in Recommendation Systems

Exploring Recommendation Systems

Ever wondered how Spotify, Apple Music, YouTube Music and so many more apps recommend your next song? How is it they keep finding you music that you like and is similar to what you've been listening to too? Is it Magic? The answer - the use of Vector databases! Throughout this blog, we delve into the intricacies of recommendation systems, shedding light on the orchestration facilitated by vector databases.

Vector Databases

These databases serve as repositories for high-dimensional vectors, adept at storing a myriad of data types. They operate as mathematical entities imbued with magnitude and direction, serving as vessels for encapsulating contextual information. These versatile databases span across domains, from 3D modelling to music generation.

These solutions are then taken to the next level through the journey of embedding – a transformative process that converts raw data into vectors. This process serves as the cornerstone of vector databases, enabling the encapsulation of data essence into numerical representations. From images to textual snippets, embedding navigates the diverse terrain of data, harmonizing its elements into cohesive vectors.

What does this look like?

Navigating the Seas of Similarity

Within the vast expanse of a database, songs are transformed into vectors, enabling seamless discovery and recommendations based on audio wave structures.

This process transcends traditional metadata-driven recommendations, offering users an intuitive exploration of content akin to their preferences.

While not magic, these databases are truly out of this world shaping our digital experiences with precision and finesse, guiding us through the mountains of available content with an elegance reminiscent of a well-composed melody.

How audio analysis works with vector databases

Vector databases are used to organize and map entities based on similarity, with the principle that more similar entities are positioned closer to each other in vector space.

In the context of a vector database storing embeddings for a music library, you would expect to find songs by the same artist or within the same genre to be physically closer to each other in this space. Each song is represented by a single vector, created through the embedding process, which points to a specific location in the multidimensional vector space.

This spatial arrangement facilitates various music information retrieval tasks, such as recommending similar songs, classifying tracks into genres, or identifying songs by the same artist. The effectiveness of a vector database in these tasks relies heavily on the quality and dimensionality of the embeddings, where each dimension captures different characteristics of the songs, enabling the system to discern and quantify similarities and differences accurately.

Vectors are essentially just arrows pointing to a particular place in an imaginary space. During the embedding process, we can capture detailed contextual information about a song and turn it into a vector.

In the context of audio analysis, this often involves mathematical analysis, one common method being Mel-frequency cepstral coefficients (MFCCs). To create a simple recommender system, you don’t need to know too much about this as we can leverage existing libraries to handle the actual audio analysis, but to summarize, these coefficients represent the spectral properties of the audio, offering insights into various aspects like the texture and colour of the sound.

How can I run this myself?

Let’s jump into a simple tutorial that demonstrates how we can achieve a simple recommendation system for a small audio library – you can run this yourself on your own machine quite easily.

First off – we need to choose a Vector Database. There are many great options to choose from, but in this tutorial, we will use Qdrant – this is easy to run locally in Docker and can easily scale and handle large workloads.

To spin this up locally, you need Docker installed, and then run this command in your terminal:

docker run -p 6333:6333 qdrant/qdrant

This will spin up our database, as well as a UI dashboard that we can access in browser at http://0.0.0.0:6333/dashboard

Now that we have our database running, we can start loading vectors into it using python script.

For this to work, we will need a local library of .mp3 files to use. We will be looping through each of them, creating a vector and then loading it into the database.

Let’s start writing the python script. First, we need to import some libraries:

# Import necessary libraries 
import os 
import librosa 
import numpy as np 
from qdrant_client import QdrantClient 
from qdrant_client.http.models import PointStruct 
from qdrant_client.models import Distance, VectorParams

We are importing Librosa to perform audio analysis and Numpy to convert the MFCC audio analysis into vectors, as well as other libraries for interacting with the database and our file system.

The first thing our script needs to do is connect to our local Qdrant database, we can do that with this snippet:

# Directory containing MP3 files – edit this to the location of your local music library 
directory = 'music' 

# Initialize Qdrant client 
client = QdrantClient(host='localhost', port=6333) 

collection_name = "audio_collection" 
# Assuming 20 MFCC features for simplicity 
size = 200

Now, we can create a collection in Qdrant to hold our vector embeddings:

client.recreate_collection( 
    collection_name=collection_name, 
    vectors_config=VectorParams(size=80, distance=Distance.COSINE), 
)

Now we are ready to loop through our music library, create a vector for each song, and store them in our database. Let’s define our function to perform the audio analysis:

def read_audio(file_path): 
    """ 

    Read audio from a file and extract features, expanding the feature vector. 

    """ 
    y, sr = librosa.load(file_path) 
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20) # Analyse the audio with MFCCs using Librosa. `

    # Calculate multiple statistics about the audio files. 
    mfcc_mean = np.mean(mfcc, axis=1) 
    mfcc_std = np.std(mfcc, axis=1) 
    mfcc_max = np.max(mfcc, axis=1) 
    mfcc_min = np.min(mfcc, axis=1) 

    # Combine these features into a single vector 
    expanded_features = np.concatenate([mfcc_mean, mfcc_std, mfcc_max, mfcc_min]) 

    return expanded_features

This code reads audio files, extracts 20 MFCCs to capture the sound's spectral features, and creates feature vectors from various statistical observations about the features. These vectors, encapsulating the song's average spectral characteristics, are then stored in a vector database and can now be used for similarity searches or other analyses.

Next, we can define our function to store the vector in our Qdrant database:

def store_in_qdrant(client, collection_name, embeddings, file_path, point_id): 
    """ 
    Store the embeddings in Qdrant. 
    """ 
    embeddings_list = embeddings.tolist() # Convert numpy array to list 
    payload = {"file_path": file_path} 
    point = PointStruct(id=point_id, vector=embeddings_list, payload=payload) 

# Insert into Qdrant 
client.upsert(points=[point], collection_name=collection_name)

This function takes our embeddings and imports them into our Qdrant collection.

Finally, we can bring it all together, and loop through our music library and process the music using the functions which we have just defined:

# Loop through the audio files, process them and load them into our database. 
point_id = 1 

for root, dir, files in os.walk(directory): 
    for file in files: 
        if file.endswith('.mp3'): 
            file_path = os.path.join(root, file) 
            print(f"Processing file: {file_path}") 
            embeddings = read_audio(file_path) 
            store_in_qdrant(client, collection_name, embeddings, file_path, point_id) 
            point_id += 1

Once we run this, we should now have our library of music successfully processed into vectors and uploaded into our Qdrant collection. We can now use the Qdrant dashboard to perform similarity searches, as well as visualize a map of similarity.

To perform a similarity search, click on your collection, and then the ‘Find Similar’ button. You should get a result that looks like this. In my music collection, you can see the most similar songs are mostly from the same album, with some exceptions which is just what we would expect:

Another cool feature in the Qdrant dashboard is the visualizer. This allows us to view a map of our songs, with the distance between the points correlating to how similar the songs are. To use this, click the ‘Visualize’ button, and on the right-hand side of the screen, there is a query editor window. Hit the small ‘Run’ button, and you should see a screen like this:

Congratulations, you now have your own personal music recommender!

Conclusion

As we've journeyed through the workings of vector databases, we have seen how the magic behind our seamless music discovery on platforms like Spotify is crafted using vector databases. These systems don't just guess our musical tastes; they intricately map them and generate personalized music experiences. Vector databases act as architects for our digital experiences, tailoring many platforms to our individual desires.