My students are learning about MDS scaling this week. For previous semesters I’ve had them look at campaign donations, but I’ve learned that the data analytics students don’t always have the requisite background to easily identify what the scales mean. After spending a day or two searching the internet for different options I came across this dataset of movie summaries. It includes plot summaries scraped from wikipedia. I thought it would be interesting to see if it would be possible to identify genres through similarities in plot summaries.
This process requires a few steps:
Download and process the dataset to make it more manageable for my students.
Embed the summaries using a sentence embedding model.
Collapse those embeddings to avoid the curse of dimensionality.
Calculate the pairwise distances between each summary (what my students will actually use).
We’ll start by downloading and processing the data.
Loads the dataset using the dataset package from Hugging Faces
2
Filters down to just one year, produced in the US, and drops some of the stranger films that have an unknown genre.
In order to embed the summaries I use the SentenceTransformer package. This package makes it easy to embed “sentences” (really just longer blocks of text) using a variety of pre-trained models. These embeddings are a way of placing the sentences into a multi-dimensional space where similar content should be near each other. There are a lot of potential models you can use to do this (the range of options remains one of the more overwhelming aspects of using LLM related techniques for me). The maintainers of SentenceTransformer suggest using this MTEB Rankings which summaries how well a range of models do on a wide variety of tasks.
I selected Lajavaness’s bilingual-embedding-large as it will run quickly on my laptop and seems to perform well. The biggest issue here is that it only uses 512 tokens (meaning longer summaries might be cut off).
We use a lambda function (fancy term for a function you don’t save) to encode each plot summary. This needs to return a dictionary item.
We’ve now embedded each summary into a 1,024 dimensional space. I could hand this over to my students as is but I wanted to make sure the data would lead them to some interesting patterns. Large dimensional spaces are bad because everything is very far from each other. There is a variety of ways of dealing with this, but I’ve come to appreciate the UMAP approach which will reduce the dimension down. Here we drop it down to 5 dimensions.
Setup the UMAP function to scale it down to 5 dimensions (n_components). n_neigbors is another parameter that identifies how much of the local space (versus global space) to preserve around each observation. The default is 15, I lowered it to 5 as this is a relatively small dataset and playing around with it this led to clearer clustering in the end.
2
Reduce the dimensions.
Finally I calculated the euclidean distance between each observation and save the data.
As a check Table 1 shows the five closest and furthest movies to Deadpool. The ones closest (except for Zoolander 2) are also action films featuring a super hero (or something close to it). The ones furthest are all over the place, including comedies and more traditional dramas.
Table 1: Distance from Deadpool
Title
Distance
Title
Distance
Loading ITables v2.2.4 from the internet...
(need help?)