Data-Driven Movie Recommendations: Scrapy Web Scraping and Actor Network Graphs with NetworkX

Scrapy
NetworkX
Plotly
Pandas
NumPy
Author

Emil Blaignan

Published

February 8, 2025

Network Graph of Movie Data

Introduction

The TMDB (The Movie Database) is a comprehensive online database that provides detailed information on movies, TV shows, actors, and production crews. In this project, we will develop a Scrapy-based web crawler to extract movie and actor data from TMDB. The crawler is designed to navigate from a movie’s main page to its full cast list, retrieve actor profiles, and extract their filmographies. We will employ CSS selectors for precise data extraction, implement depth-controlled crawling to avoid unnecessary requests, and optimize request handling to improve performance. Finally, we will use the extracted data to identify patterns in shared actor collaborations across movies using NetworkX. Through this approach, we demonstrate how Scrapy can be leveraged to build a structured dataset for analyzing relationships within the film industry.

Setting Up the Webscraper

In this section, we’ll first initialize the Scrapy framework, examine how TMDB organizes its data, and prepare our project for extracting meaningful movie-actor relationships.

Initializing the Scrapy Project

We’ll start by creating a new Scrapy project.

We can do this by opening the terminal and activating the relevant Python environment:

conda activate PIC16B-25W

Next, create a Scrapy project by running:

scrapy startproject TMDB_scraper
cd TMDB_scraper

Running this will create a folder named “TMDB_scraper” with the following contents inside:

TMDB_scraper/
│── scrapy.cfg                
└── TMDB_scraper/
    ├── spiders/              
    │   └── tmdb_spider.py    # Our main spider script
    ├── settings.py           # Scrapy settings file
    ├── middlewares.py        
    ├── pipelines.py          
    └── __init__.py           

We’ll only really need the tmdb_spider.py file, where we will write the logic for our scraper. We’ll also make minor changes to the setting of our spider in settings.py.

Understanding the TMDB Website Structure

In order to extract the correct data, we need to analyze how TMDB organizes and displays its information. A movie page typically exhibits the following structure:

Movie Starting Page
Once on a movie’s starting page, we observe that the URL contains a movie ID. Our project will focus on Harry Potter and the Sorcerer’s Stone. the TMDB URL for Harry Potter and the Sorcerer’s Stone is:

https://www.themoviedb.org/movie/671-harry-potter-and-the-philosopher-s-stone/

The movie ID 671-harry-potter-and-the-philosopher-s-stone will be passed as an argument to our scraper.

Navigate to the Cast & Crew Page
Next we want to click on the “Full Cast & Crew” link which takes us to the following page:

https://www.themoviedb.org/movie/671-harry-potter-and-the-philosopher-s-stone/cast

Notice that the URL is the same as the previous page, except for the /cast appended to the end. This page is where our spider will extract the list of actors (ignoring crew members).

Extracting Actor Profiles
Clicking on an actor’s name leads to the actor profile page:

https://www.themoviedb.org/person/10980-daniel-radcliffe

On this page, we need to find the “Acting” section, which contains a list of movies and TV shows the actor has performed in. There are multiple sections such as Production, Writing, and Acting. We must only extract movie titles listed under “Acting”.

Configuring the Scrapy Crawler Settings

To prevent excessive data requests while we are testing the crawler, we’ll temporarily add the following line in settings.py:

CLOSESPIDER_PAGECOUNT = 20

This stops the spider after 20 pages so we don’t accidentally send too many requests while debugging.

When scraping, we’ll look for the 200 status. If we run into the 403 Forbidden error when TMDB detects that we’re a bot. A simple fix is changing the User-Agent in settings.py:

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"

This makes our scraper appear as a normal browser request, reducing the chance of being blocked.

Building the Scrapy Spider

Now that our Scrapy project is set up and we have analyzed TMDB website structure, we can implement the three parsing methods in our spider:

  1. parse(): Navigates from the movie page to the Full Cast & Crew page.
  2. parse_full_credits(): Extracts the list of actors in the movie.
  3. parse_actor_page(): Extracts movies from the actor’s Acting section.

Each of these methods will systematically extract the data we need. We’ll write them into the tmdb_spider.py file.

Extracting Movie and TV Show Titles

parse_actor_page()
On an actor’s TMDB page, movies and TV shows are categorized under multiple sections like Acting, Writing, Directing, and Producing. We must extract only titles under the “Acting” section.

Implementation:

def parse_actor_page(self, response):
        """
        Extracts actor's name and their acting roles, yielding unique movie/TV titles.
        """
        # Selector extracts actor's name
        actor_name = response.css("h2.title a::text").get()
        
        # Selector extracts section headers
        section_headers = response.css("div.credits_list h3::text").getall()

        # Locates the "Acting" section
        acting_index = section_headers.index("Acting")

        # Get all tables and extract titles from the Acting section
        all_tables = response.css("div.credits_list table.card")
        acting_table = all_tables[acting_index]
        titles = acting_table.css("a.tooltip bdi::text").getall()

        for movie_or_TV_name in titles:
            yield {"actor": actor_name, "movie_or_TV_name": movie_or_TV_name}

This method works by:

  • Extracting the actor’s name from the <h2 class="title"> section.
  • Finding all section headers (e.g., “Production”, “Acting”) to locate the Acting section.
  • Finding the matching table for “Acting” and extracting all movie/TV show titles inside.
  • Yielding a dictionary containing (actor, movie/TV show) pairs.

This method ensures that only acting credits are included.

Running the Scraper

Before we run our scraper, we can check that our CSS selectors are working by running this in the terminal:

scrapy shell "your_page_link"

Within the shell for each page, we can run each selector in our three parse methods to see if it yields the desired output.

After checking our selectors, we are ready to run the scraper with the following line:

scrapy crawl tmdb_spider -o results.csv -a subdir=671-harry-potter-and-the-philosopher-s-stone

This command will:

  1. Start at the Harry Potter and the Sorcerer’s Stone movie page.
  2. Extract all actors from the Full Cast & Crew page.
  3. Visit each actor’s profile and extract all movies in which they have acted.
  4. Save the results in results.csv.

Our CSV file should look like:

actor movie_or_TV_name
0 Paul Marc Davis Arthur & Merlin: Knights of Camelot
1 Paul Marc Davis Class
2 Paul Marc Davis Son of God
3 Paul Marc Davis The Bible
4 Paul Marc Davis The Sky in Bloom
... ... ...
3121 Rupert Grint The View
3122 Rupert Grint GMTV
3123 Rupert Grint The Tonight Show with Jay Leno
3124 Rupert Grint An Audience with...
3125 Rupert Grint Today

3126 rows × 2 columns

Now that we have successfully scraped the data, the next step is to analyze the results and create a visualization of shared actors across movies. In the next section, we will:

  • Filter the data to find movies with the most shared actors.
  • Create a network graph to visually represent relationships between movies and actors.

Visualizing Movie Recommendations

Once the scraper is complete, we need to analyze the results and present them in a meaningful way. Since our dataset consists of movies and actors, a network graph is an intuitive choice for visualization. A network graph allows us to see which movies share actors and how strongly they are connected.

Processing the Scraped Data

The CSV output from our Scrapy spider contains two columns:

  • Actor: The name of the actor.
  • Movie or TV Show: The name of the movie or TV show they appeared in.

To build a movie recommendation system, we’ll look for movies that share multiple actors by the following:

  1. Grouping actors by movie: Create a mapping of each movie to the list of actors who starred in it.
  2. Compute shared actors: Identify pairs of movies that share actors.
  3. Filter based on shared actor count: We only include movies that share at least 5 actors in the final visualization.

Building the Network Graph

We’ll use NetworkX to create a graph where:

  • Nodes represent movies and actors.
  • Edges exist between a movie node and an actor node if the actor starred in the movie.
  • Movies with more shared actors are emphasized by larger node size and distinct colors.

Additionally, in order to enhance readability, we’ll apply:

  • Node scaling to make movies with more shared actors appear larger.
  • Color encoding to assign colors to movie nodes using a thermal colormap, where warmer colors (yellow) indicate more shared actors and cooler colors less.

Deploying the Movie Network Graph

First, we’ll need to import the necessary packages to create our visualization:

import networkx as nx
import plotly.graph_objects as go
import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np

Next, we will write a function that will take our CSV file and a minimum number of shared actors per movie as inputs (only recommend movies and TV shows with high actor correlation).

The following function plots our interactive movie network graph with our desired specs:

def generate_movie_actor_network(csv_file, min_shared_actors):
    """
    Generates a movie-actor network graph from a CSV file containing movie and actor relationships.
    
    """

    # Loads in the dataset
    df = pd.read_csv(csv_file)

    # Create a graph
    G = nx.Graph()

    # Build the mapping of movies to actors
    movie_actor_map = {}
    for _, row in df.iterrows():
        movie = row["movie_or_TV_name"]
        actor = row["actor"]
        if movie not in movie_actor_map:
            movie_actor_map[movie] = set()
        movie_actor_map[movie].add(actor)

    # Identify valid movies that share at least `min_shared_actors` actors with another movie
    valid_movies = set()
    for movie1 in movie_actor_map:
        for movie2 in movie_actor_map:
            if movie1 != movie2:
                shared_actors = movie_actor_map[movie1] & movie_actor_map[movie2]
                if len(shared_actors) >= min_shared_actors:
                    valid_movies.add(movie1)
                    valid_movies.add(movie2)

    # Compute the number of unique actors in each valid movie
    movie_shared_counts = {
        movie: len(movie_actor_map[movie]) for movie in valid_movies
    }
    max_shared = max(movie_shared_counts.values(), default=1)  # Avoid division by zero

    # Normalize sizes (scale between 10 and 55 for visualization purposes)
    def scale_size(value, min_size=10, max_size=55):
        return min_size + ((value - 3) / (max_shared - 3)) * (max_size - min_size) if max_shared > 3 else min_size

    # Assign movie colors based on shared actor count using thermal color scale.
    norm = plt.Normalize(vmin=min(movie_shared_counts.values(), default=0), vmax=max_shared)
    thermal_colormap = plt.get_cmap("viridis")

    movie_colors = {
        movie: f'rgb({int(r*255)},{int(g*255)},{int(b*255)})'
        for movie, (r, g, b, _) in zip(
            movie_shared_counts.keys(),
            [thermal_colormap(norm(v)) for v in movie_shared_counts.values()]
        )
    }

    # Add nodes and edges to the graph
    for movie in valid_movies:
        G.add_node(movie, type="movie", size=scale_size(movie_shared_counts[movie]), color=movie_colors[movie])

    for movie in valid_movies:
        for actor in movie_actor_map[movie]:
            if actor not in G:
                G.add_node(actor, type="actor", size=8, color="lightgray")  # Smaller nodes for actors
            G.add_edge(movie, actor)

    # Compute graph layout for better spacing
    pos = nx.spring_layout(G, seed=42, k=1.7)  # 'k' controls the node spacing

    # Create scatter plot for nodes
    node_trace = []
    for node in G.nodes():
        node_type = G.nodes[node]["type"]
        x, y = pos[node]
        color = G.nodes[node]["color"]
        size = G.nodes[node]["size"]

        # Set hover text for nodes
        hover_text = node if node_type == "actor" else f"{node}<br>Shared Actors: {movie_shared_counts[node]}"

        node_trace.append(go.Scatter(
            x=[x], y=[y],
            mode="markers" if node_type == "actor" else "markers+text",
            marker=dict(size=size, color=color, opacity=0.9, showscale=False),
            hovertext=hover_text,
            textposition="top center" if node_type == "movie" else None
        ))

    # Create edges for connections between movies and actors
    edge_trace = []
    for edge in G.edges():
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        edge_trace.append(go.Scatter(
            x=[x0, x1, None], y=[y0, y1, None],
            mode="lines",
            line=dict(width=1, color="gray"),
            hoverinfo="none",
            opacity=0.2  # Light transparency for better visibility
        ))

    # Add a color bar to indicate shared actors
    colorbar_trace = go.Scatter(
        x=[None], y=[None],
        mode='markers',
        marker=dict(
            colorscale="viridis",  # Thermal color scheme
            cmin=min(movie_shared_counts.values(), default=0),
            cmax=max_shared,
            showscale=True,
            colorbar=dict(
                title="# of Shared Actors",
                titleside="right",  # Aligns title to the right
                tickmode="array",
                tickvals=list(range(0, int(max_shared) + 1, 10)),
                ticktext=[str(tick) for tick in range(0, int(max_shared) + 1, 10)],
                tickfont=dict(size=10),  
                len=1,  
                thickness=20,  
                outlinewidth=1.3,
                xpad=6,  
                x=0.97  
            )
        ),
        hoverinfo='none'
    )

    # Create final figure
    fig = go.Figure(data=edge_trace + node_trace + [colorbar_trace])
    fig.update_layout(
        title=dict(
            text="Movie Recs: Movie-Actor Network Graph",
            x=0.427,  
            xanchor="center",
            yanchor="top",
            font=dict(size=20)  
        ),
        showlegend=False,
        hovermode="closest",
        margin=dict(b=20, l=5, r=140, t=40),
        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
        plot_bgcolor="white",  
        paper_bgcolor="white"  
    )

    # Display the graph
    return fig

We can call our plotting function with our CSV and minimum shared actors by running the following:

fig = generate_movie_actor_network(csv_file = "results.csv", 
                                   min_shared_actors=5)

fig.show()

Interpretting the Visualization

By analyzing the graph, we can identify:

  • Highly interconnected movies: Franchises and sequels (e.g., Harry Potter Series) tend to cluster together.
  • Unexpected connections: Some movies share actors despite belonging to different genres.
  • Strongest recommendations: The best recommendations are movies that share the most actors with the chosen starting movie.

This approach demonstrates how web scraping and network analysis can be combined to generate meaningful movie recommendations based purely on cast overlap, without using machine learning.

Resources Used (For NetworkX Visualization):

  • https://plotly.com/python/network-graphs/
  • https://community.plotly.com/t/is-plotly-able-to-integrate-with-networkx-graph/59115