Netflix Movies and TV Show Analasis

DataSet from: https://www.kaggle.com/shivamb/netflix-shows

https://population.un.org/wpp/Download/Standard/CSV/

https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?view=map

https://datahub.io/JohnSnowLabs/population-figures-by-country

Questions Being asked

1: Where are most of Netflix titles being produced
    a: When that country was added to the list of Netflix titles
2: Network of the directors

We will first import our data set using pandas and get it ready for processing

lets first look at what kind of data attributes we have

In [1]:
# %env BOKEH_RESOURCES=inline 
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import json
from bokeh.io import output_notebook, show, output_file
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource, ColorBar, LogColorMapper
from bokeh.palettes import brewer
# output_notebook()

#Displaying less rows as the previous size was pretty obnoxious
pd.set_option('display.max_rows', 6)
data = pd.read_csv (r'netflix_titles.csv')
display(data)
show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
... ... ... ... ... ... ... ... ... ... ... ... ...
6231 80116008 Movie Little Baby Bum: Nursery Rhyme Friends NaN NaN NaN NaN 2016 NaN 60 min Movies Nursery rhymes and original music for children...
6232 70281022 TV Show A Young Doctor's Notebook and Other Stories NaN Daniel Radcliffe, Jon Hamm, Adam Godley, Chris... United Kingdom NaN 2013 TV-MA 2 Seasons British TV Shows, TV Comedies, TV Dramas Set during the Russian Revolution, this comic ...
6233 70153404 TV Show Friends NaN Jennifer Aniston, Courteney Cox, Lisa Kudrow, ... United States NaN 2003 TV-14 10 Seasons Classic & Cult TV, TV Comedies This hit sitcom follows the merry misadventure...

6234 rows × 12 columns

Here we want to try and figure out something about this data set

The first thing that I want to look at is what country each tv show and movie is produced at.

Lets first get a cleaner dataset only grabbing a list of country values

Steps to take:

1. Obtain only the country column
2. Remove all values that are equal to N/A
In [2]:
# only grab the columns with title and country
df = pd.DataFrame(data, columns= ['country'])
# Drop all values that are NA
df = df.dropna()
#display(df)

Now that we have a cleaner dataset lets try and do some calculations on it, lets first sum up everytime we see a country to get an idea of the frequency a Netflix Show or Movie was made in one of these countries

In [3]:
frequencyOfCountriesDict = df['country'].str.split(', |,', expand=True).stack().value_counts().to_dict()

data = []
for item in frequencyOfCountriesDict:
    data.append([item, frequencyOfCountriesDict[item]])

newdf = pd.DataFrame(data, columns=['CountryCode', 'frequency'])
#display(newdf)

We have the frequency of Netflix titles in each country lets now put this into a chart so it is easier to visualize the magnitude of the countires in relation to each other

In [31]:
plt.figure(figsize=(20,20))
plt.xticks(rotation='vertical')
plt.bar(newdf['CountryCode'], newdf['frequency'], width=.8, color='R')
Out[31]:
<BarContainer object of 111 artists>

For the above image double click on it to enlarge

This bar chart doesnt give a good representation of the data in realation to other countries so lets try making a heat map of earth and see if we can gather some more conclusions based off of that

We first start off with converting all the names of countries into their respective 3 letter country code and see which ones we know and the ones we dont know

In [5]:
import pycountry

countries = {}
for country in pycountry.countries:
    countries[country.name] = country.alpha_3

swapList = {}
unknownCodes = {}
for item in newdf.iterrows():
    
    if(countries.get(item[1]['CountryCode'], 'Unknown code') == 'Unknown code'):
        unknownCodes[item[1]['CountryCode']] = countries.get(item[1]['CountryCode'], 'Unknown code')
    else:
        swapList[item[1]['CountryCode']] = countries.get(item[1]['CountryCode'], 'Unknown code')
    
# We see that the country name does not always map to a country code so we need to identify which ones we missed
# This is in unknownCodes and then we need to define them as seen in the next block
#display(swapList, unknownCodes)      

So some of the countries did not have a code attached to their name because the naming may have not matched up so lets set up a dictionary so we can explicitly map the corrrect countries to their respective code.

In [6]:
unknownCodeDict = {
    'South Korea': 'KOR',
    'Taiwan': 'TWN',
    'Russia': 'RUS',
    'Czech Republic': 'CZE',
    'West Germany': 'DEU',
    'Vietnam': 'VNM',
    'Iran': 'IRN',
    'Soviet Union': 'RUS',
    'Venezuela': 'VEN',
    'Vatican City': 'VAT',
    'East Germany': 'DEU',
    'Syria': 'SYR'
}

for country in unknownCodes:
    if country in unknownCodeDict:
        swapList[country] = unknownCodeDict[country]
    
#display(swapList)

We now have a full dictionary of country codes

We need to create a new dataframe (not replace old one, I was told it was frowned upon to try and change the values like that) with the new country codes that we aquired

In [7]:
dataFinal = []

for index, item in newdf.iterrows():
    if item[0] in swapList:
        newCode = swapList[item[0]]
        dataFinal.append([newCode, item[1]])


dfFinal = pd.DataFrame(dataFinal, columns=['CountryCode', 'frequency'])
display(dfFinal)
                   
CountryCode frequency
0 USA 2610
1 IND 838
2 GBR 602
... ... ...
107 CYP 1
108 SDN 1
109 MUS 1

110 rows × 2 columns

This is where we try and render our data onto the map

In [8]:
def createFrequencyMap(data, nameOfColumnForCountryCode, frequencyColumnName, title="Default Title"):
    shapefile = './GeoShapeFile/ne_110m_admin_0_countries.shp'
    #Read shapefile using Geopandas
    gdf = gpd.read_file(shapefile)[['ADMIN', 'ADM0_A3', 'geometry']]
    gdf.columns = ['country', 'country_code', 'geometry']

    merged = gdf.merge(data, left_on='country_code', right_on=nameOfColumnForCountryCode)
    merged_json = json.loads(merged.to_json())
    json_data = json.dumps(merged_json)


    #Input GeoJSON source that contains features for plotting.
    geosource = GeoJSONDataSource(geojson = json_data)

    #Define a sequential multi-hue color palette.
    palette = brewer['Reds'][8]

    #Reverse color order so that dark blue is highest obesity.
    palette = palette[::-1]

    #Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
    color_mapper = LogColorMapper(palette = palette, low = min(data[frequencyColumnName]), high = max(data[frequencyColumnName]))

    #Create color bar. 
    color_bar = ColorBar(color_mapper=color_mapper, label_standoff=10,width = 10, height = 400,
    border_line_color=None,location = (0,0), orientation = 'vertical' )

    #Create figure object.
    p = figure(title = title, plot_height = 500 , plot_width = 1000, toolbar_location = None)
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None

    #Add patch renderer to figure. 
    p.patches('xs','ys', source = geosource,fill_color = {'field' :frequencyColumnName, 'transform' : color_mapper},
    line_color = 'black', line_width = 0.2, fill_alpha = 1)

    #Specify figure layout.
    p.add_layout(color_bar, 'right')

    #Display figure inline in Jupyter Notebook.
    output_notebook()

    #Display figure.
    show(p)
    
createFrequencyMap(dfFinal, 'CountryCode', 'frequency', 'Number of Netflix Titles Per Country')
Loading BokehJS ...

Now by looking at the map and also the bar chart we can identify a few things

  1. America is the main location for producing movies (which makes sense netflix is american based company) So this by proximity would increase the number for mexico and canada.
  2. We also can see that India and China and india both have a good amount of titles but India far surpases China while having a smaller population, maybe the reasoning for why that is something we can explore
  3. I want to say that the frequency correlated to gdp of the country, population and maybe something with most spoken languages
  4. This does not take time into account so we could see when netflix added the movie to their library to see if they initially started with a core audience and then branched out

Lets first look at the 2nd question and maybe just do a little bit of research and see if we can figure anything out

quickly found https://www.techinasia.com/netflix-china-nope which is good to know that its an already researched topic and the awnser was that there are more restrictions in china in order to get their content out there so they take on less chineese titles on their platform because they would not be able to reach that user base. We can now take into account that this maybe the same effect for other countries. We may say that if I were netflix then I would want to invest said country produced title I would want to take into account the population inside the nation and the global population of people from that nation. We can now try and formulate a method to see how many titles produced in certain countries that we invest in.

Lets take a look at another heat map of every contries gdp and see if it lines up with the one above

In [9]:
countryGDPData = pd.read_csv (r'country_gdp.csv')
countryGDPDataDf = pd.DataFrame(countryGDPData, columns= ['Country Code', '2018'])
countryGDPDataDf = countryGDPDataDf.dropna()
#display(countryGDPDataDf)
In [10]:
createFrequencyMap(countryGDPDataDf, 'Country Code', '2018', 'Map of every Countries gdp in 2018')
Loading BokehJS ...

Now lets take a look at population

In [11]:
# Load data
populationData = pd.read_csv (r'world_pop.csv')

# Process data
# Meaning get sum of all age groups in a country by year and also change the country code
populationFrequencyDf = pd.DataFrame(populationData, columns= ['Country_Code', 'Year_2016'])
# display(populationData)
createFrequencyMap(populationFrequencyDf, 'Country_Code', 'Year_2016', 'Map of every Countries population in 2018')
Loading BokehJS ...

We now have three charts from 3 data sets we can kinda see the correlation but rather than looking at these charts lets try and get hard concrete numbers to validate that they are correlated

In [12]:
dfFinal.corrwith(countryGDPDataDf['2018'])
dfFinal.corrwith(populationFrequencyDf['Year_2016'])
s1 = pd.merge(dfFinal, countryGDPDataDf, how='outer', left_on=['CountryCode'], right_on=['Country Code'])
s1 = s1.dropna()
s2 = pd.merge(dfFinal, populationFrequencyDf, how='outer', left_on=['CountryCode'], right_on=['Country_Code'])
s2 = s2.dropna()
print("Correlation between amount of titles in country to gdp of said country in 2018")
display(s1['frequency'].corr(s1['2018']))
print("Correlation between amount of titles in country to amount of people in said country")
display(s2['frequency'].corr(s2['Year_2016']))   
Correlation between amount of titles in country to gdp of said country in 2018
0.807888646323319
Correlation between amount of titles in country to amount of people in said country
0.3515759506206489
In [ ]: