Netflix Movies and TV Show Analasis¶

Questions Being asked¶

1: Where are most of Netflix titles being produced
    a: When that country was added to the list of Netflix titles
2: Network of the directors

We will first import our data set using pandas and get it ready for processing

lets first look at what kind of data attributes we have

# %env BOKEH_RESOURCES=inline 
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import json
from bokeh.io import output_notebook, show, output_file
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource, ColorBar, LogColorMapper
from bokeh.palettes import brewer
# output_notebook()

#Displaying less rows as the previous size was pretty obnoxious
pd.set_option('display.max_rows', 6)
data = pd.read_csv (r'netflix_titles.csv')
display(data)

Here we want to try and figure out something about this data set¶

The first thing that I want to look at is what country each tv show and movie is produced at.

Lets first get a cleaner dataset only grabbing a list of country values

Steps to take:

1. Obtain only the country column
2. Remove all values that are equal to N/A

# only grab the columns with title and country
df = pd.DataFrame(data, columns= ['country'])
# Drop all values that are NA
df = df.dropna()
#display(df)

Now that we have a cleaner dataset lets try and do some calculations on it, lets first sum up everytime we see a country to get an idea of the frequency a Netflix Show or Movie was made in one of these countries

frequencyOfCountriesDict = df['country'].str.split(', |,', expand=True).stack().value_counts().to_dict()

data = []
for item in frequencyOfCountriesDict:
    data.append([item, frequencyOfCountriesDict[item]])

newdf = pd.DataFrame(data, columns=['CountryCode', 'frequency'])
#display(newdf)

We have the frequency of Netflix titles in each country lets now put this into a chart so it is easier to visualize the magnitude of the countires in relation to each other

plt.figure(figsize=(20,20))
plt.xticks(rotation='vertical')
plt.bar(newdf['CountryCode'], newdf['frequency'], width=.8, color='R')

<BarContainer object of 111 artists>

For the above image double click on it to enlarge

This bar chart doesnt give a good representation of the data in realation to other countries so lets try making a heat map of earth and see if we can gather some more conclusions based off of that

We first start off with converting all the names of countries into their respective 3 letter country code and see which ones we know and the ones we dont know

import pycountry

countries = {}
for country in pycountry.countries:
    countries[country.name] = country.alpha_3

swapList = {}
unknownCodes = {}
for item in newdf.iterrows():
    
    if(countries.get(item[1]['CountryCode'], 'Unknown code') == 'Unknown code'):
        unknownCodes[item[1]['CountryCode']] = countries.get(item[1]['CountryCode'], 'Unknown code')
    else:
        swapList[item[1]['CountryCode']] = countries.get(item[1]['CountryCode'], 'Unknown code')
    
# We see that the country name does not always map to a country code so we need to identify which ones we missed
# This is in unknownCodes and then we need to define them as seen in the next block
#display(swapList, unknownCodes)

So some of the countries did not have a code attached to their name because the naming may have not matched up so lets set up a dictionary so we can explicitly map the corrrect countries to their respective code.

unknownCodeDict = {
    'South Korea': 'KOR',
    'Taiwan': 'TWN',
    'Russia': 'RUS',
    'Czech Republic': 'CZE',
    'West Germany': 'DEU',
    'Vietnam': 'VNM',
    'Iran': 'IRN',
    'Soviet Union': 'RUS',
    'Venezuela': 'VEN',
    'Vatican City': 'VAT',
    'East Germany': 'DEU',
    'Syria': 'SYR'
}

for country in unknownCodes:
    if country in unknownCodeDict:
        swapList[country] = unknownCodeDict[country]
    
#display(swapList)

We now have a full dictionary of country codes

We need to create a new dataframe (not replace old one, I was told it was frowned upon to try and change the values like that) with the new country codes that we aquired

dataFinal = []

for index, item in newdf.iterrows():
    if item[0] in swapList:
        newCode = swapList[item[0]]
        dataFinal.append([newCode, item[1]])


dfFinal = pd.DataFrame(dataFinal, columns=['CountryCode', 'frequency'])
display(dfFinal)

This is where we try and render our data onto the map

def createFrequencyMap(data, nameOfColumnForCountryCode, frequencyColumnName, title="Default Title"):
    shapefile = './GeoShapeFile/ne_110m_admin_0_countries.shp'
    #Read shapefile using Geopandas
    gdf = gpd.read_file(shapefile)[['ADMIN', 'ADM0_A3', 'geometry']]
    gdf.columns = ['country', 'country_code', 'geometry']

    merged = gdf.merge(data, left_on='country_code', right_on=nameOfColumnForCountryCode)
    merged_json = json.loads(merged.to_json())
    json_data = json.dumps(merged_json)


    #Input GeoJSON source that contains features for plotting.
    geosource = GeoJSONDataSource(geojson = json_data)

    #Define a sequential multi-hue color palette.
    palette = brewer['Reds'][8]

    #Reverse color order so that dark blue is highest obesity.
    palette = palette[::-1]

    #Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
    color_mapper = LogColorMapper(palette = palette, low = min(data[frequencyColumnName]), high = max(data[frequencyColumnName]))

    #Create color bar. 
    color_bar = ColorBar(color_mapper=color_mapper, label_standoff=10,width = 10, height = 400,
    border_line_color=None,location = (0,0), orientation = 'vertical' )

    #Create figure object.
    p = figure(title = title, plot_height = 500 , plot_width = 1000, toolbar_location = None)
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None

    #Add patch renderer to figure. 
    p.patches('xs','ys', source = geosource,fill_color = {'field' :frequencyColumnName, 'transform' : color_mapper},
    line_color = 'black', line_width = 0.2, fill_alpha = 1)

    #Specify figure layout.
    p.add_layout(color_bar, 'right')

    #Display figure inline in Jupyter Notebook.
    output_notebook()

    #Display figure.
    show(p)
    
createFrequencyMap(dfFinal, 'CountryCode', 'frequency', 'Number of Netflix Titles Per Country')

Now by looking at the map and also the bar chart we can identify a few things

America is the main location for producing movies (which makes sense netflix is american based company) So this by proximity would increase the number for mexico and canada.
We also can see that India and China and india both have a good amount of titles but India far surpases China while having a smaller population, maybe the reasoning for why that is something we can explore
I want to say that the frequency correlated to gdp of the country, population and maybe something with most spoken languages
This does not take time into account so we could see when netflix added the movie to their library to see if they initially started with a core audience and then branched out

Lets first look at the 2nd question and maybe just do a little bit of research and see if we can figure anything out

quickly found https://www.techinasia.com/netflix-china-nope which is good to know that its an already researched topic and the awnser was that there are more restrictions in china in order to get their content out there so they take on less chineese titles on their platform because they would not be able to reach that user base. We can now take into account that this maybe the same effect for other countries. We may say that if I were netflix then I would want to invest said country produced title I would want to take into account the population inside the nation and the global population of people from that nation. We can now try and formulate a method to see how many titles produced in certain countries that we invest in.

Lets take a look at another heat map of every contries gdp and see if it lines up with the one above

countryGDPData = pd.read_csv (r'country_gdp.csv')
countryGDPDataDf = pd.DataFrame(countryGDPData, columns= ['Country Code', '2018'])
countryGDPDataDf = countryGDPDataDf.dropna()
#display(countryGDPDataDf)

createFrequencyMap(countryGDPDataDf, 'Country Code', '2018', 'Map of every Countries gdp in 2018')

Now lets take a look at population

# Load data
populationData = pd.read_csv (r'world_pop.csv')

# Process data
# Meaning get sum of all age groups in a country by year and also change the country code
populationFrequencyDf = pd.DataFrame(populationData, columns= ['Country_Code', 'Year_2016'])
# display(populationData)
createFrequencyMap(populationFrequencyDf, 'Country_Code', 'Year_2016', 'Map of every Countries population in 2018')

We now have three charts from 3 data sets we can kinda see the correlation but rather than looking at these charts lets try and get hard concrete numbers to validate that they are correlated

dfFinal.corrwith(countryGDPDataDf['2018'])
dfFinal.corrwith(populationFrequencyDf['Year_2016'])
s1 = pd.merge(dfFinal, countryGDPDataDf, how='outer', left_on=['CountryCode'], right_on=['Country Code'])
s1 = s1.dropna()
s2 = pd.merge(dfFinal, populationFrequencyDf, how='outer', left_on=['CountryCode'], right_on=['Country_Code'])
s2 = s2.dropna()
print("Correlation between amount of titles in country to gdp of said country in 2018")
display(s1['frequency'].corr(s1['2018']))
print("Correlation between amount of titles in country to amount of people in said country")
display(s2['frequency'].corr(s2['Year_2016']))

Correlation between amount of titles in country to gdp of said country in 2018

0.807888646323319

Correlation between amount of titles in country to amount of people in said country

0.3515759506206489

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	81145628	Movie	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole...	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
1	80117401	Movie	Jandino: Whatever it Takes	NaN	Jandino Asporaat	United Kingdom	September 9, 2016	2016	TV-MA	94 min	Stand-Up Comedy	Jandino Asporaat riffs on the challenges of ra...
2	70234439	TV Show	Transformers Prime	NaN	Peter Cullen, Sumalee Montano, Frank Welker, J...	United States	September 8, 2018	2013	TV-Y7-FV	1 Season	Kids' TV	With the help of three human allies, the Autob...
...	...	...	...	...	...	...	...	...	...	...	...	...
6231	80116008	Movie	Little Baby Bum: Nursery Rhyme Friends	NaN	NaN	NaN	NaN	2016	NaN	60 min	Movies	Nursery rhymes and original music for children...
6232	70281022	TV Show	A Young Doctor's Notebook and Other Stories	NaN	Daniel Radcliffe, Jon Hamm, Adam Godley, Chris...	United Kingdom	NaN	2013	TV-MA	2 Seasons	British TV Shows, TV Comedies, TV Dramas	Set during the Russian Revolution, this comic ...
6233	70153404	TV Show	Friends	NaN	Jennifer Aniston, Courteney Cox, Lisa Kudrow, ...	United States	NaN	2003	TV-14	10 Seasons	Classic & Cult TV, TV Comedies	This hit sitcom follows the merry misadventure...

	CountryCode	frequency
0	USA	2610
1	IND	838
2	GBR	602
...	...	...
107	CYP	1
108	SDN	1
109	MUS	1