DataSet from: https://www.kaggle.com/shivamb/netflix-shows
https://population.un.org/wpp/Download/Standard/CSV/
https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?view=map
https://datahub.io/JohnSnowLabs/population-figures-by-country
1: Where are most of Netflix titles being produced
a: When that country was added to the list of Netflix titles
2: Network of the directors
We will first import our data set using pandas and get it ready for processing
lets first look at what kind of data attributes we have
# %env BOKEH_RESOURCES=inline
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import json
from bokeh.io import output_notebook, show, output_file
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource, ColorBar, LogColorMapper
from bokeh.palettes import brewer
# output_notebook()
#Displaying less rows as the previous size was pretty obnoxious
pd.set_option('display.max_rows', 6)
data = pd.read_csv (r'netflix_titles.csv')
display(data)
The first thing that I want to look at is what country each tv show and movie is produced at.
Lets first get a cleaner dataset only grabbing a list of country values
Steps to take:
1. Obtain only the country column
2. Remove all values that are equal to N/A
# only grab the columns with title and country
df = pd.DataFrame(data, columns= ['country'])
# Drop all values that are NA
df = df.dropna()
#display(df)
Now that we have a cleaner dataset lets try and do some calculations on it, lets first sum up everytime we see a country to get an idea of the frequency a Netflix Show or Movie was made in one of these countries
frequencyOfCountriesDict = df['country'].str.split(', |,', expand=True).stack().value_counts().to_dict()
data = []
for item in frequencyOfCountriesDict:
data.append([item, frequencyOfCountriesDict[item]])
newdf = pd.DataFrame(data, columns=['CountryCode', 'frequency'])
#display(newdf)
We have the frequency of Netflix titles in each country lets now put this into a chart so it is easier to visualize the magnitude of the countires in relation to each other
plt.figure(figsize=(20,20))
plt.xticks(rotation='vertical')
plt.bar(newdf['CountryCode'], newdf['frequency'], width=.8, color='R')
For the above image double click on it to enlarge
This bar chart doesnt give a good representation of the data in realation to other countries so lets try making a heat map of earth and see if we can gather some more conclusions based off of that
We first start off with converting all the names of countries into their respective 3 letter country code and see which ones we know and the ones we dont know
import pycountry
countries = {}
for country in pycountry.countries:
countries[country.name] = country.alpha_3
swapList = {}
unknownCodes = {}
for item in newdf.iterrows():
if(countries.get(item[1]['CountryCode'], 'Unknown code') == 'Unknown code'):
unknownCodes[item[1]['CountryCode']] = countries.get(item[1]['CountryCode'], 'Unknown code')
else:
swapList[item[1]['CountryCode']] = countries.get(item[1]['CountryCode'], 'Unknown code')
# We see that the country name does not always map to a country code so we need to identify which ones we missed
# This is in unknownCodes and then we need to define them as seen in the next block
#display(swapList, unknownCodes)
So some of the countries did not have a code attached to their name because the naming may have not matched up so lets set up a dictionary so we can explicitly map the corrrect countries to their respective code.
unknownCodeDict = {
'South Korea': 'KOR',
'Taiwan': 'TWN',
'Russia': 'RUS',
'Czech Republic': 'CZE',
'West Germany': 'DEU',
'Vietnam': 'VNM',
'Iran': 'IRN',
'Soviet Union': 'RUS',
'Venezuela': 'VEN',
'Vatican City': 'VAT',
'East Germany': 'DEU',
'Syria': 'SYR'
}
for country in unknownCodes:
if country in unknownCodeDict:
swapList[country] = unknownCodeDict[country]
#display(swapList)
We now have a full dictionary of country codes
We need to create a new dataframe (not replace old one, I was told it was frowned upon to try and change the values like that) with the new country codes that we aquired
dataFinal = []
for index, item in newdf.iterrows():
if item[0] in swapList:
newCode = swapList[item[0]]
dataFinal.append([newCode, item[1]])
dfFinal = pd.DataFrame(dataFinal, columns=['CountryCode', 'frequency'])
display(dfFinal)
This is where we try and render our data onto the map
def createFrequencyMap(data, nameOfColumnForCountryCode, frequencyColumnName, title="Default Title"):
shapefile = './GeoShapeFile/ne_110m_admin_0_countries.shp'
#Read shapefile using Geopandas
gdf = gpd.read_file(shapefile)[['ADMIN', 'ADM0_A3', 'geometry']]
gdf.columns = ['country', 'country_code', 'geometry']
merged = gdf.merge(data, left_on='country_code', right_on=nameOfColumnForCountryCode)
merged_json = json.loads(merged.to_json())
json_data = json.dumps(merged_json)
#Input GeoJSON source that contains features for plotting.
geosource = GeoJSONDataSource(geojson = json_data)
#Define a sequential multi-hue color palette.
palette = brewer['Reds'][8]
#Reverse color order so that dark blue is highest obesity.
palette = palette[::-1]
#Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
color_mapper = LogColorMapper(palette = palette, low = min(data[frequencyColumnName]), high = max(data[frequencyColumnName]))
#Create color bar.
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=10,width = 10, height = 400,
border_line_color=None,location = (0,0), orientation = 'vertical' )
#Create figure object.
p = figure(title = title, plot_height = 500 , plot_width = 1000, toolbar_location = None)
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
#Add patch renderer to figure.
p.patches('xs','ys', source = geosource,fill_color = {'field' :frequencyColumnName, 'transform' : color_mapper},
line_color = 'black', line_width = 0.2, fill_alpha = 1)
#Specify figure layout.
p.add_layout(color_bar, 'right')
#Display figure inline in Jupyter Notebook.
output_notebook()
#Display figure.
show(p)
createFrequencyMap(dfFinal, 'CountryCode', 'frequency', 'Number of Netflix Titles Per Country')
Now by looking at the map and also the bar chart we can identify a few things
Lets first look at the 2nd question and maybe just do a little bit of research and see if we can figure anything out
quickly found https://www.techinasia.com/netflix-china-nope which is good to know that its an already researched topic and the awnser was that there are more restrictions in china in order to get their content out there so they take on less chineese titles on their platform because they would not be able to reach that user base. We can now take into account that this maybe the same effect for other countries. We may say that if I were netflix then I would want to invest said country produced title I would want to take into account the population inside the nation and the global population of people from that nation. We can now try and formulate a method to see how many titles produced in certain countries that we invest in.
Lets take a look at another heat map of every contries gdp and see if it lines up with the one above
countryGDPData = pd.read_csv (r'country_gdp.csv')
countryGDPDataDf = pd.DataFrame(countryGDPData, columns= ['Country Code', '2018'])
countryGDPDataDf = countryGDPDataDf.dropna()
#display(countryGDPDataDf)
createFrequencyMap(countryGDPDataDf, 'Country Code', '2018', 'Map of every Countries gdp in 2018')
Now lets take a look at population
# Load data
populationData = pd.read_csv (r'world_pop.csv')
# Process data
# Meaning get sum of all age groups in a country by year and also change the country code
populationFrequencyDf = pd.DataFrame(populationData, columns= ['Country_Code', 'Year_2016'])
# display(populationData)
createFrequencyMap(populationFrequencyDf, 'Country_Code', 'Year_2016', 'Map of every Countries population in 2018')
We now have three charts from 3 data sets we can kinda see the correlation but rather than looking at these charts lets try and get hard concrete numbers to validate that they are correlated
dfFinal.corrwith(countryGDPDataDf['2018'])
dfFinal.corrwith(populationFrequencyDf['Year_2016'])
s1 = pd.merge(dfFinal, countryGDPDataDf, how='outer', left_on=['CountryCode'], right_on=['Country Code'])
s1 = s1.dropna()
s2 = pd.merge(dfFinal, populationFrequencyDf, how='outer', left_on=['CountryCode'], right_on=['Country_Code'])
s2 = s2.dropna()
print("Correlation between amount of titles in country to gdp of said country in 2018")
display(s1['frequency'].corr(s1['2018']))
print("Correlation between amount of titles in country to amount of people in said country")
display(s2['frequency'].corr(s2['Year_2016']))