How to make Jupyter Notebooks Extensible and Reusable ?
All those who use Jupyter notebooks for data analysis or machine learning workloads , know the pain of the “copy — paste” cycle for reusing the notebook.
Lets start with a simple example:
The following code is used to analyse how frequently people communicated with their family before and after COVID 19. The data source used is from here
import pandas as pd#Data for the month of April 2020df=pd.read_csv('https://query.data.world/s/ukmb3b5jkp5okj6m4oyxf5pukjsa5g')df.SOC3B.value_counts().sort_values().plot(kind = 'barh',title="Before COVID - How often did you talk with any of your Family?")df.SOC3A.value_counts().sort_values().plot(kind = 'barh',title="How often did you talk with any of your Family?")
Wow , people really communicated more often with their families since COVID 19
Thats great news isn’t it?
Well now if we want to see these results for May , June and onwards what should we do ?
Option 1 : Add code for new dataset every month
import pandas as pd#Data for the month of April 2020df=pd.read_csv('https://query.data.world/s/ukmb3b5jkp5okj6m4oyxf5pukjsa5g')df.SOC3B.value_counts().sort_values().plot(kind = 'barh',title="Before COVID - How often did you talk with any of your Family?")df.SOC3A.value_counts().sort_values().plot(kind = 'barh',title="How often did you talk with any of your Family?")#-------------------------------------------------------------#
#Data for the month of May 2020df = pd.read_csv('https://query.data.world/s/g6gsty3xrfaxthwefimuzbi2xi4hrv')df.SOC3B.value_counts().sort_values().plot(kind = 'barh',title="Before COVID - How often did you talk with any of your Family?")df.SOC3A.value_counts().sort_values().plot(kind = 'barh',title="How often did you talk with any of your Family?")
Option 2 : Copy the code and create a new notebook for every month
Both the options are not scalable when the code base grows and difficult to keep a track of the changes made.
What if you had to create automated jobs to run these notebooks monthly?
Option 3 : The Netflix Way — Parametrize the notebooks and reuse the template
Fortunately there is a very easy and convenient way of parameterizing and reusing notebooks , using the papermill library.Lets jump into examples
Step 1 : Create a Template Notebook
Add a “parameters” tag to the cell in notebook
Now your notebook code should look like
import pandas as pddf=pd.read_csv(data_url)df.SOC3B.value_counts().sort_values().plot(kind = 'barh',title="Before COVID - How often did you talk with any of your Family?")df.SOC3A.value_counts().sort_values().plot(kind = 'barh',title="How often did you talk with any of your Family?")
Step 2 : Create a Driver Notebook
- Create all the variables and add them to a dictionary
- run “papermill.execute_notebook” by passing the parameters
import papermill as pmapril_data_url = "https://query.data.world/s/ukmb3b5jkp5okj6m4oyxf5pukjsa5g"
parameters = dict(
data_url = april_data_url
)exe = pm.execute_notebook(
'template/Analysis.ipynb',
'generated/Analysis_April.ipynb',
parameters = parameters,
log_output = False
)#-------------------------------------------------------------#may_data_url = "https://query.data.world/s/g6gsty3xrfaxthwefimuzbi2xi4hrv"
parmeters = dict(
data_url = may_data_url
)
exe = pm.execute_notebook(
'template/Analysis.ipynb',
'generated/Analysis_May.ipynb',
parameters = parmeters,
log_output = False
)
Corresponding notebooks are created in the “generated” folder
The “papermill” can easily be executed from command line or other python programs which gives you a flexibility to run automated reusable and extensible notebooks.
Check out more details here