How to make Jupyter Notebooks Extensible and Reusable ?

Neha Jirafe
3 min readJul 10, 2020

--

All those who use Jupyter notebooks for data analysis or machine learning workloads , know the pain of the “copy — paste” cycle for reusing the notebook.

Lets start with a simple example:

The following code is used to analyse how frequently people communicated with their family before and after COVID 19. The data source used is from here

import pandas as pd#Data for the month of April 2020df=pd.read_csv('https://query.data.world/s/ukmb3b5jkp5okj6m4oyxf5pukjsa5g')df.SOC3B.value_counts().sort_values().plot(kind = 'barh',title="Before COVID - How often did you talk with any of your Family?")df.SOC3A.value_counts().sort_values().plot(kind = 'barh',title="How often did you talk with any of your Family?")

Wow , people really communicated more often with their families since COVID 19

Thats great news isn’t it?

Well now if we want to see these results for May , June and onwards what should we do ?

Option 1 : Add code for new dataset every month

import pandas as pd#Data for the month of April 2020df=pd.read_csv('https://query.data.world/s/ukmb3b5jkp5okj6m4oyxf5pukjsa5g')df.SOC3B.value_counts().sort_values().plot(kind = 'barh',title="Before COVID - How often did you talk with any of your Family?")df.SOC3A.value_counts().sort_values().plot(kind = 'barh',title="How often did you talk with any of your Family?")#-------------------------------------------------------------#
#Data for the month of May 2020
df = pd.read_csv('https://query.data.world/s/g6gsty3xrfaxthwefimuzbi2xi4hrv')df.SOC3B.value_counts().sort_values().plot(kind = 'barh',title="Before COVID - How often did you talk with any of your Family?")df.SOC3A.value_counts().sort_values().plot(kind = 'barh',title="How often did you talk with any of your Family?")

Option 2 : Copy the code and create a new notebook for every month

Both the options are not scalable when the code base grows and difficult to keep a track of the changes made.

What if you had to create automated jobs to run these notebooks monthly?

Option 3 : The Netflix Way — Parametrize the notebooks and reuse the template

Fortunately there is a very easy and convenient way of parameterizing and reusing notebooks , using the papermill library.Lets jump into examples

Step 1 : Create a Template Notebook

Add a “parameters” tag to the cell in notebook

Now your notebook code should look like

import pandas as pddf=pd.read_csv(data_url)df.SOC3B.value_counts().sort_values().plot(kind = 'barh',title="Before COVID - How often did you talk with any of your Family?")df.SOC3A.value_counts().sort_values().plot(kind = 'barh',title="How often did you talk with any of your Family?")

Step 2 : Create a Driver Notebook

  • Create all the variables and add them to a dictionary
  • run “papermill.execute_notebook” by passing the parameters
import papermill as pmapril_data_url = "https://query.data.world/s/ukmb3b5jkp5okj6m4oyxf5pukjsa5g"
parameters = dict(
data_url = april_data_url
)
exe = pm.execute_notebook(
'template/Analysis.ipynb',
'generated/Analysis_April.ipynb',
parameters = parameters,
log_output = False
)
#-------------------------------------------------------------#may_data_url = "https://query.data.world/s/g6gsty3xrfaxthwefimuzbi2xi4hrv"
parmeters = dict(
data_url = may_data_url
)
exe = pm.execute_notebook(
'template/Analysis.ipynb',
'generated/Analysis_May.ipynb',
parameters = parmeters,
log_output = False
)

Corresponding notebooks are created in the “generated” folder

The papermillcan easily be executed from command line or other python programs which gives you a flexibility to run automated reusable and extensible notebooks.

Check out more details here

--

--