The jupyter notebook project is now designed to be a 'language agnostic' web-application front-end for any one of many possible software language kernels. We've been mostly using python but there are in fact several dozen other language kernels that can be made to work with it including Julia, R, Matlab, C, Go, Fortran and Stata.
The ecosystem of libraries and packages for scientific computing with python is huge and constantly growing but there are still many statistics and econometrics applications that are available as built-in or user-written modules in Stata that have not yet been ported to python or are just simply easier to use in Stata. On the other hand there are some libraries such as python pandas and different visualization libraries such as seaborn or matplotlib that give features that are not available in Stata.
Fortunately you don't have to choose between using Stata or python, you can use them both together, to get the best of both worlds.
R is a powerful open source software environment for statistical computing. R has R markdown which allows you to create R-markdown notebooks similar in concept to jupyter notebooks. But you can also run R inside a jupyter notebook (indeed the name 'Jupyter' is from Julia, iPython and R).
See my notebook with notes on Research Discontinuity Design for an example of a jupyter notebook running R. To install an R kernel see the IRkernel project.
Kyle Barron has created a stata_kernel that offers several useful features including code-autocompletion, inline graphics, and generally fast responses.
For this to work you must have a working licensed copy of Stata version 14 or greater on your machine.
Sometimes it may be useful to combine python and Stata in the same notebook. Ties de Kok has written a nice python library called ipystata that allows one to execute Stata code in codeblocks inside an ipython notebook when preceded by a %%stata
magic command.
This workflow allows you to pass data between python and Stata sessions and to display Stata plots inline. Compared to the stata_kernel option the response times are not quite as fast.
The remainder of this notebook illustrates the use of ipystata.
For more details see the example notebook and documentation on the ipystata repository.
In [1]:
%matplotlib inline
import seaborn as sns
import pandas as pd
import statsmodels.formula.api as smf
import ipystata
The following opens a Stata session where we load a dataset and summarize the data. The -o
flag following the `%%Stata``` magic instructs it to output or return the dataset in Stata memory as a pandas dataframe in python.
In [2]:
%%stata -o life_df
sysuse lifeexp.dta
summarize
Let's confirm the data was returned as a pandas dataframe:
In [3]:
life_df.head(3)
Out[3]:
A simple generate variable command and ols regression in Stata:
In [4]:
%%stata -o life_df
gen lngnppc = ln(gnppc)
regress lexp lngnppc
And the same regression using statsmodels and pandas:
In [5]:
model = smf.ols(formula = 'lexp ~ lngnppc',
data = life_df)
results = model.fit()
print(results.summary())
In [6]:
life_df.popgrowth = life_df.popgrowth * 100
In [7]:
life_df.popgrowth.mean()
Out[7]:
And now let's push the modified dataframe into the Stata dataset with the -d
flag:
In [8]:
%%stata -d life_df
summarize
A Stata plot:
In [9]:
%%stata -d life_df --graph
graph twoway (scatter lexp lngnppc) (lfit lexp lngnppc)
Now on the python side use lmplot from the seaborn library to graph a similar scatter and fitted line but by region.
In [10]:
sns.set_style("whitegrid")
g=sns.lmplot(y='lexp', x='lngnppc', col='region', data=life_df,col_wrap=2)