Introduction to Data Analysis and Plotting in Python

This tutorial is...

introductory and geared towards people new to Python, but familiar with STATA, MATLAB, R, etc.
interactive with simple exercises along the way

We'll be introducing Pandas for data handling and analysis, and using Matplotlib, Seaborn and Bokeh for visualization

Please execute sequentially

What is pandas?

Pandas provides objects for working with data - Series and DataFrame

Data structures have methods for manipulating data eg. indexing, sorting, grouping, filling in missing data

Pandas does not provide modeling tools eg. regression, prediction

These tools are found in packages such as scikit-learn and statsmodels, which are built on top of pandas

DataFrames

A DataFrame combines multiple 'columns' of data into a two-dimensional object, similar to a spreadsheet



In [1]:

    
from IPython.display import Image

Image('./dataframe.jpg')









    Out[1]:

We will create a DataFrame by reading in a CSV file and assigning it to the variable name data

Info on the data set

The data come from 'Assessing the Rate of Replication in Economics', American Economics Review: Papers & Proceedings, 2017
The dataframe data contains 70 empirical papers from AER's 100th volume
Web of Science: number of Web of Science citations
Top 200: number of Top 200 economics journal citations
Google Scholar: number of Google Scholar citations
Replications: number of citations that are replications of the paper
Extensions: number of citations that are extensions of the paper
Both: number of citations that are either replications or extensions of the paper



In [2]:

    
import pandas as pd  # Import the package

data = pd.read_csv('data.csv')
data.head()









    Out[2]:







  
    
      
      ID
      Title
      Web of Science
      Top 200
      Google Scholar
      Field
      Replications
      Extensions
      Both
    
  
  
    
      0
      2
      Learning about a New Technology: Pineapple in ...
      176
      108.0
      1246
      Development
      1.0
      10.0
      11.0
    
    
      1
      3
      Multiple-Product Firms and Product Switching
      111
      74.0
      764
      Labor/IO
      1.0
      2.0
      3.0
    
    
      2
      4
      Momma's Got the Pill": How Anthony Comstock an...
      23
      15.0
      98
      Labor/IO
      2.0
      0.0
      2.0
    
    
      3
      5
      Matching and Sorting in Online Dating
      75
      23.0
      326
      Labor/IO
      0.0
      5.0
      5.0
    
    
      4
      6
      Entry, Exit, and Investment-Specific Technical...
      9
      9.0
      38
      Macro/Int/Trade
      0.0
      0.0
      0.0



In [3]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 9 columns):
ID                70 non-null int64
Title             70 non-null object
Web of Science    70 non-null int64
Top 200           70 non-null float64
Google Scholar    70 non-null int64
Field             70 non-null object
Replications      70 non-null float64
Extensions        70 non-null float64
Both              70 non-null float64
dtypes: float64(4), int64(3), object(2)
memory usage: 5.0+ KB

We can access individual columns of data, returning a Series



In [4]:

    
data['Top 200'].head()









    Out[4]:





0    108.0
1     74.0
2     15.0
3     23.0
4      9.0
Name: Top 200, dtype: float64

We can also select multiple columns, returning a new dataframe



In [5]:

    
data[['Title', 'Top 200']].head()









    Out[5]:







  
    
      
      Title
      Top 200
    
  
  
    
      0
      Learning about a New Technology: Pineapple in ...
      108.0
    
    
      1
      Multiple-Product Firms and Product Switching
      74.0
    
    
      2
      Momma's Got the Pill": How Anthony Comstock an...
      15.0
    
    
      3
      Matching and Sorting in Online Dating
      23.0
    
    
      4
      Entry, Exit, and Investment-Specific Technical...
      9.0

We can add a new column to our dataframe like so



In [6]:

    
data['Total'] = data['Web of Science'] + data['Top 200'] + data['Google Scholar']
data.head()









    Out[6]:







  
    
      
      ID
      Title
      Web of Science
      Top 200
      Google Scholar
      Field
      Replications
      Extensions
      Both
      Total
    
  
  
    
      0
      2
      Learning about a New Technology: Pineapple in ...
      176
      108.0
      1246
      Development
      1.0
      10.0
      11.0
      1530.0
    
    
      1
      3
      Multiple-Product Firms and Product Switching
      111
      74.0
      764
      Labor/IO
      1.0
      2.0
      3.0
      949.0
    
    
      2
      4
      Momma's Got the Pill": How Anthony Comstock an...
      23
      15.0
      98
      Labor/IO
      2.0
      0.0
      2.0
      136.0
    
    
      3
      5
      Matching and Sorting in Online Dating
      75
      23.0
      326
      Labor/IO
      0.0
      5.0
      5.0
      424.0
    
    
      4
      6
      Entry, Exit, and Investment-Specific Technical...
      9
      9.0
      38
      Macro/Int/Trade
      0.0
      0.0
      0.0
      56.0

If we want to know the average number of Web of Science citations...



In [7]:

    
data.mean()









    Out[7]:





ID                 52.485714
Web of Science     42.071429
Top 200            22.257143
Google Scholar    227.571429
Replications        0.742857
Extensions          1.728571
Both                2.471429
Total             291.900000
dtype: float64

.describe() returns useful summary statistics



In [8]:

    
data.describe()









    Out[8]:







  
    
      
      ID
      Web of Science
      Top 200
      Google Scholar
      Replications
      Extensions
      Both
      Total
    
  
  
    
      count
      70.000000
      70.000000
      70.000000
      70.000000
      70.000000
      70.000000
      70.000000
      70.000000
    
    
      mean
      52.485714
      42.071429
      22.257143
      227.571429
      0.742857
      1.728571
      2.471429
      291.900000
    
    
      std
      31.465068
      43.070660
      23.182131
      257.258375
      1.741944
      2.868729
      3.794433
      319.403935
    
    
      min
      2.000000
      1.000000
      0.000000
      7.000000
      0.000000
      0.000000
      0.000000
      8.000000
    
    
      25%
      23.250000
      17.250000
      9.000000
      73.250000
      0.000000
      0.000000
      0.000000
      101.000000
    
    
      50%
      48.500000
      28.500000
      15.000000
      139.000000
      0.000000
      0.000000
      1.000000
      178.000000
    
    
      75%
      81.750000
      51.000000
      23.750000
      267.250000
      1.000000
      2.750000
      3.750000
      330.750000
    
    
      max
      104.000000
      195.000000
      108.000000
      1246.000000
      9.000000
      16.000000
      18.000000
      1530.000000

Exercises

Find the means of Top 200 and Google Scholar citations



In [ ]:

data[['Top 200', 'Google Scholar']].mean()

Top 200            22.257143
Google Scholar    227.571429
dtype: float64

Create a table of summary statistics of Top 200 citations and Google Scholar citations



In [ ]:

data[['Top 200', 'Google Scholar']].describe()

	Top 200	Google Scholar
count	70.000000	70.000000
mean	22.257143	227.571429
std	23.182131	257.258375
min	0.000000	7.000000
25%	9.000000	73.250000
50%	15.000000	139.000000
75%	23.750000	267.250000
max	108.000000	1246.000000

Create a table of summary statistics of the paper's fields (according to the variable `Field`)



In [ ]:

data['Field'].describe()

count                  70
unique                  5
top       Macro/Int/Trade
freq                   25
Name: Field, dtype: object

Selecting and filtering

We can use integer slicing to select rows as follows



In [9]:

    
data[:5]









    Out[9]:







  
    
      
      ID
      Title
      Web of Science
      Top 200
      Google Scholar
      Field
      Replications
      Extensions
      Both
      Total
    
  
  
    
      0
      2
      Learning about a New Technology: Pineapple in ...
      176
      108.0
      1246
      Development
      1.0
      10.0
      11.0
      1530.0
    
    
      1
      3
      Multiple-Product Firms and Product Switching
      111
      74.0
      764
      Labor/IO
      1.0
      2.0
      3.0
      949.0
    
    
      2
      4
      Momma's Got the Pill": How Anthony Comstock an...
      23
      15.0
      98
      Labor/IO
      2.0
      0.0
      2.0
      136.0
    
    
      3
      5
      Matching and Sorting in Online Dating
      75
      23.0
      326
      Labor/IO
      0.0
      5.0
      5.0
      424.0
    
    
      4
      6
      Entry, Exit, and Investment-Specific Technical...
      9
      9.0
      38
      Macro/Int/Trade
      0.0
      0.0
      0.0
      56.0

We might want to find the top cited papers

First we will sort our values by a column in the dataframe



In [10]:

    
pd.set_option('max_colwidth', 100)  # adjust column width

data[['Title', 'Google Scholar']].sort_values(by='Google Scholar', ascending=False)[:5]









    Out[10]:







  
    
      
      Title
      Google Scholar
    
  
  
    
      0
      Learning about a New Technology: Pineapple in Ghana
      1246
    
    
      21
      The Macroeconomic Effects of Tax Changes: Estimates Based on a New Measure of Fiscal Shocks
      1245
    
    
      33
      What Causes Industry Agglomeration? Evidence from Coagglomeration Patterns
      766
    
    
      1
      Multiple-Product Firms and Product Switching
      764
    
    
      15
      Social Preferences, Beliefs, and the Dynamics of Free Riding in Public Goods Experiments
      736

Another way to select rows is to use row labels, ie. set a row index

Similar to the column labels, we can add row labels (the index)



In [11]:

    
data.set_index('Title').head()









    Out[11]:







  
    
      
      ID
      Web of Science
      Top 200
      Google Scholar
      Field
      Replications
      Extensions
      Both
      Total
    
    
      Title
      
      
      
      
      
      
      
      
      
    
  
  
    
      Learning about a New Technology: Pineapple in Ghana
      2
      176
      108.0
      1246
      Development
      1.0
      10.0
      11.0
      1530.0
    
    
      Multiple-Product Firms and Product Switching
      3
      111
      74.0
      764
      Labor/IO
      1.0
      2.0
      3.0
      949.0
    
    
      Momma's Got the Pill": How Anthony Comstock and Griswold v. Connecticut Shaped US Childbearing
      4
      23
      15.0
      98
      Labor/IO
      2.0
      0.0
      2.0
      136.0
    
    
      Matching and Sorting in Online Dating
      5
      75
      23.0
      326
      Labor/IO
      0.0
      5.0
      5.0
      424.0
    
    
      Entry, Exit, and Investment-Specific Technical Change
      6
      9
      9.0
      38
      Macro/Int/Trade
      0.0
      0.0
      0.0
      56.0

Note: we haven't actually changed data



In [12]:

    
data.head()









    Out[12]:







  
    
      
      ID
      Title
      Web of Science
      Top 200
      Google Scholar
      Field
      Replications
      Extensions
      Both
      Total
    
  
  
    
      0
      2
      Learning about a New Technology: Pineapple in Ghana
      176
      108.0
      1246
      Development
      1.0
      10.0
      11.0
      1530.0
    
    
      1
      3
      Multiple-Product Firms and Product Switching
      111
      74.0
      764
      Labor/IO
      1.0
      2.0
      3.0
      949.0
    
    
      2
      4
      Momma's Got the Pill": How Anthony Comstock and Griswold v. Connecticut Shaped US Childbearing
      23
      15.0
      98
      Labor/IO
      2.0
      0.0
      2.0
      136.0
    
    
      3
      5
      Matching and Sorting in Online Dating
      75
      23.0
      326
      Labor/IO
      0.0
      5.0
      5.0
      424.0
    
    
      4
      6
      Entry, Exit, and Investment-Specific Technical Change
      9
      9.0
      38
      Macro/Int/Trade
      0.0
      0.0
      0.0
      56.0

We need to reassign the variable name data to the new copy



In [13]:

    
# data.reset_index(inplace=True)   # Uncomment this if you receive an error

data = data.set_index('Title')   # Can also use data.set_index('Title', inplace=True)
data.head()









    Out[13]:







  
    
      
      ID
      Web of Science
      Top 200
      Google Scholar
      Field
      Replications
      Extensions
      Both
      Total
    
    
      Title
      
      
      
      
      
      
      
      
      
    
  
  
    
      Learning about a New Technology: Pineapple in Ghana
      2
      176
      108.0
      1246
      Development
      1.0
      10.0
      11.0
      1530.0
    
    
      Multiple-Product Firms and Product Switching
      3
      111
      74.0
      764
      Labor/IO
      1.0
      2.0
      3.0
      949.0
    
    
      Momma's Got the Pill": How Anthony Comstock and Griswold v. Connecticut Shaped US Childbearing
      4
      23
      15.0
      98
      Labor/IO
      2.0
      0.0
      2.0
      136.0
    
    
      Matching and Sorting in Online Dating
      5
      75
      23.0
      326
      Labor/IO
      0.0
      5.0
      5.0
      424.0
    
    
      Entry, Exit, and Investment-Specific Technical Change
      6
      9
      9.0
      38
      Macro/Int/Trade
      0.0
      0.0
      0.0
      56.0



In [14]:

    
data.loc['Matching and Sorting in Online Dating']









    Out[14]:





ID                       5
Web of Science          75
Top 200                 23
Google Scholar         326
Field             Labor/IO
Replications             0
Extensions               5
Both                     5
Total                  424
Name: Matching and Sorting in Online Dating, dtype: object



In [15]:

    
data.loc[['Matching and Sorting in Online Dating', 
          'Learning about a New Technology: Pineapple in Ghana']]









    Out[15]:







  
    
      
      ID
      Web of Science
      Top 200
      Google Scholar
      Field
      Replications
      Extensions
      Both
      Total
    
    
      Title
      
      
      
      
      
      
      
      
      
    
  
  
    
      Matching and Sorting in Online Dating
      5
      75
      23.0
      326
      Labor/IO
      0.0
      5.0
      5.0
      424.0
    
    
      Learning about a New Technology: Pineapple in Ghana
      2
      176
      108.0
      1246
      Development
      1.0
      10.0
      11.0
      1530.0

Alternatively, we can filter our dataframe (select rows) using boolean conditions



In [16]:

    
data.reset_index(inplace=True)

data['Field'] == 'Development'









    Out[16]:





0      True
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
      ...  
40    False
41    False
42     True
43    False
44    False
45     True
46    False
47    False
48    False
49    False
50    False
51    False
52    False
53    False
54    False
55     True
56    False
57    False
58    False
59    False
60    False
61    False
62    False
63    False
64     True
65    False
66    False
67    False
68    False
69    False
Name: Field, Length: 70, dtype: bool

Selecting rows with this boolean condition will return only rows of the dataframe where Field == 'Development' is True



In [17]:

    
data[data['Field'] == 'Development']









    Out[17]:







  
    
      
      Title
      ID
      Web of Science
      Top 200
      Google Scholar
      Field
      Replications
      Extensions
      Both
      Total
    
  
  
    
      0
      Learning about a New Technology: Pineapple in Ghana
      2
      176
      108.0
      1246
      Development
      1.0
      10.0
      11.0
      1530.0
    
    
      7
      Multinationals and Anti-Sweatshop Activism
      9
      21
      8.0
      134
      Development
      0.0
      2.0
      2.0
      163.0
    
    
      42
      Determinants of Redistributive Politics: An Empirical Analysis of Land Reforms in West Bengal, I...
      64
      12
      9.0
      63
      Development
      0.0
      1.0
      1.0
      84.0
    
    
      45
      Watta Satta: Bride Exchange and Women's Welfare in Rural Pakistan
      73
      6
      4.0
      37
      Development
      0.0
      0.0
      0.0
      47.0
    
    
      55
      Inherited Trust and Growth
      85
      102
      44.0
      536
      Development
      3.0
      3.0
      6.0
      682.0
    
    
      64
      Can Higher Prices Stimulate Product Use? Evidence from a Field Experiment in Zambia
      97
      44
      21.0
      265
      Development
      0.0
      0.0
      0.0
      330.0



In [18]:

    
data[(data['Field'] == 'Development') & (data['Web of Science'] > 100)]









    Out[18]:







  
    
      
      Title
      ID
      Web of Science
      Top 200
      Google Scholar
      Field
      Replications
      Extensions
      Both
      Total
    
  
  
    
      0
      Learning about a New Technology: Pineapple in Ghana
      2
      176
      108.0
      1246
      Development
      1.0
      10.0
      11.0
      1530.0
    
    
      55
      Inherited Trust and Growth
      85
      102
      44.0
      536
      Development
      3.0
      3.0
      6.0
      682.0

Exercises

How many papers in the sample have over 100 Google Scholar citations?

Hint: use len() to find the length of a dataframe



In [ ]:

len(data[data['Google Scholar'] > 100])

How many 'Labor/IO' papers have over 100 Google Scholar citations?



In [ ]:

len(data[(data['Google Scholar'] > 100) & (data['Field'] == 'Labor/IO')])

Grouping and aggregating data

We might want to summarize our data by grouping it by fields

To do this, we will use the .groupby() function



In [19]:

    
data.reset_index(inplace=True)

grouped = data.groupby('Field')
grouped









    Out[19]:





<pandas.core.groupby.DataFrameGroupBy object at 0x114a2b588>

To return an aggregated dataframe, we need to specify the function we would like pandas to use to aggregate our groups



In [20]:

    
grouped.mean()









    Out[20]:







  
    
      
      index
      ID
      Web of Science
      Top 200
      Google Scholar
      Replications
      Extensions
      Both
      Total
    
    
      Field
      
      
      
      
      
      
      
      
      
    
  
  
    
      Applied (gen.)
      32.428571
      49.285714
      33.428571
      13.142857
      199.714286
      0.857143
      0.571429
      1.428571
      246.285714
    
    
      Behavioral/Exp.
      42.631579
      65.000000
      48.842105
      24.000000
      202.789474
      1.631579
      3.421053
      5.052632
      275.631579
    
    
      Development
      35.500000
      55.000000
      60.166667
      32.333333
      380.166667
      0.666667
      2.666667
      3.333333
      472.666667
    
    
      Labor/IO
      23.846154
      35.923077
      46.384615
      26.461538
      254.384615
      0.615385
      0.769231
      1.384615
      327.230769
    
    
      Macro/Int/Trade
      34.200000
      51.880000
      32.760000
      18.880000
      203.640000
      0.120000
      1.040000
      1.160000
      255.280000



In [21]:

    
grouped['Both'].mean()









    Out[21]:





Field
Applied (gen.)     1.428571
Behavioral/Exp.    5.052632
Development        3.333333
Labor/IO           1.384615
Macro/Int/Trade    1.160000
Name: Both, dtype: float64



In [22]:

    
grouped['Both'].agg(['mean', 'median', 'count'])









    Out[22]:







  
    
      
      mean
      median
      count
    
    
      Field
      
      
      
    
  
  
    
      Applied (gen.)
      1.428571
      1.0
      7
    
    
      Behavioral/Exp.
      5.052632
      4.0
      19
    
    
      Development
      3.333333
      1.5
      6
    
    
      Labor/IO
      1.384615
      1.0
      13
    
    
      Macro/Int/Trade
      1.160000
      0.0
      25

A list of built-in aggregatation functions can be found here

Plotting

Pandas provides a built-in plotting command to create matplotlib plots from your dataframes

To display our plots, we need to import first matplotlib



In [23]:

    
import matplotlib.pyplot as plt

grouped['Google Scholar'].mean().plot(kind='bar')
plt.show()



In [24]:

    
data.plot(x='Google Scholar', y='Both', kind='scatter', alpha=0.6)
plt.ylabel('Total replications and extensions')
plt.xlabel('Number of citations')
plt.grid()
plt.show()

Exercise

Plot the average number of replications and extensions for each field, sorted from highest to lowest



In [ ]:

grouped['Both'].mean().sort_values(ascending=False).plot(kind='bar')
plt.title('Average number of replications/extensions by field')
plt.show()

Another plotting library: `seaborn`

seaborn is a plotting library built on top of matplotlib
It is geared towards producing pretty plots for statistical applications
You can find an example gallery of seaborn plots here



In [26]:

    
import seaborn as sns  # Import the package



In [27]:

    
sns.jointplot(x='Google Scholar', y='Both', data=data)
plt.show()



In [28]:

    
sns.kdeplot(data=data[data['Replications'] > 0].loc[:, 'Web of Science'], cumulative=True, bw=2)
sns.kdeplot(data=data[data['Replications'] == 0].loc[:, 'Web of Science'], cumulative=True, linestyle='--', bw=2)
plt.xlim(0, 200)
plt.xlabel('Web of Science Citations')
plt.ylabel('CDF')
plt.legend(['One or more replications', 'No replications'])
plt.show()

Bokeh example

Bokeh is a Python library that makes creating interactive plots super easy - an example gallery is here



In [30]:

    
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource, HoverTool
output_notebook()

# Add data source
source = ColumnDataSource(data[['Title', 'Web of Science', 'Replications', 'Extensions']])
colors = ['red', 'blue']

# Add title tooltips
hover = HoverTool(tooltips=[
    ("Title", "@Title"),
    ("(x,y)", "($x, $y)")
])

# Create figure
p = figure(tools=[hover])

# Plot scatter
for to_plot, color in zip(['Replications', 'Extensions'], colors):
    p.circle(x='Web of Science', y=to_plot, source=source, size=7, alpha=0.5, legend=[to_plot.title()], color=color)

p.legend.click_policy = 'hide'
p.xaxis.axis_label = 'Number of citations'
p.yaxis.axis_label = 'Number of replications or extensions'

show(p)









    





    
        
        Loading BokehJS ...

Other resources

Here are some free, online resources for learning pandas/matplotlib:

QuantEcon lectures: Pandas, Pandas for Panel Data, Matplotlib
QuantEcon Stata-R-Pandas cheatsheet
SciPy 2017: Anatomy of Matplotlib
Coursera/University of Michigan: Introduction to Data Science in Python

	ID	Title	Web of Science	Top 200	Google Scholar	Field	Replications	Extensions	Both
0	2	Learning about a New Technology: Pineapple in ...	176	108.0	1246	Development	1.0	10.0	11.0
1	3	Multiple-Product Firms and Product Switching	111	74.0	764	Labor/IO	1.0	2.0	3.0
2	4	Momma's Got the Pill": How Anthony Comstock an...	23	15.0	98	Labor/IO	2.0	0.0	2.0
3	5	Matching and Sorting in Online Dating	75	23.0	326	Labor/IO	0.0	5.0	5.0
4	6	Entry, Exit, and Investment-Specific Technical...	9	9.0	38	Macro/Int/Trade	0.0	0.0	0.0

	Title	Google Scholar
0	Learning about a New Technology: Pineapple in Ghana	1246
21	The Macroeconomic Effects of Tax Changes: Estimates Based on a New Measure of Fiscal Shocks	1245
33	What Causes Industry Agglomeration? Evidence from Coagglomeration Patterns	766
1	Multiple-Product Firms and Product Switching	764
15	Social Preferences, Beliefs, and the Dynamics of Free Riding in Public Goods Experiments	736

	ID	Web of Science	Top 200	Google Scholar	Field	Replications	Extensions	Both	Total
Title
Learning about a New Technology: Pineapple in Ghana	2	176	108.0	1246	Development	1.0	10.0	11.0	1530.0
Multiple-Product Firms and Product Switching	3	111	74.0	764	Labor/IO	1.0	2.0	3.0	949.0
Momma's Got the Pill": How Anthony Comstock and Griswold v. Connecticut Shaped US Childbearing	4	23	15.0	98	Labor/IO	2.0	0.0	2.0	136.0
Matching and Sorting in Online Dating	5	75	23.0	326	Labor/IO	0.0	5.0	5.0	424.0
Entry, Exit, and Investment-Specific Technical Change	6	9	9.0	38	Macro/Int/Trade	0.0	0.0	0.0	56.0

	index	ID	Web of Science	Top 200	Google Scholar	Replications	Extensions	Both	Total
Field
Applied (gen.)	32.428571	49.285714	33.428571	13.142857	199.714286	0.857143	0.571429	1.428571	246.285714
Behavioral/Exp.	42.631579	65.000000	48.842105	24.000000	202.789474	1.631579	3.421053	5.052632	275.631579
Development	35.500000	55.000000	60.166667	32.333333	380.166667	0.666667	2.666667	3.333333	472.666667
Labor/IO	23.846154	35.923077	46.384615	26.461538	254.384615	0.615385	0.769231	1.384615	327.230769
Macro/Int/Trade	34.200000	51.880000	32.760000	18.880000	203.640000	0.120000	1.040000	1.160000	255.280000

Introduction to Data Analysis and Plotting in Python

What is pandas?

DataFrames

Info on the data set

Exercises

Find the means of Top 200 and Google Scholar citations

Create a table of summary statistics of Top 200 citations and Google Scholar citations

Create a table of summary statistics of the paper's fields (according to the variable Field)

Selecting and filtering

Exercises

How many papers in the sample have over 100 Google Scholar citations?

How many 'Labor/IO' papers have over 100 Google Scholar citations?

Grouping and aggregating data

Plotting

Exercise

Plot the average number of replications and extensions for each field, sorted from highest to lowest

Another plotting library: seaborn

Bokeh example

Other resources

Create a table of summary statistics of the paper's fields (according to the variable `Field`)

Another plotting library: `seaborn`