Data Bootcamp Final Project

Kavya Beheraj | May 13, 2016

The goal of my final project for Data Bootcamp was to develop useful data visualizations for my current internship, Student Success Network (SSN). I took two typical datasets currently used by SSN, extracted two variables for both males and females, and created two graphs comparing the network average to an individual organization's outcome. The purpose of such a visualization is to allow SSN's clients, education organizations across New York, to see how well they are doing in their students' social-emotional learning and make decisions on where to distribute resources for targeted programs or interventions.

Contents:

Background
About the Data
- 2.1 | Links to My Data
- 2.2 | Dataframes
Organizational Data
- 3.1 | Slicing the Organizational Data
- 3.2 | Joining the Organizational Data
Network Summary Data
Creating the Final Output
Visualizing the Data
- 6.1 | Visualizing the Percent Positive, Organization vs. Network
- 6.2 | Visualizing the Mean, Organization vs. Network



In [217]:

    
import sys                             
import pandas as pd                   
import matplotlib.pyplot as plt    
import datetime as dt                
import seaborn as sns
import numpy as np          

%matplotlib inline                     

print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())









    



Python version: 3.5.1 |Anaconda 2.4.1 (x86_64)| (default, Dec  7 2015, 11:24:55) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Pandas version:  0.17.1
Today:  2016-05-13

1 | Background

Student Success Network (SSN) is a nonprofit that helps 40 partner organizations measure seven indicators of social-emotional learning (SEL) in students using a survey:

Academic Behaviors
Academic Self-Efficacy
Growth Mindset
Interpersonal Skills
Problem-Solving
Self-Advocacy
Belonging

Social-emotional learning has a huge impact on student outcomes later in life, often comparable to academic outcomes like test scores. SSN has developed a survey to measure social-emotional learning, which they distribute to their partner organizations.

SSN sends the survey responses to another company (the Research Alliance for NYC Schools) for the descriptive statistics, and what they receive is a large, unwieldy spreadsheet that they must translate into easy-to-understand and actionable visualizations. They also provide partner organizations with visualizations of specific subgroups like gender, race, and school, and compare the organization's results on SEL indicators to a network-wide average.

With this project, I attempted to create a uniform method of taking those spreadsheets and turning them into helpful data visualizations.

Source: SSN Website

2 | About the Data

I received my data directly from Student Success Network, and as such, it is not online. They gave me two spreadsheets in .csv format: a sample organizational output (real data from a de-identified partner organization) and a network wide average.

The most important outcomes to partner organizations are:

Percent positive: the percentage of students whose responses are positive for that SEL indicator
Mean: the average result on a scale of 1 to 5 for that SEL indicator

These two variables provide are the most helpful for organizations to make decisions on resource allocation for student social-emotional learning.

I decided to focus on the gender subgroup for the purposes of this project, to reduce the number of different variables at play. Since all of the descriptive statistics they receive for each partner organization are within the same format, I believe SSN can easily translate this program to different subgroups.

2.1 | Links to my data:

2.2 | Dataframes

These are the dataframes I created in order to extract percent positive and mean data for males and females at the organization and network-wide level. The final output is a dataframe summarizing the mean and percent positive for all variables at both levels.

Organization Data

Label	Dataframe
MaleFemale	Total Organization Data for Males and Females
mean	Mean for Males and Females
ppos	Percent Positive for Males and Females
male	Male Data (Mean and Percent Positive)
female	Female Data (Mean and Percent Positive)
mlmean	Mean Data for Males
mlppos	Percent Positive Data for Males
fmmean	Mean Data for Females
fmppos	Percent Positive Data for Females

Network Data

Label	Dataframe
network	Total Network Data
mlnet	Male Network Data (Mean and Percent Positive)
fmnet	Female Network Data (Mean and Percent Positive)
output	Network and Organizational Data, Males and Females

Total Data: Dataframes for both the organization and the network

Label	Dataframe
output	Network and Organizational Data, Males and Females
mean_output	Mean Data for Network and Organization, Males and Females
pp_output	Percent Positive Data for Network and Organization, Males and Females
pp_net	Mean Data for Network, Males and Females
mean_net	Percent Positive Data for Network, Males and Females

3 | Organizational Data

I first focused on cleaning up the Sample Organizational Output for Males and Females. I wanted to extract the two outcomes (percent positive and mean) for both males and females along the seven SEL indicators.

The results include data from 230 males and 334 females, or 564 students in total.



In [47]:

    
MaleFemale = "/Users/kavyabeheraj/Desktop/Current Classes/Data Bootcamp/Male_Female_Sample_Org_Output.csv"

# Sample organizational output for males and females

df = pd.read_csv(MaleFemale)
df

3.1 | Slicing the Organizational Data

I noticed that the male outcomes consisted of the first 14 rows of the spreadsheet, while the female outcomes consisted of the last 14. The 14 rows of outcomes were each divided into 7 rows for the mean and 7 rows for the percent positive (called percent perfect in the dataframe).

I sliced the spreadsheet into four separate dataframes:

Male Mean (mlmean)
Male Percent Positive (mlpp)
Female Mean (fmmean)
Female Percent Positive (fmpp)

I also set the SEL indicators as the index renamed all of them for consistency.



In [56]:

    
male = pd.read_csv(MaleFemale).head(14)
female = pd.read_csv(MaleFemale).tail(14)



In [4]:

    
mlmean = male.head(7)                 # Reads the first seven lines of the dataframe
mlmean = mlmean[["Label","Mean"]].set_index("Label")     # Slices only two columns and sets the index to be "Label"
mlmean = mlmean.rename(index={"Academic Behavior" : "AcaBeh", 
                              "Academic Self-efficacy" : "AcaEf", 
                              "Growth Mindset" : "Growth", 
                              "Interpersonal Skills" : "Intp", 
                              "Problem Solving" : "Prob", 
                              "SELF-ADVOCACY" : "SelfAd", 
                              "BELONGING" : "Belong"},
                      columns={"Mean" : "Male Mean"})
mlmean



In [5]:

    
mlpp = male.tail(7)                 # Reads the first seven lines of the dataframe
mlpp = mlpp[["Label","Mean"]].set_index("Label")     # Slices only two columns and sets the index to be "Label"
mlpp = mlpp.rename(index={"Academic Behavior Percent Perfect" : "AcaBeh", 
                          "Academic Self-efficacy Percent Perfect" : "AcaEf", 
                          "Growth Mindset Percent Perfect" : "Growth", 
                          "Interpersonal Skills Percent Perfect" : "Intp", 
                          "Problem Solving Percent Perfect" : "Prob", 
                          "SELF ADVOCACY PERCENT PERFECT" : "SelfAd", 
                          "BELONGING PERCENT PERFECT ge 4" : "Belong"},
                   columns={"Mean" : "Male Percent Positive"})

mlpp









    Out[5]:






  
    
      
      Male Percent Positive
    
    
      Label
      
    
  
  
    
      AcaBeh
      0.262009
    
    
      AcaEf
      0.432314
    
    
      Growth
      0.510917
    
    
      Intp
      0.456522
    
    
      Prob
      0.330435
    
    
      SelfAd
      0.333333
    
    
      Belong
      0.591304



In [57]:

    
fmmean = female.head(7)                 # Reads the first seven lines of the dataframe
fmmean = fmmean[["Label","Mean"]].set_index("Label")     # Slices only two columns and sets the index to be "Label"
fmmean = fmmean.rename(index={"Academic Behavior" : "AcaBeh", 
                              "Academic Self-efficacy" : "AcaEf", 
                              "Growth Mindset" : "Growth", 
                              "Interpersonal Skills" : "Intp", 
                              "Problem Solving" : "Prob", 
                              "SELF-ADVOCACY" : "SelfAd", 
                              "BELONGING" : "Belong"},
                      columns={"Mean" : "Female Mean"})
fmmean









    Out[57]:






  
    
      
      Female Mean
    
    
      Label
      
    
  
  
    
      AcaBeh
      3.519144
    
    
      AcaEf
      2.640891
    
    
      Growth
      2.825781
    
    
      Intp
      4.817563
    
    
      Prob
      2.672490
    
    
      SelfAd
      2.834629
    
    
      Belong
      3.927492



In [58]:

    
fmpp = female.tail(7)
fmpp = fmpp[["Label","Mean"]].set_index("Label")
fmpp = fmpp.rename(index={"Academic Behavior Percent Perfect" : "AcaBeh", 
                          "Academic Self-efficacy Percent Perfect" : "AcaEf", 
                          "Growth Mindset Percent Perfect" : "Growth", 
                          "Interpersonal Skills Percent Perfect" : "Intp", 
                          "Problem Solving Percent Perfect" : "Prob", 
                          "SELF ADVOCACY PERCENT PERFECT" : "SelfAd", 
                          "BELONGING PERCENT PERFECT ge 4" : "Belong"},
                   columns={"Mean" : "Female Percent Positive"}) 
fmpp









    Out[58]:






  
    
      
      Female Percent Positive
    
    
      Label
      
    
  
  
    
      AcaBeh
      0.293210
    
    
      AcaEf
      0.323263
    
    
      Growth
      0.506061
    
    
      Intp
      0.496988
    
    
      Prob
      0.322289
    
    
      SelfAd
      0.389058
    
    
      Belong
      0.567976

3.2 | Joining the Organizational Data

After creating four separate dataframes, I decided to concatenate them along the seven SEL indicators. I envisioned a problem in graphing both outcomes (mean and percent positive) within the same graph, since they had different scales, but seeing all of the data within one dataframe is easier to understand.

I created two dataframes summarizing the mean and percent positive (ppos), as well as one with both outcomes (meanppos).



In [212]:

    
mean = pd.concat([mlmean, fmmean], axis=1)
mean









    Out[212]:






  
    
      
      Male Mean
      Female Mean
    
    
      Label
      
      
    
  
  
    
      AcaBeh
      3.414368
      3.519144
    
    
      AcaEf
      2.770742
      2.640891
    
    
      Growth
      2.864493
      2.825781
    
    
      Intp
      4.733817
      4.817563
    
    
      Prob
      2.711232
      2.672490
    
    
      SelfAd
      2.791545
      2.834629
    
    
      Belong
      3.897101
      3.927492



In [42]:

    
ppos = pd.concat([mlpp, fmpp], axis=1)
ppos









    Out[42]:






  
    
      
      Male Percent Positive
      Female Percent Positive
    
    
      Label
      
      
    
  
  
    
      AcaBeh
      0.262009
      0.293210
    
    
      AcaEf
      0.432314
      0.323263
    
    
      Growth
      0.510917
      0.506061
    
    
      Intp
      0.456522
      0.496988
    
    
      Prob
      0.330435
      0.322289
    
    
      SelfAd
      0.333333
      0.389058
    
    
      Belong
      0.591304
      0.567976



In [43]:

    
meanppos = pd.concat([mlpp, fmpp, mlmean, fmmean], axis=1)
meanppos









    Out[43]:






  
    
      
      Male Percent Positive
      Female Percent Positive
      Male Mean
      Female Mean
    
    
      Label
      
      
      
      
    
  
  
    
      AcaBeh
      0.262009
      0.293210
      3.414368
      3.414368
    
    
      AcaEf
      0.432314
      0.323263
      2.770742
      2.770742
    
    
      Growth
      0.510917
      0.506061
      2.864493
      2.864493
    
    
      Intp
      0.456522
      0.496988
      4.733817
      4.733817
    
    
      Prob
      0.330435
      0.322289
      2.711232
      2.711232
    
    
      SelfAd
      0.333333
      0.389058
      2.791545
      2.791545
    
    
      Belong
      0.591304
      0.567976
      3.897101
      3.897101



In [ ]:

    
mean.plot.barh(figsize = (10,7))

4 | Network Summary Data

I read in the male and female summary data for the entire network. I then extracted the data for males and females, avoiding the rows which had a blank for "isFemale". Unlike the organizational data, SSN network data has a separate column for percent positive, which meant that I did not have to create as many dataframes to get the same output.

I created two dataframes, one summarizing male network data (mlnet) and one for female network data (fmnet).



In [80]:

    
df2 = "/Users/kavyabeheraj/Desktop/Current Classes/Data Bootcamp/Network_Summary_Gender.csv"

network = pd.read_csv(df2)
network



In [61]:

    
mlnet = network.tail(7)                 
mlnet = mlnet[["label","mean", "percentPositive"]].set_index("label")     
mlnet = mlnet.rename(index={"Academic Behavior" : "AcaBeh", 
                          "Academic Self-efficacy" : "AcaEf", 
                          "Growth Mindset" : "Growth", 
                          "Interpersonal Skills" : "Intp", 
                          "Problem Solving" : "Prob", 
                          "Self-Advocacy" : "SelfAd", 
                          "Belonging" : "Belong"},
                   columns={"mean" : "Male Mean, Network",
                           "percentPositive" : "Male Percent Positive, Network"})
mlnet









    Out[61]:






  
    
      
      Male Mean, Network
      Male Percent Positive, Network
    
    
      label
      
      
    
  
  
    
      AcaBeh
      3.545661
      0.285714
    
    
      AcaEf
      2.600877
      0.263158
    
    
      Belong
      3.392157
      0.055556
    
    
      Growth
      2.411111
      0.200000
    
    
      Intp
      4.387238
      0.294118
    
    
      Prob
      2.645614
      0.190476
    
    
      SelfAd
      2.786946
      0.222222



In [64]:

    
fmnet = network[7:14]              
fmnet = fmnet[["label","mean", "percentPositive"]].set_index("label")     
fmnet = fmnet.rename(index={"Academic Behavior" : "AcaBeh", 
                          "Academic Self-efficacy" : "AcaEf", 
                          "Growth Mindset" : "Growth", 
                          "Interpersonal Skills" : "Intp", 
                          "Problem Solving" : "Prob", 
                          "Self-Advocacy" : "SelfAd", 
                          "Belonging" : "Belong"},
                   columns={"mean" : "Female Mean, Network", "percentPositive" : "Female Percent Positive, Network"})
fmnet









    Out[64]:






  
    
      
      Female Mean, Network
      Female Percent Positive, Network
    
    
      label
      
      
    
  
  
    
      AcaBeh
      3.508376
      0.221106
    
    
      AcaEf
      2.660160
      0.228070
    
    
      Belong
      3.548576
      0.187970
    
    
      Growth
      2.442761
      0.188917
    
    
      Intp
      4.670432
      0.384422
    
    
      Prob
      2.695707
      0.203008
    
    
      SelfAd
      2.789838
      0.195980

5 | Slicing the Final Output

After I cleaned and sliced both the organization and network-wide data, I joined them along the SEL indicators and created separate dataframes for mean and percent positive.



In [66]:

    
output = pd.concat([meanppos, fmnet, mlnet ], axis=1)
output









    Out[66]:






  
    
      
      Male Percent Positive
      Female Percent Positive
      Male Mean
      Female Mean
      Female Mean, Network
      Female Percent Positive, Network
      Male Mean, Network
      Male Percent Positive, Network
    
  
  
    
      AcaBeh
      0.262009
      0.293210
      3.414368
      3.414368
      3.508376
      0.221106
      3.545661
      0.285714
    
    
      AcaEf
      0.432314
      0.323263
      2.770742
      2.770742
      2.660160
      0.228070
      2.600877
      0.263158
    
    
      Belong
      0.591304
      0.567976
      3.897101
      3.897101
      3.548576
      0.187970
      3.392157
      0.055556
    
    
      Growth
      0.510917
      0.506061
      2.864493
      2.864493
      2.442761
      0.188917
      2.411111
      0.200000
    
    
      Intp
      0.456522
      0.496988
      4.733817
      4.733817
      4.670432
      0.384422
      4.387238
      0.294118
    
    
      Prob
      0.330435
      0.322289
      2.711232
      2.711232
      2.695707
      0.203008
      2.645614
      0.190476
    
    
      SelfAd
      0.333333
      0.389058
      2.791545
      2.791545
      2.789838
      0.195980
      2.786946
      0.222222



In [74]:

    
mean_output = output[["Male Mean", 
                      "Female Mean", 
                      "Male Mean, Network", 
                      "Female Mean, Network"]]
mean_output









    Out[74]:






  
    
      
      Male Mean
      Female Mean
      Male Mean, Network
      Female Mean, Network
    
  
  
    
      AcaBeh
      3.414368
      3.414368
      3.545661
      3.508376
    
    
      AcaEf
      2.770742
      2.770742
      2.600877
      2.660160
    
    
      Belong
      3.897101
      3.897101
      3.392157
      3.548576
    
    
      Growth
      2.864493
      2.864493
      2.411111
      2.442761
    
    
      Intp
      4.733817
      4.733817
      4.387238
      4.670432
    
    
      Prob
      2.711232
      2.711232
      2.645614
      2.695707
    
    
      SelfAd
      2.791545
      2.791545
      2.786946
      2.789838



In [88]:

    
pp_output = output[["Male Percent Positive",  
                    "Male Percent Positive, Network",
                    "Female Percent Positive",
                    "Female Percent Positive, Network"]]
pp_output









    Out[88]:






  
    
      
      Male Percent Positive
      Male Percent Positive, Network
      Female Percent Positive
      Female Percent Positive, Network
    
  
  
    
      AcaBeh
      0.262009
      0.285714
      0.293210
      0.221106
    
    
      AcaEf
      0.432314
      0.263158
      0.323263
      0.228070
    
    
      Belong
      0.591304
      0.055556
      0.567976
      0.187970
    
    
      Growth
      0.510917
      0.200000
      0.506061
      0.188917
    
    
      Intp
      0.456522
      0.294118
      0.496988
      0.384422
    
    
      Prob
      0.330435
      0.190476
      0.322289
      0.203008
    
    
      SelfAd
      0.333333
      0.222222
      0.389058
      0.195980



In [152]:

    
mean_net = output[["Male Mean, Network",
                   "Female Mean, Network"]]
mean_net









    Out[152]:






  
    
      
      Male Mean, Network
      Female Mean, Network
    
  
  
    
      AcaBeh
      3.545661
      3.508376
    
    
      AcaEf
      2.600877
      2.660160
    
    
      Belong
      3.392157
      3.548576
    
    
      Growth
      2.411111
      2.442761
    
    
      Intp
      4.387238
      4.670432
    
    
      Prob
      2.645614
      2.695707
    
    
      SelfAd
      2.786946
      2.789838



In [169]:

    
ppos_net = output[["Male Percent Positive, Network",
                   "Female Percent Positive, Network"]]
ppos_net









    Out[169]:






  
    
      
      Male Percent Positive, Network
      Female Percent Positive, Network
    
  
  
    
      AcaBeh
      0.285714
      0.221106
    
    
      AcaEf
      0.263158
      0.228070
    
    
      Belong
      0.055556
      0.187970
    
    
      Growth
      0.200000
      0.188917
    
    
      Intp
      0.294118
      0.384422
    
    
      Prob
      0.190476
      0.203008
    
    
      SelfAd
      0.222222
      0.195980

6 | Visualizing the Data

I created two visualizations of the data, one for percent positive and another for the mean. I compared male vs. female and the organization's data vs. the network-wide outcome.

6.1 | Visualizing the Percent Positive, Organization vs. Network



In [354]:

    
plt.style.use('seaborn-pastel')

ax = ppos_net.plot(linestyle='-', marker='o', colormap = "Accent")
ppos.plot(kind='bar', colormap = "Pastel2",
          ax=ax,
         figsize = (10,7))

ax.set_ylim(0, 0.8)
ax.set_title("Percent Positive Male and Female SEL Outcomes, Organization vs. Network")









    Out[354]:





<matplotlib.text.Text at 0x13479ea90>

From the data above, we can see that this organization has a greater percentage of students who meet or exceed requirements for the 7 SEL indicators, except for Problem-Solving.

5.1 | Visualizing the Mean, Organization vs. Network



In [353]:

    
plt.style.use('seaborn-pastel')

ax = mean_net.plot(linestyle='-', marker='o', colormap = "Accent")
mean.plot(kind='bar', colormap = "Pastel2",
          ax=ax,
         figsize = (10,7))

ax.set_ylim(0, 5)
ax.set_title("Mean Male and Female SEL Outcomes, Organization vs. Network")









    Out[353]:





<matplotlib.text.Text at 0x1346f2a90>

Looking at the mean for the organization compared to the network, we see that the students at this organization generally perform below-average on SEL indicators, except for Interpersonal Skills. If I was advising this organization, I would suggest that they distribute more resources towards enhancing their students' problem-solving skills, as well as their academic behaviors in general.



In [ ]:

	Male Mean
Label
AcaBeh	3.414368
AcaEf	2.770742
Growth	2.864493
Intp	4.733817
Prob	2.711232
SelfAd	2.791545
Belong	3.897101

	Male Percent Positive
Label
AcaBeh	0.262009
AcaEf	0.432314
Growth	0.510917
Intp	0.456522
Prob	0.330435
SelfAd	0.333333
Belong	0.591304

	Female Mean
Label
AcaBeh	3.519144
AcaEf	2.640891
Growth	2.825781
Intp	4.817563
Prob	2.672490
SelfAd	2.834629
Belong	3.927492

	Female Percent Positive
Label
AcaBeh	0.293210
AcaEf	0.323263
Growth	0.506061
Intp	0.496988
Prob	0.322289
SelfAd	0.389058
Belong	0.567976

	Male Mean, Network	Male Percent Positive, Network
label
AcaBeh	3.545661	0.285714
AcaEf	2.600877	0.263158
Belong	3.392157	0.055556
Growth	2.411111	0.200000
Intp	4.387238	0.294118
Prob	2.645614	0.190476
SelfAd	2.786946	0.222222

	Female Mean, Network	Female Percent Positive, Network
label
AcaBeh	3.508376	0.221106
AcaEf	2.660160	0.228070
Belong	3.548576	0.187970
Growth	2.442761	0.188917
Intp	4.670432	0.384422
Prob	2.695707	0.203008
SelfAd	2.789838	0.195980