The goal of my final project for Data Bootcamp was to develop useful data visualizations for my current internship, Student Success Network (SSN). I took two typical datasets currently used by SSN, extracted two variables for both males and females, and created two graphs comparing the network average to an individual organization's outcome. The purpose of such a visualization is to allow SSN's clients, education organizations across New York, to see how well they are doing in their students' social-emotional learning and make decisions on where to distribute resources for targeted programs or interventions.
Contents:
In [217]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import seaborn as sns
import numpy as np
%matplotlib inline
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())
Student Success Network (SSN) is a nonprofit that helps 40 partner organizations measure seven indicators of social-emotional learning (SEL) in students using a survey:
Social-emotional learning has a huge impact on student outcomes later in life, often comparable to academic outcomes like test scores. SSN has developed a survey to measure social-emotional learning, which they distribute to their partner organizations.
SSN sends the survey responses to another company (the Research Alliance for NYC Schools) for the descriptive statistics, and what they receive is a large, unwieldy spreadsheet that they must translate into easy-to-understand and actionable visualizations. They also provide partner organizations with visualizations of specific subgroups like gender, race, and school, and compare the organization's results on SEL indicators to a network-wide average.
With this project, I attempted to create a uniform method of taking those spreadsheets and turning them into helpful data visualizations.
Source: SSN Website
I received my data directly from Student Success Network, and as such, it is not online. They gave me two spreadsheets in .csv format: a sample organizational output (real data from a de-identified partner organization) and a network wide average.
The most important outcomes to partner organizations are:
These two variables provide are the most helpful for organizations to make decisions on resource allocation for student social-emotional learning.
I decided to focus on the gender subgroup for the purposes of this project, to reduce the number of different variables at play. Since all of the descriptive statistics they receive for each partner organization are within the same format, I believe SSN can easily translate this program to different subgroups.
These are the dataframes I created in order to extract percent positive and mean data for males and females at the organization and network-wide level. The final output is a dataframe summarizing the mean and percent positive for all variables at both levels.
Label | Dataframe |
---|---|
MaleFemale | Total Organization Data for Males and Females |
mean | Mean for Males and Females |
ppos | Percent Positive for Males and Females |
male | Male Data (Mean and Percent Positive) |
female | Female Data (Mean and Percent Positive) |
mlmean | Mean Data for Males |
mlppos | Percent Positive Data for Males |
fmmean | Mean Data for Females |
fmppos | Percent Positive Data for Females |
Label | Dataframe |
---|---|
network | Total Network Data |
mlnet | Male Network Data (Mean and Percent Positive) |
fmnet | Female Network Data (Mean and Percent Positive) |
output | Network and Organizational Data, Males and Females |
Label | Dataframe |
---|---|
output | Network and Organizational Data, Males and Females |
mean_output | Mean Data for Network and Organization, Males and Females |
pp_output | Percent Positive Data for Network and Organization, Males and Females |
pp_net | Mean Data for Network, Males and Females |
mean_net | Percent Positive Data for Network, Males and Females |
I first focused on cleaning up the Sample Organizational Output for Males and Females. I wanted to extract the two outcomes (percent positive and mean) for both males and females along the seven SEL indicators.
The results include data from 230 males and 334 females, or 564 students in total.
In [47]:
MaleFemale = "/Users/kavyabeheraj/Desktop/Current Classes/Data Bootcamp/Male_Female_Sample_Org_Output.csv"
# Sample organizational output for males and females
df = pd.read_csv(MaleFemale)
df
I noticed that the male outcomes consisted of the first 14 rows of the spreadsheet, while the female outcomes consisted of the last 14. The 14 rows of outcomes were each divided into 7 rows for the mean and 7 rows for the percent positive (called percent perfect in the dataframe).
I sliced the spreadsheet into four separate dataframes:
I also set the SEL indicators as the index renamed all of them for consistency.
In [56]:
male = pd.read_csv(MaleFemale).head(14)
female = pd.read_csv(MaleFemale).tail(14)
In [4]:
mlmean = male.head(7) # Reads the first seven lines of the dataframe
mlmean = mlmean[["Label","Mean"]].set_index("Label") # Slices only two columns and sets the index to be "Label"
mlmean = mlmean.rename(index={"Academic Behavior" : "AcaBeh",
"Academic Self-efficacy" : "AcaEf",
"Growth Mindset" : "Growth",
"Interpersonal Skills" : "Intp",
"Problem Solving" : "Prob",
"SELF-ADVOCACY" : "SelfAd",
"BELONGING" : "Belong"},
columns={"Mean" : "Male Mean"})
mlmean
Out[4]:
In [5]:
mlpp = male.tail(7) # Reads the first seven lines of the dataframe
mlpp = mlpp[["Label","Mean"]].set_index("Label") # Slices only two columns and sets the index to be "Label"
mlpp = mlpp.rename(index={"Academic Behavior Percent Perfect" : "AcaBeh",
"Academic Self-efficacy Percent Perfect" : "AcaEf",
"Growth Mindset Percent Perfect" : "Growth",
"Interpersonal Skills Percent Perfect" : "Intp",
"Problem Solving Percent Perfect" : "Prob",
"SELF ADVOCACY PERCENT PERFECT" : "SelfAd",
"BELONGING PERCENT PERFECT ge 4" : "Belong"},
columns={"Mean" : "Male Percent Positive"})
mlpp
Out[5]:
In [57]:
fmmean = female.head(7) # Reads the first seven lines of the dataframe
fmmean = fmmean[["Label","Mean"]].set_index("Label") # Slices only two columns and sets the index to be "Label"
fmmean = fmmean.rename(index={"Academic Behavior" : "AcaBeh",
"Academic Self-efficacy" : "AcaEf",
"Growth Mindset" : "Growth",
"Interpersonal Skills" : "Intp",
"Problem Solving" : "Prob",
"SELF-ADVOCACY" : "SelfAd",
"BELONGING" : "Belong"},
columns={"Mean" : "Female Mean"})
fmmean
Out[57]:
In [58]:
fmpp = female.tail(7)
fmpp = fmpp[["Label","Mean"]].set_index("Label")
fmpp = fmpp.rename(index={"Academic Behavior Percent Perfect" : "AcaBeh",
"Academic Self-efficacy Percent Perfect" : "AcaEf",
"Growth Mindset Percent Perfect" : "Growth",
"Interpersonal Skills Percent Perfect" : "Intp",
"Problem Solving Percent Perfect" : "Prob",
"SELF ADVOCACY PERCENT PERFECT" : "SelfAd",
"BELONGING PERCENT PERFECT ge 4" : "Belong"},
columns={"Mean" : "Female Percent Positive"})
fmpp
Out[58]:
After creating four separate dataframes, I decided to concatenate them along the seven SEL indicators. I envisioned a problem in graphing both outcomes (mean and percent positive) within the same graph, since they had different scales, but seeing all of the data within one dataframe is easier to understand.
I created two dataframes summarizing the mean and percent positive (ppos), as well as one with both outcomes (meanppos).
In [212]:
mean = pd.concat([mlmean, fmmean], axis=1)
mean
Out[212]:
In [42]:
ppos = pd.concat([mlpp, fmpp], axis=1)
ppos
Out[42]:
In [43]:
meanppos = pd.concat([mlpp, fmpp, mlmean, fmmean], axis=1)
meanppos
Out[43]:
In [ ]:
mean.plot.barh(figsize = (10,7))
I read in the male and female summary data for the entire network. I then extracted the data for males and females, avoiding the rows which had a blank for "isFemale". Unlike the organizational data, SSN network data has a separate column for percent positive, which meant that I did not have to create as many dataframes to get the same output.
I created two dataframes, one summarizing male network data (mlnet) and one for female network data (fmnet).
In [80]:
df2 = "/Users/kavyabeheraj/Desktop/Current Classes/Data Bootcamp/Network_Summary_Gender.csv"
network = pd.read_csv(df2)
network
In [61]:
mlnet = network.tail(7)
mlnet = mlnet[["label","mean", "percentPositive"]].set_index("label")
mlnet = mlnet.rename(index={"Academic Behavior" : "AcaBeh",
"Academic Self-efficacy" : "AcaEf",
"Growth Mindset" : "Growth",
"Interpersonal Skills" : "Intp",
"Problem Solving" : "Prob",
"Self-Advocacy" : "SelfAd",
"Belonging" : "Belong"},
columns={"mean" : "Male Mean, Network",
"percentPositive" : "Male Percent Positive, Network"})
mlnet
Out[61]:
In [64]:
fmnet = network[7:14]
fmnet = fmnet[["label","mean", "percentPositive"]].set_index("label")
fmnet = fmnet.rename(index={"Academic Behavior" : "AcaBeh",
"Academic Self-efficacy" : "AcaEf",
"Growth Mindset" : "Growth",
"Interpersonal Skills" : "Intp",
"Problem Solving" : "Prob",
"Self-Advocacy" : "SelfAd",
"Belonging" : "Belong"},
columns={"mean" : "Female Mean, Network", "percentPositive" : "Female Percent Positive, Network"})
fmnet
Out[64]:
In [66]:
output = pd.concat([meanppos, fmnet, mlnet ], axis=1)
output
Out[66]:
In [74]:
mean_output = output[["Male Mean",
"Female Mean",
"Male Mean, Network",
"Female Mean, Network"]]
mean_output
Out[74]:
In [88]:
pp_output = output[["Male Percent Positive",
"Male Percent Positive, Network",
"Female Percent Positive",
"Female Percent Positive, Network"]]
pp_output
Out[88]:
In [152]:
mean_net = output[["Male Mean, Network",
"Female Mean, Network"]]
mean_net
Out[152]:
In [169]:
ppos_net = output[["Male Percent Positive, Network",
"Female Percent Positive, Network"]]
ppos_net
Out[169]:
In [354]:
plt.style.use('seaborn-pastel')
ax = ppos_net.plot(linestyle='-', marker='o', colormap = "Accent")
ppos.plot(kind='bar', colormap = "Pastel2",
ax=ax,
figsize = (10,7))
ax.set_ylim(0, 0.8)
ax.set_title("Percent Positive Male and Female SEL Outcomes, Organization vs. Network")
Out[354]:
From the data above, we can see that this organization has a greater percentage of students who meet or exceed requirements for the 7 SEL indicators, except for Problem-Solving.
In [353]:
plt.style.use('seaborn-pastel')
ax = mean_net.plot(linestyle='-', marker='o', colormap = "Accent")
mean.plot(kind='bar', colormap = "Pastel2",
ax=ax,
figsize = (10,7))
ax.set_ylim(0, 5)
ax.set_title("Mean Male and Female SEL Outcomes, Organization vs. Network")
Out[353]:
Looking at the mean for the organization compared to the network, we see that the students at this organization generally perform below-average on SEL indicators, except for Interpersonal Skills. If I was advising this organization, I would suggest that they distribute more resources towards enhancing their students' problem-solving skills, as well as their academic behaviors in general.
In [ ]: