Charter school identities and outcomes in the accountability era:
Preliminary results

April 19th, 2017
By Jaren Haber, PhD CandidateDept. of Sociology, UC Berkeley

alt text

(this out-dated graphic courtesy of U.S. News & World Report, 2009)

Research questions

How are charter schools different from each other in terms of ideology? How do these differences shape their survival and their outcomes, and what does this reveal about current educational policy?

The corpus

  • Website self-descriptions of all 6,753 charter schools open in 2014-15 (identified using the NCES Public School Universe Survey)
  • Charter school websites are a publicly visible proclamation of identity attempting to impress parents, regulators, etc.
  • This study the first to use this contemporary, comprehensive data source on U.S. charter school identities
  • Me & research team working on using BeautifulSoup and requests.get to webscrape the full sample

Motivation

  • Too much focus on test scores in education, too little on organizational aspects
  • Are charter schools innovative? How?
  • How does educational policy shape ed. philosophy? Organization? Outcomes?
  • No one has studied charters' public image as expressed in their OWN words

Methods

  • NLP: Word frequencies, distinctive words, etc.
  • Supervised: Custom dictionaries
  • Unsupervised: Topic models, word embeddings
  • Later: statistical regression to test, e.g., how progressivist schools in liberal communities have higher performance than they do in other places

Preliminary analysis: website self-descriptions of non-random sample of 196 schools

  • Early-stage sample: NOT representative!
  • About half randomly selected, half tracked down (many through Internet Archive) because of missing URLs
  • Closed schools over-represented

Preliminary conclusions:

Word counts:

  • Website self-descriptions for schools in mid-sized cities and suburbs tend to be longest, followed by other urban and suburban schools, then schools in towns, and shortest tends to be rural schools
  • Charter schools in cities and suburbs have the highest textual redundancy (lowest ratio of types to tokens)

Word embeddings:

  • The two educational philosophies I'm interested in--progressivism and essentialism--can be distinguished using semantic vectors
  • Useful way for creating and checking my dictionaries

Topic modeling:

  • Urban charter schools' websites emphasize GOALS (topic 0)
  • Suburban charter schools' websites emphasize CURRICULUM (topic 1) in addition to goals

Next steps:

  • Working with custom dictionaries, POS tagging
  • Webscraping and parsing HTML to get full sample
  • Match website text with data on test scores and community characteristics (e.g., race, class, political leanings) --> test hypotheses with statistical regression

  • More long-term: Collect longitudinal mission statement data from the Internet Archive --> look at survival and geographic dispersion of identity categories over time (especially pre-NCLB if possible)

In [1]:
# The keyword categories to help parse website text:
mission = ['mission',' vision ', 'vision:', 'mission:', 'our purpose', 'our ideals', 'ideals:', 'our cause', 'cause:', 'goals', 'objective']
curriculum = ['curriculum', 'curricular', 'program', 'method', 'pedagogy', 'pedagogical', 'approach', 'model', 'system', 'structure']
philosophy = ['philosophy', 'philosophical', 'beliefs', 'believe', 'principles', 'creed', 'credo', 'value',  'moral']
history = ['history', 'our story', 'the story', 'school story', 'background', 'founding', 'founded', 'established', 'establishment', 'our school began', 'we began', 'doors opened', 'school opened']
general =  ['about us', 'our school', 'who we are', 'overview', 'general information', 'our identity', 'profile', 'highlights']

Initializing Python


In [2]:
#!/usr/bin/env python
# -*- coding: UTF-8

In [3]:
# IMPORTING KEY PACKAGES
import csv # for reading in CSVs and turning them into dictionaries
import re # for regular expressions
import os # for navigating file trees
import nltk # for natural language processing tools
import pandas # for working with dataframes
import numpy as np # for working with numbers

In [4]:
# FOR CLEANING, TOKENIZING, AND STEMMING THE TEXT
from nltk import word_tokenize, sent_tokenize # widely used text tokenizer
from nltk.stem.porter import PorterStemmer # an approximate method of stemming words (it just cuts off the ends)
from nltk.corpus import stopwords # for one method of eliminating stop words, to clean the text
stopenglish = list(stopwords.words("english")) # assign the string of english stopwords to a variable and turn it into a list
import string # for one method of eliminating punctuation
punctuations = list(string.punctuation) # assign the string of common punctuation symbols to a variable and turn it into a list

In [5]:
# FOR ANALYZING WITH THE TEXT
from sklearn.feature_extraction.text import CountVectorizer # to work with document-term matrices, especially
countvec = CountVectorizer(tokenizer=nltk.word_tokenize)
from sklearn.feature_extraction.text import TfidfVectorizer # for creating TF-IDFs
tfidfvec = TfidfVectorizer()
from sklearn.decomposition import LatentDirichletAllocation # for topic modeling

import gensim # for word embedding models
from scipy.spatial.distance import cosine # for cosine similarity
from sklearn.metrics import pairwise # for pairwise similarity
from sklearn.manifold import MDS, TSNE # for multi-dimensional scaling

In [6]:
# FOR VISUALIZATIONS
import matplotlib
import matplotlib.pyplot as plt

# Visualization parameters
% pylab inline 
% matplotlib inline
matplotlib.style.use('ggplot')


Populating the interactive namespace from numpy and matplotlib

Reading in preliminary data


In [7]:
sample = [] # make empty list
with open('../data_URAP_etc/mission_data_prelim.csv', 'r', encoding = 'Latin-1')\
as csvfile: # open file                      
    reader = csv.DictReader(csvfile) # create a reader
    for row in reader: # loop through rows
        sample.append(row) # append each row to the list

In [8]:
sample[0]


Out[8]:
{'ADDRESS': '308 SOUTH BLAKE ST, PINE BLUFF, AR',
 'AM': '0',
 'AM01F': '-2',
 'AM01M': '-2',
 'AM02F': '-2',
 'AM02M': '-2',
 'AM03F': '-2',
 'AM03M': '-2',
 'AM04F': '-2',
 'AM04M': '-2',
 'AM05F': '0',
 'AM05M': '0',
 'AM06F': '0',
 'AM06M': '0',
 'AM07F': '0',
 'AM07M': '0',
 'AM08F': '0',
 'AM08M': '0',
 'AM09F': '-2',
 'AM09M': '-2',
 'AM10F': '-2',
 'AM10M': '-2',
 'AM11F': '-2',
 'AM11M': '-2',
 'AM12F': '-2',
 'AM12M': '-2',
 'AMALF': '0',
 'AMALM': '0',
 'AMKGF': '-2',
 'AMKGM': '-2',
 'AMPKF': '-2',
 'AMPKM': '-2',
 'AMUGF': '-2',
 'AMUGM': '-2',
 'AS01F': '-2',
 'AS01M': '-2',
 'AS02F': '-2',
 'AS02M': '-2',
 'AS03F': '-2',
 'AS03M': '-2',
 'AS04F': '-2',
 'AS04M': '-2',
 'AS05F': '0',
 'AS05M': '0',
 'AS06F': '0',
 'AS06M': '0',
 'AS07F': '0',
 'AS07M': '0',
 'AS08F': '0',
 'AS08M': '0',
 'AS09F': '-2',
 'AS09M': '-2',
 'AS10F': '-2',
 'AS10M': '-2',
 'AS11F': '-2',
 'AS11M': '-2',
 'AS12F': '-2',
 'AS12M': '-2',
 'ASALF': '0',
 'ASALM': '0',
 'ASIAN': '0',
 'ASKGF': '-2',
 'ASKGM': '-2',
 'ASPKF': '-2',
 'ASPKM': '-2',
 'ASUGF': '-2',
 'ASUGM': '-2',
 'BIES': '2',
 'BL01F': '-2',
 'BL01M': '-2',
 'BL02F': '-2',
 'BL02M': '-2',
 'BL03F': '-2',
 'BL03M': '-2',
 'BL04F': '-2',
 'BL04M': '-2',
 'BL05F': '5',
 'BL05M': '12',
 'BL06F': '13',
 'BL06M': '11',
 'BL07F': '14',
 'BL07M': '11',
 'BL08F': '13',
 'BL08M': '11',
 'BL09F': '-2',
 'BL09M': '-2',
 'BL10F': '-2',
 'BL10M': '-2',
 'BL11F': '-2',
 'BL11M': '-2',
 'BL12F': '-2',
 'BL12M': '-2',
 'BLACK': '90',
 'BLALF': '45',
 'BLALM': '45',
 'BLKGF': '-2',
 'BLKGM': '-2',
 'BLPKF': '-2',
 'BLPKM': '-2',
 'BLUGF': '-2',
 'BLUGM': '-2',
 'CDCODE': '504',
 'CHARTAUTH1': '0500A',
 'CHARTAUTH2': '0500B',
 'CHARTR': '1',
 'CONAME': 'JEFFERSON COUNTY',
 'CONUM': '5069',
 'CUSTOMID': 'AR3542702',
 'FIPST': '5',
 'FRELCH': '77',
 'FTE': '5.01',
 'G01': '-2',
 'G01OFFRD': '2',
 'G02': '-2',
 'G02OFFRD': '2',
 'G03': '-2',
 'G03OFFRD': '2',
 'G04': '-2',
 'G04OFFRD': '2',
 'G05': '18',
 'G05OFFRD': '1',
 'G06': '25',
 'G06OFFRD': '1',
 'G07': '25',
 'G07OFFRD': '1',
 'G08': '24',
 'G08OFFRD': '1',
 'G09': '-2',
 'G09OFFRD': '2',
 'G10': '-2',
 'G10OFFRD': '2',
 'G11': '-2',
 'G11OFFRD': '2',
 'G12': '-2',
 'G12OFFRD': '2',
 'GSHI': '8',
 'GSLO': '5',
 'HI01F': '-2',
 'HI01M': '-2',
 'HI02F': '-2',
 'HI02M': '-2',
 'HI03F': '-2',
 'HI03M': '-2',
 'HI04F': '-2',
 'HI04M': '-2',
 'HI05F': '1',
 'HI05M': '0',
 'HI06F': '1',
 'HI06M': '0',
 'HI07F': '0',
 'HI07M': '0',
 'HI08F': '0',
 'HI08M': '0',
 'HI09F': '-2',
 'HI09M': '-2',
 'HI10F': '-2',
 'HI10M': '-2',
 'HI11F': '-2',
 'HI11M': '-2',
 'HI12F': '-2',
 'HI12M': '-2',
 'HIALF': '2',
 'HIALM': '0',
 'HIKGF': '-2',
 'HIKGM': '-2',
 'HIPKF': '-2',
 'HIPKM': '-2',
 'HISP': '2',
 'HIUGF': '-2',
 'HIUGM': '-2',
 'HP01F': '-2',
 'HP01M': '-2',
 'HP02F': '-2',
 'HP02M': '-2',
 'HP03F': '-2',
 'HP03M': '-2',
 'HP04F': '-2',
 'HP04M': '-2',
 'HP05F': '0',
 'HP05M': '0',
 'HP06F': '0',
 'HP06M': '0',
 'HP07F': '0',
 'HP07M': '0',
 'HP08F': '0',
 'HP08M': '0',
 'HP09F': '-2',
 'HP09M': '-2',
 'HP10F': '-2',
 'HP10M': '-2',
 'HP11F': '-2',
 'HP11M': '-2',
 'HP12F': '-2',
 'HP12M': '-2',
 'HPALF': '0',
 'HPALM': '0',
 'HPKGF': '-2',
 'HPKGM': '-2',
 'HPPKF': '-2',
 'HPPKM': '-2',
 'HPUGF': '-2',
 'HPUGM': '-2',
 'ISFLE': 'PS',
 'ISFTEPUP': 'PS',
 'ISMEMPUP': 'PS',
 'ISPELM': 'PS',
 'ISPFEMALE': 'PS',
 'ISPWHITE': 'PS',
 'KG': '-2',
 'KGOFFRD': '2',
 'LATCOD': '34.2275',
 'LCITY': 'PINE BLUFF',
 'LEAID': '500410',
 'LEANM': 'RESPONSIVE ED SOLUTIONS QUEST MIDDLE SCHOOL OF PINE BLUFF',
 'LEVEL': '2',
 'LONCOD': '-92.0436',
 'LSTATE': 'AR',
 'LSTREE': '308 SOUTH BLAKE ST',
 'LZIP': '71601',
 'LZIP4': '',
 'MAGNET': '2',
 'MCITY': 'PINE BLUFF',
 'MEMBER': '92',
 'MSTATE': 'AR',
 'MSTREE': '308 SOUTH BLAKE ST',
 'MZIP': '71601',
 'MZIP4': '',
 'NCESSCH': '50041001581',
 'NSLPSTATUS': 'NSLPWOPRO',
 'PACIFIC': '0',
 'PHONE': '8703293310',
 'PK': '-2',
 'PKOFFRD': '2',
 'RECONSTF': '2',
 'RECONSTY': 'N',
 'REDLCH': '5',
 'SCHNAM': 'QUEST MIDDLE SCHOOL OF PINE BLUFF',
 'SCHNO': '1581',
 'SEARCH': 'QUEST MIDDLE SCHOOL OF PINE BLUFF 308 SOUTH BLAKE ST, PINE BLUFF, AR',
 'SEASCH': '3542702',
 'SFLE': '2',
 'SFTEPUP': '2',
 'SHARED': '2',
 'SMEMPUP': '2',
 'SPELM': '2',
 'SPFEMALE': '2',
 'SPWHITE': '2',
 'STATUS': '3',
 'STID': '3542700',
 'STITLI': '1',
 'SURVYEAR': '2013',
 'TITLEI': '1',
 'TITLEISTAT': '3',
 'TOTETH': '92',
 'TOTFRL': '82',
 'TR': '0',
 'TR01F': '-2',
 'TR01M': '-2',
 'TR02F': '-2',
 'TR02M': '-2',
 'TR03F': '-2',
 'TR03M': '-2',
 'TR04F': '-2',
 'TR04M': '-2',
 'TR05F': '0',
 'TR05M': '0',
 'TR06F': '0',
 'TR06M': '0',
 'TR07F': '0',
 'TR07M': '0',
 'TR08F': '0',
 'TR08M': '0',
 'TR09F': '-2',
 'TR09M': '-2',
 'TR10F': '-2',
 'TR10M': '-2',
 'TR11F': '-2',
 'TR11M': '-2',
 'TR12F': '-2',
 'TR12M': '-2',
 'TRALF': '0',
 'TRALM': '0',
 'TRKGF': '-2',
 'TRKGM': '-2',
 'TRPKF': '-2',
 'TRPKM': '-2',
 'TRUGF': '-2',
 'TRUGM': '-2',
 'TYPE': '1',
 'UG': '-2',
 'UGOFFRD': '2',
 'ULOCAL': '13',
 'UNION': '0',
 'URL': 'http://responsiveed.com/questpinebluff/',
 'VIRTUALSTAT': 'VIRTUALNO',
 'WEBTEXT': 'Quest Middle Schools¨ are schools focused on high expectations for behavior and academics. Students must work hard to meet their goals. To fully succeed in a Quest Middle School, students must consistently show leadership skills, good behavior, and a work ethic to meet expectations. Beyond this, Quest Schools provides curriculum designed to teach wisdom. Knowledge is crucial, but wisdom is a vital part of a middle school studentÕs growth and maturity. Character education is taught at all levels. Students are taught leadership skills through our 7 Habits of Highly Effective Teens* environment.\nOur administrators and teachers care about students and have a passion to see them reach their full potential. While providing quality education for all students, Quest educators collaborate to make sure each child receives the attention necessary to be successful. Quest provides a safe environment committed to learning. Educators work with students and parents to meet the rigorous academic standards. Quest Combines the Teaching of Knowledge and Wisdom.\nQuest Middle Schools use a variety of curriculum to ensure that middle school students have a solid foundation of content learning above traditional curriculum.\nBeyond this, Quest Middle Schools provide curriculum designed to teach wisdom. Knowledge is crucial, but wisdom is a vital part of a middle school studentÕs growth and maturity. Character education is taught at all levels. Students are taught leadership skills through our 7 Habits of Highly Effective Teens* environment.\nQuest has a Private School Atmosphere Without the Tuition Cost.\nThe campus is dedicated to the idea that education can have a great connection with the home and family. Though the atmosphere feels like a private school, there is no tuition to attend a Quest Middle School.\nQuest is a public school chartered by the State Board of Education. As a public school, the campus has the responsibility to ensure all students meet the standards created by the Texas Education Agency. ',
 'WH01F': '-2',
 'WH01M': '-2',
 'WH02F': '-2',
 'WH02M': '-2',
 'WH03F': '-2',
 'WH03M': '-2',
 'WH04F': '-2',
 'WH04M': '-2',
 'WH05F': '0',
 'WH05M': '0',
 'WH06F': '0',
 'WH06M': '0',
 'WH07F': '0',
 'WH07M': '0',
 'WH08F': '0',
 'WH08M': '0',
 'WH09F': '-2',
 'WH09M': '-2',
 'WH10F': '-2',
 'WH10M': '-2',
 'WH11F': '-2',
 'WH11M': '-2',
 'WH12F': '-2',
 'WH12M': '-2',
 'WHALF': '0',
 'WHALM': '0',
 'WHITE': '0',
 'WHKGF': '-2',
 'WHKGM': '-2',
 'WHPKF': '-2',
 'WHPKM': '-2',
 'WHUGF': '-2',
 'WHUGM': '-2'}

In [9]:
# Take a look at the most important contents and the variables list
# in our sample (a list of dictionaries)--let's look at just the first entry
print(sample[1]["SCHNAM"], "\n", sample[1]["URL"], "\n", sample[1]["WEBTEXT"], "\n")
print(sample[1].keys()) # look at all the variables!


THE ACADEMIES AT JONESBORO HIGH SCHOOL 
 http://www.jonesboroschools.net/schools/academies_at_jonesboro_high_school 
 The mission of the Academies at Jonesboro High School is to provide a high quality, research-based education for all students in order to equip them with the essential skills necessary to be successful in todayÕs changing global community. Through strong partnerships with business and community stakeholders, the Academies at Jonesboro High School will ensure high achievement in all subjects through an expanded curriculum and the use of data-driven methods to evaluate and implement proven instructional strategies. The Academies at JHS will foster respect for global diversity and maintain a commitment to create exceptional opportunities for the educational growth of every child.  Excellence is our Standard, not our Goal, for All Students   

dict_keys(['BL06F', 'BLPKM', 'HIALM', 'TR05F', 'HI12F', 'AM01F', 'HI01M', 'ASPKM', 'RECONSTY', 'AS02M', 'HPKGM', 'BL12F', 'BL10F', 'TR03F', 'FRELCH', 'PHONE', 'AM03M', 'AS05M', 'AS08M', 'SPELM', 'TR09F', 'HIPKF', 'AS08F', 'SHARED', 'AS11F', 'G01', 'AM06F', 'MSTATE', 'HP04M', 'SURVYEAR', 'AM04F', 'HI05F', 'HP09M', 'AS09F', 'AM11M', 'G10OFFRD', 'TR06F', 'HI10M', 'G07', 'HP12F', 'HPUGF', 'AM02F', 'HP07M', 'AS05F', 'FIPST', 'HIKGM', 'ASIAN', 'CHARTAUTH1', 'HI07F', 'HI06M', 'TRPKF', 'HP11M', 'BL09M', 'HI11F', 'LSTATE', 'LEAID', 'G02', 'TR08F', 'TR10F', 'AM05M', 'WH10M', 'AM07F', 'TR', 'WH02M', 'G12OFFRD', 'AM12M', 'TR01M', 'AS02F', 'HP06F', 'HP03M', 'TR06M', 'HP12M', 'RECONSTF', 'WH07F', 'HPKGF', 'WH09F', 'AS06M', 'G01OFFRD', 'STID', 'HIUGM', 'HIKGF', 'HI08F', 'SMEMPUP', 'MCITY', 'WH06F', 'ISPWHITE', 'BL08M', 'G06', 'AS11M', 'HIUGF', 'AM04M', 'HI04F', 'TYPE', 'G06OFFRD', 'WH12F', 'WEBTEXT', 'PKOFFRD', 'G05', 'AM10F', 'TR02M', 'HPALF', 'G03OFFRD', 'HP05M', 'TR11M', 'TRUGM', 'AS12F', 'BL03M', 'HI08M', 'TR02F', 'HI09F', 'BL02F', 'AM05F', 'AMKGF', 'HP01F', 'LEVEL', 'HI07M', 'STITLI', 'HP06M', 'TR04M', 'TITLEISTAT', 'WH08F', 'G10', 'AMUGM', 'BL09F', 'LATCOD', 'WH06M', 'WH12M', 'HIALF', 'HI09M', 'BL07F', 'HISP', 'G12', 'TRKGF', 'AS04M', 'HI04M', 'GSLO', 'TRKGM', 'HI03F', 'CONUM', 'WH01F', 'BL01M', 'ASUGF', 'MZIP', 'WH05M', 'BL04F', 'SCHNO', 'WH10F', 'HP05F', 'LZIP4', 'ISPFEMALE', 'HI05M', 'TR05M', 'HP04F', 'AM02M', 'ASUGM', 'TR08M', 'BL03F', 'AM', 'WHALF', 'BLPKF', 'AMALF', 'BL10M', 'WHPKF', 'TOTETH', 'CDCODE', 'WHUGF', 'CUSTOMID', 'HPPKF', 'WHKGF', 'G08', 'BLALF', 'AMKGM', 'WH03M', 'HI06F', 'WHITE', 'WHALM', 'TOTFRL', 'TR04F', 'ISMEMPUP', 'AS10F', 'WHPKM', 'G07OFFRD', 'G03', 'MSTREE', 'LEANM', 'WHUGM', 'AS06F', 'AS03M', 'SCHNAM', 'G04OFFRD', 'SPWHITE', 'ISFLE', 'SFTEPUP', 'G04', 'TR11F', 'AMPKF', 'SFLE', 'BL01F', 'HP02F', 'AM08M', 'HPPKM', 'HI01F', 'HPUGM', 'BL11M', 'MAGNET', 'AM06M', 'TRPKM', 'ISPELM', 'BL06M', 'BLUGM', 'BLACK', 'AMPKM', 'WH04M', 'ASKGF', 'MEMBER', 'HI10F', 'AM09F', 'PACIFIC', 'AMUGF', 'TR12M', 'WHKGM', 'HI02F', 'ULOCAL', 'CHARTR', 'HI12M', 'KGOFFRD', 'G11', 'AM09M', 'WH09M', 'HPALM', 'NSLPSTATUS', 'TR12F', 'ASPKF', 'WH11M', 'UGOFFRD', 'AM07M', 'WH04F', 'ISFTEPUP', 'HI11M', 'WH03F', 'STATUS', 'AM08F', 'BL04M', 'WH11F', 'HP01M', 'BL05F', 'UNION', 'UG', 'WH02F', 'TR09M', 'HIPKM', 'HP08F', 'HP08M', 'SEARCH', 'G05OFFRD', 'FTE', 'WH01M', 'REDLCH', 'AS01F', 'WH07M', 'AS12M', 'KG', 'ASKGM', 'VIRTUALSTAT', 'AS10M', 'HI03M', 'BL11F', 'LONCOD', 'BL07M', 'TR01F', 'AM11F', 'BL02M', 'BL05M', 'HP03F', 'CONAME', 'SPFEMALE', 'BIES', 'ASALF', 'AM03F', 'HP07F', 'LSTREE', 'TR03M', 'HP10M', 'G09', 'HP11F', 'TR10M', 'HP09F', 'TR07F', 'WH05F', 'PK', 'BL12M', 'AS03F', 'AS07F', 'ADDRESS', 'HP10F', 'MZIP4', 'TITLEI', 'BLUGF', 'G08OFFRD', 'HP02M', 'BLKGF', 'BL08F', 'LZIP', 'AS01M', 'NCESSCH', 'G02OFFRD', 'G11OFFRD', 'AS07M', 'AMALM', 'TR07M', 'CHARTAUTH2', 'URL', 'TRALF', 'LCITY', 'AM10M', 'SEASCH', 'AS09M', 'WH08M', 'AM01M', 'GSHI', 'TRALM', 'BLALM', 'G09OFFRD', 'BLKGM', 'HI02M', 'ASALM', 'AS04F', 'TRUGF', 'AM12F'])

In [10]:
# Read the data in as a pandas dataframe
df = pandas.read_csv("../data_URAP_etc/mission_data_prelim.csv", encoding = 'Latin-1')
df = df.dropna(subset=["WEBTEXT"]) # drop any schools with no webtext that might have snuck in (none currently)

In [11]:
# Add additional variables for analysis:
# PCTETH = percentage of enrolled students belonging to a racial minority
# this includes American Indian, Asian, Hispanic, Black, Hawaiian, or Pacific Islander
df["PCTETH"] = (df["AM"] + df["ASIAN"] + df["HISP"] + df["BLACK"] + df["PACIFIC"]) / df["MEMBER"]

df["STR"] = df["MEMBER"] / df["FTE"] # Student/teacher ratio
df["PCTFRPL"] = df["TOTFRL"] / df["MEMBER"] # Percent of students receiving FRPL

# Another interesting variable: 
# TYPE = type of school, where 1 = regular, 2 = special ed, 3 = vocational, 4 = other/alternative, 5 = reportable program

In [12]:
## Print the webtext from the first school in the dataframe
print(df.iloc[0]["WEBTEXT"])


Quest is a public school chartered by the State Board of Education. As a public school, the campus has the responsibility to ensure all students meet the standards created by the Texas Education Agency. 

Descriptive statistics

How urban proximity is coded: Lower number = more urban (closer to large city)

More specifically, it uses two digits with distinct meanings:

  • the first digit:
    • 1 = city
    • 2 = suburb
    • 3 = town
    • 4 = rural
  • the second digit:
    • 1 = large or fringe
    • 2 = mid-size or distant
    • 3 = small/remote

In [13]:
print(df.describe()) # get descriptive statistics for all numerical columns
print()
print(df['ULOCAL'].value_counts()) # frequency counts for categorical data
print()
print(df['LEVEL'].value_counts()) # treat grade range served as categorical
# Codes for level/ grade range served: 3 = High school, 2 = Middle school, 1 = Elementary, 4 = Other)
print()
print(df['LSTATE'].mode()) # find the most common state represented in these data
print(df['ULOCAL'].mode()) # find the most urbanicity represented in these data
# print(df['FTE']).mean() # What's the average number of full-time employees by school?
# print(df['STR']).mean() # And the average student-teacher ratio?


       SURVYEAR       NCESSCH       FIPST           LEAID         SCHNO  \
count       196  1.960000e+02  196.000000      196.000000    196.000000   
mean       2013  2.510655e+11   25.035714  2511089.642857   5845.489796   
std           0  1.771059e+11   17.759468  1771302.884773   4214.934368   
min        2013  4.001010e+10    4.000000   400101.000000     16.000000   
25%        2013  6.402526e+10    6.000000   640252.500000   2243.500000   
50%        2013  2.500000e+11   25.000000  2500284.000000   4600.500000   
75%        2013  4.200000e+11   42.000000  4200094.500000   8657.250000   
max        2013  5.510000e+11   55.000000  5514220.000000  13727.000000   

              PHONE          MZIP        MZIP4         LZIP        LZIP4  \
count  1.960000e+02    196.000000   141.000000    196.00000   141.000000   
mean   5.800643e+09  61181.617347  2800.822695  61034.77551  2700.843972   
std    2.457133e+09  27190.843275  2819.455282  27277.06302  2724.089340   
min    2.022488e+09   1035.000000     0.000000   1035.00000     0.000000   
25%    3.236360e+09  34111.750000     0.000000  34126.25000     0.000000   
50%    6.022434e+09  70553.000000  2513.000000  70553.00000  2230.000000   
75%    8.033513e+09  85339.000000  3941.000000  85356.25000  3941.000000   
max    9.854463e+09  97497.000000  9999.000000  97497.00000  9999.000000   

          ...         PACIFIC       HPALM       HPALF          TR       TRALM  \
count     ...      196.000000  196.000000  196.000000  196.000000  196.000000   
mean      ...        1.142857    0.525510    0.510204    9.219388    4.387755   
std       ...       10.247201    5.406014    5.001015   18.875240    9.955421   
min       ...       -9.000000   -9.000000   -9.000000   -9.000000   -9.000000   
25%       ...        0.000000    0.000000    0.000000    1.000000    0.000000   
50%       ...        0.000000    0.000000    0.000000    4.000000    2.000000   
75%       ...        0.000000    0.000000    0.000000   11.250000    5.250000   
max       ...      142.000000   74.000000   68.000000  220.000000  116.000000   

            TRALF       TOTETH      PCTETH         STR     PCTFRPL  
count  196.000000   196.000000  196.000000  196.000000  196.000000  
mean     4.724490   303.316327    0.625529  -10.081974    0.577878  
std      9.383325   274.841366    1.059312  112.926532    0.319359  
min     -9.000000    -9.000000  -11.250000 -635.000000   -0.022222  
25%      0.000000   105.750000    0.324186   10.822209    0.286746  
50%      2.000000   235.500000    0.703642   15.118598    0.653312  
75%      6.000000   398.000000    0.955846   20.376488    0.877253  
max    104.000000  1542.000000    5.000000  632.000000    1.000000  

[8 rows x 297 columns]

11    78
21    36
13    21
12    19
41    11
42     9
33     6
32     6
23     4
31     3
22     3
Name: ULOCAL, dtype: int64

1    82
3    46
4    41
2    26
N     1
Name: LEVEL, dtype: int64

0    CA
dtype: object
0    11
dtype: int64

In [14]:
# here's the number of schools from each state, in a graph:
grouped_state = df.groupby('LSTATE')
grouped_state['WEBTEXT'].count().sort_values(ascending=True).plot(kind = 'bar', title='Schools mostly in CA, TX, AZ, FL--similar to national trend')
plt.show()



In [15]:
# and here's the number of schools in each urban category, in a graph:
grouped_urban = df.groupby('ULOCAL')
grouped_urban['WEBTEXT'].count().sort_values(ascending=True).plot(kind = 'bar', title='Most schools are in large cities or large suburbs')
plt.show()


What these numbers say about the charter schools in the sample:

  • Most are located in large cities, followed by large suburbs, then medium and small city, and then rural.
  • The means for percent minorities and students receiving free- or reduced-price lunch are both about 60%.
  • Most are in CA, TX, AZ, and FL
  • Most of the schools in the sample are primary schools

This means that the sample reflects national averages. In that sense, this sample isn't so bad.

Cleaning, tokenizing, and stemming the text


In [16]:
# Now we clean the webtext by rendering each word lower-case then removing punctuation. 
df['webtext_lc'] = df['WEBTEXT'].str.lower() # make the webtext lower case
df['webtokens'] = df['webtext_lc'].apply(nltk.word_tokenize) # tokenize the lower-case webtext by word
df['webtokens_nopunct'] = df['webtokens'].apply(lambda x: [word for word in x if word not in list(string.punctuation)]) # remove punctuation

In [17]:
print(df.iloc[0]["webtokens"]) # the tokenized text without punctuation


['quest', 'middle', 'schools¨', 'are', 'schools', 'focused', 'on', 'high', 'expectations', 'for', 'behavior', 'and', 'academics', '.', 'students', 'must', 'work', 'hard', 'to', 'meet', 'their', 'goals', '.', 'to', 'fully', 'succeed', 'in', 'a', 'quest', 'middle', 'school', ',', 'students', 'must', 'consistently', 'show', 'leadership', 'skills', ',', 'good', 'behavior', ',', 'and', 'a', 'work', 'ethic', 'to', 'meet', 'expectations', '.', 'beyond', 'this', ',', 'quest', 'schools', 'provides', 'curriculum', 'designed', 'to', 'teach', 'wisdom', '.', 'knowledge', 'is', 'crucial', ',', 'but', 'wisdom', 'is', 'a', 'vital', 'part', 'of', 'a', 'middle', 'school', 'studentõs', 'growth', 'and', 'maturity', '.', 'character', 'education', 'is', 'taught', 'at', 'all', 'levels', '.', 'students', 'are', 'taught', 'leadership', 'skills', 'through', 'our', '7', 'habits', 'of', 'highly', 'effective', 'teens*', 'environment', '.', 'our', 'administrators', 'and', 'teachers', 'care', 'about', 'students', 'and', 'have', 'a', 'passion', 'to', 'see', 'them', 'reach', 'their', 'full', 'potential', '.', 'while', 'providing', 'quality', 'education', 'for', 'all', 'students', ',', 'quest', 'educators', 'collaborate', 'to', 'make', 'sure', 'each', 'child', 'receives', 'the', 'attention', 'necessary', 'to', 'be', 'successful', '.', 'quest', 'provides', 'a', 'safe', 'environment', 'committed', 'to', 'learning', '.', 'educators', 'work', 'with', 'students', 'and', 'parents', 'to', 'meet', 'the', 'rigorous', 'academic', 'standards', '.', 'quest', 'combines', 'the', 'teaching', 'of', 'knowledge', 'and', 'wisdom', '.', 'quest', 'middle', 'schools', 'use', 'a', 'variety', 'of', 'curriculum', 'to', 'ensure', 'that', 'middle', 'school', 'students', 'have', 'a', 'solid', 'foundation', 'of', 'content', 'learning', 'above', 'traditional', 'curriculum', '.', 'beyond', 'this', ',', 'quest', 'middle', 'schools', 'provide', 'curriculum', 'designed', 'to', 'teach', 'wisdom', '.', 'knowledge', 'is', 'crucial', ',', 'but', 'wisdom', 'is', 'a', 'vital', 'part', 'of', 'a', 'middle', 'school', 'studentõs', 'growth', 'and', 'maturity', '.', 'character', 'education', 'is', 'taught', 'at', 'all', 'levels', '.', 'students', 'are', 'taught', 'leadership', 'skills', 'through', 'our', '7', 'habits', 'of', 'highly', 'effective', 'teens*', 'environment', '.', 'quest', 'has', 'a', 'private', 'school', 'atmosphere', 'without', 'the', 'tuition', 'cost', '.', 'the', 'campus', 'is', 'dedicated', 'to', 'the', 'idea', 'that', 'education', 'can', 'have', 'a', 'great', 'connection', 'with', 'the', 'home', 'and', 'family', '.', 'though', 'the', 'atmosphere', 'feels', 'like', 'a', 'private', 'school', ',', 'there', 'is', 'no', 'tuition', 'to', 'attend', 'a', 'quest', 'middle', 'school', '.', 'quest', 'is', 'a', 'public', 'school', 'chartered', 'by', 'the', 'state', 'board', 'of', 'education', '.', 'as', 'a', 'public', 'school', ',', 'the', 'campus', 'has', 'the', 'responsibility', 'to', 'ensure', 'all', 'students', 'meet', 'the', 'standards', 'created', 'by', 'the', 'texas', 'education', 'agency', '.']

In [18]:
# Now we remove stopwords and stem. This will improve the results
df['webtokens_clean'] = df['webtokens_nopunct'].apply(lambda x: [word for word in x if word not in list(stopenglish)]) # remove stopwords
df['webtokens_stemmed'] = df['webtokens_clean'].apply(lambda x: [PorterStemmer().stem(word) for word in x])

In [19]:
# Some analyses require a string version of the webtext without punctuation or numbers.
# To get this, we join together the cleaned and stemmed tokens created above, and then remove numbers and punctuation:
df['webtext_stemmed'] = df['webtokens_stemmed'].apply(lambda x: ' '.join(char for char in x))
df['webtext_stemmed'] = df['webtext_stemmed'].apply(lambda x: ''.join(char for char in x if char not in punctuations))
df['webtext_stemmed'] = df['webtext_stemmed'].apply(lambda x: ''.join(char for char in x if not char.isdigit()))

In [20]:
df['webtext_stemmed'][0]


Out[20]:
'quest middl schools¨ school focus high expect behavior academ student must work hard meet goal fulli succeed quest middl school student must consist show leadership skill good behavior work ethic meet expect beyond quest school provid curriculum design teach wisdom knowledg crucial wisdom vital part middl school studentõ growth matur charact educ taught level student taught leadership skill  habit highli effect teens environ administr teacher care student passion see reach full potenti provid qualiti educ student quest educ collabor make sure child receiv attent necessari success quest provid safe environ commit learn educ work student parent meet rigor academ standard quest combin teach knowledg wisdom quest middl school use varieti curriculum ensur middl school student solid foundat content learn tradit curriculum beyond quest middl school provid curriculum design teach wisdom knowledg crucial wisdom vital part middl school studentõ growth matur charact educ taught level student taught leadership skill  habit highli effect teens environ quest privat school atmospher without tuition cost campu dedic idea educ great connect home famili though atmospher feel like privat school tuition attend quest middl school quest public school charter state board educ public school campu respons ensur student meet standard creat texa educ agenc'

In [21]:
# Some analyses require tokenized sentences. I'll do this with the list of dictionaries.
# I'll use cleaned, tokenized sentences (with stopwords) to create both a dictionary variable and a separate list for word2vec

words_by_sentence = [] # initialize the list of tokenized sentences as an empty list
for school in sample:
    school["sent_toksclean"] = []
    school["sent_tokens"] = [word_tokenize(sentence) for sentence in sent_tokenize(school["WEBTEXT"])] 
    for sent in school["sent_tokens"]:
        school["sent_toksclean"].append([PorterStemmer().stem(word.lower()) for word in sent if (word not in punctuations)]) # for each word: stem, lower-case, and remove punctuations
        words_by_sentence.append([PorterStemmer().stem(word.lower()) for word in sent if (word not in punctuations)])

In [22]:
words_by_sentence[:2]


Out[22]:
[['quest',
  'middl',
  'schools¨',
  'are',
  'school',
  'focus',
  'on',
  'high',
  'expect',
  'for',
  'behavior',
  'and',
  'academ'],
 ['student', 'must', 'work', 'hard', 'to', 'meet', 'their', 'goal']]

Counting document lengths


In [23]:
# We can also count document lengths. I'll mostly use the version with punctuation removed but including stopwords,
# because stopwords are also part of these schools' public image/ self-presentation to potential parents, regulators, etc.

df['webstem_count'] = df['webtokens_stemmed'].apply(len) # find word count without stopwords or punctuation
df['webpunct_count'] = df['webtokens_nopunct'].apply(len) # find length with stopwords still in there (but no punctuation)
df['webclean_count'] = df['webtokens_clean'].apply(len) # find word count without stopwords or punctuation

In [24]:
# For which urban status are website self-description the longest?
print(grouped_urban['webpunct_count'].mean().sort_values(ascending=False))


ULOCAL
12    941.421053
22    780.000000
21    593.472222
41    576.909091
11    571.500000
13    530.809524
31    408.666667
33    364.500000
32    292.333333
42    257.444444
23    210.000000
Name: webpunct_count, dtype: float64

In [25]:
# here's the mean website self-description word count for schools grouped by urban proximity, in a graph:
grouped_urban['webpunct_count'].mean().sort_values(ascending=True).plot(kind = 'bar', title='Schools in mid-sized cities and suburbs have longer self-descriptions than in fringe areas', yerr = grouped_state["webpunct_count"].std())
plt.show()



In [26]:
# Look at 'FTE' (proxy for # administrators) clustered by urban proximity and whether it explains this
grouped_urban['FTE'].mean().sort_values(ascending=True).plot(kind = 'bar', title='Title', yerr = grouped_state["FTE"].std())
plt.show()



In [27]:
# Now let's calculate the type-token ratio (TTR) for each school, which compares
# the number of types (unique words used) with the number of words (including repetitions of words).

df['numtypes'] = df['webtokens_nopunct'].apply(lambda x: len(set(x))) # this is the number of unique words per site
df['TTR'] =  df['numtypes'] / df['webpunct_count'] # calculate TTR

In [28]:
# here's the mean TTR for schools grouped by urban category:
grouped_urban = df.groupby('ULOCAL')
grouped_urban['TTR'].mean().sort_values(ascending=True).plot(kind = 'bar', title='Charters in cities and suburbs have higher textual redundancy than in fringe areas', yerr = grouped_urban["TTR"].std())
plt.show()


(Excessively) Frequent words


In [29]:
# First, aggregate all the cleaned webtext:
webtext_all = []
df['webtokens_clean'].apply(lambda x: [webtext_all.append(word) for word in x])
webtext_all[:20]


Out[29]:
['quest',
 'middle',
 'schools¨',
 'schools',
 'focused',
 'high',
 'expectations',
 'behavior',
 'academics',
 'students',
 'must',
 'work',
 'hard',
 'meet',
 'goals',
 'fully',
 'succeed',
 'quest',
 'middle',
 'school']

In [30]:
# Now apply the nltk function FreqDist to count the number of times each token occurs.
word_frequency = nltk.FreqDist(webtext_all)

#print out the 50 most frequent words using the function most_common
print(word_frequency.most_common(50))


[('students', 1739), ('school', 1661), ('learning', 736), ('education', 539), ('charter', 476), ('community', 470), ('student', 466), ('high', 415), ('program', 395), ('academic', 390), ('schools', 344), ('academy', 342), ('curriculum', 340), ('college', 328), ('skills', 320), ('teachers', 295), ('children', 280), ('grade', 267), ('environment', 241), ('provide', 238), ('success', 226), ('educational', 224), ('every', 217), ('work', 213), ('support', 207), ('staff', 207), ('leadership', 198), ('year', 198), ('arts', 190), ('parents', 185), ('development', 185), ('develop', 184), ('learn', 183), ('state', 182), ('public', 182), ('grades', 176), ('standards', 174), ('needs', 170), ('instruction', 169), ('core', 169), ('science', 168), ('world', 153), ('mission', 152), ('programs', 152), ('life', 151), ('new', 150), ('opportunities', 150), ('social', 148), ('one', 148), ('also', 146)]

These are prolific, ritual, empty words and will be excluded from topic models!

Distinctive words (mostly place names)


In [31]:
sklearn_dtm = countvec.fit_transform(df['webtext_stemmed'])
print(sklearn_dtm)


  (0, 3848)	11
  (0, 3003)	8
  (0, 4223)	1
  (0, 4214)	13
  (0, 1883)	1
  (0, 2208)	1
  (0, 1703)	2
  (0, 456)	2
  (0, 25)	2
  (0, 4679)	9
  (0, 3130)	2
  (0, 5432)	3
  (0, 2157)	1
  (0, 2956)	4
  (0, 2045)	1
  (0, 1968)	1
  (0, 4716)	1
  (0, 1005)	1
  (0, 4380)	1
  (0, 2697)	3
  (0, 4438)	3
  (0, 2055)	1
  (0, 1646)	1
  (0, 480)	2
  (0, 3802)	4
  :	:
  (195, 1421)	1
  (195, 3963)	1
  (195, 3768)	1
  (195, 3894)	1
  (195, 2069)	1
  (195, 3020)	2
  (195, 4618)	1
  (195, 2060)	1
  (195, 3197)	1
  (195, 710)	1
  (195, 1605)	1
  (195, 1120)	1
  (195, 2224)	1
  (195, 401)	1
  (195, 5380)	2
  (195, 2102)	1
  (195, 4935)	1
  (195, 5576)	1
  (195, 320)	1
  (195, 5598)	1
  (195, 1704)	1
  (195, 3096)	1
  (195, 3391)	1
  (195, 4715)	1
  (195, 3810)	2

In [32]:
# What are some of the words in the DTM? 
print(countvec.get_feature_names()[:10])


['a', 'aaec', 'ab', 'abandon', 'abbi', 'abbott', 'abc', 'abernathi', 'abid', 'abil']

In [33]:
# now we can create the dtm, but with cells weigthed by the tf-idf score.
dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.webtext_stemmed).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)

dtm_tfidf_df[:20] # let's take a look!


Out[33]:
aaec ab abandon abbi abbott abc abernathi abid abil abl ... ômi ôno ôsave ôsearchõ ôsign ôsigninõ ôsuper ôtapestryõ ôthreadsõ ôwatt
0 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
1 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
2 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
3 0.413013 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
4 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
5 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
6 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
7 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
8 0.000000 0 0 0 0 0 0 0 0.031577 0 ... 0 0 0 0 0 0 0 0 0 0
9 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
10 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
11 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
12 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
13 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
14 0.000000 0 0 0 0 0 0 0 0.047121 0 ... 0 0 0 0 0 0 0 0 0 0
15 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
16 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
17 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
18 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0
19 0.000000 0 0 0 0 0 0 0 0.000000 0 ... 0 0 0 0 0 0 0 0 0 0

20 rows × 5629 columns


In [34]:
# What are the 20 words with the highest TF-IDF scores?
print(dtm_tfidf_df.max().sort_values(ascending=False)[:20])


we                0.959053
telesi            0.877403
wra               0.787935
action            0.775454
swp               0.733787
ywlc              0.732152
tapestri          0.720522
quest             0.718819
bam               0.710280
englishspanish    0.699484
waysid            0.698282
slam              0.693069
treknorth         0.688578
renaiss           0.675415
rcsa              0.667000
graham            0.664289
taylion           0.662956
ivi               0.648762
somerset          0.648687
scholar           0.647563
dtype: float64

Like the frequent words above, these highly "unique" words are empty of meaning and will be excluded from topic models!

Word Embeddings with word2vec

Word2Vec features

  • Size: Number of dimensions for word embedding model
  • Window: Number of context words to observe in each direction
  • min_count: Minimum frequency for words included in model
  • sg (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram
  • Alpha: Learning rate (initial); prevents model from over-correcting, enables finer tuning
  • Iterations: Number of passes through dataset
  • Batch Size: Number of words to sample from data during each pass
  • Worker: Set the 'worker' option to ensure reproducibility

In [35]:
# train the model, using a minimum of 5 words
model = gensim.models.Word2Vec(words_by_sentence, size=100, window=5, \
                               min_count=2, sg=1, alpha=0.025, iter=5, batch_words=10000, workers=1)

In [36]:
# dictionary of words in model (may not work for old gensim)
# print(len(model.vocab))
# model.vocab

In [37]:
# Find cosine distance between two given word vectors
print(model.similarity('college-prep','align')) # these two are close to essentialism
print(model.similarity('emot', 'curios')) # these two are close to progressivism


0.914012837905
0.934911449012

In [38]:
# create some rough dictionaries for our contrasting educational philosophies
essentialism = ['excel', 'perform', 'prep', 'rigor', 'standard', 'align', 'comprehens', 'content', \
                               'data-driven', 'market', 'research', 'research-bas', 'program', 'standards-bas']
progressivism = ['inquir', 'curios', 'project', 'teamwork', 'social', 'emot', 'reflect', 'creat',\
                'ethic', 'independ', 'discov', 'deep', 'problem-solv', 'natur']

In [39]:
# Let's look at two vectors that demonstrate the binary between these philosophies: align and emot
print(model.most_similar('align')) # words core to essentialism
print()
print(model.most_similar('emot')) # words core to progressivism


[('across', 0.962225079536438), ('common', 0.9599748253822327), ('design', 0.9472075700759888), ('compon', 0.9386098980903625), ('research-bas', 0.9384891986846924), ('sequenc', 0.9361364841461182), ('util', 0.9324259161949158), ('philosophi', 0.9307288527488708), ('exceed', 0.9298505187034607), ('pennsylvania', 0.9288857579231262)]

[('creativ', 0.9845424294471741), ('intellectu', 0.9743475914001465), ('strategi', 0.9633899927139282), ('tool', 0.9565004110336304), ('basic', 0.9564352035522461), ('compet', 0.9557984471321106), ('awar', 0.9557099342346191), ('practic', 0.9553041458129883), ('critic', 0.9519203901290894), ('studentsõ', 0.951266884803772)]

In [40]:
print(model.most_similar('emot')) # words core to progressivism


[('creativ', 0.9845424294471741), ('intellectu', 0.9743475914001465), ('strategi', 0.9633899927139282), ('tool', 0.9565004110336304), ('basic', 0.9564352035522461), ('compet', 0.9557984471321106), ('awar', 0.9557099342346191), ('practic', 0.9553041458129883), ('critic', 0.9519203901290894), ('studentsõ', 0.951266884803772)]

In [41]:
# Let's work with the binary between progressivism vs. essentialism
# first let's find the 50 words closest to each philosophy using the two 14-term dictionaries defined above
prog_words = model.most_similar(progressivism, topn=50)
prog_words = [word for word, similarity in prog_words]
for word in progressivism:
    prog_words.append(word)
print(prog_words[:20])


['deeper', 'acquir', 'appreci', 'disciplin', 'real-world', 'awar', 'cognit', 'trait', 'human', 'mind', 'differenti', 'defin', 'strengthen', 'play', 'authent', 'self-confid', 'show', 'studentsõ', 'explor', 'scientif']

In [42]:
ess_words = model.most_similar(essentialism, topn=50) # now let's get the 50 most similar words for our essentialist dictionary
ess_words = [word for word, similarity in ess_words]
for word in essentialism:
    ess_words.append(word)
print(ess_words[:20])


['acceler', 'blend', 'compon', 'rich', 'intens', 'sequenc', 'infus', 'coursework', 'framework', 'proven', 'college-preparatori', 'across', 'key', 'student-cent', 'fulli', 'industri', 'aim', 'util', 'self-pac', 'profici']

In [43]:
# construct an combined dictionary
phil_words = ess_words + prog_words

In [44]:
# preparing for visualizing this binary with word2vec
x = [model.similarity('emot', word) for word in phil_words]
y = [model.similarity('align', word) for word in phil_words]

In [45]:
# here's a visual of the progressivism/essentialism binary: 
# top-left half is essentialism, bottom-right half is progressivism
_, ax = plt.subplots(figsize=(20,20))
ax.scatter(x, y, alpha=1, color='b')
for i in range(len(phil_words)):
    ax.annotate(phil_words[i], (x[i], y[i]))
ax.set_xlim(.635, 1.005)
ax.set_ylim(.635, 1.005)
plt.plot([0, 1], [0, 1], linestyle='--');


Binary of essentialist (top-left) and progressivist (bottom-right) word vectors

Topic Modeling with scikit-learn

For documentation on this topic modeling (TM) package, which uses Latent Dirichlet Allocation (LDA), see here.

And for documentation on the vectorizer package, CountVectorizer from scikit-learn, see here.


In [46]:
####Adopted From: 
#Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

# Initialize the variables needed for the topic models
n_samples = 2000
n_topics = 3
n_top_words = 50

# Create helper function that prints out the top words for each topic in a pretty way
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [47]:
# Vectorize our text using CountVectorizer
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=70, min_df=4,
                                max_features=None,
                                stop_words=stopenglish, lowercase=1
                                )

tf = tf_vectorizer.fit_transform(df.WEBTEXT)


Extracting tf features for LDA...

In [48]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_topics=%d..."
      % (n_samples, n_topics))

# define the lda function, with desired options
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=20,
                                learning_method='online',
                                learning_offset=80.,
                                total_samples=n_samples,
                                random_state=0)
#fit the model
lda.fit(tf)


Fitting LDA models with tf features, n_samples=2000 and n_topics=3...
Out[48]:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=80.0,
             max_doc_update_iter=100, max_iter=20, mean_change_tol=0.001,
             n_jobs=1, n_topics=3, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=2000, verbose=0)

In [49]:
# print the top words per topic, using the function defined above.

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:

Topic #0:
scholars character child leadership new achievement district vision believe day values help programs self excellence families teacher ensure leaders strong respect preparatory individual small responsibility safe best others quality free learners take also process teaching campus board become family successful time texas experience reading rigorous meet focus math building prepare

Topic #1:
arts science technology leadership reading language action services provides music also young art writing middle 12 district programs day math history class time study career use english campus elementary areas classes international using knowledge course teacher understanding including activities focus new teaching first content physical personal service may challenging ib

Topic #2:
science new arts leaders summit also teacher leadership time teaching young throughout programs achievement math study approach part building class believe families focus use language elementary small renaissance child help parent experiences technology future activities reading day provides values instructional process quality best behavior rigorous art service 12 thinking experience

These topics seem to mean:

  • topic 0 relates to GOALS,
  • topic 1 relates to CURRICULUM, and
  • topic 2 relates to PHILOSOPHY or learning process (but this topic less clear/ more mottled)

In [50]:
# Preparation for looking at distribution of topics over schools
topic_dist = lda.transform(tf) # transpose topic distribution
topic_dist_df = pandas.DataFrame(topic_dist) # turn into a df
df_w_topics = topic_dist_df.join(df) # merge with charter MS dataframe
df_w_topics[:20] # check out the merged df with topics!


Out[50]:
0 1 2 SCHNAM ADDRESS URL SEARCH CUSTOMID WEBTEXT LEVEL ... webtokens webtokens_nopunct webtokens_clean webtokens_stemmed webtext_stemmed webstem_count webpunct_count webclean_count numtypes TTR
0 131.261273 0.394623 0.344103 QUEST MIDDLE SCHOOL OF PINE BLUFF 308 SOUTH BLAKE ST, PINE BLUFF, AR http://responsiveed.com/questpinebluff/ QUEST MIDDLE SCHOOL OF PINE BLUFF 308 SOUTH BL... AR3542702 Quest Middle Schools¨ are schools focused on h... 2 ... [quest, middle, schools¨, are, schools, focuse... [quest, middle, schools¨, are, schools, focuse... [quest, middle, schools¨, schools, focused, hi... [quest, middl, schools¨, school, focus, high, ... quest middl schools¨ school focus high expect ... 200 314 200 139 0.442675
1 43.251431 0.404550 0.344018 THE ACADEMIES AT JONESBORO HIGH SCHOOL 301 HURRICANE DR, JONESBORO, AR http://www.jonesboroschools.net/schools/academ... THE ACADEMIES AT JONESBORO HIGH SCHOOL 301 HUR... AR1608703 The mission of the Academies at Jonesboro High... 3 ... [the, mission, of, the, academies, at, jonesbo... [the, mission, of, the, academies, at, jonesbo... [mission, academies, jonesboro, high, school, ... [mission, academi, jonesboro, high, school, pr... mission academi jonesboro high school provid h... 63 111 63 71 0.639640
2 23.290207 0.367458 0.342335 A CHILD'S VIEW SCHOOL 2846 DREXEL RD, TUCSON, AZ http://childcarecenter.us/provider_detail/a_ch... A CHILD'S VIEW SCHOOL 2846 DREXEL RD, TUCSON, AZ AZ87345 We believe that every child needs a well-round... 1 ... [we, believe, that, every, child, needs, a, we... [we, believe, that, every, child, needs, a, we... [believe, every, child, needs, well-rounded, e... [believ, everi, child, need, well-round, educ,... believ everi child need wellround educ famili ... 38 77 38 61 0.792208
3 13.599131 5.047494 0.353375 AAEC - PARADISE VALLEY 17811 NORTH 32ND ST, PHOENIX, AZ http://www.aaechighschools.com/ AAEC - PARADISE VALLEY 17811 NORTH 32ND ST, PH... AZ6344 AAEC Early College High School prepares young ... 3 ... [aaec, early, college, high, school, prepares,... [aaec, early, college, high, school, prepares,... [aaec, early, college, high, school, prepares,... [aaec, earli, colleg, high, school, prepar, yo... aaec earli colleg high school prepar young adu... 31 45 31 38 0.844444
4 24.045600 18.605429 0.348971 ABRAHAM LINCOLN TRADITIONAL SCHOOL 10444 NORTH 39TH AVE, PHOENIX, AZ http://abrahamlincoln.wesdschools.org/ ABRAHAM LINCOLN TRADITIONAL SCHOOL 10444 NORTH... AZ5274 The mission of the Abraham Lincoln Traditional... 1 ... [the, mission, of, the, abraham, lincoln, trad... [the, mission, of, the, abraham, lincoln, trad... [mission, abraham, lincoln, traditional, schoo... [mission, abraham, lincoln, tradit, school, gu... mission abraham lincoln tradit school guid cha... 75 106 75 71 0.669811
5 0.437738 93.213344 0.348919 ACADEMY DEL SOL 4525 EAST BROADWAY BLVD, TUCSON, AZ http://www.academydelsol.com/ ACADEMY DEL SOL 4525 EAST BROADWAY BLVD, TUCSO... AZ90200 Academy Del SolÕs mission is to provide a rigo... 1 ... [academy, del, solõs, mission, is, to, provide... [academy, del, solõs, mission, is, to, provide... [academy, del, solõs, mission, provide, rigoro... [academi, del, solõ, mission, provid, rigor, s... academi del solõ mission provid rigor superior... 147 226 147 140 0.619469
6 196.920592 33.728109 0.351299 ACADEMY OF TUCSON ELEMENTARY SCHOOL 9209 EAST WRIGHTSTOWN RD, TUCSON, AZ http://www.academyoftucson.com/elementary-scho... ACADEMY OF TUCSON ELEMENTARY SCHOOL 9209 EAST ... AZ81130 Mission:\rIt is the purpose of the Academy of ... 1 ... [mission, :, it, is, the, purpose, of, the, ac... [mission, it, is, the, purpose, of, the, acade... [mission, purpose, academy, tucson, provide, p... [mission, purpos, academi, tucson, provid, pre... mission purpos academi tucson provid prepar gr... 356 620 356 278 0.448387
7 24.971913 14.676298 0.351789 DESERT MOSAIC SCHOOL 5757 WEST AJO HWY, TUCSON, AZ http://desertmosaic.com/Home_Page.php DESERT MOSAIC SCHOOL 5757 WEST AJO HWY, TUCSON... AZ79118 Desert Mosaic School commits to creating a tea... 4 ... [desert, mosaic, school, commits, to, creating... [desert, mosaic, school, commits, to, creating... [desert, mosaic, school, commits, creating, te... [desert, mosaic, school, commit, creat, teach,... desert mosaic school commit creat teach enviro... 67 115 67 74 0.643478
8 179.458463 56.191420 0.350117 KAIZEN EDUCATION FOUNDATION DBA SUMMIT HIGH SC... 728 EAST MCDOWELL RD, PHOENIX, AZ http://www.summiths.com/ KAIZEN EDUCATION FOUNDATION DBA SUMMIT HIGH SC... AZ10749 Summit High SchoolÕs Mission and Vision is to ... 3 ... [summit, high, schoolõs, mission, and, vision,... [summit, high, schoolõs, mission, and, vision,... [summit, high, schoolõs, mission, vision, prov... [summit, high, schoolõ, mission, vision, provi... summit high schoolõ mission vision provid safe... 363 580 363 275 0.474138
9 47.900249 20.751915 0.347836 OASIS HIGH SCHOOL 8632 WEST NORTHERN AVE, GLENDALE, AZ https://web.archive.org/web/20120617204246/htt... OASIS HIGH SCHOOL 8632 WEST NORTHERN AVE, GLEN... AZ78955 Omega's mission is to provide an optimal teach... 3 ... [omega, 's, mission, is, to, provide, an, opti... [omega, 's, mission, is, to, provide, an, opti... [omega, 's, mission, provide, optimal, teachin... [omega, 's, mission, provid, optim, teach, lea... omega s mission provid optim teach learn envir... 133 221 133 137 0.619910
10 0.424860 138.224560 0.350580 SAGE ACADEMY 1055 EAST HEARN RD, SCOTTSDALE, AZ http://www.sage-academy.org/ SAGE ACADEMY 1055 EAST HEARN RD, SCOTTSDALE, AZ AZ89415 River Valley Charter School (RVCS) is a ... 1 ... [river, valley, charter, school, (, rvcs, ), i... [river, valley, charter, school, rvcs, is, a, ... [river, valley, charter, school, rvcs, public,... [river, valley, charter, school, rvc, public, ... river valley charter school rvc public charter... 206 341 206 185 0.542522
11 401.374463 198.273555 0.351982 SCHOOL FOR INTEGRATED ACADEMICS AND TECHNOLOGIES 518 SOUTH 3RD ST, PHOENIX, AZ http://www.siatech.org/ SCHOOL FOR INTEGRATED ACADEMICS AND TECHNOLOGI... AZ79450 As a nonprofit organization, we create the env... 3 ... [as, a, nonprofit, organization, ,, we, create... [as, a, nonprofit, organization, we, create, t... [nonprofit, organization, create, environments... [nonprofit, organ, creat, environ, tool, techn... nonprofit organ creat environ tool techniqu re... 964 1476 964 634 0.429539
12 9.411709 89.237486 0.350806 SEQUOIA VILLAGE SCHOOL 982 FULL HOUSE LN, SHOW LOW, AZ http://www.sequoiavillageschool.org/ SEQUOIA VILLAGE SCHOOL 982 FULL HOUSE LN, SHOW... AZ10848 Building a Better World One Student at a Time"... 1 ... [building, a, better, world, one, student, at,... [building, a, better, world, one, student, at,... [building, better, world, one, student, time, ... [build, better, world, one, student, time, '',... build better world one student time seed publ... 188 263 188 167 0.634981
13 52.244098 0.404307 0.351595 SOUTH POINTE HIGH SCHOOL 8325 SOUTH CENTRAL AVE, PHOENIX, AZ http://www.southpointehs.com/ SOUTH POINTE HIGH SCHOOL 8325 SOUTH CENTRAL AV... AZ80990 We nurture our students academically, behavior... 3 ... [we, nurture, our, students, academically, ,, ... [we, nurture, our, students, academically, beh... [nurture, students, academically, behaviorally... [nurtur, student, academ, behavior, emot, serv... nurtur student academ behavior emot serv produ... 65 117 65 71 0.606838
14 91.595543 80.053909 0.350548 SOUTH POINTE HIGH SCHOOL 8325 SOUTH CENTRAL AVE, PHOENIX, AZ http://www.southpointehs.com/ SOUTH POINTE HIGH SCHOOL 8325 SOUTH CENTRAL AV... AZ80990 We nurture our students academically, behavior... 3 ... [we, nurture, our, students, academically, ,, ... [we, nurture, our, students, academically, beh... [nurture, students, academically, behaviorally... [nurtur, student, academ, behavior, emot, serv... nurtur student academ behavior emot serv produ... 265 466 265 233 0.500000
15 14.281958 0.371092 0.346950 SOUTH POINTE JUNIOR HIGH SCHOOL 217 EAST OLYMPIC DR, PHOENIX, AZ http://www.southpointejh.com/ SOUTH POINTE JUNIOR HIGH SCHOOL 217 EAST OLYMP... AZ79178 To provide an education and support students d... 2 ... [to, provide, an, education, and, support, stu... [to, provide, an, education, and, support, stu... [provide, education, support, students, develo... [provid, educ, support, student, develop, self... provid educ support student develop selfreli s... 28 48 28 37 0.770833
16 18.623108 3.020419 0.356473 SOUTHSIDE COMMUNITY SCHOOL 2701 SOUTH CAMPBELL AVE, TUCSON, AZ http://www.ade.az.gov/edd/NewDetails.asp?Entit... SOUTHSIDE COMMUNITY SCHOOL 2701 SOUTH CAMPBELL... AZ79432 Southside Community School is a free, public c... 4 ... [southside, community, school, is, a, free, ,,... [southside, community, school, is, a, free, pu... [southside, community, school, free, public, c... [southsid, commun, school, free, public, chart... southsid commun school free public charter sch... 29 41 29 33 0.804878
17 18.192244 426.458469 0.349286 SUN VALLEY CHARTER SCHOOL 5806 SOUTH 35TH AVE BLDG EAST, PHOENIX, AZ http://www.sunvalleycharterschool.com/ SUN VALLEY CHARTER SCHOOL 5806 SOUTH 35TH AVE ... AZ90193 Sun Valley Charter School has made a commitmen... 1 ... [sun, valley, charter, school, has, made, a, c... [sun, valley, charter, school, has, made, a, c... [sun, valley, charter, school, made, commitmen... [sun, valley, charter, school, made, commit, p... sun valley charter school made commit provid s... 666 986 666 383 0.388438
18 71.231727 0.418614 0.349659 TANQUE VERDE ELEMENTARY SCHOOL 2600 NORTH FENNIMOREA AVE, TUCSON, AZ http://www.tanqueverdeschools.org/ TANQUE VERDE ELEMENTARY SCHOOL 2600 NORTH FENN... AZ5829 "Excellence is our goal, understanding our fou... 1 ... [``, excellence, is, our, goal, ,, understandi... [``, excellence, is, our, goal, understanding,... [``, excellence, goal, understanding, foundati... [``, excel, goal, understand, foundat, '', tvu... excel goal understand foundat tvusd ensur ef... 103 153 103 109 0.712418
19 49.240595 0.410438 0.348967 TARTESSO ELEMENTARY SCHOOL 29677 WEST INDIANOLA RD, BUCKEYE, AZ http://tartesso.smusd90.org/ TARTESSO ELEMENTARY SCHOOL 29677 WEST INDIANOL... AZ89596 We are located in Buckeye, Arizona and enjoy t... 1 ... [we, are, located, in, buckeye, ,, arizona, an... [we, are, located, in, buckeye, arizona, and, ... [located, buckeye, arizona, enjoy, benefits, a... [locat, buckey, arizona, enjoy, benefit, activ... locat buckey arizona enjoy benefit activ suppo... 86 145 86 88 0.606897

20 rows × 347 columns


In [51]:
topic_columns = range(0,n_topics) # Set numerical range of topic columns for use in analyses, using n_topics from above

In [52]:
# Which schools are weighted highest for topic 0? How do they trend with regard to urban proximity and student class? 
print(df_w_topics[['LSTATE', 'ULOCAL', 'PCTETH', 'PCTFRPL', 0, 1, 2]].sort_values(by=[0], ascending=False))


    LSTATE  ULOCAL     PCTETH   PCTFRPL           0            1         2
73      FL      12   0.457797  0.440629  836.884989   178.763940  0.351071
87      IL      12   0.733624  0.532751  807.866177    85.781466  0.352358
52      CA      11   0.997093  0.973837  682.823243     0.824402  0.352355
84      FL      11   0.556028  0.173050  660.449380    78.198383  0.352236
51      CA      11   0.557576  0.232323  656.232871     0.417456  0.349673
23      AZ      13   0.174312 -0.009174  571.003730     8.642115  0.354154
127     NY      11   0.989899  0.818182  563.216136    10.432775  0.351089
44      CA      11   0.810409  0.483271  530.244377     0.404822  0.350801
20      AZ      13   0.297222  0.008333  515.216903    96.431358  0.351739
180     TX      12   0.894057  0.744186  473.231776     0.421162  0.347063
22      AZ      41   0.341346  0.552885  446.228030     0.420970  0.351000
11      AZ      11 -11.250000  0.000000  401.374463   198.273555  0.351982
165     TX      11   0.961957  0.739130  391.239505     0.412921  0.347574
128     NY      11   0.996667  0.930000  382.700076    95.953308  0.346616
76      FL      21   0.850288  0.382917  371.228569     0.421452  0.349978
149     PA      11   1.000000 -0.013575  359.454317    12.198365  0.347319
182     TX      21   0.994253  0.948276  352.248564     0.403317  0.348119
195     WI      11   0.353234  0.343284  345.233831     0.419037  0.347131
101     MI      12   0.884232  0.934132  328.253226     0.398792  0.347982
160     TX      21   0.882096  0.895197  319.249176     0.404244  0.346579
159     TX      11   0.976143  0.940358  303.255555     0.397995  0.346450
82      FL      21   0.976744  0.941860  292.288126     2.359365  0.352509
140     OH      12   0.328358  0.776119  291.066628    22.584135  0.349237
104     MI      12   0.753769  0.773869  274.249669     0.402176  0.348154
174     TX      21   0.744076  0.677725  273.262781     0.392607  0.344612
119     NC      21   0.088095  0.095238  266.783878   126.866013  0.350109
63      DE      13   0.965157  0.606272  264.251378     0.401792  0.346830
142     OR      41   0.168224  0.224299  263.240002     0.411085  0.348913
194     WI      22   0.354232  0.360502  262.266485     0.388329  0.345186
179     TX      21   0.991489  0.927660  260.245051     0.407997  0.346951
..     ...     ...        ...       ...         ...          ...       ...
154     SC      13   0.392857  0.535714    0.433510    68.220386  0.346105
72      FL      32   0.324503  0.675497    0.428338    67.225470  0.346192
37      CA      21   0.193878  0.034014    0.425801   200.225710  0.348489
10      AZ      11   0.369427  0.656051    0.424860   138.224560  0.350580
45      CA      11   0.967273  0.709091    0.422800     6.234783  0.342417
172     TX      21   0.337386  0.118541    0.422751   136.230226  0.347023
65      FL      22   0.208333  0.370833    0.422699   367.225716  0.351586
184     UT      41   0.060311  0.130350    0.422111   555.231439  0.346449
192     WI      32   0.107143  0.500000    0.419505    13.238635  0.341860
88      IL      11   0.995968  0.784274    0.417534   185.231030  0.351436
64      FL      21   0.978964  0.621359    0.415587    17.240277  0.344136
31      CA      33   0.286184  0.644737    0.412016    49.240774  0.347210
54      CA      22   0.420664  0.553506    0.407155   306.245094  0.347752
145     PA      21   0.140662  0.159574    0.400168    63.253037  0.346794
50      CA      12   0.494755  0.312937    0.397237   851.253728  0.349034
148     PA      11   0.931104  0.853090    0.396484   260.257830  0.345686
152     RI      12   0.900763  0.877863    0.396424   973.256774  0.346802
110     MN      21   0.084507  0.070423    0.396098  1895.255444  0.348458
67      FL      11   0.382022  0.292135    0.395258    24.261845  0.342896
81      FL      13   0.564706  0.701176    0.394519   142.251015  0.354467
68      FL      11   0.487395  0.291317    0.394138    24.263576  0.342285
97      MA      21   0.257764  0.177019    0.394035   390.259808  0.346158
150     PA      11   0.998494  0.936747    0.392511   660.261869  0.345620
153     SC      21   0.652174  0.231884    0.392209    93.262182  0.345609
74      FL      21   0.965517  0.775862    0.391582   126.259423  0.348995
181     TX      11   0.762821  0.588675    0.389915   468.265444  0.344641
191     WI      33   1.000000  0.000000    0.387120   450.268534  0.344346
168     TX      33   0.191057  0.069106    0.387115    27.269994  0.342890
151     PA      21   0.269565  0.608696    0.387079   308.265419  0.347502
95      LA      13   5.000000  1.000000    0.375334    74.275300  0.349365

[196 rows x 7 columns]

In [53]:
# Preparation for comparing total number of words aligned with each topic
# To weight each topic by its prevalenced in the corpus, multiply each topic by the word count from above

col_list = []
for num in topic_columns:
    col = "%d_wc" % num
    col_list.append(col)
    df_w_topics[col] = df_w_topics[num] * df_w_topics['webpunct_count']
    
df_w_topics[:20]


Out[53]:
0 1 2 SCHNAM ADDRESS URL SEARCH CUSTOMID WEBTEXT LEVEL ... webtokens_stemmed webtext_stemmed webstem_count webpunct_count webclean_count numtypes TTR 0_wc 1_wc 2_wc
0 131.261273 0.394623 0.344103 QUEST MIDDLE SCHOOL OF PINE BLUFF 308 SOUTH BLAKE ST, PINE BLUFF, AR http://responsiveed.com/questpinebluff/ QUEST MIDDLE SCHOOL OF PINE BLUFF 308 SOUTH BL... AR3542702 Quest Middle Schools¨ are schools focused on h... 2 ... [quest, middl, schools¨, school, focus, high, ... quest middl schools¨ school focus high expect ... 200 314 200 139 0.442675 41216.039781 123.911766 108.048453
1 43.251431 0.404550 0.344018 THE ACADEMIES AT JONESBORO HIGH SCHOOL 301 HURRICANE DR, JONESBORO, AR http://www.jonesboroschools.net/schools/academ... THE ACADEMIES AT JONESBORO HIGH SCHOOL 301 HUR... AR1608703 The mission of the Academies at Jonesboro High... 3 ... [mission, academi, jonesboro, high, school, pr... mission academi jonesboro high school provid h... 63 111 63 71 0.639640 4800.908882 44.905089 38.186028
2 23.290207 0.367458 0.342335 A CHILD'S VIEW SCHOOL 2846 DREXEL RD, TUCSON, AZ http://childcarecenter.us/provider_detail/a_ch... A CHILD'S VIEW SCHOOL 2846 DREXEL RD, TUCSON, AZ AZ87345 We believe that every child needs a well-round... 1 ... [believ, everi, child, need, well-round, educ,... believ everi child need wellround educ famili ... 38 77 38 61 0.792208 1793.345929 28.294249 26.359822
3 13.599131 5.047494 0.353375 AAEC - PARADISE VALLEY 17811 NORTH 32ND ST, PHOENIX, AZ http://www.aaechighschools.com/ AAEC - PARADISE VALLEY 17811 NORTH 32ND ST, PH... AZ6344 AAEC Early College High School prepares young ... 3 ... [aaec, earli, colleg, high, school, prepar, yo... aaec earli colleg high school prepar young adu... 31 45 31 38 0.844444 611.960911 227.137229 15.901860
4 24.045600 18.605429 0.348971 ABRAHAM LINCOLN TRADITIONAL SCHOOL 10444 NORTH 39TH AVE, PHOENIX, AZ http://abrahamlincoln.wesdschools.org/ ABRAHAM LINCOLN TRADITIONAL SCHOOL 10444 NORTH... AZ5274 The mission of the Abraham Lincoln Traditional... 1 ... [mission, abraham, lincoln, tradit, school, gu... mission abraham lincoln tradit school guid cha... 75 106 75 71 0.669811 2548.833565 1972.175517 36.990918
5 0.437738 93.213344 0.348919 ACADEMY DEL SOL 4525 EAST BROADWAY BLVD, TUCSON, AZ http://www.academydelsol.com/ ACADEMY DEL SOL 4525 EAST BROADWAY BLVD, TUCSO... AZ90200 Academy Del SolÕs mission is to provide a rigo... 1 ... [academi, del, solõ, mission, provid, rigor, s... academi del solõ mission provid rigor superior... 147 226 147 140 0.619469 98.928743 21066.215642 78.855615
6 196.920592 33.728109 0.351299 ACADEMY OF TUCSON ELEMENTARY SCHOOL 9209 EAST WRIGHTSTOWN RD, TUCSON, AZ http://www.academyoftucson.com/elementary-scho... ACADEMY OF TUCSON ELEMENTARY SCHOOL 9209 EAST ... AZ81130 Mission:\rIt is the purpose of the Academy of ... 1 ... [mission, purpos, academi, tucson, provid, pre... mission purpos academi tucson provid prepar gr... 356 620 356 278 0.448387 122090.767038 20911.427627 217.805335
7 24.971913 14.676298 0.351789 DESERT MOSAIC SCHOOL 5757 WEST AJO HWY, TUCSON, AZ http://desertmosaic.com/Home_Page.php DESERT MOSAIC SCHOOL 5757 WEST AJO HWY, TUCSON... AZ79118 Desert Mosaic School commits to creating a tea... 4 ... [desert, mosaic, school, commit, creat, teach,... desert mosaic school commit creat teach enviro... 67 115 67 74 0.643478 2871.770000 1687.774263 40.455737
8 179.458463 56.191420 0.350117 KAIZEN EDUCATION FOUNDATION DBA SUMMIT HIGH SC... 728 EAST MCDOWELL RD, PHOENIX, AZ http://www.summiths.com/ KAIZEN EDUCATION FOUNDATION DBA SUMMIT HIGH SC... AZ10749 Summit High SchoolÕs Mission and Vision is to ... 3 ... [summit, high, schoolõ, mission, vision, provi... summit high schoolõ mission vision provid safe... 363 580 363 275 0.474138 104085.908809 32591.023578 203.067613
9 47.900249 20.751915 0.347836 OASIS HIGH SCHOOL 8632 WEST NORTHERN AVE, GLENDALE, AZ https://web.archive.org/web/20120617204246/htt... OASIS HIGH SCHOOL 8632 WEST NORTHERN AVE, GLEN... AZ78955 Omega's mission is to provide an optimal teach... 3 ... [omega, 's, mission, provid, optim, teach, lea... omega s mission provid optim teach learn envir... 133 221 133 137 0.619910 10585.954983 4586.173248 76.871769
10 0.424860 138.224560 0.350580 SAGE ACADEMY 1055 EAST HEARN RD, SCOTTSDALE, AZ http://www.sage-academy.org/ SAGE ACADEMY 1055 EAST HEARN RD, SCOTTSDALE, AZ AZ89415 River Valley Charter School (RVCS) is a ... 1 ... [river, valley, charter, school, rvc, public, ... river valley charter school rvc public charter... 206 341 206 185 0.542522 144.877221 47134.575023 119.547755
11 401.374463 198.273555 0.351982 SCHOOL FOR INTEGRATED ACADEMICS AND TECHNOLOGIES 518 SOUTH 3RD ST, PHOENIX, AZ http://www.siatech.org/ SCHOOL FOR INTEGRATED ACADEMICS AND TECHNOLOGI... AZ79450 As a nonprofit organization, we create the env... 3 ... [nonprofit, organ, creat, environ, tool, techn... nonprofit organ creat environ tool techniqu re... 964 1476 964 634 0.429539 592428.707375 292651.766967 519.525658
12 9.411709 89.237486 0.350806 SEQUOIA VILLAGE SCHOOL 982 FULL HOUSE LN, SHOW LOW, AZ http://www.sequoiavillageschool.org/ SEQUOIA VILLAGE SCHOOL 982 FULL HOUSE LN, SHOW... AZ10848 Building a Better World One Student at a Time"... 1 ... [build, better, world, one, student, time, '',... build better world one student time seed publ... 188 263 188 167 0.634981 2475.279353 23469.458783 92.261863
13 52.244098 0.404307 0.351595 SOUTH POINTE HIGH SCHOOL 8325 SOUTH CENTRAL AVE, PHOENIX, AZ http://www.southpointehs.com/ SOUTH POINTE HIGH SCHOOL 8325 SOUTH CENTRAL AV... AZ80990 We nurture our students academically, behavior... 3 ... [nurtur, student, academ, behavior, emot, serv... nurtur student academ behavior emot serv produ... 65 117 65 71 0.606838 6112.559448 47.303895 41.136657
14 91.595543 80.053909 0.350548 SOUTH POINTE HIGH SCHOOL 8325 SOUTH CENTRAL AVE, PHOENIX, AZ http://www.southpointehs.com/ SOUTH POINTE HIGH SCHOOL 8325 SOUTH CENTRAL AV... AZ80990 We nurture our students academically, behavior... 3 ... [nurtur, student, academ, behavior, emot, serv... nurtur student academ behavior emot serv produ... 265 466 265 233 0.500000 42683.523220 37305.121560 163.355220
15 14.281958 0.371092 0.346950 SOUTH POINTE JUNIOR HIGH SCHOOL 217 EAST OLYMPIC DR, PHOENIX, AZ http://www.southpointejh.com/ SOUTH POINTE JUNIOR HIGH SCHOOL 217 EAST OLYMP... AZ79178 To provide an education and support students d... 2 ... [provid, educ, support, student, develop, self... provid educ support student develop selfreli s... 28 48 28 37 0.770833 685.533972 17.812416 16.653613
16 18.623108 3.020419 0.356473 SOUTHSIDE COMMUNITY SCHOOL 2701 SOUTH CAMPBELL AVE, TUCSON, AZ http://www.ade.az.gov/edd/NewDetails.asp?Entit... SOUTHSIDE COMMUNITY SCHOOL 2701 SOUTH CAMPBELL... AZ79432 Southside Community School is a free, public c... 4 ... [southsid, commun, school, free, public, chart... southsid commun school free public charter sch... 29 41 29 33 0.804878 763.547425 123.837179 14.615396
17 18.192244 426.458469 0.349286 SUN VALLEY CHARTER SCHOOL 5806 SOUTH 35TH AVE BLDG EAST, PHOENIX, AZ http://www.sunvalleycharterschool.com/ SUN VALLEY CHARTER SCHOOL 5806 SOUTH 35TH AVE ... AZ90193 Sun Valley Charter School has made a commitmen... 1 ... [sun, valley, charter, school, made, commit, p... sun valley charter school made commit provid s... 666 986 666 383 0.388438 17937.552914 420488.050703 344.396383
18 71.231727 0.418614 0.349659 TANQUE VERDE ELEMENTARY SCHOOL 2600 NORTH FENNIMOREA AVE, TUCSON, AZ http://www.tanqueverdeschools.org/ TANQUE VERDE ELEMENTARY SCHOOL 2600 NORTH FENN... AZ5829 "Excellence is our goal, understanding our fou... 1 ... [``, excel, goal, understand, foundat, '', tvu... excel goal understand foundat tvusd ensur ef... 103 153 103 109 0.712418 10898.454184 64.048014 53.497802
19 49.240595 0.410438 0.348967 TARTESSO ELEMENTARY SCHOOL 29677 WEST INDIANOLA RD, BUCKEYE, AZ http://tartesso.smusd90.org/ TARTESSO ELEMENTARY SCHOOL 29677 WEST INDIANOL... AZ89596 We are located in Buckeye, Arizona and enjoy t... 1 ... [locat, buckey, arizona, enjoy, benefit, activ... locat buckey arizona enjoy benefit activ suppo... 86 145 86 88 0.606897 7139.886291 59.513470 50.600239

20 rows × 350 columns


In [54]:
# Now we can see the prevalence of each topic over words for each urban category and state
grouped_urban = df_w_topics.groupby('ULOCAL')
for e in col_list:
    print(e)
    print(grouped_urban[e].sum()/grouped_urban['webpunct_count'].sum())

grouped_state = df_w_topics.groupby('LSTATE')
for e in col_list:
    print(e)
    print(grouped_state[e].sum()/grouped_state['webpunct_count'].sum())


0_wc
ULOCAL
11    252.460199
12    344.588359
13    240.022477
21    127.201643
22     79.306683
23    138.867584
31    100.670414
32    137.660762
33     34.277186
41    146.551647
42     67.639781
dtype: float64
1_wc
ULOCAL
11    171.386633
12    318.874117
13    124.322144
21    567.652927
22    237.680326
23      0.399967
31    252.846123
32     41.157828
33    231.109223
41    228.613868
42    324.782390
dtype: float64
2_wc
ULOCAL
11    0.349413
12    0.349343
13    0.350095
21    0.348660
22    0.348461
23    0.347925
31    0.348880
32    0.348457
33    0.347473
41    0.349770
42    0.349948
dtype: float64
0_wc
LSTATE
AR    108.275173
AZ    307.060573
CA    244.894963
CO    120.827397
DC     73.747499
DE    264.251378
FL    371.423011
GA     20.119871
IL    476.761298
IN    121.664297
LA     29.510303
MA     42.715775
MD    154.503844
MI    195.752094
MN     19.651640
MO    134.226657
NC    192.330028
NJ     97.425364
NM     62.503755
NV      1.570714
NY    351.901927
OH    174.489852
OR    203.211717
PA     81.033768
RI      0.396424
SC      0.411729
TN     74.079174
TX    210.712480
UT     38.839149
WI    154.229482
dtype: float64
1_wc
LSTATE
AR       0.397216
AZ      98.357483
CA     303.382596
CO       3.559631
DC     237.404967
DE       0.401792
FL     129.540301
GA       0.527551
IL     264.466043
IN      63.021684
LA      44.251730
MA     278.693377
MD     220.144010
MI     304.035136
MN    1563.666227
MO       0.421632
NC     136.675891
NJ      28.225592
NM     130.136175
NV      94.078004
NY      74.740746
OH      19.871231
OR       7.582450
PA     321.091370
RI     973.256774
SC      81.426507
TN       2.251475
TX     122.607045
UT     400.402277
WI     176.445612
dtype: float64
2_wc
LSTATE
AR    0.344081
AZ    0.351287
CA    0.350417
CO    0.349726
DC    0.349568
DE    0.346830
FL    0.351154
GA    0.352578
IL    0.351628
IN    0.348872
LA    0.349275
MA    0.348913
MD    0.352145
MI    0.349951
MN    0.348572
MO    0.351711
NC    0.350324
NJ    0.349044
NM    0.351285
NV    0.351282
NY    0.348722
OH    0.346984
OR    0.347862
PA    0.346778
RI    0.346802
SC    0.345843
TN    0.347922
TX    0.346956
UT    0.345979
WI    0.346329
dtype: float64

In [55]:
# Here's the distribution of urban proximity over the three topics:
fig1 = plt.figure()
chrt = 0
for num in topic_columns:
    chrt += 1 
    ax = fig1.add_subplot(2,3, chrt)
    grouped_urban[num].mean().plot(kind = 'bar', yerr = grouped_urban[num].std(), ylim=0, ax=ax, title=num)

fig1.tight_layout()
plt.show()



In [56]:
# Here's the distribution of each topic over words, for each urban category:
fig2 = plt.figure()
chrt = 0
for e in col_list:
    chrt += 1 
    ax2 = fig2.add_subplot(2,3, chrt)
    (grouped_urban[e].sum()/grouped_urban['webpunct_count'].sum()).plot(kind = 'bar', ylim=0, ax=ax2, title=e)

fig2.tight_layout()
plt.show()