This notebook explores the relationship between 'customers', schools sites, account_uuids and meter_uuids in the prop 39 data.
In general, an account_uuid generally corresponds with an individual school, although there are some exceptions (which generally look to be typos and other data quality issues).
'Customers' may be school districts or other entities, but may also be schools themselves.
There is a strictly heirarchical tree relationship between the following three columns:
customer_name > account_uuid > meter_uuid
Other notes from looking at the PEPS_Data.xlsx definitions table:
In [1]:
db_pass = ""
with open('data_db.cfg','r') as f:
for line in f:
s = line.split("=")
if s[0].strip() == "user":
db_user = s[1].strip()
if s[0].strip() == "pass":
db_pass = ":" + s[1].strip()
if s[0].strip() == "db":
db_db = s[1].strip()
In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import mysql.connector, os, datetime
from sqlalchemy import create_engine
In [3]:
engine = create_engine('mysql+mysqlconnector://' + db_user + db_pass + '@localhost/' + db_db, echo=False)
conn = engine.connect()
In [4]:
account_df = pd.read_sql('SELECT * FROM account;', con=engine)
print len(account_df)
account_df.head(3)
Out[4]:
In [5]:
df = pd.read_sql('SELECT * FROM account LIMIT 1;', con=engine)
for c in df.columns:
df[c] = pd.read_sql('SELECT COUNT(DISTINCT(' + c + ')) FROM account;', con=engine)
df
Out[5]:
In [6]:
query = '''
SELECT COUNT(count)
FROM (
SELECT COUNT(*) AS count
FROM account
GROUP BY school_city, school_site_name
) AS t1;
'''
pd.read_sql(query, con=engine)
Out[6]:
Based on the above, we summarize:
In [7]:
account_df['site_pair'] = account_df['school_city'] + "_" + account_df['school_site_name']
In [8]:
comp = account_df[['site_pair','account_uuid']]
print len(comp)
comp = comp.drop_duplicates()
print len(comp)
comp.head(3)
Out[8]:
Note that 1083 is greater than 1078, which is the number of distinct account_uuids: there are thus some account_uuids with multiple site_pair associations. As shown below, these cases look to be caused by discrepencies in site naming texts.
In [9]:
account_uuid_doubles = comp[comp.duplicated(subset='account_uuid', keep=False)].sort_values(by='account_uuid')
account_uuid_doubles
Out[9]:
There are more cases of duplicated site_pair values. These look to be caused by either typos in the account_uuid values, or by the use of "LEA" in the site_name category.
In [10]:
site_pair_doubles = comp[comp.duplicated(subset='site_pair', keep=False)].sort_values(by='site_pair')
site_pair_doubles
Out[10]:
Testing this hypothesis:
Note that this would mean that (a) each distinct meter_uuid must be associated with just one account_uuid and that (b) each distinct account_uuid must be associated with just one customer_name.
Also note that (a) is true since there is only one instance of each meter_uuid in the table, so we just need to test (b).
In [11]:
customer_name_account_df = account_df[['customer_name','account_uuid']]
customer_name_account_df = customer_name_account_df.drop_duplicates()
print len(customer_name_account_df)
customer_name_account_df.head(3)
Out[11]:
Note from the 'lay of the land' section above that there are 1078 distinct account_uuids in the table, so they are all unique and thus must each be associated with just one customer_name.
So the hierarchical relationship holds.
There are various ways of visualizing the overall mapping - e.g. as a tree or sunburst, below we just consider the distributions of accounts per customer and meters per account.
In [12]:
customer_name_groups = customer_name_account_df.groupby('customer_name').count()
customer_name_groups.head()
Out[12]:
In [13]:
customer_name_groups.hist(bins=50)
Out[13]:
In [14]:
account_groups = account_df[['account_uuid','meter_uuid']].groupby('account_uuid').count()
account_groups.head()
Out[14]:
In [15]:
account_groups.hist(bins=16)
Out[15]: