Thesis Progress Updates

Progress updates on master's thesis. Latest version on github.



In [ ]:

Wed Jan 20th

  • (DONE) your abstract is assuming way too much knowledge and shouldn't be quoting someone else at length. Change it to be a description of the existing model, and then a description of the what you did here

  • (DONE) 1.1: don't say 'the original paper'. It's Schultheis, Glasmeier & Nadeu [2014]

  • (DONE) 3.4 'linearly impute' should be 'linearly imputing'

  • (DONE) 4.1 typo in Mercer County (also, County should be capitalized or plural)

  • (Ask For Feedback) 4.2 it still has a TODO

  • (DONE) 4.3 the choropleth scale consistent for all three charts? if so, that's fantastic. (yes, I should copy the legend over but its defintely the same scale)

  • (DONE) 4.5 the scale for these charts is kinda hilarious. I think you should get rid of it and just depend on the title. I guess it's your choice. (same applies for figure 17)

  • (DONE) 4.6 you have a ref to 'table ??'.

  • (DONE) 4.6 Also, I have no idea what you're trying to say with the most1, most2 groups. can you improve that explanation?

  • (DONE) 4.6 table 1 - in the Variable names, 2nd column, you have 'most' where I think you want 'most1'

  • (DONE) I don't understand Figure 20: what are the two distributions?

    • added a better title and a legend. Due to how the data is formatted, I can only make estimates here, so they represent an under and over estimate. I have an explanation in the paragraph for Figure 20, does it need more explanation?
  • (DONE) 6.1 - you should professionalize the language here: 'not sure why' is not good enough.

I think this needs a little bit more polishing/reviewing, but it's in a pretty good place.

Wed Jan 6th

Finished loading in minimum and median wage data. Will look if there is a breakdown of wage by race by county / state. Finished creating custom violin plot that takes into account weight, and pushed violin plots of living wage breakdown by race. Need to fix it such that all aces show, there is a bug in it now.

Added first map plot to show county living wages for 2014. Noticed that a portion of the map is not there, which confirms that there is something wrong with "East Coast" states (I noticed this earlier in the "regional" section). I have to go back over my previous steps and debugging.

Final Output TODO:

  • Adding a title
  • Adding findings to the abstract
  • Add a writeup of what the wage-gap is with references to your sources (I think most of this can be copied from your proposal)
  • Include a graph with living wage by state or region
  • Move all code and all the print outs of data sets (final_model_values_df, etc) to the appendix
  • Add a 'further analysis' section with things that you wanted to look at but fell out of scope
  • Add references/citations/links
  • General clean-up (typos, formatting, drop the todos, etc)

Overall / Project TODO:

  • Boxplot for pop weighted race breakdown
  • Heatmap for state averages?
  • Regional weighting fo the CEX data. No response from original author, but should do some kind of weighting to diversify numbers
  • Issue with count yconsistency from 2004 - 2005; affects regional and race section

Sunday Jan 3rd

Loaded minimum wage data and median wage data. For each, I need to do some data validation and confirm:

  • What to do when minimum wage is -1? Load federal minimum wage?
  • Some cells in the median wage data are blank, what to do?

Finished up some more of the race break down graphs. I want to do a boxplot or violin plot of the county breakdown, but matplotlib does not do a weighted version. Taking the data and 'expanding' it basd on the integer weights would work in theory but the list ends up too large. Both type of plots can take in a custom function to return custom values for where the box is placed, etc. Might open source solution, but need to move forward on looking at the median and minimum wage comparisons

Friday Jan 1st

Over the past week I have done quite a lot more analysis of the living wage. I am now trying to move on the living wage gap, which will have a similar analysis. Starting to download wage data now.

Overall / Project TODO:

  • Boxplot for pop weighted race breakdown
  • Heatmap for state averages?
  • Regional weighting fo the CEX data. No response from original author, but should do some kind of weighting to diversify numbers
  • Issue with count yconsistency from 2004 - 2005; affects regional and race section

Immediate TODO:

  • Prep wage data

Sunday Dec 28th

Added proper population weigthed averages to 2 out of 4 sections; will finish that up tomorrow. Using population data from housing data. Cleaned up quite a few of the plots, using better colors and dashed lines as to not insinuate that I am interpolating data between years.

Added a natinal average and broke it down by model variables to see which of them increased the most from 2004 - 2014. Not surprising that its rent. Interesting that other_costs have gone down (is this people cutting back on non-needed items to pay for increasing housing costs?

Saturday Dec 27th

Holiday weekend is over so back to work. Expanded on the analysis between population subgroups of counties, and added some visualizations. WIll expand on the discussion, including limitations of analysis due to a lack of regional weighting for the CEX variables. Tomorrow I will move on to downloading wage data (minimum, median) and calculate the gap and do a similar analysis.

Think about population weighted averages.

Show error bars in plots?

Sunday Dec 20th

Completed IS622 project, so now I have only this left.

Plotted some counties by hand Working on deriving state aervages, and plotting regional averages as well. Importing list of most populous counties as well to look at differences between counties that are populous versus non

Issue with 2004 - 2005 - 2006 in FMR data really effects mostly eastern states, so I need to go back and look at the what is wrong. I think county breakdowns for these states differ, and deriving FIPS codes seems to be problematic

Tuesday Dec 15th

Finished TODOs from previous entry. Will move on to visualizations and analysis:

  • Seperate counties by majority race from census 2010 data
    • Average by year and race, plot versus each other

Sunday Dec 13th - Monday 14th

Spent Friday and aturday on IS622 project. Almost done with it, will work on it at end of this week. Need to step up work on thesis now. Current plan is to:

  • Load in CEX data and forget regional weights for now (DONE)

  • Produce visualization of all model variables in appendix (DONE)

  • Redo 2002 fips matching

    • DONE but still issues with 2002 and 2003 matching. data is defintively different, restricting to 2004 - 2014
  • Find counties that appear in all years for housing data (DONE)

  • Create two final dataframes with counties as rows: (Started ...)

    • One with model variables (DONE)
    • The other with final model value for living wage (DONE)
  • Produce visualization of living wage for one county (DONE)

Wednesday Dec 9th

Finished importing state tax data from spreadsheet. Loaded as data frame in main section and displayed in appendix. Did some more formatting.

At this point, going to skip the regional weighting for the CEX data and just load in what I can. I will have to revisit this

Monday Dev 7th

Created some better formatting for the document. Got the latest inflation numbers from BLS calculator.

Adjusted for inflation on food data, which is now in data frame form. Also adjusted inflation for housing data, and created the multi-level dataframe (and created more consistent column names).

Going through all variables to confirm that the methodology works, is adjusted for inflation and the final data is in data frame format.

Sunday Dec 6th

  • All issues with insurance data is now fixed
    • have pre-2006 data
    • inflation adjusted
    • filled in missing states in early data
  • Worked on taxes data
    • Imported federal numbers on payroll taxes (FICA)
    • Imported federal numbers on income tax
    • Still working on tax rate per state

TODO:

  • finsih grunt work for loading in tax rate per state
  • what are the extra counties from 2005 - 2006 in fmr data
  • regional weighting
  • Merge all data and move on to visualization

Tuesday Nov 30th

Solved issue with 2002 FMR data by using levishtein distance to match counties to 2003 counties and copying the FIPs code from there. Will contiue with TODO items from before for FMR data

Sun Nov 29th

Finished loading data from housing files after exporting them all to CSV. The data for 2002 is not loaded since it does not include a FIPS column. I will check to see if an alternate download exists, but may have to match based on county name.

Need to clean this data up and use multi=level index. Created state to code mapping, as well as a state to region mapping as well (to do regional weighting of model variables that need it). Mapped each county to a region via adding a new column.

Tomorrow:

  • Continue todo list for housing data, which is mostly data cleanup and confirming there are no surprises. Looks like there is a change in 2005 - 2006 with regards to the number of counties (mostly handeled by filtering 'sub-county' rows, but still some other changes)

Wed Nov 25th

I did some more work for IS622 to get some more work off my plate. Most of the hw for rest of semester is done, just need to work on project. This helps free more time for continuous work going forward. I have been feeling ill past two - three days so I didn't get much work done as I would like. This holiday weekend I need / will make significant progress.

Saturday Nov 21st

Got off track again, but going forward I will have another night (Friday) to do work on this, as I dropped some extracirricular activities to make more time for this

Updated schedule:

  • weekend
    • Start loading housing data
    • email researchers again
  • monday
    • taxes data
  • tuesday
    • Minimum / median wage
    • start merging data into single data frame

Wednesday Nov 18th

Started loading in housing data from FMR data. Converted XLS files to CSV and imported into pandas and filtered out columns we do not need. Going to figure out multi-level indexes to start storing this in a more convienent fashion (as well as other forms of data). Confirmed data matches listed data for one county, but need to confirm methodology for HUD areas (like NYC, which uses a population weighted average).

Tuesday Nov 17th

Some success: went over the food data and now getting exact values. There is a note in the USDA PDFs about adding 20% to the values when looking at individuals, since their calculations are for individuals in families. This corrects the discrepency and I am confident this portion is now officially done. Also this confirms that my model should be able to get the same exact output from the posted model data.

Insurance data has been downloaded and parsed into a data frame. Main problem here is data only goes back to 2006. Either need to find another data source to go back to 2001, or limit the model to 8 years (2006 - 2014).

Added some more to the outline of general steps I need to take at end of other notebook

Updated Schedule this week:

  • wednesday / thur
    • Start loading housing data
    • email researchers again
  • thursday / friday
    • taxes data
  • Weekend
    • Minimum / median wage
    • start merging data into single data frame

Monday Nov 16th

Looked at the CEX data again, but I do not see how to get the numbers to line up exactly with the model numbers I see on the county websites (i.e. http://livingwage.mit.edu/counties/36047). They are not far off, but the south and midwest seem to not be regionally weighed correctly.

Since the 'other' variable comes from the CEX, I took a look at it as well but same thing as transportation. Similar numbers but something seems off about the south and midwest.

Schedule this week:

  • tomorrow

    • look at food data again, make sure the calculation looks correct. Regional weights are given so numbers should align better (DONE)
    • take break from cex data
    • email researchers again to see if data on county sites is what I should be comparing to (since data is now 'proprietary', I wonder why this data is still there)
  • wednesday / thur

    • Start loading health insurance data
  • thursday / friday

    • Start loading housing data

TODO - compare food and cex regional definitions (http://www.bls.gov/cex/csxgloss.htm)

TODO - the flation numbers used are old; use their numbers to confirm model accuracy but then will have to use new numbers to scale to 2015 dollars

Sunday Nov 15th

I got a bit delayed this week, but my plan is to finish up Week13 and Week14 homeworks by monday nightm which will free me up for two weeks with no homework to worry about. The plan then is to spend continuguous chunks of time on the thesis and make significant progress by the end of this week.

Currently, I am in an email conversation with one of the model authors about regional weights. She has described the methodology, but need to confirm that this works with the data (as I thought I tried what she suggested).

Monday Nov 9th

Downloading other data sets as I think about how to use the Consumer Expenditure Survey correctly (with respect to regional differences). Started with child care and had to manually download PDFs from ChildCareAware.org. Sadly, they only go back to 2010. I can now either:

  • have to find other estimates of child care costs from pre-2010 (prefered)
  • check if the Consumer Expenditure Survey has data on this
  • impute the data (dont think this is a good idea)
  • limit the analysis going back to 2010 (which seems limiting since other data, like the Consumer Expenditure Survey in 2014 provides 2013 data and that is the latest currently).

Currently I am only focusing on modeling costs for a single adult (an assumption I made early on) since I am interested in trends, and the other 'family configurations' are just linear combinations of the costs for one adult and for one child. However if I wanted to extend the numbers for 1 adult + 1 child, I would have to look into this further. For now I'll move on.

Downloaded all the housing data, will determine what we need to extract.

Will work on the insurance component tomorrow

Saturday, Nov 7th

Tried loading in the transportation cost data from the customer expenditure survey for 2014. The data is in excel files, which makes it OK to pull data out, but I am still confused as to how to figure out how the original model deals with regional differences. I emailed the model author, hoping for a response soon. The numbers I get do not line up well for all regions.

Also need to go over what year the model is for. Thought the model is reporting 2014 estimates; and in the case of data that is only available in 2013, we adjust for inflation to get an estimate for 2014. Need to go over the details carefully.

For my current theory of how to do regional differences, I figured out how to get the values I think I need from the aggregate files since those are the only files going back to 2001.

TODO

  • Load more transport data after confirming regional data issue
  • Go over food data and confirm that the years / inflation calculations make sense

Thur Nov 5th

No progress to speak of. I got an email back from the library, with som hints to find more journals to look into. Will look into this sometime this weekend. I need to start placing deadlines for the data to be injested, or I will never get this done. Will come up with schedule for this weeknd tomorrow night.

Tuesday Oct 27th

Sadly, due to family events and issues with my stomach, I have not been able to do much work. Starting work tonight to finish up loading of food data into dataframe. Eventually got it done and food data is loaded, just needs to be adjusted for inflation

Question: for inflation, everything in original model is in 2014 dollars sinc ethey did not do this for 2015. To test if I am getting same values as the original model, I should inflate all past values to 2014. But what about 2015? Exclude for now? Deflate back to 2014 dollars?

Journals: Will look into some journals as well. Emailed the librarians at the Newman Library to see if they can help pin point journals. The Economic Policy Institute looks interesting, though clearly a bit partisan.

Using the Elsevier Journal Finder, I found the following journals that might be useful:

Some I think might be related but I do not think this paper would meet requirements or would only be tangentially related:

Tuesday Oct 20th

Did some updates on the proposal, but no real work on data. Will work on that later in ths week.

Monday Oct 19th

Most costs acorss counties in NY state seem to be the same, with the biggest county level changes coming from housing. As a matter of fact, looks like only housing changes. This means the living wage model does come up with a county-level approximation, but most of the variables are state or national level averages with housing being the only true county-data. In some instances, regional differences are accounted for, like food.

Accomplished

  • Updated thesis document with some dictionaries for regional food and inflation multiplers
  • Downloaded food data from 2001 to 2015
  • Outlines data sources below:

Data Sources:

  • Child Care Costs: The child care component is constructed from
    • 2013 state level estimates published by the National Association of Child CareResource and Referral Agencies.
    • Inflation adjusted
  • Health definition: "The health component of the basic needs budget includes: (1) health insurance costs for employer sponsored plans, (2) medical services, (3) drugs, (4) medical supplies."

    • Costs for (2) medical services , (3) drugs and (4) medical supplies were derived from 2013 national expenditure estimates by household size provided in the 2014 Bureau of Labor Statistics Consumer Expenditure Survey. These estimates were further adjusted for regional differences using annual income expenditure shares reported by region. Values were inflated to 2014 dollars using the Consumer Price Index inflation multiplier from the Bureau of Labor Statistics

      • Data Source: 2014 Consumer Expenditure Survey, Table 1400
      • Regional Diff: 2014 Consumer Expenditure Survey, Table 1800
      • Inflation from 2013 to 2015 dollars: Inflation multiplier for 2010 = 1.092609, 2011 = 1.059176, 2012 = 1.037701, and 2013 = 1.022721. BLS inflation calculator is available at: http://www.bls.gov/data/inflation_calculator.htm
    • Costs for (1) health insurance calculated using the Health Insurance Component Analytical Tool (MEPSnet/IC) provided online by the Agency for Healthcare Research and Quality

  • Housing: captures the likely cost of rental housing in a given area in 2014 using HUD Fair Market Rents (FMR) estimates. The FMR estimates are produced at the sub - county and county level.
    • NOTE: County FMRs were obtained by aggregating sub - county estimates (where sub - county estimates existed) using a population - weighted average.
    • NOTE: The FMR estimates include utility costs and vary depending on the number of bedrooms in each u nit, from zero to four bedrooms. We assumed that a one adult family would rent a single occupancy unit (zero bedrooms)
  • Transportation: Transportation costs cover operational expenses such as fuel and routine maintenance as well as vehicle financing and vehicle insurance but do not include the costs of purchasing a n ew automobile. These costs were further adjusted for regional differences using annual expenditure shares reported by region
    • Data Source: 2014 Consumer Expenditure Survey, Table 1400
    • Regional Diff: 2014 Consumer Expenditure Survey, Table 1800
    • Inflation from 2013 to 2015 dollars: Inflation multiplier for 2010 = 1.092609, 2011 = 1.059176, 2012 = 1.037701, and 2013 = 1.022721. BLS inflation calculator is available at: http://www.bls.gov/data/inflation_calculator.htm
  • Other: Expenditures fo r other necessities are based on 2013 data by household size from the 2014 Bureau of Labor Statistics Consumer Expenditure Survey including: (1) Apparel and services, (2) Housekeeping supplies, (3) Personal care products and services, (4) Reading, and (5 ) Miscellaneous.
    • Data Source: 2014 Consumer Expenditure Survey, Table 1400
    • Regional Diff: 2014 Consumer Expenditure Survey, Table 1800
    • Inflation from 2013 to 2015 dollars: Inflation multiplier for 2010 = 1.092609, 2011 = 1.059176, 2012 = 1.037701, and 2013 = 1.022721. BLS inflation calculator is available at: http://www.bls.gov/data/inflation_calculator.htm
  • Taxes: Estimates for payroll taxes, state income tax, and federal income tax rates are included in the calculation of a living wage.
    • NOTE: Property and slaves taxes are estimated under housing costs and 'other' costs
    • Payroll tax is a nationally represen tative rate as specified in the Federal Insurance Contributions Act.
      • Data Source: The payroll tax rate (Social Security and Medicare taxes) is 6.2% of total wages as of 2014.
    • The state tax rate is taken from the second lowest income tax rate for 2011 for the state as reported by the CCH State Tax Handbook (the lowest bracket was used if the second lowest bracket was for incomes of over 30,000 dollars) (we assume no deductions).
      • Data Source: State income tax rates are for the 2011 tax year. These rates were taken from the 2011 CCH Tax Handbook (various organizations provide the CCH State Tax Handbook rates (including The Tax Foundation))
    • The federal income tax rate is calculated as a percentage of total income based on the average tax paid by median - income four - person families as reported by the Tax Policy Center of the Urban Institute and Brookings Institution for 2013
      • Data Source: The Tax Policy Center reported that the average federal income tax rate for 2013 was 5.32%.

Saturday Oct 17th

Working on the thesis a litle, but didn't have much time. Did some reserch on methodology and realized I can make a simplifying assumption. The model calculates living wages for 12 different sets of families, based on the number of adults and children. Since each of the 12 combinations is a linear combination of adult and child costs from the model, and I am looking for trends and correlations, I will only calculate numbers based on a single adult. This has the side benefit of removing a variable from the model, child care costs.

I also looked over the methodology for the food cost variable, and the logic is simple: food costs are taken from the second chepatest meal in the USDA outline, and are taken to represent a national average. Each county takes this national average and is weighed by a regional factor.

Thursday Oct 15th - Discussion

  • look at whether data is on county or state level (DONE)
  • expand on what presenting (DONE)
  • expand on more variables to correlate to (DONE)
  • google scholar / lexus nexis --> search for journal

In [ ]: