Home Sales Project so far

As many readers / listeners already know, Linh Da and I are in the process of looking to buy a house here in Los Angeles. I set out to collect historical data on property sales so that I could leverage the techniques of data science to help make a more informed decision about the price of homes.

Now, I had no illusion about the results of my model. I realized that there are many well qualified data scientists who send their entire careers working on home price prediction. I'm not more likely to outperform their results than I am competing with a Wall Street quant that has better and faster data access than I do, let alone the experience of studying their dataset 7 days a week their entire career.

This project really has four intentions:

  1. Give me a the familiarity with the data to achieve the understanding one has only after a respectable amount of time spent playing with it
  2. Produce a tool that could assist in our home evaluation process
  3. Leverage the entire project as a means of teaching various techniques and pitfalls via Data Skeptic (podcast and otherwise)
  4. See how successful I could be organizing a volunteer project and fostering a community under the Data Skeptic banner

To be honest, we're off to a slow start in two respects.

First, data is no where near as available as I thought it might be. I started the projec out of a frustration for not being able to find the datasets I was seeking. I find only filtered, pruned, active listings. Modeling only these will surely introduce a significant bias.

I knew we would have a hard time aquiring this data, and it's proving more difficult than I expect. Further, the "feel free to volunteer to do any idea you like" has been a failure. Thankfully, Linh Da (a project manager, as keen listeners will know), is giving me some tips on how to better organize. I sent out a spreadsheet asking people to list their skills and interest in contributing. Very soon, I'm going to ask people to take responsibility for specific tasks in the overall project.

That means I also need to formalize this as a more concrete project with tangible objectives and goals. I hope to provide such an outline in this blog post.

One issue I've had so far in the project is that the majority of feedback takes the form of a would-be contributor doing a single google search, and sending me the top link which undoubtable doesn't really provide what I'm looking for in the project. While I deeply appreciate the effort, I can do my own Googling. If people want to contribute, I'd like them to do so in measurably meaningful ways, so the first order of buiness for me is to set up a system that allows them to do that.

Thus, my first milestone goal is to set up a public facing API that accepts property information submissions and returns them to requestors as well. Consider it a sort of Wiki for property data. Anyone can consume the API and initiate CRUD opperations. The API can also be consumed by anyone.

This, naturally, presents a few problems once created:

  • How do I avoid spam and vandalism?
  • How do I detect duplicate content?
  • How do we control and throttle access since I'll be running this API on services paid out of pocket?
  • What format shall the listings be stored in?

And from these questions, some other projects can arrise as well including:

  • Converters from existing formats to the Home Sales Project format
  • Visualization tools
  • Perhaps leveraging Torrent technology to distribute the data with streaming updates (wow, I hope we get to this one, it seems super interesting!)

This are the key questions I'll be tackling in the near future. If you're interested in contributing, join the slack group, and in particular, check for our next live sessions and come participate.