PYT-DS SAISOFT

DATA SCIENCE WITH PYTHON

Where Have We Been, What Have We Seen?

Data Science includes Data Management. This means we might call a DBA (Database Administrator) a kind of data scientist? Why not? Their speciality is efficiently warehousing data, meaning the same information is not redundantly scattered.

In terms of rackspace and data center security, of course we want redundancy, but in databases the potential for data corruption increases exponentially with the number of places the same information must be kept up to date. If a person changes their legal name, you don't want to have to break your primary key, which should be based on something less mutable.

Concepts of mutability versus immutability are important in data science. In consulting, I would often advertise spreadsheets as ideal for "what if" scenarios, but if the goal is to chronicle "what was" then the mutability of a spreedsheet becomes a liability. The bookkeeping community always encourages databases over spreadsheets when it comes to keeping a company or agency's books.

DBAs also concern themselves with missing data. If the data is increasingly full of holes, that's a sign the database may no longer be loved. DBAs engage in load balancing, meaning they must give priority to services most in demand. However "what's in demand" may be a changing vista.


In [5]:
import pandas as pd
import numpy as np

In [6]:
rng_years = pd.period_range('1/1/2000', '1/1/2018', freq='Y')

People needing to divide a fiscal year starting in July, into quarters, are in luck with pandas. I've been looking for lunar year and other periodic progressions. The whole timeline thing still seems difficult, even with a proleptic Gregorian plus UTC timezones.


In [8]:
head_count = np.random.randint(10,35, size=19)

As usual, I'm recommending telling yourself a story, in this case about an exclusive party you've been hosting ever since 2000, all the way up to 2018. Once you get the interactive version of this Notebook, you'll be able to extend this record by as many more years as you want.


In [16]:
new_years_party = pd.DataFrame(head_count, index = rng_years,
                               columns=["Attenders"])

DBAs who know SQL / noSQL, will find pandas, especially its inner outer left and right merge possibilities somewhat familiar. We learn about the set type through maths, through Python, and understand about unions and intersections, differences.

We did a fair amount of practicing with merge, appreciating that pandas pays a lot of attention to the DataFrame labels, synchronizing along indexes and columns, creating NaN empty cells where needed.

We're spared a lot of programming, and yet even so though these patchings- together can become messy and disorganized. At least the steps are chronicled. That's why spreadsheets are not a good idea. You lose your audit trail. There's no good way to find and debug your mistakes.

Keep the whole pipeline in view, from raw data sources, through numerous cleaning and filtering steps. The linked Youtube is a good example: the data scientist vastly shrinks the data needed, by weeding out what's irrelevant. Data science is all about dismissing the irrelevant, which takes work, real energy.


In [17]:
new_years_party


Out[17]:
Attenders
2000 31
2001 16
2002 11
2003 10
2004 24
2005 14
2006 26
2007 18
2008 26
2009 25
2010 10
2011 17
2012 19
2013 12
2014 33
2015 16
2016 24
2017 14
2018 33

What's the average number of party-goers over this nine-year period?


In [28]:
np.round(new_years_party.Attenders.mean())


Out[28]:
20.0

Might you also want the median and mode? Do you remember what those are?


In [29]:
new_years_party.Attenders.mode()


Out[29]:
0    10
1    14
2    16
3    24
4    26
5    33
dtype: int64

Now that seems strange. Isn't the mode of a column of numbers, a number?

We're looking at the numbers that appear most often, the top six in the ranking. Surely there must be some tie breaking rule.


In [30]:
new_years_party.Attenders.median()


Out[30]:
18.0

That's not years of age, lets remember, but the number clocked in at just after midnight, so still there for the beginning of the New Year. What other columns (features) were collected on these people? Do they know they were being surveilled? Would they recognize themselves, even if they saw this data? Or is this data totally made up. Come to think of it, we did use a randomizer now didn't we.