PYT-DS SAISOFT

DATA SCIENCE WITH PYTHON

Where Have We Been, What Have We Seen?

Data Science includes Data Management. This means we might call a DBA (Database Administrator) a kind of data scientist? Why not? Their speciality is efficiently warehousing data, meaning the same information is not redundantly scattered.

In terms of rackspace and data center security, of course we want redundancy, but in databases the potential for data corruption increases exponentially with the number of places the same information must be kept up to date. If a person changes their legal name, you don't want to have to break your primary key, which should be based on something less mutable.

Concepts of mutability versus immutability are important in data science. In consulting, I would often advertise spreadsheets as ideal for "what if" scenarios, but if the goal is to chronicle "what was" then the mutability of a spreedsheet becomes a liability. The bookkeeping community always encourages databases over spreadsheets when it comes to keeping a company or agency's books.

DBAs also concern themselves with missing data. If the data is increasingly full of holes, that's a sign the database may no longer be loved. DBAs engage in load balancing, meaning they must give priority to services most in demand. However "what's in demand" may be a changing vista.



In [5]:

    
import pandas as pd
import numpy as np



In [6]:

    
rng_years = pd.period_range('1/1/2000', '1/1/2018', freq='Y')

People needing to divide a fiscal year starting in July, into quarters, are in luck with pandas. I've been looking for lunar year and other periodic progressions. The whole timeline thing still seems difficult, even with a proleptic Gregorian plus UTC timezones.



In [8]:

    
head_count = np.random.randint(10,35, size=19)

As usual, I'm recommending telling yourself a story, in this case about an exclusive party you've been hosting ever since 2000, all the way up to 2018. Once you get the interactive version of this Notebook, you'll be able to extend this record by as many more years as you want.



In [16]:

    
new_years_party = pd.DataFrame(head_count, index = rng_years,
                               columns=["Attenders"])

DBAs who know SQL / noSQL, will find pandas, especially its inner outer left and right merge possibilities somewhat familiar. We learn about the set type through maths, through Python, and understand about unions and intersections, differences.

We did a fair amount of practicing with merge, appreciating that pandas pays a lot of attention to the DataFrame labels, synchronizing along indexes and columns, creating NaN empty cells where needed.

We're spared a lot of programming, and yet even so though these patchings- together can become messy and disorganized. At least the steps are chronicled. That's why spreadsheets are not a good idea. You lose your audit trail. There's no good way to find and debug your mistakes.

Keep the whole pipeline in view, from raw data sources, through numerous cleaning and filtering steps. The linked Youtube is a good example: the data scientist vastly shrinks the data needed, by weeding out what's irrelevant. Data science is all about dismissing the irrelevant, which takes work, real energy.



In [17]:

    
new_years_party

What's the average number of party-goers over this nine-year period?



In [28]:

    
np.round(new_years_party.Attenders.mean())









    Out[28]:





20.0

Might you also want the median and mode? Do you remember what those are?



In [29]:

    
new_years_party.Attenders.mode()









    Out[29]:





0    10
1    14
2    16
3    24
4    26
5    33
dtype: int64

Now that seems strange. Isn't the mode of a column of numbers, a number?

We're looking at the numbers that appear most often, the top six in the ranking. Surely there must be some tie breaking rule.



In [30]:

    
new_years_party.Attenders.median()









    Out[30]:





18.0

That's not years of age, lets remember, but the number clocked in at just after midnight, so still there for the beginning of the New Year. What other columns (features) were collected on these people? Do they know they were being surveilled? Would they recognize themselves, even if they saw this data? Or is this data totally made up. Come to think of it, we did use a randomizer now didn't we.

	Attenders
2000	31
2001	16
2002	11
2003	10
2004	24
2005	14
2006	26
2007	18
2008	26
2009	25
2010	10
2011	17
2012	19
2013	12
2014	33
2015	16
2016	24
2017	14
2018	33