Data Science includes Data Management. This means we might call a DBA (Database Administrator) a kind of data scientist? Why not? Their speciality is efficiently warehousing data, meaning the same information is not redundantly scattered.
In terms of rackspace and data center security, of course we want redundancy, but in databases the potential for data corruption increases exponentially with the number of places the same information must be kept up to date. If a person changes their legal name, you don't want to have to break your primary key, which should be based on something less mutable.
Concepts of mutability versus immutability are important in data science. In consulting, I would often advertise spreadsheets as ideal for "what if" scenarios, but if the goal is to chronicle "what was" then the mutability of a spreedsheet becomes a liability. The bookkeeping community always encourages databases over spreadsheets when it comes to keeping a company or agency's books.
DBAs also concern themselves with missing data. If the data is increasingly full of holes, that's a sign the database may no longer be loved. DBAs engage in load balancing, meaning they must give priority to services most in demand. However "what's in demand" may be a changing vista.
In :import pandas as pd import numpy as np
In :rng_years = pd.period_range('1/1/2000', '1/1/2018', freq='Y')
People needing to divide a fiscal year starting in July, into quarters, are in luck with
pandas. I've been looking for lunar year and other periodic progressions. The whole timeline thing still seems difficult, even with a proleptic Gregorian plus UTC timezones.
In :head_count = np.random.randint(10,35, size=19)
As usual, I'm recommending telling yourself a story, in this case about an exclusive party you've been hosting ever since 2000, all the way up to 2018. Once you get the interactive version of this Notebook, you'll be able to extend this record by as many more years as you want.
In :new_years_party = pd.DataFrame(head_count, index = rng_years, columns=["Attenders"])
DBAs who know SQL / noSQL, will find pandas, especially its
right merge possibilities somewhat familiar. We learn about the
set type through maths, through Python, and understand about unions and intersections, differences.
We did a fair amount of practicing with merge, appreciating that pandas pays a lot of attention to the DataFrame labels, synchronizing along indexes and columns, creating NaN empty cells where needed.
We're spared a lot of programming, and yet even so though these patchings- together can become messy and disorganized. At least the steps are chronicled. That's why spreadsheets are not a good idea. You lose your audit trail. There's no good way to find and debug your mistakes.
Keep the whole pipeline in view, from raw data sources, through numerous cleaning and filtering steps. The linked Youtube is a good example: the data scientist vastly shrinks the data needed, by weeding out what's irrelevant. Data science is all about dismissing the irrelevant, which takes work, real energy.
Attenders 2000 31 2001 16 2002 11 2003 10 2004 24 2005 14 2006 26 2007 18 2008 26 2009 25 2010 10 2011 17 2012 19 2013 12 2014 33 2015 16 2016 24 2017 14 2018 33
What's the average number of party-goers over this nine-year period?
Might you also want the median and mode? Do you remember what those are?
Out:0 10 1 14 2 16 3 24 4 26 5 33 dtype: int64
Now that seems strange. Isn't the mode of a column of numbers, a number?
We're looking at the numbers that appear most often, the top six in the ranking. Surely there must be some tie breaking rule.
That's not years of age, lets remember, but the number clocked in at just after midnight, so still there for the beginning of the New Year. What other columns (features) were collected on these people? Do they know they were being surveilled? Would they recognize themselves, even if they saw this data? Or is this data totally made up. Come to think of it, we did use a randomizer now didn't we.