pandas Series Basics


In [3]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt


3.3.2 (v3.3.2:d047928ae3f6, May 13 2013, 13:52:24) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2
0.16.2

The series is one of the foundations of pandas as we saw in the previous video. it’s got a lot of helpful add ons that bring more expressive power to the NumPy Array.

We’re still going to be using randomly generated data in this video. Personally I always get tired when we use a lot of made up data but I promise you, promise that we’re going to get to the good stuff very soon. It’s important to cover a lot of these bases before you get in over your head. I know it’s helped me a lot.

So let’s get started, you can see we’ve got our standard import. This is the import that I’ll be using from here on out, it gives you the python and pandas versions. Sets some default styling which I’ll get to when we cover plotting.

I’m going to create a random range of integers from 1,20 and get 26 of them.


In [4]:
np.random.seed(125)
raw_np_range = np.random.random_integers(1,20,26)

Now I’ll convert that into a panda series using pd.Series.from_array.


In [5]:
data = pd.Series.from_array(raw_np_range)

Now one thing to note is that we can actually just use pd.Series, which is more common and what I’ll be using from now on.


In [6]:
pd.Series(raw_np_range)


Out[6]:
0      3
1      4
2     15
3     14
4     12
5      1
6      6
7      3
8     14
9      1
10    19
11    13
12     1
13     3
14    10
15     4
16    13
17    16
18    13
19     8
20    10
21    10
22    18
23     5
24     5
25    12
dtype: int64

Typically pandas will do it’s best to figure out the type of the data that you’re bringing in.


In [7]:
pd.Series(['hello',1,1.0])


Out[7]:
0    hello
1        1
2        1
dtype: object

It will typically default to a float if you've got one in the list.


In [8]:
pd.Series([1.0,2,3,4,5])


Out[8]:
0    1
1    2
2    3
3    4
4    5
dtype: float64

You can also instantiate the index with it as well. This makes it so we can look up those row based values using those identifiers.


In [9]:
pd.Series([1.0,2,3,4,5], index=['a','b','c','d','e'])


Out[9]:
a    1
b    2
c    3
d    4
e    5
dtype: float64

We can also convert our original list into a float as well overriding the data type. We can do this with any of the numpy data types we choose to.


In [10]:
pd.Series(raw_np_range, dtype=np.float16)


Out[10]:
0      3
1      4
2     15
3     14
4     12
5      1
6      6
7      3
8     14
9      1
10    19
11    13
12     1
13     3
14    10
15     4
16    13
17    16
18    13
19     8
20    10
21    10
22    18
23     5
24     5
25    12
dtype: float16

Now that we know how to get it into a Series we can start using some Series commands.


In [11]:
data


Out[11]:
0      3
1      4
2     15
3     14
4     12
5      1
6      6
7      3
8     14
9      1
10    19
11    13
12     1
13     3
14    10
15     4
16    13
17    16
18    13
19     8
20    10
21    10
22    18
23     5
24     5
25    12
dtype: int64

First it can be helpful to get the shape of the Series, we can do this with len() or with .shape property.


In [12]:
data.shape


Out[12]:
(26,)

In [13]:
len(data)


Out[13]:
26

Head and tail will print the first and last n numbers of the Series respectively. By default this is 5.


In [16]:
print(data.head())
print(data.tail())


0     3
1     4
2    15
3    14
4    12
dtype: int64
21    10
22    18
23     5
24     5
25    12
dtype: int64

However we can specify any number of items to print like 10.


In [17]:
data.head(10)


Out[17]:
0     3
1     4
2    15
3    14
4    12
5     1
6     6
7     3
8    14
9     1
dtype: int64

Now since we’ve got a list of number we might want to take the mean median and mode. We can do that extremely easily with the mean, median, and mode commands.


In [18]:
data.mean()


Out[18]:
8.9615384615384617

In [19]:
data.median()


Out[19]:
10.0

In [20]:
data.mode()


Out[20]:
0     1
1     3
2    10
3    13
dtype: int64

We can also get the count of values, just like the shape command except this returns a dedicated integer.


In [21]:
data.count()


Out[21]:
26

We can also find out the unique values in an array by just using the unique method. This will give us all the unique values that we have in our series.


In [22]:
data.unique()


Out[22]:
array([ 3,  4, 15, 14, 12,  1,  6, 19, 13, 10, 16,  8, 18,  5])

Now if we wanted a Frequency Distribution It would be helpful to be able to see all the counts. We can do that with the value_counts command. We can see like we saw above that 1,3,10,13 are all tied for the mode value.


In [23]:
data.value_counts()


Out[23]:
13    3
10    3
3     3
1     3
14    2
12    2
5     2
4     2
19    1
18    1
16    1
15    1
8     1
6     1
dtype: int64

We can get a lot of these values and get a good sense of the data with the “describe” method. This method allows you to get a lot of key statistics about the data and is one that you’ll likely use every time you start working with a data set.


In [24]:
data.describe()


Out[24]:
count    26.000000
mean      8.961538
std       5.574806
min       1.000000
25%       4.000000
50%      10.000000
75%      13.000000
max      19.000000
dtype: float64

Now while we’re at it I think this would be an appropriate time to show you our first graphical representation. We just created a frequency distribution, or the number of counters per value. It would be helpful to see that graphically as well represented as a histogram.

We can make that extremely simply with the .hist() command.


In [25]:
data.hist()


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x1078d6f50>

Now we’ve got our first graph! On that note we’ll end this video but I hope you are starting to see how expressive pandas is. In the next video we will cover querying data in pandas series through look ups, selections, and indexing.


In [22]: