This is a hands on workshop aimed at getting you comfortable with the the syntax of core data analysis concepts in Python. Some background in base Python is useful, but not required to learn from this workshop.
| Mac | PC | |
|---|---|---|
| Command Mode | esc | |
| Delete | d, d | |
| Markdown | m | |
| Run Cell | control, return | |
| Run Cell and Insert Below | option, return | |
| Insert Above | a | |
| Insert Below | b | |
| Edit Mode | return | Enter |
| Run Cell | control, return | |
| Run Cell and Insert Below | option, return |
In [1]:
# Import the pandas and numpy libraries
import pandas as pd
import numpy as np
In [2]:
# Read a file with an absolute path
ufo = pd.read_csv('/Users/josiahdavis/Documents/GitHub/python_data_analysis/ufo_sightings.csv')
In [50]:
# Alterntively, read the the file using a relative path
ufo = pd.read_csv('ufo_sightings.csv')
In [51]:
# Alterntively read in the file from the internet
ufo = pd.read_csv('https://raw.githubusercontent.com/josiahdavis/python_data_analysis/master/ufo_sightings.csv')
In [52]:
# Get help on a function
help(pd.read_csv)
Help on function read_csv in module pandas.io.parsers:
read_csv(filepath_or_buffer, sep=',', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, float_precision=None, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False, skip_blank_lines=True)
Read CSV (comma-separated) file into DataFrame
Also supports optionally iterating or breaking of the file
into chunks.
Parameters
----------
filepath_or_buffer : string or file handle / StringIO
The string could be a URL. Valid URL schemes include
http, ftp, s3, and file. For file URLs, a
host is expected. For instance, a local file could be
file ://localhost/path/to/table.csv
sep : string, default ','
Delimiter to use. If sep is None, will try to automatically determine
this. Regular expressions are accepted.
engine : {'c', 'python'}
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete.
lineterminator : string (length 1), default None
Character to break file into lines. Only valid with C parser
quotechar : string (length 1)
The character used to denote the start and end of a quoted item. Quoted
items can include the delimiter and it will be ignored.
quoting : int or csv.QUOTE_* instance, default None
Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
Default (None) results in QUOTE_MINIMAL behavior.
skipinitialspace : boolean, default False
Skip spaces after delimiter
escapechar : string (length 1), default None
One-character string used to escape delimiter when quoting is QUOTE_NONE.
dtype : Type name or dict of column -> type
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
(Unsupported with engine='python')
compression : {'gzip', 'bz2', None}, default None
For on-the-fly decompression of on-disk data
dialect : string or csv.Dialect instance, default None
If None defaults to Excel dialect. Ignored if sep longer than 1 char
See csv.Dialect documentation for more details
header : int, list of ints
Row number(s) to use as the column names, and the start of the
data. Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
pass ``header=0`` to be able to replace existing names. The header can be
a list of integers that specify row locations for a multi-index on the
columns E.g. [0,1,3]. Intervening rows that are not specified will be
skipped (e.g. 2 in this example are skipped). Note that this parameter
ignores commented lines and empty lines if ``skip_blank_lines=True``, so header=0
denotes the first line of data rather than the first line of the file.
skiprows : list-like or integer
Line numbers to skip (0-indexed) or number of lines to skip (int)
at the start of the file
index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end
of each line, you might consider index_col=False to force pandas to _not_
use the first column as the index (row names)
names : array-like
List of column names to use. If file contains no header row, then you
should explicitly pass header=None
prefix : string, default None
Prefix to add to column numbers when no header, e.g 'X' for X0, X1, ...
na_values : list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific
per-column NA values
true_values : list
Values to consider as True
false_values : list
Values to consider as False
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN
values are overridden, otherwise they're appended to
parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index.
If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
{'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
A fast-path exists for iso8601-formatted dates.
keep_date_col : boolean, default False
If True and parse_dates specifies combining multiple columns then
keep the original columns.
date_parser : function
Function to use for converting a sequence of string columns to an
array of datetime instances. The default uses dateutil.parser.parser
to do the conversion.
dayfirst : boolean, default False
DD/MM format dates, international and European format
thousands : str, default None
Thousands separator
comment : str, default None
Indicates remainder of line should not be parsed. If found at the
beginning of a line, the line will be ignored altogether. This parameter
must be a single character. Like empty lines (as long as ``skip_blank_lines=True``),
fully commented lines are ignored by the parameter `header`
but not by `skiprows`. For example, if comment='#', parsing
'#empty\na,b,c\n1,2,3' with `header=0` will result in 'a,b,c' being
treated as the header.
decimal : str, default '.'
Character to recognize as decimal point. E.g. use ',' for European data
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files
iterator : boolean, default False
Return TextFileReader object
chunksize : int, default None
Return TextFileReader object for iteration
skipfooter : int, default 0
Number of lines at bottom of file to skip (Unsupported with engine='c')
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can either
be integers or column labels
verbose : boolean, default False
Indicate number of NA values placed in non-numeric columns
delimiter : string, default None
Alternative argument name for sep. Regular expressions are accepted.
encoding : string, default None
Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
standard encodings
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In
data without any NAs, passing na_filter=False can improve the performance
of reading a large file
usecols : array-like
Return a subset of the columns.
Results in much faster parsing time and lower memory usage.
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
tupleize_cols : boolean, default False
Leave a list of tuples on columns as is (default is to convert to
a Multi Index on the columns)
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned.
If False, then these "bad lines" will dropped from the DataFrame that is
returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output. (Only valid with C parser).
infer_datetime_format : boolean, default False
If True and parse_dates is enabled for a column, attempt to infer
the datetime format to speed up the processing
skip_blank_lines : boolean, default True
If True, skip over blank lines rather than interpreting as NaN values
Returns
-------
result : DataFrame or TextParser
In [53]:
ufo.head(10) # Look at the top 10 observations
Out[53]:
City
Colors Reported
Shape Reported
State
Time
0
Ithaca
NaN
TRIANGLE
NY
6/1/1930 22:00
1
Willingboro
NaN
OTHER
NJ
6/30/1930 20:00
2
Holyoke
NaN
OVAL
CO
2/15/1931 14:00
3
Abilene
NaN
DISK
KS
6/1/1931 13:00
4
New York Worlds Fair
NaN
LIGHT
NY
4/18/1933 19:00
5
Valley City
NaN
DISK
ND
9/15/1934 15:30
6
Crater Lake
NaN
CIRCLE
CA
6/15/1935 0:00
7
Alma
NaN
DISK
MI
7/15/1936 0:00
8
Eklutna
NaN
CIGAR
AK
10/15/1936 17:00
9
Hubbard
NaN
CYLINDER
OR
6/15/1937 0:00
In [55]:
ufo.tail() # Bottom x observations (defaults to 5)
Out[55]:
City
Colors Reported
Shape Reported
State
Time
80538
Neligh
NaN
CIRCLE
NE
9/4/2014 23:20
80539
Uhrichsville
NaN
LIGHT
OH
9/5/2014 1:14
80540
Tucson
RED BLUE
NaN
AZ
9/5/2014 2:40
80541
Orland park
RED
LIGHT
IL
9/5/2014 3:43
80542
Loughman
NaN
LIGHT
FL
9/5/2014 5:30
In [56]:
ufo.describe() # get summary statistics for columns
Out[56]:
City
Colors Reported
Shape Reported
State
Time
count
80496
17034
72141
80543
80543
unique
13504
31
27
52
68901
top
Seattle
ORANGE
LIGHT
CA
7/4/2014 22:00
freq
646
5216
16332
10743
45
In [57]:
ufo.index # "the index" (aka "the labels")
Out[57]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
In [58]:
ufo.columns # column names (which is "an index")
Out[58]:
Index([u'City', u'Colors Reported', u'Shape Reported', u'State', u'Time'], dtype='object')
In [59]:
ufo.dtypes # data types of each column
Out[59]:
City object
Colors Reported object
Shape Reported object
State object
Time object
dtype: object
In [60]:
ufo.values # underlying numpy array
Out[60]:
array([['Ithaca', nan, 'TRIANGLE', 'NY', '6/1/1930 22:00'],
['Willingboro', nan, 'OTHER', 'NJ', '6/30/1930 20:00'],
['Holyoke', nan, 'OVAL', 'CO', '2/15/1931 14:00'],
...,
['Tucson', 'RED BLUE', nan, 'AZ', '9/5/2014 2:40'],
['Orland park', 'RED', 'LIGHT', 'IL', '9/5/2014 3:43'],
['Loughman', nan, 'LIGHT', 'FL', '9/5/2014 5:30']], dtype=object)
In [61]:
ufo.info() # concise summary
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80543 entries, 0 to 80542
Data columns (total 5 columns):
City 80496 non-null object
Colors Reported 17034 non-null object
Shape Reported 72141 non-null object
State 80543 non-null object
Time 80543 non-null object
dtypes: object(5)
memory usage: 3.7+ MB
In [62]:
# Select a single column
ufo['State']
Out[62]:
0 NY
1 NJ
2 CO
3 KS
4 NY
5 ND
6 CA
7 MI
8 AK
9 OR
10 CA
11 AL
12 SC
13 IA
14 MI
...
80528 OH
80529 WA
80530 FL
80531 VA
80532 MA
80533 IA
80534 TX
80535 KY
80536 PA
80537 NE
80538 NE
80539 OH
80540 AZ
80541 IL
80542 FL
Name: State, Length: 80543, dtype: object
In [63]:
ufo.State # This is equivalent
Out[63]:
0 NY
1 NJ
2 CO
3 KS
4 NY
5 ND
6 CA
7 MI
8 AK
9 OR
10 CA
11 AL
12 SC
13 IA
14 MI
...
80528 OH
80529 WA
80530 FL
80531 VA
80532 MA
80533 IA
80534 TX
80535 KY
80536 PA
80537 NE
80538 NE
80539 OH
80540 AZ
80541 IL
80542 FL
Name: State, Length: 80543, dtype: object
In [64]:
# Select multiple columns
ufo[['State', 'City','Shape Reported']]
Out[64]:
State
City
Shape Reported
0
NY
Ithaca
TRIANGLE
1
NJ
Willingboro
OTHER
2
CO
Holyoke
OVAL
3
KS
Abilene
DISK
4
NY
New York Worlds Fair
LIGHT
5
ND
Valley City
DISK
6
CA
Crater Lake
CIRCLE
7
MI
Alma
DISK
8
AK
Eklutna
CIGAR
9
OR
Hubbard
CYLINDER
10
CA
Fontana
LIGHT
11
AL
Waterloo
FIREBALL
12
SC
Belton
SPHERE
13
IA
Keokuk
OVAL
14
MI
Ludington
DISK
15
CA
Forest Home
CIRCLE
16
CA
Los Angeles
NaN
17
GA
Hapeville
NaN
18
TN
Oneida
RECTANGLE
19
AK
Bering Sea
OTHER
20
NE
Nebraska
DISK
21
LA
NaN
NaN
22
LA
NaN
LIGHT
23
KY
Owensboro
RECTANGLE
24
WV
Wilderness
DISK
25
CA
San Diego
CIGAR
26
WV
Wilderness
DISK
27
NM
Clovis
DISK
28
NM
Los Alamos
DISK
29
UT
Ft. Duschene
DISK
...
...
...
...
80513
NJ
Manahawkin
CIRCLE
80514
MA
New Bedford
LIGHT
80515
VA
Woodbridge
TRIANGLE
80516
CA
Glendale
CYLINDER
80517
NH
Laconia
CIGAR
80518
PA
Langhorne
OTHER
80519
IL
Glen Ellyn
CIRCLE
80520
PA
Waynesburg
FLASH
80521
OH
Canal Winchester
TRIANGLE
80522
MA
Lawrence
LIGHT
80523
MD
Ellicott City
LIGHT
80524
WA
Olympia
LIGHT
80525
IA
Iowa City
LIGHT
80526
MA
Marshfield
LIGHT
80527
WA
Seattle
DIAMOND
80528
OH
North Royalton
TRIANGLE
80529
WA
Deer Park
LIGHT
80530
FL
Glen St. Mary
DISK
80531
VA
Abingdon
CIRCLE
80532
MA
Lawrence
LIGHT
80533
IA
Melbourne
OVAL
80534
TX
Burleson
LIGHT
80535
KY
Elizabethtown
CIRCLE
80536
PA
Wyoming
DISK
80537
NE
Neligh
CIRCLE
80538
NE
Neligh
CIRCLE
80539
OH
Uhrichsville
LIGHT
80540
AZ
Tucson
NaN
80541
IL
Orland park
LIGHT
80542
FL
Loughman
LIGHT
80543 rows × 3 columns
In [65]:
my_cols = ['State', 'City', 'Shape Reported']
ufo[my_cols] # This is equivalent
Out[65]:
State
City
Shape Reported
0
NY
Ithaca
TRIANGLE
1
NJ
Willingboro
OTHER
2
CO
Holyoke
OVAL
3
KS
Abilene
DISK
4
NY
New York Worlds Fair
LIGHT
5
ND
Valley City
DISK
6
CA
Crater Lake
CIRCLE
7
MI
Alma
DISK
8
AK
Eklutna
CIGAR
9
OR
Hubbard
CYLINDER
10
CA
Fontana
LIGHT
11
AL
Waterloo
FIREBALL
12
SC
Belton
SPHERE
13
IA
Keokuk
OVAL
14
MI
Ludington
DISK
15
CA
Forest Home
CIRCLE
16
CA
Los Angeles
NaN
17
GA
Hapeville
NaN
18
TN
Oneida
RECTANGLE
19
AK
Bering Sea
OTHER
20
NE
Nebraska
DISK
21
LA
NaN
NaN
22
LA
NaN
LIGHT
23
KY
Owensboro
RECTANGLE
24
WV
Wilderness
DISK
25
CA
San Diego
CIGAR
26
WV
Wilderness
DISK
27
NM
Clovis
DISK
28
NM
Los Alamos
DISK
29
UT
Ft. Duschene
DISK
...
...
...
...
80513
NJ
Manahawkin
CIRCLE
80514
MA
New Bedford
LIGHT
80515
VA
Woodbridge
TRIANGLE
80516
CA
Glendale
CYLINDER
80517
NH
Laconia
CIGAR
80518
PA
Langhorne
OTHER
80519
IL
Glen Ellyn
CIRCLE
80520
PA
Waynesburg
FLASH
80521
OH
Canal Winchester
TRIANGLE
80522
MA
Lawrence
LIGHT
80523
MD
Ellicott City
LIGHT
80524
WA
Olympia
LIGHT
80525
IA
Iowa City
LIGHT
80526
MA
Marshfield
LIGHT
80527
WA
Seattle
DIAMOND
80528
OH
North Royalton
TRIANGLE
80529
WA
Deer Park
LIGHT
80530
FL
Glen St. Mary
DISK
80531
VA
Abingdon
CIRCLE
80532
MA
Lawrence
LIGHT
80533
IA
Melbourne
OVAL
80534
TX
Burleson
LIGHT
80535
KY
Elizabethtown
CIRCLE
80536
PA
Wyoming
DISK
80537
NE
Neligh
CIRCLE
80538
NE
Neligh
CIRCLE
80539
OH
Uhrichsville
LIGHT
80540
AZ
Tucson
NaN
80541
IL
Orland park
LIGHT
80542
FL
Loughman
LIGHT
80543 rows × 3 columns
In [74]:
# Logical filtering
ufo[ufo.State == 'TX'] # Select only rows where State == 'TX'
Out[74]:
City
Colors Reported
Shape Reported
State
Time
37
Dallas
NaN
SPHERE
TX
7/15/1945 14:00
43
Alice
NaN
DISK
TX
3/15/1946 15:30
49
Conroe
NaN
OTHER
TX
1/10/1947 20:00
92
Borger
NaN
DISK
TX
6/15/1948 16:00
114
Post
NaN
DISK
TX
9/15/1949 21:00
115
San Marcos
NaN
CYLINDER
TX
10/10/1949 20:30
116
Lackland AFB
NaN
LIGHT
TX
10/10/1949 21:00
132
Fort Worth
NaN
DISK
TX
6/15/1950 20:00
134
Galveston
NaN
SPHERE
TX
6/20/1950 17:00
174
Greenville
ORANGE
TRIANGLE
TX
4/15/1952 16:00
189
Pecos
NaN
DISK
TX
6/30/1952 3:00
201
Arlington
NaN
DISK
TX
7/7/1952 13:00
213
Smithson's Valley
NaN
CIGAR
TX
8/15/1952 17:00
223
Oklahoma
NaN
CIGAR
TX
1/1/1953 14:00
235
Fort Worth
NaN
SPHERE
TX
6/15/1953 13:00
237
Galveston
NaN
CIGAR
TX
6/17/1953 7:30
252
Mauriceville
NaN
DISK
TX
10/15/1953 14:00
257
Lackland Air Force Base
NaN
DISK
TX
11/25/1953 4:25
267
Houston
NaN
DISK
TX
6/1/1954 0:00
280
Ellington AFB
NaN
DISK
TX
6/15/1954 16:30
283
Thrifty
NaN
CIRCLE
TX
6/25/1954 3:30
304
Beaumont
RED
DISK
TX
9/9/1954 12:30
307
Blanco
NaN
OTHER
TX
10/15/1954 16:30
310
Thelma
NaN
DISK
TX
11/14/1954 1:00
330
Lubbock
NaN
CIGAR
TX
6/30/1955 13:00
331
Dumas
NaN
LIGHT
TX
6/30/1955 21:00
340
Eagle Pass
NaN
CIGAR
TX
8/15/1955 0:01
359
Edroy
NaN
NaN
TX
6/15/1956 1:00
373
Houston
NaN
TRIANGLE
TX
6/30/1956 18:30
387
Edna
NaN
CIRCLE
TX
10/10/1956 21:00
...
...
...
...
...
...
79676
Austin
RED
NaN
TX
8/1/2014 20:55
79697
Fort Worth
BLUE
CIRCLE
TX
8/2/2014 5:58
79708
Von Ormy
NaN
LIGHT
TX
8/2/2014 21:20
79746
Cleburne
NaN
LIGHT
TX
8/3/2014 20:15
79763
Cypress
NaN
FIREBALL
TX
8/4/2014 0:00
79774
Port Neches
NaN
LIGHT
TX
8/5/2014 0:00
79790
Amarillo
NaN
TRIANGLE
TX
8/5/2014 23:00
79803
Allen
BLUE
RECTANGLE
TX
8/6/2014 22:30
79848
Allen
NaN
LIGHT
TX
8/9/2014 4:00
79874
San Angelo
NaN
LIGHT
TX
8/9/2014 23:30
79891
Katy
NaN
FIREBALL
TX
8/10/2014 23:00
79919
McKinney
RED GREEN
RECTANGLE
TX
8/13/2014 2:05
79921
Cypress
NaN
TRIANGLE
TX
8/13/2014 8:45
79933
Comfort
NaN
LIGHT
TX
8/13/2014 22:00
79938
Pearland
NaN
CONE
TX
8/14/2014 7:30
79965
Austin
NaN
CIRCLE
TX
8/15/2014 6:45
79969
Austin
RED
CIRCLE
TX
8/15/2014 18:45
79986
Houston
YELLOW
CIRCLE
TX
8/15/2014 23:30
80038
Borger
NaN
LIGHT
TX
8/17/2014 21:15
80075
Round Rock
ORANGE
LIGHT
TX
8/19/2014 1:32
80113
Menard
RED
LIGHT
TX
8/20/2014 21:40
80134
Morton
NaN
SPHERE
TX
8/21/2014 20:15
80141
Port Neches
GREEN BLUE
LIGHT
TX
8/21/2014 23:00
80240
New Braunfels
RED
LIGHT
TX
8/24/2014 21:05
80241
Atascocita
NaN
CIRCLE
TX
8/24/2014 21:15
80287
Lago Vista
NaN
TRIANGLE
TX
8/26/2014 3:35
80350
Houston
NaN
TRIANGLE
TX
8/28/2014 19:55
80414
Eustace
NaN
LIGHT
TX
8/30/2014 23:00
80476
Fort Worth
NaN
OTHER
TX
9/2/2014 16:05
80534
Burleson
NaN
LIGHT
TX
9/4/2014 22:30
4186 rows × 5 columns
In [76]:
ufo[~(ufo.State == 'TX')] # Select everything where the test fails
Out[76]:
City
Colors Reported
Shape Reported
State
Time
0
Ithaca
NaN
TRIANGLE
NY
6/1/1930 22:00
1
Willingboro
NaN
OTHER
NJ
6/30/1930 20:00
2
Holyoke
NaN
OVAL
CO
2/15/1931 14:00
3
Abilene
NaN
DISK
KS
6/1/1931 13:00
4
New York Worlds Fair
NaN
LIGHT
NY
4/18/1933 19:00
5
Valley City
NaN
DISK
ND
9/15/1934 15:30
6
Crater Lake
NaN
CIRCLE
CA
6/15/1935 0:00
7
Alma
NaN
DISK
MI
7/15/1936 0:00
8
Eklutna
NaN
CIGAR
AK
10/15/1936 17:00
9
Hubbard
NaN
CYLINDER
OR
6/15/1937 0:00
10
Fontana
NaN
LIGHT
CA
8/15/1937 21:00
11
Waterloo
NaN
FIREBALL
AL
6/1/1939 20:00
12
Belton
RED
SPHERE
SC
6/30/1939 20:00
13
Keokuk
NaN
OVAL
IA
7/7/1939 2:00
14
Ludington
NaN
DISK
MI
6/1/1941 13:00
15
Forest Home
NaN
CIRCLE
CA
7/2/1941 11:30
16
Los Angeles
NaN
NaN
CA
2/25/1942 0:00
17
Hapeville
NaN
NaN
GA
6/1/1942 22:30
18
Oneida
NaN
RECTANGLE
TN
7/15/1942 1:00
19
Bering Sea
RED
OTHER
AK
4/30/1943 23:00
20
Nebraska
NaN
DISK
NE
6/1/1943 15:00
21
NaN
NaN
NaN
LA
8/15/1943 0:00
22
NaN
NaN
LIGHT
LA
8/15/1943 0:00
23
Owensboro
NaN
RECTANGLE
KY
10/15/1943 11:00
24
Wilderness
NaN
DISK
WV
1/1/1944 10:00
25
San Diego
NaN
CIGAR
CA
1/1/1944 12:00
26
Wilderness
NaN
DISK
WV
1/1/1944 12:00
27
Clovis
NaN
DISK
NM
4/2/1944 11:00
28
Los Alamos
NaN
DISK
NM
6/1/1944 12:00
29
Ft. Duschene
NaN
DISK
UT
6/30/1944 10:00
...
...
...
...
...
...
80512
Malta
NaN
LIGHT
ID
9/4/2014 5:41
80513
Manahawkin
NaN
CIRCLE
NJ
9/4/2014 8:30
80514
New Bedford
NaN
LIGHT
MA
9/4/2014 9:47
80515
Woodbridge
NaN
TRIANGLE
VA
9/4/2014 16:30
80516
Glendale
NaN
CYLINDER
CA
9/4/2014 17:38
80517
Laconia
NaN
CIGAR
NH
9/4/2014 19:02
80518
Langhorne
NaN
OTHER
PA
9/4/2014 19:30
80519
Glen Ellyn
RED
CIRCLE
IL
9/4/2014 20:20
80520
Waynesburg
NaN
FLASH
PA
9/4/2014 20:25
80521
Canal Winchester
NaN
TRIANGLE
OH
9/4/2014 20:30
80522
Lawrence
ORANGE
LIGHT
MA
9/4/2014 20:30
80523
Ellicott City
NaN
LIGHT
MD
9/4/2014 20:45
80524
Olympia
RED
LIGHT
WA
9/4/2014 21:10
80525
Iowa City
BLUE
LIGHT
IA
9/4/2014 21:11
80526
Marshfield
NaN
LIGHT
MA
9/4/2014 21:15
80527
Seattle
NaN
DIAMOND
WA
9/4/2014 21:15
80528
North Royalton
RED
TRIANGLE
OH
9/4/2014 21:30
80529
Deer Park
NaN
LIGHT
WA
9/4/2014 22:00
80530
Glen St. Mary
NaN
DISK
FL
9/4/2014 22:00
80531
Abingdon
NaN
CIRCLE
VA
9/4/2014 22:05
80532
Lawrence
NaN
LIGHT
MA
9/4/2014 22:15
80533
Melbourne
NaN
OVAL
IA
9/4/2014 22:15
80535
Elizabethtown
NaN
CIRCLE
KY
9/4/2014 22:30
80536
Wyoming
RED
DISK
PA
9/4/2014 23:00
80537
Neligh
NaN
CIRCLE
NE
9/4/2014 23:20
80538
Neligh
NaN
CIRCLE
NE
9/4/2014 23:20
80539
Uhrichsville
NaN
LIGHT
OH
9/5/2014 1:14
80540
Tucson
RED BLUE
NaN
AZ
9/5/2014 2:40
80541
Orland park
RED
LIGHT
IL
9/5/2014 3:43
80542
Loughman
NaN
LIGHT
FL
9/5/2014 5:30
76357 rows × 5 columns
In [68]:
ufo[ufo.State != 'TX'] # Same thing as before
Out[68]:
City
Colors Reported
Shape Reported
State
Time
0
Ithaca
NaN
TRIANGLE
NY
6/1/1930 22:00
1
Willingboro
NaN
OTHER
NJ
6/30/1930 20:00
2
Holyoke
NaN
OVAL
CO
2/15/1931 14:00
3
Abilene
NaN
DISK
KS
6/1/1931 13:00
4
New York Worlds Fair
NaN
LIGHT
NY
4/18/1933 19:00
5
Valley City
NaN
DISK
ND
9/15/1934 15:30
6
Crater Lake
NaN
CIRCLE
CA
6/15/1935 0:00
7
Alma
NaN
DISK
MI
7/15/1936 0:00
8
Eklutna
NaN
CIGAR
AK
10/15/1936 17:00
9
Hubbard
NaN
CYLINDER
OR
6/15/1937 0:00
10
Fontana
NaN
LIGHT
CA
8/15/1937 21:00
11
Waterloo
NaN
FIREBALL
AL
6/1/1939 20:00
12
Belton
RED
SPHERE
SC
6/30/1939 20:00
13
Keokuk
NaN
OVAL
IA
7/7/1939 2:00
14
Ludington
NaN
DISK
MI
6/1/1941 13:00
15
Forest Home
NaN
CIRCLE
CA
7/2/1941 11:30
16
Los Angeles
NaN
NaN
CA
2/25/1942 0:00
17
Hapeville
NaN
NaN
GA
6/1/1942 22:30
18
Oneida
NaN
RECTANGLE
TN
7/15/1942 1:00
19
Bering Sea
RED
OTHER
AK
4/30/1943 23:00
20
Nebraska
NaN
DISK
NE
6/1/1943 15:00
21
NaN
NaN
NaN
LA
8/15/1943 0:00
22
NaN
NaN
LIGHT
LA
8/15/1943 0:00
23
Owensboro
NaN
RECTANGLE
KY
10/15/1943 11:00
24
Wilderness
NaN
DISK
WV
1/1/1944 10:00
25
San Diego
NaN
CIGAR
CA
1/1/1944 12:00
26
Wilderness
NaN
DISK
WV
1/1/1944 12:00
27
Clovis
NaN
DISK
NM
4/2/1944 11:00
28
Los Alamos
NaN
DISK
NM
6/1/1944 12:00
29
Ft. Duschene
NaN
DISK
UT
6/30/1944 10:00
...
...
...
...
...
...
80512
Malta
NaN
LIGHT
ID
9/4/2014 5:41
80513
Manahawkin
NaN
CIRCLE
NJ
9/4/2014 8:30
80514
New Bedford
NaN
LIGHT
MA
9/4/2014 9:47
80515
Woodbridge
NaN
TRIANGLE
VA
9/4/2014 16:30
80516
Glendale
NaN
CYLINDER
CA
9/4/2014 17:38
80517
Laconia
NaN
CIGAR
NH
9/4/2014 19:02
80518
Langhorne
NaN
OTHER
PA
9/4/2014 19:30
80519
Glen Ellyn
RED
CIRCLE
IL
9/4/2014 20:20
80520
Waynesburg
NaN
FLASH
PA
9/4/2014 20:25
80521
Canal Winchester
NaN
TRIANGLE
OH
9/4/2014 20:30
80522
Lawrence
ORANGE
LIGHT
MA
9/4/2014 20:30
80523
Ellicott City
NaN
LIGHT
MD
9/4/2014 20:45
80524
Olympia
RED
LIGHT
WA
9/4/2014 21:10
80525
Iowa City
BLUE
LIGHT
IA
9/4/2014 21:11
80526
Marshfield
NaN
LIGHT
MA
9/4/2014 21:15
80527
Seattle
NaN
DIAMOND
WA
9/4/2014 21:15
80528
North Royalton
RED
TRIANGLE
OH
9/4/2014 21:30
80529
Deer Park
NaN
LIGHT
WA
9/4/2014 22:00
80530
Glen St. Mary
NaN
DISK
FL
9/4/2014 22:00
80531
Abingdon
NaN
CIRCLE
VA
9/4/2014 22:05
80532
Lawrence
NaN
LIGHT
MA
9/4/2014 22:15
80533
Melbourne
NaN
OVAL
IA
9/4/2014 22:15
80535
Elizabethtown
NaN
CIRCLE
KY
9/4/2014 22:30
80536
Wyoming
RED
DISK
PA
9/4/2014 23:00
80537
Neligh
NaN
CIRCLE
NE
9/4/2014 23:20
80538
Neligh
NaN
CIRCLE
NE
9/4/2014 23:20
80539
Uhrichsville
NaN
LIGHT
OH
9/5/2014 1:14
80540
Tucson
RED BLUE
NaN
AZ
9/5/2014 2:40
80541
Orland park
RED
LIGHT
IL
9/5/2014 3:43
80542
Loughman
NaN
LIGHT
FL
9/5/2014 5:30
76357 rows × 5 columns
In [69]:
ufo.City[ufo.State == 'TX'] # Select only city columm where State == 'TX'
Out[69]:
37 Dallas
43 Alice
49 Conroe
92 Borger
114 Post
115 San Marcos
116 Lackland AFB
132 Fort Worth
134 Galveston
174 Greenville
189 Pecos
201 Arlington
213 Smithson's Valley
223 Oklahoma
235 Fort Worth
...
79965 Austin
79969 Austin
79986 Houston
80038 Borger
80075 Round Rock
80113 Menard
80134 Morton
80141 Port Neches
80240 New Braunfels
80241 Atascocita
80287 Lago Vista
80350 Houston
80414 Eustace
80476 Fort Worth
80534 Burleson
Name: City, Length: 4186, dtype: object
In [70]:
ufo[ufo.State == 'TX'].City # Same thing as before
Out[70]:
37 Dallas
43 Alice
49 Conroe
92 Borger
114 Post
115 San Marcos
116 Lackland AFB
132 Fort Worth
134 Galveston
174 Greenville
189 Pecos
201 Arlington
213 Smithson's Valley
223 Oklahoma
235 Fort Worth
...
79965 Austin
79969 Austin
79986 Houston
80038 Borger
80075 Round Rock
80113 Menard
80134 Morton
80141 Port Neches
80240 New Braunfels
80241 Atascocita
80287 Lago Vista
80350 Houston
80414 Eustace
80476 Fort Worth
80534 Burleson
Name: City, Length: 4186, dtype: object
In [71]:
ufo[(ufo.State == 'CA') | (ufo.State =='TX')] # Select only records where State is 'CA' or State is 'TX'
Out[71]:
City
Colors Reported
Shape Reported
State
Time
6
Crater Lake
NaN
CIRCLE
CA
6/15/1935 0:00
10
Fontana
NaN
LIGHT
CA
8/15/1937 21:00
15
Forest Home
NaN
CIRCLE
CA
7/2/1941 11:30
16
Los Angeles
NaN
NaN
CA
2/25/1942 0:00
25
San Diego
NaN
CIGAR
CA
1/1/1944 12:00
33
Salinas AFB
NaN
DISK
CA
6/1/1945 12:00
37
Dallas
NaN
SPHERE
TX
7/15/1945 14:00
38
Huntington Beach
NaN
DISK
CA
8/8/1945 12:00
42
Merced
NaN
SPHERE
CA
2/1/1946 17:00
43
Alice
NaN
DISK
TX
3/15/1946 15:30
44
Blairsden
GREEN
SPHERE
CA
6/30/1946 19:00
49
Conroe
NaN
OTHER
TX
1/10/1947 20:00
52
San Deigo
NaN
TRIANGLE
CA
4/15/1947 23:00
56
Oroville
NaN
NaN
CA
6/1/1947 18:00
61
Hayward
NaN
CIRCLE
CA
6/20/1947 12:00
65
Maywood
NaN
DISK
CA
7/1/1947 20:00
75
Richmond
NaN
CIRCLE
CA
7/14/1947 11:00
82
San Jose
BLUE
CHEVRON
CA
7/15/1947 21:00
84
Modesto
BLUE
DISK
CA
8/8/1947 22:00
90
Kelseyville
NaN
DISK
CA
4/30/1948 12:00
92
Borger
NaN
DISK
TX
6/15/1948 16:00
95
Ojai
NaN
LIGHT
CA
7/15/1948 22:00
110
Mill Valley
NaN
LIGHT
CA
8/15/1949 21:00
114
Post
NaN
DISK
TX
9/15/1949 21:00
115
San Marcos
NaN
CYLINDER
TX
10/10/1949 20:30
116
Lackland AFB
NaN
LIGHT
TX
10/10/1949 21:00
129
Napa
GREEN
DISK
CA
6/10/1950 0:00
132
Fort Worth
NaN
DISK
TX
6/15/1950 20:00
134
Galveston
NaN
SPHERE
TX
6/20/1950 17:00
142
Los Padres National Forest
NaN
FORMATION
CA
7/15/1950 23:00
...
...
...
...
...
...
80270
Oceanside
NaN
VARIOUS
CA
8/25/2014 21:38
80272
Murrieta
NaN
LIGHT
CA
8/25/2014 22:00
80275
Escondido
NaN
SPHERE
CA
8/25/2014 22:09
80276
Aliso Viejo
ORANGE
FIREBALL
CA
8/25/2014 22:11
80278
Carlsbad
ORANGE
DISK
CA
8/25/2014 22:20
80279
San Marcos
YELLOW
LIGHT
CA
8/25/2014 22:28
80287
Lago Vista
NaN
TRIANGLE
TX
8/26/2014 3:35
80296
Ventura
RED
LIGHT
CA
8/26/2014 21:30
80311
Woodland Hills
NaN
CIGAR
CA
8/27/2014 16:00
80312
Menifee
NaN
NaN
CA
8/27/2014 16:30
80328
Santa Ysabel
NaN
OTHER
CA
8/28/2014 4:02
80349
San Francisco
RED
RECTANGLE
CA
8/28/2014 19:45
80350
Houston
NaN
TRIANGLE
TX
8/28/2014 19:55
80351
Rancho Cucamonga
NaN
LIGHT
CA
8/28/2014 20:00
80360
Fontana
RED BLUE
CIGAR
CA
8/28/2014 22:00
80364
Big Pine
NaN
FLASH
CA
8/28/2014 22:28
80365
West Hills
NaN
TEARDROP
CA
8/28/2014 22:28
80366
Folsom
ORANGE BLUE
TRIANGLE
CA
8/28/2014 22:30
80373
Ocean Beach
NaN
CYLINDER
CA
8/29/2014 8:12
80379
Apple Valley
ORANGE
OVAL
CA
8/29/2014 21:30
80394
San Francisco
NaN
RECTANGLE
CA
8/30/2014 4:15
80396
El Cajon
NaN
DIAMOND
CA
8/30/2014 8:20
80405
San Diego
NaN
LIGHT
CA
8/30/2014 21:00
80408
Fallbrook
ORANGE
FIREBALL
CA
8/30/2014 21:20
80414
Eustace
NaN
LIGHT
TX
8/30/2014 23:00
80421
Manteca
NaN
SPHERE
CA
8/31/2014 2:20
80442
El Cajon
NaN
TRIANGLE
CA
8/31/2014 22:35
80476
Fort Worth
NaN
OTHER
TX
9/2/2014 16:05
80516
Glendale
NaN
CYLINDER
CA
9/4/2014 17:38
80534
Burleson
NaN
LIGHT
TX
9/4/2014 22:30
14929 rows × 5 columns
In [72]:
ufo_dallas = ufo[(ufo.City == 'Dallas') & (ufo.State =='TX')] # Select only Dallas, TX records
In [77]:
ufo[ufo.City.isin(['Austin','Dallas', 'Houston'])] # Select only Austin, Dallas, or Houston records
Out[77]:
City
Colors Reported
Shape Reported
State
Time
37
Dallas
NaN
SPHERE
TX
7/15/1945 14:00
267
Houston
NaN
DISK
TX
6/1/1954 0:00
373
Houston
NaN
TRIANGLE
TX
6/30/1956 18:30
392
Dallas
NaN
NaN
TX
1/1/1957 21:00
414
Dallas
NaN
OTHER
TX
6/30/1957 0:00
428
Houston
NaN
DIAMOND
TX
7/15/1957 13:00
463
Houston
NaN
LIGHT
TX
10/15/1957 20:30
490
Houston
NaN
DISK
TX
6/15/1958 20:00
584
Houston
NaN
TRIANGLE
TX
3/12/1960 23:00
585
Houston
NaN
OTHER
TX
4/1/1960 21:00
666
Dallas
NaN
DISK
GA
6/22/1961 0:00
787
Houston
NaN
NaN
TX
6/1/1963 0:00
793
Houston
NaN
CIGAR
TX
6/1/1963 12:00
964
Austin
NaN
NaN
TX
4/1/1965 5:30
1147
Houston
NaN
FORMATION
TX
12/22/1965 9:00
1604
Houston
NaN
TRIANGLE
TX
6/1/1968 22:00
1661
Dallas
NaN
OTHER
TX
7/1/1968 3:30
1672
Houston
NaN
DISK
TX
7/7/1968 12:00
1712
Austin
NaN
DISK
TX
8/15/1968 5:00
1792
Dallas
NaN
LIGHT
TX
5/15/1969 20:00
1889
Austin
NaN
SPHERE
TX
9/9/1969 21:10
1945
Houston
NaN
DISK
TX
5/1/1970 3:00
1974
Houston
NaN
DISK
TX
6/15/1970 3:00
2155
Houston
NaN
CIRCLE
TX
9/15/1971 18:00
2205
Austin
NaN
LIGHT
TX
6/1/1972 18:00
2248
Dallas
GREEN
NaN
TX
6/30/1972 12:00
2278
Houston
NaN
LIGHT
TX
7/15/1972 20:00
2348
Houston
NaN
RECTANGLE
TX
2/10/1973 16:00
2359
Houston
NaN
DISK
TX
4/5/1973 21:30
2387
Austin
NaN
SPHERE
TX
6/1/1973 21:00
...
...
...
...
...
...
76081
Dallas
NaN
TRIANGLE
TX
2/17/2014 22:00
76093
Houston
NaN
DISK
TX
2/18/2014 18:00
76121
Houston
NaN
NaN
TX
2/19/2014 21:20
76548
Houston
NaN
FLASH
TX
3/18/2014 20:45
76712
Houston
NaN
TRIANGLE
TX
3/29/2014 1:00
76855
Austin
RED
LIGHT
TX
4/5/2014 22:15
76905
Austin
NaN
NaN
TX
4/8/2014 21:15
76966
Houston
RED BLUE
SPHERE
TX
4/11/2014 0:30
77527
Dallas
NaN
OTHER
TX
5/9/2014 8:00
78353
Austin
NaN
CYLINDER
TX
6/19/2014 23:00
78354
Austin
GREEN BLUE
CYLINDER
TX
6/19/2014 23:00
78420
Dallas
GREEN BLUE
LIGHT
TX
6/21/2014 22:00
78925
Houston
ORANGE
FIREBALL
TX
7/4/2014 22:30
78983
Austin
NaN
LIGHT
TX
7/4/2014 23:30
79128
Austin
NaN
SPHERE
TX
7/7/2014 23:00
79212
Austin
NaN
LIGHT
TX
7/12/2014 21:00
79223
Austin
YELLOW
LIGHT
TX
7/12/2014 21:05
79227
Austin
NaN
LIGHT
TX
7/12/2014 21:08
79236
Austin
NaN
LIGHT
TX
7/12/2014 21:12
79238
Austin
NaN
LIGHT
TX
7/12/2014 21:15
79302
Austin
ORANGE
CIRCLE
TX
7/14/2014 5:10
79390
Houston
NaN
LIGHT
TX
7/18/2014 22:00
79572
Austin
GREEN
NaN
TX
7/27/2014 0:30
79628
Houston
NaN
OTHER
TX
7/30/2014 12:00
79676
Austin
RED
NaN
TX
8/1/2014 20:55
79767
Dallas
NaN
TRIANGLE
GA
8/4/2014 21:00
79965
Austin
NaN
CIRCLE
TX
8/15/2014 6:45
79969
Austin
RED
CIRCLE
TX
8/15/2014 18:45
79986
Houston
YELLOW
CIRCLE
TX
8/15/2014 23:30
80350
Houston
NaN
TRIANGLE
TX
8/28/2014 19:55
789 rows × 5 columns
In [78]:
ufo.State.order() # only works for a Series
Out[78]:
76702 AK
74536 AK
28987 AK
8551 AK
76703 AK
31387 AK
2374 AK
44188 AK
18424 AK
61798 AK
1891 AK
33161 AK
73924 AK
48085 AK
67403 AK
...
14586 WY
1177 WY
77831 WY
80082 WY
188 WY
26882 WY
41926 WY
54342 WY
79897 WY
55682 WY
52869 WY
49543 WY
7183 WY
58080 WY
21766 WY
Name: State, Length: 80543, dtype: object
In [79]:
ufo.sort_index(inplace=True) # sort rows by label
In [80]:
ufo.sort_index(ascending=False, inplace=False)
Out[80]:
City
Colors Reported
Shape Reported
State
Time
80542
Loughman
NaN
LIGHT
FL
9/5/2014 5:30
80541
Orland park
RED
LIGHT
IL
9/5/2014 3:43
80540
Tucson
RED BLUE
NaN
AZ
9/5/2014 2:40
80539
Uhrichsville
NaN
LIGHT
OH
9/5/2014 1:14
80538
Neligh
NaN
CIRCLE
NE
9/4/2014 23:20
80537
Neligh
NaN
CIRCLE
NE
9/4/2014 23:20
80536
Wyoming
RED
DISK
PA
9/4/2014 23:00
80535
Elizabethtown
NaN
CIRCLE
KY
9/4/2014 22:30
80534
Burleson
NaN
LIGHT
TX
9/4/2014 22:30
80533
Melbourne
NaN
OVAL
IA
9/4/2014 22:15
80532
Lawrence
NaN
LIGHT
MA
9/4/2014 22:15
80531
Abingdon
NaN
CIRCLE
VA
9/4/2014 22:05
80530
Glen St. Mary
NaN
DISK
FL
9/4/2014 22:00
80529
Deer Park
NaN
LIGHT
WA
9/4/2014 22:00
80528
North Royalton
RED
TRIANGLE
OH
9/4/2014 21:30
80527
Seattle
NaN
DIAMOND
WA
9/4/2014 21:15
80526
Marshfield
NaN
LIGHT
MA
9/4/2014 21:15
80525
Iowa City
BLUE
LIGHT
IA
9/4/2014 21:11
80524
Olympia
RED
LIGHT
WA
9/4/2014 21:10
80523
Ellicott City
NaN
LIGHT
MD
9/4/2014 20:45
80522
Lawrence
ORANGE
LIGHT
MA
9/4/2014 20:30
80521
Canal Winchester
NaN
TRIANGLE
OH
9/4/2014 20:30
80520
Waynesburg
NaN
FLASH
PA
9/4/2014 20:25
80519
Glen Ellyn
RED
CIRCLE
IL
9/4/2014 20:20
80518
Langhorne
NaN
OTHER
PA
9/4/2014 19:30
80517
Laconia
NaN
CIGAR
NH
9/4/2014 19:02
80516
Glendale
NaN
CYLINDER
CA
9/4/2014 17:38
80515
Woodbridge
NaN
TRIANGLE
VA
9/4/2014 16:30
80514
New Bedford
NaN
LIGHT
MA
9/4/2014 9:47
80513
Manahawkin
NaN
CIRCLE
NJ
9/4/2014 8:30
...
...
...
...
...
...
29
Ft. Duschene
NaN
DISK
UT
6/30/1944 10:00
28
Los Alamos
NaN
DISK
NM
6/1/1944 12:00
27
Clovis
NaN
DISK
NM
4/2/1944 11:00
26
Wilderness
NaN
DISK
WV
1/1/1944 12:00
25
San Diego
NaN
CIGAR
CA
1/1/1944 12:00
24
Wilderness
NaN
DISK
WV
1/1/1944 10:00
23
Owensboro
NaN
RECTANGLE
KY
10/15/1943 11:00
22
NaN
NaN
LIGHT
LA
8/15/1943 0:00
21
NaN
NaN
NaN
LA
8/15/1943 0:00
20
Nebraska
NaN
DISK
NE
6/1/1943 15:00
19
Bering Sea
RED
OTHER
AK
4/30/1943 23:00
18
Oneida
NaN
RECTANGLE
TN
7/15/1942 1:00
17
Hapeville
NaN
NaN
GA
6/1/1942 22:30
16
Los Angeles
NaN
NaN
CA
2/25/1942 0:00
15
Forest Home
NaN
CIRCLE
CA
7/2/1941 11:30
14
Ludington
NaN
DISK
MI
6/1/1941 13:00
13
Keokuk
NaN
OVAL
IA
7/7/1939 2:00
12
Belton
RED
SPHERE
SC
6/30/1939 20:00
11
Waterloo
NaN
FIREBALL
AL
6/1/1939 20:00
10
Fontana
NaN
LIGHT
CA
8/15/1937 21:00
9
Hubbard
NaN
CYLINDER
OR
6/15/1937 0:00
8
Eklutna
NaN
CIGAR
AK
10/15/1936 17:00
7
Alma
NaN
DISK
MI
7/15/1936 0:00
6
Crater Lake
NaN
CIRCLE
CA
6/15/1935 0:00
5
Valley City
NaN
DISK
ND
9/15/1934 15:30
4
New York Worlds Fair
NaN
LIGHT
NY
4/18/1933 19:00
3
Abilene
NaN
DISK
KS
6/1/1931 13:00
2
Holyoke
NaN
OVAL
CO
2/15/1931 14:00
1
Willingboro
NaN
OTHER
NJ
6/30/1930 20:00
0
Ithaca
NaN
TRIANGLE
NY
6/1/1930 22:00
80543 rows × 5 columns
In [81]:
ufo.sort_index(by='State') # sort rows by specific column
Out[81]:
City
Colors Reported
Shape Reported
State
Time
76702
Wasilla
NaN
VARIOUS
AK
3/28/2014 11:17
74536
Ketchikan
NaN
LIGHT
AK
12/5/2013 17:04
28987
Anchorage
NaN
SPHERE
AK
5/28/2004 8:35
8551
Cantwell
NaN
CIRCLE
AK
8/20/1995 11:30
76703
Chugiak
NaN
LIGHT
AK
3/28/2014 11:30
31387
Fairbanks
NaN
NaN
AK
12/20/2004 20:00
2374
Chugiak
NaN
RECTANGLE
AK
5/15/1973 22:00
44188
Alaska
NaN
CIGAR
AK
4/29/2008 0:00
18424
North Pole
NaN
DISK
AK
1/24/2001 0:00
61798
Anchorage
RED
LIGHT
AK
3/2/2012 19:30
1891
Fairbanks
NaN
CIGAR
AK
9/20/1969 23:00
33161
Susitna Lodge
NaN
OTHER
AK
7/7/2005 22:00
73924
Kuparuk Oilfield
NaN
LIGHT
AK
11/9/2013 17:30
48085
Fairbanks
GREEN
LIGHT
AK
2/20/2009 22:00
67403
Fairbanks
ORANGE
LIGHT
AK
11/22/2012 19:05
35828
White Mountain
NaN
SPHERE
AK
3/21/2006 0:00
61819
Bethel
NaN
LIGHT
AK
3/3/2012 22:15
5426
Adak
NaN
LIGHT
AK
6/1/1986 0:00
68716
Palmer
ORANGE
LIGHT
AK
2/23/2013 21:30
32284
Fairbanks
NaN
CIRCLE
AK
4/11/2005 0:00
76317
Anchorage
RED
LIGHT
AK
3/1/2014 0:00
39041
Anchorage
RED
DISK
AK
1/27/2007 23:07
22150
Bethel
NaN
OTHER
AK
6/5/2002 23:45
17589
Fairbanks
NaN
NaN
AK
10/2/2000 23:00
7714
Anchorage
NaN
LIGHT
AK
1/17/1995 22:45
28204
Fairbanks
YELLOW
LIGHT
AK
3/11/2004 21:30
76721
Anchorage
NaN
CIGAR
AK
3/29/2014 20:45
467
St Lawrence Island
NaN
OTHER
AK
11/10/1957 20:00
33158
Susitna Lodge
BLUE
RECTANGLE
AK
7/7/2005 18:45
76106
Anchorage
NaN
FLASH
AK
2/19/2014 3:00
...
...
...
...
...
...
2976
Fort Washakie
NaN
LIGHT
WY
7/7/1975 19:00
79995
Cheyenne
NaN
TRIANGLE
WY
8/16/2014 17:05
33421
Clark
NaN
LIGHT
WY
7/31/2005 4:20
23548
Bosler
NaN
NaN
WY
10/15/2002 18:00
38054
Laramie
NaN
RECTANGLE
WY
11/7/2006 21:00
9277
Cody
NaN
OTHER
WY
5/1/1996 22:00
20848
Clark
NaN
LIGHT
WY
11/15/2001 21:10
13716
On highway; location unknown
NaN
OTHER
WY
6/15/1999 22:00
12215
Sheridan
NaN
OTHER
WY
9/24/1998 0:15
74651
Powell
NaN
CHEVRON
WY
12/11/2013 18:25
45615
Gillette
NaN
EGG
WY
8/11/2008 20:15
7064
Mammoth Hot Springs
YELLOW
CHEVRON
WY
6/4/1993 21:55
37171
Riverton
NaN
FORMATION
WY
8/13/2006 22:20
47842
Casper
NaN
NaN
WY
2/2/2009 6:00
35111
Evanston
NaN
SPHERE
WY
12/12/2005 23:00
14586
Egg Harbor
NaN
NaN
WY
9/15/1999 0:00
1177
Francis E Warren AFB
NaN
NaN
WY
4/15/1966 23:00
77831
Rock Springs
NaN
LIGHT
WY
5/25/2014 20:00
80082
Yellowstone National Park
YELLOW
TEARDROP
WY
8/19/2014 7:40
188
Cheyenne
NaN
DISK
WY
6/30/1952 0:00
26882
Cheyenne
NaN
FIREBALL
WY
11/1/2003 0:00
41926
Opal
NaN
NaN
WY
10/20/2007 1:00
54342
Cokeville
NaN
FORMATION
WY
8/29/2010 21:20
79897
Evanston
NaN
CONE
WY
8/11/2014 19:45
55682
Buffalo
NaN
DISK
WY
12/17/2010 18:15
52869
Gillette
NaN
LIGHT
WY
6/6/2010 2:45
49543
Powell
NaN
CIRCLE
WY
7/8/2009 4:25
7183
Laramie
NaN
VARIOUS
WY
8/23/1993 19:56
58080
Casper
NaN
LIGHT
WY
7/20/2011 22:13
21766
Laramie
NaN
VARIOUS
WY
4/4/2002 19:45
80543 rows × 5 columns
In [82]:
ufo.sort_index(by=['State', 'Shape Reported']) # sort by multiple columns
Out[82]:
City
Colors Reported
Shape Reported
State
Time
8
Eklutna
NaN
CIGAR
AK
10/15/1936 17:00
413
Arctic
NaN
CIGAR
AK
6/30/1957 0:00
1306
Delta Junction
ORANGE
CIGAR
AK
9/15/1966 18:00
1891
Fairbanks
NaN
CIGAR
AK
9/20/1969 23:00
7233
Fairbanks
BLUE
CIGAR
AK
11/23/1993 11:24
10682
Kenai
NaN
CIGAR
AK
8/8/1997 23:00
44188
Alaska
NaN
CIGAR
AK
4/29/2008 0:00
76721
Anchorage
NaN
CIGAR
AK
3/29/2014 20:45
1494
Gula
NaN
CIRCLE
AK
8/15/1967 23:00
2304
Anchorage
BLUE
CIRCLE
AK
8/30/1972 1:30
3511
Kodiak Island
NaN
CIRCLE
AK
6/15/1977 2:00
5333
Juneau
NaN
CIRCLE
AK
9/5/1985 1:30
8551
Cantwell
NaN
CIRCLE
AK
8/20/1995 11:30
12470
Anchorage
NaN
CIRCLE
AK
10/31/1998 10:00
14154
Ketchikan
NaN
CIRCLE
AK
8/11/1999 23:07
15376
Seward
ORANGE
CIRCLE
AK
11/27/1999 11:00
20182
Fairbanks
NaN
CIRCLE
AK
8/27/2001 5:30
24310
Anchorage
NaN
CIRCLE
AK
1/24/2003 8:30
25700
Fairbanks
NaN
CIRCLE
AK
8/7/2003 1:30
29737
Wrangell
NaN
CIRCLE
AK
8/2/2004 0:00
29914
Anchorage
NaN
CIRCLE
AK
8/16/2004 12:08
32284
Fairbanks
NaN
CIRCLE
AK
4/11/2005 0:00
37906
Anchorage
NaN
CIRCLE
AK
10/23/2006 21:00
39338
Eagle River
BLUE
CIRCLE
AK
3/4/2007 19:45
41774
Willow
GREEN
CIRCLE
AK
10/6/2007 22:00
47434
Venetie
RED GREEN
CIRCLE
AK
1/2/2009 21:00
53969
Anchorage
NaN
CIRCLE
AK
8/6/2010 18:00
54283
Craig
GREEN
CIRCLE
AK
8/27/2010 0:00
55673
Anchorage
NaN
CIRCLE
AK
12/16/2010 12:00
56673
Wasilla
NaN
CIRCLE
AK
4/4/2011 4:00
...
...
...
...
...
...
18314
Gillette
NaN
VARIOUS
WY
1/8/2001 18:30
21766
Laramie
NaN
VARIOUS
WY
4/4/2002 19:45
59219
Yellowstone National Park
NaN
VARIOUS
WY
9/20/2011 19:55
60572
Laramie
ORANGE
VARIOUS
WY
12/15/2011 17:30
1177
Francis E Warren AFB
NaN
NaN
WY
4/15/1966 23:00
2437
Powell
NaN
NaN
WY
7/5/1973 1:00
5065
Wheeling
NaN
NaN
WY
6/15/1984 20:30
5232
Old Faithful
NaN
NaN
WY
6/2/1985 23:30
7093
Laramie
NaN
NaN
WY
6/17/1993 15:00
7684
Glendo
NaN
NaN
WY
1/6/1995 17:30
9264
Mile 69
NaN
NaN
WY
4/28/1996 1:00
11490
Cheyenne
NaN
NaN
WY
5/2/1998 20:30
13756
Laramie
BLUE
NaN
WY
6/20/1999 23:00
14586
Egg Harbor
NaN
NaN
WY
9/15/1999 0:00
14618
Cheyenne
NaN
NaN
WY
9/17/1999 21:55
15667
Rock Spings, 2miles west of
NaN
NaN
WY
1/6/2000 5:45
20080
Rawlins
RED ORANGE
NaN
WY
8/17/2001 22:20
23548
Bosler
NaN
NaN
WY
10/15/2002 18:00
35060
Cheyenne
ORANGE
NaN
WY
12/5/2005 21:00
36535
Seminoe Lake
NaN
NaN
WY
6/14/2006 23:00
38849
Shawnee
NaN
NaN
WY
1/10/2007 19:00
40260
Thermopolis
NaN
NaN
WY
6/14/2007 23:00
41926
Opal
NaN
NaN
WY
10/20/2007 1:00
46209
Glenrock
RED GREEN
NaN
WY
10/1/2008 20:00
46430
Colony
NaN
NaN
WY
10/17/2008 19:20
47842
Casper
NaN
NaN
WY
2/2/2009 6:00
52361
Cheyenne
NaN
NaN
WY
4/10/2010 21:00
53722
Lander
NaN
NaN
WY
7/18/2010 23:00
73095
Clark
NaN
NaN
WY
10/5/2013 12:15
79629
Hulett
NaN
NaN
WY
7/30/2014 13:00
80543 rows × 5 columns
In [83]:
ufo.sort_index(by=['State', 'Shape Reported'], ascending=[False, True], inplace=True) # specify sort order
In [84]:
# Add a new column as a function of existing columns
ufo['Location'] = ufo['City'] + ', ' + ufo['State']
ufo.head()
Out[84]:
City
Colors Reported
Shape Reported
State
Time
Location
7064
Mammoth Hot Springs
YELLOW
CHEVRON
WY
6/4/1993 21:55
Mammoth Hot Springs, WY
24658
Jackson Hole
NaN
CHEVRON
WY
3/16/2003 11:00
Jackson Hole, WY
61928
Rock Springs
NaN
CHEVRON
WY
3/12/2012 1:02
Rock Springs, WY
73885
Casper
NaN
CHEVRON
WY
11/7/2013 1:45
Casper, WY
74651
Powell
NaN
CHEVRON
WY
12/11/2013 18:25
Powell, WY
In [86]:
# Rename columns
ufo.rename(columns={'Colors Reported':'Colors', 'Shape Reported':'Shape'}, inplace=True)
ufo.head()
Out[86]:
City
Colors
Shape
State
Time
Location
7064
Mammoth Hot Springs
YELLOW
CHEVRON
WY
6/4/1993 21:55
Mammoth Hot Springs, WY
24658
Jackson Hole
NaN
CHEVRON
WY
3/16/2003 11:00
Jackson Hole, WY
61928
Rock Springs
NaN
CHEVRON
WY
3/12/2012 1:02
Rock Springs, WY
73885
Casper
NaN
CHEVRON
WY
11/7/2013 1:45
Casper, WY
74651
Powell
NaN
CHEVRON
WY
12/11/2013 18:25
Powell, WY
In [87]:
# Hide a column (temporarily)
ufo.drop(['Location'], axis=1)
Out[87]:
City
Colors
Shape
State
Time
7064
Mammoth Hot Springs
YELLOW
CHEVRON
WY
6/4/1993 21:55
24658
Jackson Hole
NaN
CHEVRON
WY
3/16/2003 11:00
61928
Rock Springs
NaN
CHEVRON
WY
3/12/2012 1:02
73885
Casper
NaN
CHEVRON
WY
11/7/2013 1:45
74651
Powell
NaN
CHEVRON
WY
12/11/2013 18:25
1461
Wyoming
NaN
CIGAR
WY
7/12/1967 2:00
12072
Cheyenne
NaN
CIGAR
WY
9/4/1998 15:00
60064
Torrington
NaN
CIGAR
WY
11/5/2011 21:30
72264
Wyoming
NaN
CIGAR
WY
9/1/2013 0:00
1985
Cheyenne
NaN
CIRCLE
WY
6/20/1970 2:00
3326
Casper
NaN
CIRCLE
WY
7/25/1976 18:00
9399
Gillette
NaN
CIRCLE
WY
6/15/1996 2:30
23452
Lander
GREEN
CIRCLE
WY
10/6/2002 19:25
29083
Hanna
NaN
CIRCLE
WY
6/5/2004 16:00
49543
Powell
NaN
CIRCLE
WY
7/8/2009 4:25
49739
Laramie
NaN
CIRCLE
WY
7/21/2009 22:15
52524
Evansville
NaN
CIRCLE
WY
4/30/2010 21:30
57240
Mills
NaN
CIRCLE
WY
6/3/2011 15:00
58355
Cheyenne
NaN
CIRCLE
WY
8/5/2011 18:13
59084
Casper
RED GREEN BLUE
CIRCLE
WY
9/12/2011 20:30
59194
La Barge
NaN
CIRCLE
WY
9/18/2011 21:00
62612
Rock Springs
NaN
CIRCLE
WY
4/21/2012 23:05
63577
Laramie
NaN
CIRCLE
WY
6/20/2012 16:00
68629
Point of Rocks
GREEN
CIRCLE
WY
2/14/2013 6:00
72481
Gillette
NaN
CIRCLE
WY
9/11/2013 9:00
76250
Casper
NaN
CIRCLE
WY
2/25/2014 2:20
77097
Yellowstone North Entrance
RED YELLOW
CIRCLE
WY
4/17/2014 18:00
79897
Evanston
NaN
CONE
WY
8/11/2014 19:45
3328
Cowley
NaN
CYLINDER
WY
7/27/1976 1:00
13823
Buffalo
NaN
CYLINDER
WY
7/1/1999 22:30
...
...
...
...
...
...
8821
Prince of Wales Island
NaN
NaN
AK
10/27/1995 20:28
9129
Mountain Village
NaN
NaN
AK
3/2/1996 0:30
9212
Nikishi
RED
NaN
AK
4/12/1996 2:00
10262
North Pole
NaN
NaN
AK
5/20/1997 15:00
13552
Wacker City
NaN
NaN
AK
5/26/1999 18:00
15741
Juneau
NaN
NaN
AK
1/18/2000 7:45
17589
Fairbanks
NaN
NaN
AK
10/2/2000 23:00
18116
North Pole
NaN
NaN
AK
12/12/2000 22:00
21296
Anchorage
RED
NaN
AK
1/20/2002 20:00
23064
St. George Island
NaN
NaN
AK
8/30/2002 23:30
31387
Fairbanks
NaN
NaN
AK
12/20/2004 20:00
33534
Suriname
NaN
NaN
AK
8/11/2005 6:11
34623
Glennallen
NaN
NaN
AK
11/1/2005 0:30
44957
Fairbanks
BLUE
NaN
AK
7/3/2008 16:40
47362
Fairbanks
NaN
NaN
AK
12/29/2008 15:00
48773
Eagle River
NaN
NaN
AK
5/9/2009 2:30
52781
Prince William Sound
NaN
NaN
AK
5/29/2010 9:45
53051
Elk
NaN
NaN
AK
6/25/2010 10:22
56186
Fairbanks
NaN
NaN
AK
2/11/2011 8:00
56499
Hooper Bay
NaN
NaN
AK
3/15/2011 0:03
61311
Anchorage
ORANGE
NaN
AK
1/22/2012 20:15
65642
Fairbanks
NaN
NaN
AK
9/3/2012 21:57
68292
Anchorage
NaN
NaN
AK
1/7/2013 1:50
71252
Sterling
NaN
NaN
AK
7/22/2013 3:13
73090
Anchorage
NaN
NaN
AK
10/5/2013 2:55
73916
Fairbanks
ORANGE BLUE
NaN
AK
11/8/2013 23:00
74639
Fairbanks
ORANGE
NaN
AK
12/10/2013 23:15
74969
Anchorage
NaN
NaN
AK
12/26/2013 18:21
75471
Talkitna
NaN
NaN
AK
1/14/2014 23:00
77458
Anchorage
NaN
NaN
AK
5/5/2014 16:00
80543 rows × 5 columns
In [88]:
# Delete a column (permanently)
del ufo['Location']
In [89]:
# Missing values are often just excluded
ufo.describe() # Excludes missing values
Out[89]:
City
Colors
Shape
State
Time
count
80496
17034
72141
80543
80543
unique
13504
31
27
52
68901
top
Seattle
ORANGE
LIGHT
CA
7/4/2014 22:00
freq
646
5216
16332
10743
45
In [90]:
ufo.Shape.value_counts() # Excludes missing values
Out[90]:
LIGHT 16332
TRIANGLE 7816
CIRCLE 7725
FIREBALL 6249
OTHER 5506
SPHERE 5231
DISK 5226
OVAL 3721
FORMATION 2405
CIGAR 1983
VARIOUS 1957
FLASH 1329
RECTANGLE 1295
CYLINDER 1252
DIAMOND 1152
CHEVRON 940
EGG 733
TEARDROP 723
CONE 310
CROSS 241
DELTA 7
CRESCENT 2
ROUND 2
HEXAGON 1
FLARE 1
PYRAMID 1
DOME 1
dtype: int64
In [91]:
ufo.Shape.value_counts(dropna=False) # Includes missing values
Out[91]:
LIGHT 16332
NaN 8402
TRIANGLE 7816
CIRCLE 7725
FIREBALL 6249
OTHER 5506
SPHERE 5231
DISK 5226
OVAL 3721
FORMATION 2405
CIGAR 1983
VARIOUS 1957
FLASH 1329
RECTANGLE 1295
CYLINDER 1252
DIAMOND 1152
CHEVRON 940
EGG 733
TEARDROP 723
CONE 310
CROSS 241
DELTA 7
CRESCENT 2
ROUND 2
HEXAGON 1
PYRAMID 1
DOME 1
FLARE 1
dtype: int64
In [92]:
# Find missing values in a Series
ufo.Shape.isnull() # True if NaN, False otherwise
Out[92]:
7064 False
24658 False
61928 False
73885 False
74651 False
1461 False
12072 False
60064 False
72264 False
1985 False
3326 False
9399 False
23452 False
29083 False
49543 False
...
48773 True
52781 True
53051 True
56186 True
56499 True
61311 True
65642 True
68292 True
71252 True
73090 True
73916 True
74639 True
74969 True
75471 True
77458 True
Name: Shape, Length: 80543, dtype: bool
In [93]:
ufo.Shape.notnull() # False if NaN, True otherwise
Out[93]:
7064 True
24658 True
61928 True
73885 True
74651 True
1461 True
12072 True
60064 True
72264 True
1985 True
3326 True
9399 True
23452 True
29083 True
49543 True
...
48773 False
52781 False
53051 False
56186 False
56499 False
61311 False
65642 False
68292 False
71252 False
73090 False
73916 False
74639 False
74969 False
75471 False
77458 False
Name: Shape, Length: 80543, dtype: bool
In [94]:
ufo.Shape.isnull().sum() # Count the missing values
Out[94]:
8402
In [95]:
# Find missing values in a DataFrame
ufo.isnull()
Out[95]:
City
Colors
Shape
State
Time
7064
False
False
False
False
False
24658
False
True
False
False
False
61928
False
True
False
False
False
73885
False
True
False
False
False
74651
False
True
False
False
False
1461
False
True
False
False
False
12072
False
True
False
False
False
60064
False
True
False
False
False
72264
False
True
False
False
False
1985
False
True
False
False
False
3326
False
True
False
False
False
9399
False
True
False
False
False
23452
False
False
False
False
False
29083
False
True
False
False
False
49543
False
True
False
False
False
49739
False
True
False
False
False
52524
False
True
False
False
False
57240
False
True
False
False
False
58355
False
True
False
False
False
59084
False
False
False
False
False
59194
False
True
False
False
False
62612
False
True
False
False
False
63577
False
True
False
False
False
68629
False
False
False
False
False
72481
False
True
False
False
False
76250
False
True
False
False
False
77097
False
False
False
False
False
79897
False
True
False
False
False
3328
False
True
False
False
False
13823
False
True
False
False
False
...
...
...
...
...
...
8821
False
True
True
False
False
9129
False
True
True
False
False
9212
False
False
True
False
False
10262
False
True
True
False
False
13552
False
True
True
False
False
15741
False
True
True
False
False
17589
False
True
True
False
False
18116
False
True
True
False
False
21296
False
False
True
False
False
23064
False
True
True
False
False
31387
False
True
True
False
False
33534
False
True
True
False
False
34623
False
True
True
False
False
44957
False
False
True
False
False
47362
False
True
True
False
False
48773
False
True
True
False
False
52781
False
True
True
False
False
53051
False
True
True
False
False
56186
False
True
True
False
False
56499
False
True
True
False
False
61311
False
False
True
False
False
65642
False
True
True
False
False
68292
False
True
True
False
False
71252
False
True
True
False
False
73090
False
True
True
False
False
73916
False
False
True
False
False
74639
False
False
True
False
False
74969
False
True
True
False
False
75471
False
True
True
False
False
77458
False
True
True
False
False
80543 rows × 5 columns
In [96]:
# Count the missing values in a DataFrame
ufo.isnull().sum()
Out[96]:
City 47
Colors 63509
Shape 8402
State 0
Time 0
dtype: int64
In [97]:
# Exclude rows with missing values in a dataframe
ufo[(ufo.Shape.notnull()) & (ufo.Colors.notnull())]
Out[97]:
City
Colors
Shape
State
Time
7064
Mammoth Hot Springs
YELLOW
CHEVRON
WY
6/4/1993 21:55
23452
Lander
GREEN
CIRCLE
WY
10/6/2002 19:25
59084
Casper
RED GREEN BLUE
CIRCLE
WY
9/12/2011 20:30
68629
Point of Rocks
GREEN
CIRCLE
WY
2/14/2013 6:00
77097
Yellowstone North Entrance
RED YELLOW
CIRCLE
WY
4/17/2014 18:00
23356
Fort Bridger
GREEN
DISK
WY
9/28/2002 18:30
67644
Rock Springs
RED
DISK
WY
12/7/2012 6:30
12241
Rock Springs
GREEN
FIREBALL
WY
9/26/1998 3:05
23132
Gillette
GREEN
FIREBALL
WY
9/6/2002 20:30
10729
Atlantic City
RED
LIGHT
WY
8/22/1997 23:30
24173
Rock Springs
RED GREEN
LIGHT
WY
1/8/2003 20:15
38591
Sheridan
RED GREEN
LIGHT
WY
12/19/2006 21:00
40417
Douglas
ORANGE
LIGHT
WY
6/24/2007 0:30
61636
Alpine
RED
LIGHT
WY
2/15/2012 21:05
63575
Bill
GREEN
LIGHT
WY
6/20/2012 12:42
71785
Devils Tower
GREEN
LIGHT
WY
8/12/2013 21:30
75664
Fort Washakie
RED
LIGHT
WY
1/23/2014 23:00
76253
Evanston
ORANGE
LIGHT
WY
2/25/2014 18:35
77802
Rock Springs
ORANGE
LIGHT
WY
5/24/2014 22:00
46556
Rock Springs
ORANGE
OTHER
WY
10/25/2008 21:00
53963
Cheyenne
YELLOW
OTHER
WY
8/6/2010 0:30
20679
Evanston
GREEN
OVAL
WY
10/24/2001 23:00
23134
Baggs
RED
OVAL
WY
9/6/2002 21:00
45717
Laramie
RED
OVAL
WY
8/20/2008 15:30
163
Green River
GREEN
SPHERE
WY
7/3/1951 12:00
35099
Fox Park
RED GREEN
SPHERE
WY
12/11/2005 21:15
57383
Green River
RED GREEN
SPHERE
WY
6/16/2011 23:30
66994
Casper
GREEN
SPHERE
WY
11/7/2012 19:30
74569
Laramie
ORANGE
SPHERE
WY
12/7/2013 1:00
80082
Yellowstone National Park
YELLOW
TEARDROP
WY
8/19/2014 7:40
...
...
...
...
...
...
20579
Fairbanks
RED
OTHER
AK
10/11/2001 2:15
25724
Anchorage
RED
OTHER
AK
8/10/2003 3:00
26770
Anchorage
RED
OTHER
AK
10/22/2003 7:00
38099
North Pole
GREEN
OTHER
AK
11/10/2006 23:15
3149
Ft. Yukon
BLUE
OVAL
AK
2/13/1976 13:30
5459
Teller
RED GREEN
OVAL
AK
6/20/1986 23:30
11112
Fairbanks
GREEN
OVAL
AK
1/8/1998 22:38
67995
Fairbanks
RED YELLOW
OVAL
AK
12/25/2012 16:30
72717
Anchorage
RED ORANGE
OVAL
AK
9/21/2013 21:20
76790
Kenai
RED
OVAL
AK
4/2/2014 11:00
33158
Susitna Lodge
BLUE
RECTANGLE
AK
7/7/2005 18:45
3395
Soldotna
RED
SPHERE
AK
10/20/1976 22:00
4716
Juneau
ORANGE
SPHERE
AK
6/15/1982 14:00
5554
Girdwood
ORANGE
SPHERE
AK
11/15/1986 14:30
11313
North Pole
BLUE
SPHERE
AK
3/15/1998 20:00
12222
North Pole
ORANGE
SPHERE
AK
9/24/1998 22:35
18127
North Pole
ORANGE
SPHERE
AK
12/15/2000 16:36
27518
Anchorage
ORANGE
SPHERE
AK
12/25/2003 0:00
51682
Willow
RED
SPHERE
AK
1/1/2010 12:01
55791
Wasilla
RED
SPHERE
AK
12/31/2010 21:00
60708
Fairbanks
RED BLUE
SPHERE
AK
12/24/2011 2:00
61938
Palmer
ORANGE
SPHERE
AK
3/12/2012 21:45
65707
Wasilla
ORANGE
SPHERE
AK
9/8/2012 19:00
49308
Juneau
RED
TRIANGLE
AK
6/27/2009 2:30
59064
Wasilla
YELLOW
TRIANGLE
AK
9/11/2011 9:00
60830
North Pole
ORANGE YELLOW
TRIANGLE
AK
12/30/2011 4:47
68192
Delta Junction
ORANGE
TRIANGLE
AK
1/1/2013 1:00
76701
Bethel
ORANGE
TRIANGLE
AK
3/28/2014 11:03
77045
Wasilla
RED
TRIANGLE
AK
4/14/2014 22:45
18233
Anchorage
RED
VARIOUS
AK
12/31/2000 21:00
15516 rows × 5 columns
In [98]:
# Drop missing values
ufo.dropna() # Drop a row if ANY values are missing
Out[98]:
City
Colors
Shape
State
Time
7064
Mammoth Hot Springs
YELLOW
CHEVRON
WY
6/4/1993 21:55
23452
Lander
GREEN
CIRCLE
WY
10/6/2002 19:25
59084
Casper
RED GREEN BLUE
CIRCLE
WY
9/12/2011 20:30
68629
Point of Rocks
GREEN
CIRCLE
WY
2/14/2013 6:00
77097
Yellowstone North Entrance
RED YELLOW
CIRCLE
WY
4/17/2014 18:00
23356
Fort Bridger
GREEN
DISK
WY
9/28/2002 18:30
67644
Rock Springs
RED
DISK
WY
12/7/2012 6:30
12241
Rock Springs
GREEN
FIREBALL
WY
9/26/1998 3:05
23132
Gillette
GREEN
FIREBALL
WY
9/6/2002 20:30
10729
Atlantic City
RED
LIGHT
WY
8/22/1997 23:30
24173
Rock Springs
RED GREEN
LIGHT
WY
1/8/2003 20:15
38591
Sheridan
RED GREEN
LIGHT
WY
12/19/2006 21:00
40417
Douglas
ORANGE
LIGHT
WY
6/24/2007 0:30
61636
Alpine
RED
LIGHT
WY
2/15/2012 21:05
63575
Bill
GREEN
LIGHT
WY
6/20/2012 12:42
71785
Devils Tower
GREEN
LIGHT
WY
8/12/2013 21:30
75664
Fort Washakie
RED
LIGHT
WY
1/23/2014 23:00
76253
Evanston
ORANGE
LIGHT
WY
2/25/2014 18:35
77802
Rock Springs
ORANGE
LIGHT
WY
5/24/2014 22:00
46556
Rock Springs
ORANGE
OTHER
WY
10/25/2008 21:00
53963
Cheyenne
YELLOW
OTHER
WY
8/6/2010 0:30
20679
Evanston
GREEN
OVAL
WY
10/24/2001 23:00
23134
Baggs
RED
OVAL
WY
9/6/2002 21:00
45717
Laramie
RED
OVAL
WY
8/20/2008 15:30
163
Green River
GREEN
SPHERE
WY
7/3/1951 12:00
35099
Fox Park
RED GREEN
SPHERE
WY
12/11/2005 21:15
57383
Green River
RED GREEN
SPHERE
WY
6/16/2011 23:30
66994
Casper
GREEN
SPHERE
WY
11/7/2012 19:30
74569
Laramie
ORANGE
SPHERE
WY
12/7/2013 1:00
80082
Yellowstone National Park
YELLOW
TEARDROP
WY
8/19/2014 7:40
...
...
...
...
...
...
20579
Fairbanks
RED
OTHER
AK
10/11/2001 2:15
25724
Anchorage
RED
OTHER
AK
8/10/2003 3:00
26770
Anchorage
RED
OTHER
AK
10/22/2003 7:00
38099
North Pole
GREEN
OTHER
AK
11/10/2006 23:15
3149
Ft. Yukon
BLUE
OVAL
AK
2/13/1976 13:30
5459
Teller
RED GREEN
OVAL
AK
6/20/1986 23:30
11112
Fairbanks
GREEN
OVAL
AK
1/8/1998 22:38
67995
Fairbanks
RED YELLOW
OVAL
AK
12/25/2012 16:30
72717
Anchorage
RED ORANGE
OVAL
AK
9/21/2013 21:20
76790
Kenai
RED
OVAL
AK
4/2/2014 11:00
33158
Susitna Lodge
BLUE
RECTANGLE
AK
7/7/2005 18:45
3395
Soldotna
RED
SPHERE
AK
10/20/1976 22:00
4716
Juneau
ORANGE
SPHERE
AK
6/15/1982 14:00
5554
Girdwood
ORANGE
SPHERE
AK
11/15/1986 14:30
11313
North Pole
BLUE
SPHERE
AK
3/15/1998 20:00
12222
North Pole
ORANGE
SPHERE
AK
9/24/1998 22:35
18127
North Pole
ORANGE
SPHERE
AK
12/15/2000 16:36
27518
Anchorage
ORANGE
SPHERE
AK
12/25/2003 0:00
51682
Willow
RED
SPHERE
AK
1/1/2010 12:01
55791
Wasilla
RED
SPHERE
AK
12/31/2010 21:00
60708
Fairbanks
RED BLUE
SPHERE
AK
12/24/2011 2:00
61938
Palmer
ORANGE
SPHERE
AK
3/12/2012 21:45
65707
Wasilla
ORANGE
SPHERE
AK
9/8/2012 19:00
49308
Juneau
RED
TRIANGLE
AK
6/27/2009 2:30
59064
Wasilla
YELLOW
TRIANGLE
AK
9/11/2011 9:00
60830
North Pole
ORANGE YELLOW
TRIANGLE
AK
12/30/2011 4:47
68192
Delta Junction
ORANGE
TRIANGLE
AK
1/1/2013 1:00
76701
Bethel
ORANGE
TRIANGLE
AK
3/28/2014 11:03
77045
Wasilla
RED
TRIANGLE
AK
4/14/2014 22:45
18233
Anchorage
RED
VARIOUS
AK
12/31/2000 21:00
15510 rows × 5 columns
In [99]:
ufo.dropna(how='all') # Drop a row only if ALL values are missing
Out[99]:
City
Colors
Shape
State
Time
7064
Mammoth Hot Springs
YELLOW
CHEVRON
WY
6/4/1993 21:55
24658
Jackson Hole
NaN
CHEVRON
WY
3/16/2003 11:00
61928
Rock Springs
NaN
CHEVRON
WY
3/12/2012 1:02
73885
Casper
NaN
CHEVRON
WY
11/7/2013 1:45
74651
Powell
NaN
CHEVRON
WY
12/11/2013 18:25
1461
Wyoming
NaN
CIGAR
WY
7/12/1967 2:00
12072
Cheyenne
NaN
CIGAR
WY
9/4/1998 15:00
60064
Torrington
NaN
CIGAR
WY
11/5/2011 21:30
72264
Wyoming
NaN
CIGAR
WY
9/1/2013 0:00
1985
Cheyenne
NaN
CIRCLE
WY
6/20/1970 2:00
3326
Casper
NaN
CIRCLE
WY
7/25/1976 18:00
9399
Gillette
NaN
CIRCLE
WY
6/15/1996 2:30
23452
Lander
GREEN
CIRCLE
WY
10/6/2002 19:25
29083
Hanna
NaN
CIRCLE
WY
6/5/2004 16:00
49543
Powell
NaN
CIRCLE
WY
7/8/2009 4:25
49739
Laramie
NaN
CIRCLE
WY
7/21/2009 22:15
52524
Evansville
NaN
CIRCLE
WY
4/30/2010 21:30
57240
Mills
NaN
CIRCLE
WY
6/3/2011 15:00
58355
Cheyenne
NaN
CIRCLE
WY
8/5/2011 18:13
59084
Casper
RED GREEN BLUE
CIRCLE
WY
9/12/2011 20:30
59194
La Barge
NaN
CIRCLE
WY
9/18/2011 21:00
62612
Rock Springs
NaN
CIRCLE
WY
4/21/2012 23:05
63577
Laramie
NaN
CIRCLE
WY
6/20/2012 16:00
68629
Point of Rocks
GREEN
CIRCLE
WY
2/14/2013 6:00
72481
Gillette
NaN
CIRCLE
WY
9/11/2013 9:00
76250
Casper
NaN
CIRCLE
WY
2/25/2014 2:20
77097
Yellowstone North Entrance
RED YELLOW
CIRCLE
WY
4/17/2014 18:00
79897
Evanston
NaN
CONE
WY
8/11/2014 19:45
3328
Cowley
NaN
CYLINDER
WY
7/27/1976 1:00
13823
Buffalo
NaN
CYLINDER
WY
7/1/1999 22:30
...
...
...
...
...
...
8821
Prince of Wales Island
NaN
NaN
AK
10/27/1995 20:28
9129
Mountain Village
NaN
NaN
AK
3/2/1996 0:30
9212
Nikishi
RED
NaN
AK
4/12/1996 2:00
10262
North Pole
NaN
NaN
AK
5/20/1997 15:00
13552
Wacker City
NaN
NaN
AK
5/26/1999 18:00
15741
Juneau
NaN
NaN
AK
1/18/2000 7:45
17589
Fairbanks
NaN
NaN
AK
10/2/2000 23:00
18116
North Pole
NaN
NaN
AK
12/12/2000 22:00
21296
Anchorage
RED
NaN
AK
1/20/2002 20:00
23064
St. George Island
NaN
NaN
AK
8/30/2002 23:30
31387
Fairbanks
NaN
NaN
AK
12/20/2004 20:00
33534
Suriname
NaN
NaN
AK
8/11/2005 6:11
34623
Glennallen
NaN
NaN
AK
11/1/2005 0:30
44957
Fairbanks
BLUE
NaN
AK
7/3/2008 16:40
47362
Fairbanks
NaN
NaN
AK
12/29/2008 15:00
48773
Eagle River
NaN
NaN
AK
5/9/2009 2:30
52781
Prince William Sound
NaN
NaN
AK
5/29/2010 9:45
53051
Elk
NaN
NaN
AK
6/25/2010 10:22
56186
Fairbanks
NaN
NaN
AK
2/11/2011 8:00
56499
Hooper Bay
NaN
NaN
AK
3/15/2011 0:03
61311
Anchorage
ORANGE
NaN
AK
1/22/2012 20:15
65642
Fairbanks
NaN
NaN
AK
9/3/2012 21:57
68292
Anchorage
NaN
NaN
AK
1/7/2013 1:50
71252
Sterling
NaN
NaN
AK
7/22/2013 3:13
73090
Anchorage
NaN
NaN
AK
10/5/2013 2:55
73916
Fairbanks
ORANGE BLUE
NaN
AK
11/8/2013 23:00
74639
Fairbanks
ORANGE
NaN
AK
12/10/2013 23:15
74969
Anchorage
NaN
NaN
AK
12/26/2013 18:21
75471
Talkitna
NaN
NaN
AK
1/14/2014 23:00
77458
Anchorage
NaN
NaN
AK
5/5/2014 16:00
80543 rows × 5 columns
In [100]:
# Fill in missing values for a series
ufo.Colors.fillna(value='Unknown', inplace=True)
In [101]:
# Fill in missing values for the DataFrame
ufo.fillna(value='Unknown', inplace=True)
In [31]:
# Read drinks.csv (in the 'drinks_data' folder) into a DataFrame called 'drinks'
In [32]:
# Print the first 10 rows
In [33]:
# Examine the data types of all columns
In [34]:
# Print the 'beer_servings' Series
In [35]:
# Calculate the average 'beer_servings' for the entire dataset
In [36]:
# Print all columns, but only show rows where the country is in Europe
In [37]:
# Calculate the average 'beer_servings' for all of Europe
In [38]:
# Only show European countries with 'wine_servings' greater than 300
In [39]:
# Determine which 10 countries have the highest 'total_litres_of_pure_alcohol'
In [40]:
# Determine which country has the highest value for 'beer_servings'
In [41]:
# Count the number of occurrences of each 'continent' value and see if it looks correct
In [42]:
# Determine which countries do not have continent designations
In [43]:
# Determine the number of countries per continent. Does it look right?
In [102]:
# Read drinks.csv (in the drinks_data folder) into a DataFrame called 'drinks'
drinks = pd.read_csv('drinks_data/drinks.csv')
In [103]:
# Print the first 10 rows
drinks.head(10)
Out[103]:
country
beer_servings
spirit_servings
wine_servings
total_litres_of_pure_alcohol
continent
0
Afghanistan
0
0
0
0.0
AS
1
Albania
89
132
54
4.9
EU
2
Algeria
25
0
14
0.7
AF
3
Andorra
245
138
312
12.4
EU
4
Angola
217
57
45
5.9
AF
5
Antigua & Barbuda
102
128
45
4.9
NaN
6
Argentina
193
25
221
8.3
SA
7
Armenia
21
179
11
3.8
EU
8
Australia
261
72
212
10.4
OC
9
Austria
279
75
191
9.7
EU
In [104]:
# Examine the data types of all columns
drinks.dtypes
drinks.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 193 entries, 0 to 192
Data columns (total 6 columns):
country 193 non-null object
beer_servings 193 non-null int64
spirit_servings 193 non-null int64
wine_servings 193 non-null int64
total_litres_of_pure_alcohol 193 non-null float64
continent 170 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 10.6+ KB
In [ ]:
# Print the 'beer_servings' Series
drinks.beer_servings
drinks['beer_servings']
In [107]:
# Calculate the average 'beer_servings' for the entire dataset
drinks.describe() # Mean is provided in the summary from describe()
drinks.beer_servings.mean() # Alternatively, calculate the mean directly
Out[107]:
beer_servings
spirit_servings
wine_servings
total_litres_of_pure_alcohol
count
193.000000
193.000000
193.000000
193.000000
mean
106.160622
80.994819
49.450777
4.717098
std
101.143103
88.284312
79.697598
3.773298
min
0.000000
0.000000
0.000000
0.000000
25%
20.000000
4.000000
1.000000
1.300000
50%
76.000000
56.000000
8.000000
4.200000
75%
188.000000
128.000000
59.000000
7.200000
max
376.000000
438.000000
370.000000
14.400000
In [108]:
# Print all columns, but only show rows where the country is in Europe
drinks[drinks.continent=='EU']
Out[108]:
country
beer_servings
spirit_servings
wine_servings
total_litres_of_pure_alcohol
continent
1
Albania
89
132
54
4.9
EU
3
Andorra
245
138
312
12.4
EU
7
Armenia
21
179
11
3.8
EU
9
Austria
279
75
191
9.7
EU
10
Azerbaijan
21
46
5
1.3
EU
15
Belarus
142
373
42
14.4
EU
16
Belgium
295
84
212
10.5
EU
21
Bosnia-Herzegovina
76
173
8
4.6
EU
25
Bulgaria
231
252
94
10.3
EU
42
Croatia
230
87
254
10.2
EU
44
Cyprus
192
154
113
8.2
EU
45
Czech Republic
361
170
134
11.8
EU
48
Denmark
224
81
278
10.4
EU
57
Estonia
224
194
59
9.5
EU
60
Finland
263
133
97
10.0
EU
61
France
127
151
370
11.8
EU
64
Georgia
52
100
149
5.4
EU
65
Germany
346
117
175
11.3
EU
67
Greece
133
112
218
8.3
EU
75
Hungary
234
215
185
11.3
EU
76
Iceland
233
61
78
6.6
EU
81
Ireland
313
118
165
11.4
EU
83
Italy
85
42
237
6.5
EU
93
Latvia
281
216
62
10.5
EU
98
Lithuania
343
244
56
12.9
EU
99
Luxembourg
236
133
271
11.4
EU
105
Malta
149
100
120
6.6
EU
111
Monaco
0
0
0
0.0
EU
113
Montenegro
31
114
128
4.9
EU
120
Netherlands
251
88
190
9.4
EU
126
Norway
169
71
129
6.7
EU
135
Poland
343
215
56
10.9
EU
136
Portugal
194
67
339
11.0
EU
139
Moldova
109
226
18
6.3
EU
140
Romania
297
122
167
10.4
EU
147
San Marino
0
0
0
0.0
EU
151
Serbia
283
131
127
9.6
EU
155
Slovakia
196
293
116
11.4
EU
156
Slovenia
270
51
276
10.6
EU
160
Spain
284
157
112
10.0
EU
165
Sweden
152
60
186
7.2
EU
166
Switzerland
185
100
280
10.2
EU
170
Macedonia
106
27
86
3.9
EU
180
Ukraine
206
237
45
8.9
EU
182
United Kingdom
219
126
195
10.4
EU
In [109]:
# Calculate the average 'beer_servings' for all of Europe (hint: use the .mean() function)
drinks[drinks.continent=='EU'].beer_servings.mean()
Out[109]:
193.77777777777777
In [110]:
# Only show European countries with 'wine_servings' greater than 300
drinks[(drinks.continent=='EU') & (drinks.wine_servings > 300)]
Out[110]:
country
beer_servings
spirit_servings
wine_servings
total_litres_of_pure_alcohol
continent
3
Andorra
245
138
312
12.4
EU
61
France
127
151
370
11.8
EU
136
Portugal
194
67
339
11.0
EU
In [111]:
# Determine which 10 countries have the highest 'total_litres_of_pure_alcohol'
drinks.sort_index(by='total_litres_of_pure_alcohol').tail(10)
Out[111]:
country
beer_servings
spirit_servings
wine_servings
total_litres_of_pure_alcohol
continent
99
Luxembourg
236
133
271
11.4
EU
155
Slovakia
196
293
116
11.4
EU
81
Ireland
313
118
165
11.4
EU
141
Russian Federation
247
326
73
11.5
AS
61
France
127
151
370
11.8
EU
45
Czech Republic
361
170
134
11.8
EU
68
Grenada
199
438
28
11.9
NaN
3
Andorra
245
138
312
12.4
EU
98
Lithuania
343
244
56
12.9
EU
15
Belarus
142
373
42
14.4
EU
In [112]:
# Determine which country has the highest value for 'beer_servings' (hint: use the .max() function)
drinks[drinks.beer_servings==drinks.beer_servings.max()].country
Out[112]:
117 Namibia
Name: country, dtype: object
In [113]:
drinks[['country', 'beer_servings']].sort_index(by='beer_servings', ascending=False).head(1) # This is equivalent
Out[113]:
country
beer_servings
117
Namibia
376
In [114]:
# Count the number of occurrences of each 'continent' value and see if it looks correct
drinks.continent.value_counts()
Out[114]:
AF 53
EU 45
AS 44
OC 16
SA 12
dtype: int64
In [115]:
# Determine which countries do not have continent designations
drinks[drinks.continent.isnull()].country
Out[115]:
5 Antigua & Barbuda
11 Bahamas
14 Barbados
17 Belize
32 Canada
41 Costa Rica
43 Cuba
50 Dominica
51 Dominican Republic
54 El Salvador
68 Grenada
69 Guatemala
73 Haiti
74 Honduras
84 Jamaica
109 Mexico
122 Nicaragua
130 Panama
143 St. Kitts & Nevis
144 St. Lucia
145 St. Vincent & the Grenadines
174 Trinidad & Tobago
184 USA
Name: country, dtype: object
In [116]:
# Due to "na_filter = True" default within pd.read_csv()
help(pd.read_csv)
Help on function read_csv in module pandas.io.parsers:
read_csv(filepath_or_buffer, sep=',', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, float_precision=None, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False, skip_blank_lines=True)
Read CSV (comma-separated) file into DataFrame
Also supports optionally iterating or breaking of the file
into chunks.
Parameters
----------
filepath_or_buffer : string or file handle / StringIO
The string could be a URL. Valid URL schemes include
http, ftp, s3, and file. For file URLs, a
host is expected. For instance, a local file could be
file ://localhost/path/to/table.csv
sep : string, default ','
Delimiter to use. If sep is None, will try to automatically determine
this. Regular expressions are accepted.
engine : {'c', 'python'}
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete.
lineterminator : string (length 1), default None
Character to break file into lines. Only valid with C parser
quotechar : string (length 1)
The character used to denote the start and end of a quoted item. Quoted
items can include the delimiter and it will be ignored.
quoting : int or csv.QUOTE_* instance, default None
Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
Default (None) results in QUOTE_MINIMAL behavior.
skipinitialspace : boolean, default False
Skip spaces after delimiter
escapechar : string (length 1), default None
One-character string used to escape delimiter when quoting is QUOTE_NONE.
dtype : Type name or dict of column -> type
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
(Unsupported with engine='python')
compression : {'gzip', 'bz2', None}, default None
For on-the-fly decompression of on-disk data
dialect : string or csv.Dialect instance, default None
If None defaults to Excel dialect. Ignored if sep longer than 1 char
See csv.Dialect documentation for more details
header : int, list of ints
Row number(s) to use as the column names, and the start of the
data. Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
pass ``header=0`` to be able to replace existing names. The header can be
a list of integers that specify row locations for a multi-index on the
columns E.g. [0,1,3]. Intervening rows that are not specified will be
skipped (e.g. 2 in this example are skipped). Note that this parameter
ignores commented lines and empty lines if ``skip_blank_lines=True``, so header=0
denotes the first line of data rather than the first line of the file.
skiprows : list-like or integer
Line numbers to skip (0-indexed) or number of lines to skip (int)
at the start of the file
index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end
of each line, you might consider index_col=False to force pandas to _not_
use the first column as the index (row names)
names : array-like
List of column names to use. If file contains no header row, then you
should explicitly pass header=None
prefix : string, default None
Prefix to add to column numbers when no header, e.g 'X' for X0, X1, ...
na_values : list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific
per-column NA values
true_values : list
Values to consider as True
false_values : list
Values to consider as False
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN
values are overridden, otherwise they're appended to
parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index.
If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
{'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
A fast-path exists for iso8601-formatted dates.
keep_date_col : boolean, default False
If True and parse_dates specifies combining multiple columns then
keep the original columns.
date_parser : function
Function to use for converting a sequence of string columns to an
array of datetime instances. The default uses dateutil.parser.parser
to do the conversion.
dayfirst : boolean, default False
DD/MM format dates, international and European format
thousands : str, default None
Thousands separator
comment : str, default None
Indicates remainder of line should not be parsed. If found at the
beginning of a line, the line will be ignored altogether. This parameter
must be a single character. Like empty lines (as long as ``skip_blank_lines=True``),
fully commented lines are ignored by the parameter `header`
but not by `skiprows`. For example, if comment='#', parsing
'#empty\na,b,c\n1,2,3' with `header=0` will result in 'a,b,c' being
treated as the header.
decimal : str, default '.'
Character to recognize as decimal point. E.g. use ',' for European data
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files
iterator : boolean, default False
Return TextFileReader object
chunksize : int, default None
Return TextFileReader object for iteration
skipfooter : int, default 0
Number of lines at bottom of file to skip (Unsupported with engine='c')
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can either
be integers or column labels
verbose : boolean, default False
Indicate number of NA values placed in non-numeric columns
delimiter : string, default None
Alternative argument name for sep. Regular expressions are accepted.
encoding : string, default None
Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
standard encodings
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In
data without any NAs, passing na_filter=False can improve the performance
of reading a large file
usecols : array-like
Return a subset of the columns.
Results in much faster parsing time and lower memory usage.
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
tupleize_cols : boolean, default False
Leave a list of tuples on columns as is (default is to convert to
a Multi Index on the columns)
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned.
If False, then these "bad lines" will dropped from the DataFrame that is
returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output. (Only valid with C parser).
infer_datetime_format : boolean, default False
If True and parse_dates is enabled for a column, attempt to infer
the datetime format to speed up the processing
skip_blank_lines : boolean, default True
If True, skip over blank lines rather than interpreting as NaN values
Returns
-------
result : DataFrame or TextParser
In [117]:
ufo.set_index('State', inplace=True)
ufo.index
Out[117]:
Index([u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', ...], dtype='object')
In [118]:
ufo.index.is_unique
Out[118]:
False
In [119]:
ufo.sort_index(inplace=True)
ufo.head(25)
Out[119]:
City
Colors
Shape
Time
State
AK
Wainwright
Unknown
LIGHT
1/2/2002 21:32
AK
Mountain Village
Unknown
Unknown
3/2/1996 0:30
AK
Anchorage
Unknown
Unknown
5/5/2014 16:00
AK
Talkitna
Unknown
Unknown
1/14/2014 23:00
AK
Anchorage
Unknown
Unknown
12/26/2013 18:21
AK
Fairbanks
ORANGE
Unknown
12/10/2013 23:15
AK
Fairbanks
ORANGE BLUE
Unknown
11/8/2013 23:00
AK
Anchorage
Unknown
Unknown
10/5/2013 2:55
AK
Sterling
Unknown
Unknown
7/22/2013 3:13
AK
Anchorage
Unknown
Unknown
1/7/2013 1:50
AK
Fairbanks
Unknown
Unknown
9/3/2012 21:57
AK
Anchorage
ORANGE
Unknown
1/22/2012 20:15
AK
Hooper Bay
Unknown
Unknown
3/15/2011 0:03
AK
Fairbanks
Unknown
Unknown
2/11/2011 8:00
AK
Elk
Unknown
Unknown
6/25/2010 10:22
AK
Prince William Sound
Unknown
Unknown
5/29/2010 9:45
AK
Eagle River
Unknown
Unknown
5/9/2009 2:30
AK
Fairbanks
Unknown
Unknown
12/29/2008 15:00
AK
Fairbanks
BLUE
Unknown
7/3/2008 16:40
AK
Glennallen
Unknown
Unknown
11/1/2005 0:30
AK
Suriname
Unknown
Unknown
8/11/2005 6:11
AK
Fairbanks
Unknown
Unknown
12/20/2004 20:00
AK
St. George Island
Unknown
Unknown
8/30/2002 23:30
AK
Anchorage
RED
Unknown
1/20/2002 20:00
AK
North Pole
Unknown
Unknown
12/12/2000 22:00
In [120]:
ufo.loc['FL',:] # row with label FL`
Out[120]:
City
Colors
Shape
Time
State
FL
Plantation
Unknown
TEARDROP
11/5/2000 5:12
FL
Wildwood
RED GREEN
Unknown
9/2/2014 5:00
FL
Kissimmee
Unknown
Unknown
8/22/2014 21:10
FL
Cape Canaveral
RED GREEN
Unknown
8/6/2014 22:00
FL
Port Saint Lucie
Unknown
Unknown
8/5/2014 4:00
FL
South Daytona
RED
Unknown
7/30/2014 23:15
FL
Sarasota
Unknown
Unknown
7/26/2014 21:00
FL
Jacksonville
Unknown
Unknown
7/22/2014 0:05
FL
Ruskin
Unknown
Unknown
7/14/2014 21:15
FL
Delray Beach
ORANGE
Unknown
7/4/2014 22:00
FL
Clermont
RED
Unknown
7/4/2014 19:00
FL
Nokomis
Unknown
Unknown
6/24/2014 12:15
FL
Orlando
Unknown
Unknown
6/8/2014 10:15
FL
Palm City
Unknown
Unknown
6/5/2014 21:30
FL
Port Richey
Unknown
Unknown
5/7/2014 9:30
FL
Clermont
Unknown
Unknown
5/4/2014 4:00
FL
Orange Park
Unknown
Unknown
5/2/2014 2:37
FL
Tarpon Springs
Unknown
Unknown
4/26/2014 21:45
FL
Orlando
RED
Unknown
4/19/2014 21:30
FL
New Port Richey
Unknown
Unknown
4/13/2014 22:30
FL
Panama City Beach
ORANGE
Unknown
4/12/2014 20:00
FL
Parrish
ORANGE
Unknown
4/11/2014 23:00
FL
Lehigh Acres
Unknown
Unknown
4/2/2014 22:00
FL
Port St. Lucie
Unknown
Unknown
4/1/2014 23:00
FL
Celebration
RED ORANGE
Unknown
3/31/2014 21:00
FL
Lehigh Acres
Unknown
Unknown
3/27/2014 23:30
FL
Kissimmee
Unknown
Unknown
3/20/2014 22:00
FL
Davie
Unknown
Unknown
3/19/2014 16:00
FL
St. Cloud
Unknown
Unknown
3/9/2014 22:10
FL
Tampa
ORANGE
Unknown
3/8/2014 21:25
...
...
...
...
...
FL
St. George Island
RED
CHEVRON
5/23/2005 21:30
FL
Miami Gardens
Unknown
CHEVRON
10/27/2004 22:00
FL
Jacksonville
Unknown
CHEVRON
4/30/2004 16:00
FL
Orlando
Unknown
CHEVRON
3/29/2004 5:15
FL
Pensacola
Unknown
CHEVRON
11/8/2003 19:00
FL
New Port Richey
Unknown
CHEVRON
8/14/2003 18:00
FL
Pensacola
Unknown
CHEVRON
4/10/2003 23:15
FL
Panama City
Unknown
CHEVRON
3/13/2003 23:00
FL
Merritt Island
Unknown
CHEVRON
9/19/2002 21:00
FL
St. James City
ORANGE
CHEVRON
5/10/2002 20:00
FL
Hollywood
Unknown
CHEVRON
11/21/2001 23:45
FL
Perry
Unknown
CHEVRON
11/20/2001 4:05
FL
Panama City Beach
ORANGE
CHEVRON
11/20/2001 2:55
FL
Melbourne
Unknown
CHEVRON
7/7/2001 21:45
FL
Miami
Unknown
CHEVRON
10/13/2000 20:00
FL
Warrington
Unknown
CHEVRON
8/18/2000 20:15
FL
Delray Beach
Unknown
CHEVRON
4/19/2000 23:09
FL
Jacksonville
Unknown
CHEVRON
10/17/1999 15:00
FL
Jacksonville
Unknown
CHEVRON
10/17/1999 15:00
FL
Winter Park
Unknown
CHEVRON
4/11/1999 23:49
FL
St. Petersburg
Unknown
CHEVRON
3/23/1999 21:15
FL
Ormond Beach
Unknown
CHEVRON
2/14/1997 20:00
FL
Brooksville
Unknown
CHEVRON
4/15/1994 22:00
FL
Cedar Key
Unknown
CHEVRON
1/12/1994 18:00
FL
Miami
Unknown
CHEVRON
9/29/1989 6:15
FL
Tallahassee
Unknown
CHEVRON
8/20/1987 21:00
FL
Fort Lauderdale
Unknown
CHEVRON
12/3/1977 0:50
FL
Opa Locka
Unknown
CHEVRON
1/1/1967 13:00
FL
Jacksonville
Unknown
CHEVRON
11/12/1966 0:00
FL
Opa Locka
Unknown
CHEVRON
11/1/1966 12:00
4804 rows × 4 columns
In [ ]:
ufo.loc[:'FL',:] # rows with labels through'FL'
In [ ]:
ufo.loc['FL':'HI', 'City':'Shape'] # rows FL, columns 'City' through 'Shape Reported'
In [ ]:
ufo.loc[:, 'City':'Shape'] # all rows, columns 'City' through 'Shape Reported'
In [ ]:
ufo.loc[['FL', 'TX'], ['City','Shape']] # rows FL and TX, columns 'City' and 'Shape Reported'
In [ ]:
ufo.iloc[0,:] # row with 0th position (first row)
In [ ]:
ufo.iloc[0:3,:] # rows with positions 0 through 2 (not 3)
In [ ]:
ufo.iloc[0:3, 0:3] # rows and columns with positions 0 through 2
In [ ]:
ufo.iloc[:, 0:3] # all rows, columns with positions 0 through 2
In [ ]:
ufo.iloc[[0,2], [0,1]] # 1st and 3rd row, 1st and 2nd column
In [121]:
ufo.set_index('City', inplace=True, append=True) # Adds to existing index
ufo.sort_index(inplace=True)
ufo.head(25)
Out[121]:
Colors
Shape
Time
State
City
AK
Adak
Unknown
LIGHT
6/1/1986 0:00
Alaska
Unknown
CIGAR
4/29/2008 0:00
Alaska
Unknown
DISK
1/15/1998 13:00
Alaska
Unknown
LIGHT
9/1/1999 23:00
Alaska cruise
Unknown
LIGHT
5/11/2011 21:00
Anchor Point
Unknown
DISK
5/24/2005 18:30
Anchorage
Unknown
Unknown
5/5/2014 16:00
Anchorage
Unknown
Unknown
12/26/2013 18:21
Anchorage
Unknown
Unknown
10/5/2013 2:55
Anchorage
Unknown
Unknown
1/7/2013 1:50
Anchorage
ORANGE
Unknown
1/22/2012 20:15
Anchorage
RED
Unknown
1/20/2002 20:00
Anchorage
Unknown
CIGAR
3/29/2014 20:45
Anchorage
BLUE
CIRCLE
8/30/1972 1:30
Anchorage
Unknown
CIRCLE
10/31/1998 10:00
Anchorage
Unknown
CIRCLE
1/24/2003 8:30
Anchorage
Unknown
CIRCLE
8/16/2004 12:08
Anchorage
Unknown
CIRCLE
10/23/2006 21:00
Anchorage
Unknown
CIRCLE
8/6/2010 18:00
Anchorage
Unknown
CIRCLE
12/16/2010 12:00
Anchorage
Unknown
CIRCLE
10/21/2011 21:00
Anchorage
Unknown
CIRCLE
11/23/2011 11:00
Anchorage
ORANGE
CIRCLE
10/23/2013 20:00
Anchorage
RED
CIRCLE
12/19/2013 16:00
Anchorage
RED
CIRCLE
12/23/2013 22:22
In [122]:
ufo.loc[['ND', 'WY'],:] # Select all records from ND AND WY
Out[122]:
Colors
Shape
Time
State
City
ND
Amenia
Unknown
DISK
7/15/1978 18:00
Aneta
Unknown
CIRCLE
12/20/2001 19:30
Arthur
Unknown
Unknown
9/7/1958 19:00
Beach
Unknown
DISK
7/3/1996 11:30
Belcourt
Unknown
OTHER
6/18/2012 7:00
Belcourt
Unknown
CIRCLE
4/14/2012 6:00
Belfield
Unknown
LIGHT
7/17/1997 23:00
Belfield
Unknown
DISK
9/15/2000 19:00
Berthold
Unknown
CIGAR
7/15/1976 15:00
Bismarck
ORANGE
Unknown
11/24/1996 19:30
Bismarck
Unknown
VARIOUS
9/10/2008 22:30
Bismarck
ORANGE
TRIANGLE
4/25/2012 23:30
Bismarck
Unknown
TRIANGLE
9/29/2011 22:30
Bismarck
RED
RECTANGLE
6/30/2008 23:00
Bismarck
Unknown
OTHER
11/28/2003 23:45
Bismarck
Unknown
LIGHT
8/5/2013 23:15
Bismarck
Unknown
LIGHT
6/4/2010 22:00
Bismarck
Unknown
LIGHT
7/1/2000 2:30
Bismarck
Unknown
LIGHT
11/24/1996 18:07
Bismarck
Unknown
FORMATION
10/11/1996 16:00
Bismarck
Unknown
DISK
4/11/2009 11:20
Bismarck
GREEN BLUE
CIRCLE
8/14/2004 4:15
Brunswick County
Unknown
Unknown
10/25/1995 16:55
Burlington
Unknown
FIREBALL
8/14/1998 16:00
Cartwright
Unknown
RECTANGLE
5/20/1999 14:00
Casselton
Unknown
TRIANGLE
4/11/2012 21:58
Casselton
Unknown
TRIANGLE
4/11/2012 21:48
Casselton
RED BLUE
Unknown
8/24/2008 1:00
Central U.S.
Unknown
DIAMOND
4/2/1995 13:00
Colfax
Unknown
DISK
1/15/2008 21:00
...
...
...
...
...
WY
Sheridan
Unknown
OVAL
9/6/2002 21:00
Sheridan
Unknown
OTHER
9/24/1998 0:15
Sheridan
Unknown
LIGHT
8/7/2012 22:20
Sheridan
RED GREEN
LIGHT
12/19/2006 21:00
Sheridan
Unknown
LIGHT
10/28/2004 2:00
Shirley Basin Road
Unknown
LIGHT
3/15/2006 23:30
Ten Sleep
Unknown
DISK
3/21/1998 20:00
Teton National Park
Unknown
OVAL
7/19/2008 12:00
Teton National Park
Unknown
OTHER
9/26/2005 11:00
Thermopolis
Unknown
Unknown
6/14/2007 23:00
Torrington
Unknown
CIGAR
11/5/2011 21:30
Wheeling
Unknown
Unknown
6/15/1984 20:30
Worland
Unknown
OVAL
2/15/2008 5:00
Worland
Unknown
LIGHT
6/17/2003 22:42
Wyoming
Unknown
OTHER
9/28/2005 16:30
Wyoming
Unknown
OTHER
4/20/1983 0:00
Wyoming
Unknown
CIGAR
9/1/2013 0:00
Wyoming
Unknown
CIGAR
7/12/1967 2:00
Yellowstone
Unknown
LIGHT
8/4/2013 22:00
Yellowstone National Park
Unknown
VARIOUS
9/20/2011 19:55
Yellowstone National Park
YELLOW
TEARDROP
8/19/2014 7:40
Yellowstone National Park
Unknown
SPHERE
9/29/2003 20:35
Yellowstone National Park
Unknown
LIGHT
2/15/2011 20:00
Yellowstone National Park
Unknown
FORMATION
6/19/2007 23:00
Yellowstone National Park
Unknown
FLASH
6/25/2011 23:30
Yellowstone National Park
Unknown
EGG
6/30/1971 14:00
Yellowstone North Entrance
RED YELLOW
CIRCLE
4/17/2014 18:00
Yellowstone Park
Unknown
DISK
7/15/1956 0:00
na
Unknown
TEARDROP
9/19/2003 14:00
remote
Unknown
FLASH
10/23/2008 4:45
394 rows × 3 columns
In [123]:
ufo.loc['ND':'WY',:] # Select all records from ND THROUGH WY
Out[123]:
Colors
Shape
Time
State
City
ND
Amenia
Unknown
DISK
7/15/1978 18:00
Aneta
Unknown
CIRCLE
12/20/2001 19:30
Arthur
Unknown
Unknown
9/7/1958 19:00
Beach
Unknown
DISK
7/3/1996 11:30
Belcourt
Unknown
OTHER
6/18/2012 7:00
Belcourt
Unknown
CIRCLE
4/14/2012 6:00
Belfield
Unknown
LIGHT
7/17/1997 23:00
Belfield
Unknown
DISK
9/15/2000 19:00
Berthold
Unknown
CIGAR
7/15/1976 15:00
Bismarck
ORANGE
Unknown
11/24/1996 19:30
Bismarck
Unknown
VARIOUS
9/10/2008 22:30
Bismarck
ORANGE
TRIANGLE
4/25/2012 23:30
Bismarck
Unknown
TRIANGLE
9/29/2011 22:30
Bismarck
RED
RECTANGLE
6/30/2008 23:00
Bismarck
Unknown
OTHER
11/28/2003 23:45
Bismarck
Unknown
LIGHT
8/5/2013 23:15
Bismarck
Unknown
LIGHT
6/4/2010 22:00
Bismarck
Unknown
LIGHT
7/1/2000 2:30
Bismarck
Unknown
LIGHT
11/24/1996 18:07
Bismarck
Unknown
FORMATION
10/11/1996 16:00
Bismarck
Unknown
DISK
4/11/2009 11:20
Bismarck
GREEN BLUE
CIRCLE
8/14/2004 4:15
Brunswick County
Unknown
Unknown
10/25/1995 16:55
Burlington
Unknown
FIREBALL
8/14/1998 16:00
Cartwright
Unknown
RECTANGLE
5/20/1999 14:00
Casselton
Unknown
TRIANGLE
4/11/2012 21:58
Casselton
Unknown
TRIANGLE
4/11/2012 21:48
Casselton
RED BLUE
Unknown
8/24/2008 1:00
Central U.S.
Unknown
DIAMOND
4/2/1995 13:00
Colfax
Unknown
DISK
1/15/2008 21:00
...
...
...
...
...
WY
Sheridan
Unknown
OVAL
9/6/2002 21:00
Sheridan
Unknown
OTHER
9/24/1998 0:15
Sheridan
Unknown
LIGHT
8/7/2012 22:20
Sheridan
RED GREEN
LIGHT
12/19/2006 21:00
Sheridan
Unknown
LIGHT
10/28/2004 2:00
Shirley Basin Road
Unknown
LIGHT
3/15/2006 23:30
Ten Sleep
Unknown
DISK
3/21/1998 20:00
Teton National Park
Unknown
OVAL
7/19/2008 12:00
Teton National Park
Unknown
OTHER
9/26/2005 11:00
Thermopolis
Unknown
Unknown
6/14/2007 23:00
Torrington
Unknown
CIGAR
11/5/2011 21:30
Wheeling
Unknown
Unknown
6/15/1984 20:30
Worland
Unknown
OVAL
2/15/2008 5:00
Worland
Unknown
LIGHT
6/17/2003 22:42
Wyoming
Unknown
OTHER
9/28/2005 16:30
Wyoming
Unknown
OTHER
4/20/1983 0:00
Wyoming
Unknown
CIGAR
9/1/2013 0:00
Wyoming
Unknown
CIGAR
7/12/1967 2:00
Yellowstone
Unknown
LIGHT
8/4/2013 22:00
Yellowstone National Park
Unknown
VARIOUS
9/20/2011 19:55
Yellowstone National Park
YELLOW
TEARDROP
8/19/2014 7:40
Yellowstone National Park
Unknown
SPHERE
9/29/2003 20:35
Yellowstone National Park
Unknown
LIGHT
2/15/2011 20:00
Yellowstone National Park
Unknown
FORMATION
6/19/2007 23:00
Yellowstone National Park
Unknown
FLASH
6/25/2011 23:30
Yellowstone National Park
Unknown
EGG
6/30/1971 14:00
Yellowstone North Entrance
RED YELLOW
CIRCLE
4/17/2014 18:00
Yellowstone Park
Unknown
DISK
7/15/1956 0:00
na
Unknown
TEARDROP
9/19/2003 14:00
remote
Unknown
FLASH
10/23/2008 4:45
34607 rows × 3 columns
In [124]:
ufo.loc[('ND', 'Bismarck'),:] # Select all records from Bismark, ND
Out[124]:
Colors
Shape
Time
State
City
ND
Bismarck
ORANGE
Unknown
11/24/1996 19:30
Bismarck
Unknown
VARIOUS
9/10/2008 22:30
Bismarck
ORANGE
TRIANGLE
4/25/2012 23:30
Bismarck
Unknown
TRIANGLE
9/29/2011 22:30
Bismarck
RED
RECTANGLE
6/30/2008 23:00
Bismarck
Unknown
OTHER
11/28/2003 23:45
Bismarck
Unknown
LIGHT
8/5/2013 23:15
Bismarck
Unknown
LIGHT
6/4/2010 22:00
Bismarck
Unknown
LIGHT
7/1/2000 2:30
Bismarck
Unknown
LIGHT
11/24/1996 18:07
Bismarck
Unknown
FORMATION
10/11/1996 16:00
Bismarck
Unknown
DISK
4/11/2009 11:20
Bismarck
GREEN BLUE
CIRCLE
8/14/2004 4:15
In [125]:
ufo.loc[('ND', 'Bismarck'):('ND','Casselton'),:] # Select all records from Bismark, ND through Casselton, ND
Out[125]:
Colors
Shape
Time
State
City
ND
Bismarck
ORANGE
Unknown
11/24/1996 19:30
Bismarck
Unknown
VARIOUS
9/10/2008 22:30
Bismarck
ORANGE
TRIANGLE
4/25/2012 23:30
Bismarck
Unknown
TRIANGLE
9/29/2011 22:30
Bismarck
RED
RECTANGLE
6/30/2008 23:00
Bismarck
Unknown
OTHER
11/28/2003 23:45
Bismarck
Unknown
LIGHT
8/5/2013 23:15
Bismarck
Unknown
LIGHT
6/4/2010 22:00
Bismarck
Unknown
LIGHT
7/1/2000 2:30
Bismarck
Unknown
LIGHT
11/24/1996 18:07
Bismarck
Unknown
FORMATION
10/11/1996 16:00
Bismarck
Unknown
DISK
4/11/2009 11:20
Bismarck
GREEN BLUE
CIRCLE
8/14/2004 4:15
Brunswick County
Unknown
Unknown
10/25/1995 16:55
Burlington
Unknown
FIREBALL
8/14/1998 16:00
Cartwright
Unknown
RECTANGLE
5/20/1999 14:00
Casselton
Unknown
TRIANGLE
4/11/2012 21:58
Casselton
Unknown
TRIANGLE
4/11/2012 21:48
Casselton
RED BLUE
Unknown
8/24/2008 1:00
In [126]:
ufo.reset_index(level='City', inplace=True) # Remove the City from the index
ufo.head()
Out[126]:
City
Colors
Shape
Time
State
AK
Adak
Unknown
LIGHT
6/1/1986 0:00
AK
Alaska
Unknown
CIGAR
4/29/2008 0:00
AK
Alaska
Unknown
DISK
1/15/1998 13:00
AK
Alaska
Unknown
LIGHT
9/1/1999 23:00
AK
Alaska cruise
Unknown
LIGHT
5/11/2011 21:00
In [127]:
ufo.reset_index(inplace=True) # Remove all columns from the index
In [130]:
ufo.head()
Out[130]:
State
City
Colors
Shape
Time
0
AK
Adak
Unknown
LIGHT
6/1/1986 0:00
1
AK
Alaska
Unknown
CIGAR
4/29/2008 0:00
2
AK
Alaska
Unknown
DISK
1/15/1998 13:00
3
AK
Alaska
Unknown
LIGHT
9/1/1999 23:00
4
AK
Alaska cruise
Unknown
LIGHT
5/11/2011 21:00
In [129]:
# Reset the index
ufo.dtypes
Out[129]:
State object
City object
Colors object
Shape object
Time object
dtype: object
In [131]:
# Convert Time column to date-time format (defined in Pandas)
# Reference: https://docs.python.org/2/library/time.html#time.strftime
ufo['Time'] = pd.to_datetime(ufo['Time'], format="%m/%d/%Y %H:%M")
ufo.dtypes
Out[131]:
State object
City object
Colors object
Shape object
Time datetime64[ns]
dtype: object
In [132]:
# Compute date range
ufo.Time.min()
Out[132]:
Timestamp('2014-09-05 05:30:00')
In [133]:
ufo.Time.max()
Out[133]:
Timestamp('2014-09-05 05:30:00')
In [ ]:
# Slice using time
ufo[ufo.Time > pd.datetime(1995, 1, 1)] # Slice using the time
In [ ]:
ufo[(ufo.Time > pd.datetime(1995, 1, 1)) & (ufo.State =='TX')] # Works with other logical conditions, as expected
In [134]:
# Set the index to time
ufo.set_index('Time', inplace=True)
ufo.sort_index(inplace=True)
ufo.head()
Out[134]:
State
City
Colors
Shape
Time
1930-06-01 22:00:00
NY
Ithaca
Unknown
TRIANGLE
1930-06-30 20:00:00
NJ
Willingboro
Unknown
OTHER
1931-02-15 14:00:00
CO
Holyoke
Unknown
OVAL
1931-06-01 13:00:00
KS
Abilene
Unknown
DISK
1933-04-18 19:00:00
NY
New York Worlds Fair
Unknown
LIGHT
In [135]:
# Access particular times/ranges
ufo.loc['1995',:]
ufo.loc['1995-01',:]
ufo.loc['1995-01-01',:]
Out[135]:
State
City
Colors
Shape
Time
1995-01-01 00:02:00
CA
Pacific Grove
Unknown
Unknown
1995-01-01 01:00:00
NV
Mt. Charleston
Unknown
OVAL
1995-01-01 02:00:00
CA
El Granada
Unknown
FIREBALL
1995-01-01 03:00:00
KS
Kansas
Unknown
LIGHT
1995-01-01 06:00:00
SC
Travelers Rest
Unknown
TRIANGLE
1995-01-01 10:00:00
CA
Oxnard
Unknown
VARIOUS
1995-01-01 15:00:00
SC
Greenwood
RED
SPHERE
1995-01-01 15:00:00
FL
Cedar Key
Unknown
TRIANGLE
1995-01-01 21:00:00
AZ
Tucson
Unknown
TRIANGLE
1995-01-01 22:00:00
WA
Shelton
Unknown
Unknown
1995-01-01 22:45:00
CA
Anaheim
Unknown
Unknown
1995-01-01 23:00:00
NY
Southampton
Unknown
LIGHT
In [136]:
# Access range of times/ranges
ufo.loc['1995':,:]
ufo.loc['1995':'1996',:]
ufo.loc['1995-12-01':'1996-01',:]
Out[136]:
State
City
Colors
Shape
Time
1995-12-01 17:20:00
WI
Waukesha
Unknown
Unknown
1995-12-01 21:00:00
CA
Mount Shasta
Unknown
Unknown
1995-12-01 22:55:00
MI
Owosso
Unknown
LIGHT
1995-12-01 23:40:00
IL
Orland Park
Unknown
Unknown
1995-12-02 22:00:00
CA
Sebastopal
Unknown
Unknown
1995-12-03 11:20:00
AZ
Chandler
Unknown
Unknown
1995-12-03 21:00:00
OR
Boring
Unknown
Unknown
1995-12-03 22:00:00
WA
Nisqually Delta
Unknown
Unknown
1995-12-04 02:00:00
NM
Deming
Unknown
VARIOUS
1995-12-04 02:14:00
WA
Seattle
Unknown
LIGHT
1995-12-04 22:16:00
NY
Geneva
Unknown
Unknown
1995-12-05 20:30:00
WA
North Bend
Unknown
CHEVRON
1995-12-05 20:30:00
WA
North Bend
Unknown
Unknown
1995-12-06 17:00:00
WA
Bothell
Unknown
Unknown
1995-12-06 22:45:00
WA
Lynnwood
Unknown
Unknown
1995-12-07 20:40:00
FL
Oldsmar
ORANGE BLUE
LIGHT
1995-12-08 18:00:00
CA
Unknown
Unknown
VARIOUS
1995-12-09 18:30:00
IL
Springfield
Unknown
Unknown
1995-12-10 05:00:00
SD
Rapid City
BLUE
CIRCLE
1995-12-10 06:00:00
NM
Clayton
Unknown
Unknown
1995-12-10 18:12:00
AZ
Phoenix
Unknown
Unknown
1995-12-10 22:45:00
TX
Houston
Unknown
Unknown
1995-12-10 23:00:00
IL
Jolliet
RED
Unknown
1995-12-12 22:30:00
NY
Glen
Unknown
Unknown
1995-12-13 18:45:00
PA
Upper Makefield
Unknown
Unknown
1995-12-14 04:24:00
ID
Boise
Unknown
Unknown
1995-12-14 14:45:00
CA
Pomona
Unknown
ROUND
1995-12-14 21:40:00
NV
Las Vegas
BLUE
Unknown
1995-12-15 19:30:00
MD
Elkridge
GREEN
FIREBALL
1995-12-15 20:50:00
KY
Hawesville
ORANGE
VARIOUS
...
...
...
...
...
1996-01-13 05:45:00
MO
Lancaster
Unknown
Unknown
1996-01-13 20:22:00
MO
Osage Beach
Unknown
Unknown
1996-01-13 21:00:00
OR
Selma
Unknown
TRIANGLE
1996-01-14 06:05:00
SC
Charleston
Unknown
Unknown
1996-01-14 17:00:00
VA
Chesterfield
Unknown
Unknown
1996-01-14 21:00:00
IL
Collinsville
Unknown
Unknown
1996-01-15 00:00:00
FL
Panama City Beach
Unknown
CIRCLE
1996-01-15 02:30:00
MN
Hill City
Unknown
OVAL
1996-01-15 05:00:00
WA
Spanaway
Unknown
Unknown
1996-01-15 18:00:00
TX
Loredo
RED
FORMATION
1996-01-15 22:00:00
AL
Birmingham
Unknown
TEARDROP
1996-01-16 07:40:00
WA
Seattle
Unknown
Unknown
1996-01-16 22:32:00
WA
Monitor
ORANGE
Unknown
1996-01-16 22:32:00
WA
Monitor
Unknown
Unknown
1996-01-18 03:15:00
CA
San Diego
Unknown
Unknown
1996-01-18 06:30:00
AR
Fayetteville
Unknown
Unknown
1996-01-20 21:44:00
NJ
Stockton
Unknown
LIGHT
1996-01-20 22:00:00
WI
Madison
Unknown
Unknown
1996-01-22 11:00:00
CA
Fullerton
Unknown
TRIANGLE
1996-01-22 21:00:00
CA
Redding
Unknown
Unknown
1996-01-22 21:30:00
MO
Clark
Unknown
Unknown
1996-01-23 20:15:00
ND
Minot
Unknown
Unknown
1996-01-25 15:04:00
TN
Bartlett
Unknown
Unknown
1996-01-25 21:15:00
CA
Redding
RED
Unknown
1996-01-25 21:30:00
AZ
Phoenix
Unknown
Unknown
1996-01-26 23:59:00
ME
Sedgewick
Unknown
Unknown
1996-01-28 00:00:00
AZ
Chandler
Unknown
Unknown
1996-01-30 00:00:00
OR
Oregon City
Unknown
Unknown
1996-01-31 02:00:00
AL
Heflin
Unknown
LIGHT
1996-01-31 20:15:00
FL
Buckhead Ridge
Unknown
LIGHT
103 rows × 4 columns
In [137]:
# Access elements of the timestamp
# Reference: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components
ufo.index.year
ufo.index.month
ufo.index.weekday
ufo.index.day
ufo.index.time
ufo.index.hour
Out[137]:
array([22, 20, 14, ..., 2, 3, 5], dtype=int32)
In [138]:
# Create a new variable with time element
ufo['Year'] = ufo.index.year
ufo['Month'] = ufo.index.month
ufo['Day'] = ufo.index.day
ufo['Weekday'] = ufo.index.weekday
ufo['Hour'] = ufo.index.hour
In [141]:
# For each year, calculate the count of sightings
ufo.groupby('Year').City.count()
Out[141]:
2012 7263
2013 7003
2014 5382
2011 5089
2008 4655
2009 4251
2010 4154
2007 4058
2004 3850
2005 3787
2003 3507
2006 3445
2002 2933
2001 2925
1999 2774
...
1948 9
1945 9
1946 8
1944 8
1943 5
1939 3
1942 3
1931 2
1937 2
1941 2
1936 2
1930 2
1935 1
1933 1
1934 1
Length: 82, dtype: int64
In [ ]:
# For each Shape, calculate the first sighting, last sighting, and range of sightings.
ufo.groupby('Shape').Year.min()
ufo.groupby('Shape').Year.max()
In [ ]:
# Specify the variable outside of the apply statement
ufo.groupby('Shape').Year.apply(lambda x: x.max())
In [ ]:
# Specifiy the variable within the apply statement
ufo.groupby('Shape').apply(lambda x: x.Year.max() - x.Year.min())
In [ ]:
# Specify a custom function to use in the apply statement
def get_max_year(df):
try:
return df.Year.max()
except:
return ''
ufo.groupby('Shape').apply(lambda x: get_max_year(x))
In [ ]:
# Split/combine can occur on multiple columns at the same time
ufo.groupby(['Weekday','Hour']).City.count()
In [ ]:
# Read in population data
pop = pd.read_csv('population.csv')
pop.head()
In [ ]:
ufo.head()
In [ ]:
# Merge the data together
ufo = pd.merge(ufo, pop, on='State', how = 'left')
In [ ]:
# Specify keys if columns have different names
ufo = pd.merge(ufo, pop, left_on='State', right_on='State', how = 'left')
In [ ]:
# Observe the new Population column
ufo.head()
In [ ]:
# Check for values that didn't make it (length)
ufo.Population.isnull().sum()
In [ ]:
# Check for values that didn't make it (values)
ufo[ufo.Population.isnull()]
In [ ]:
# Change the records that didn't match up using np.where command
ufo['State'] = np.where(ufo['State'] == 'Fl', 'FL', ufo['State'])
In [ ]:
# Alternatively, change the state using native python string functionality
ufo['State'] = ufo['State'].str.upper()
In [ ]:
# Merge again, this time get all of the records
ufo = pd.merge(ufo, pop, on='State', how = 'left')
In [ ]:
ufo.to_csv('ufo_new.csv')
In [ ]:
ufo.to_csv('ufo_new.csv', index=False) # Index is not included in the csv
In [ ]:
ufo.duplicated() # Series of logicals
In [ ]:
ufo.duplicated().sum() # count of duplicates
In [ ]:
ufo[ufo.duplicated(['State','Time'])] # only show duplicates
In [ ]:
ufo[ufo.duplicated()==False] # only show unique rows
In [ ]:
ufo_unique = ufo[~ufo.duplicated()] # only show unique rows
In [ ]:
ufo.duplicated(['State','Time']).sum() # columns for identifying duplicates
In [146]:
ufo['Weekday'] = ufo.Weekday.map({ 0:'Mon', 1:'Tue', 2:'Wed',
3:'Thu', 4:'Fri', 5:'Sat',
6:'Sun'})
In [ ]:
ufo.groupby(['Weekday','Hour']).City.count()
In [ ]:
ufo.groupby(['Weekday','Hour']).City.count().unstack(0) # Make first row level a column
In [ ]:
ufo.groupby(['Weekday','Hour']).City.count().unstack(1) # Make second row level a column
# Note: .stack() transforms columns to rows
In [ ]:
idxs = np.random.rand(len(ufo)) < 0.66 # create a Series of booleans
train = ufo[idxs] # will contain about 66% of the rows
test = ufo[~idxs] # will contain the remaining rows
In [ ]:
ufo.Shape.replace('DELTA', 'TRIANGLE') # replace values in a Series
In [ ]:
ufo.replace('PYRAMID', 'TRIANGLE') # replace values throughout a DataFrame
In [143]:
%matplotlib inline
In [144]:
# Plot the number of sightings over time
ufo.groupby('Year').City.count().plot( kind='line',
color='r',
linewidth=2,
title='UFO Sightings by year')
Out[144]:
<matplotlib.axes._subplots.AxesSubplot at 0x113e6a150>
In [147]:
# Plot the number of sightings over the day of week and time of day
ufo.groupby(['Weekday','Hour']).City.count().unstack(0).plot( kind='line',
linewidth=2,
title='UFO Sightings by Time of Day')
Out[147]:
<matplotlib.axes._subplots.AxesSubplot at 0x11586ef10>
In [148]:
# Plot multiple plots on the same plot (plots neeed to be in column format)
ufo_fourth = ufo[(ufo.Year.isin([2011, 2012, 2013, 2014])) & (ufo.Month == 7)]
ufo_fourth.groupby(['Year', 'Day']).City.count().unstack(0).plot( kind = 'bar',
subplots=True,
figsize=(7,9))
Out[148]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x113abbcd0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x116da5a50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x116e28750>,
<matplotlib.axes._subplots.AxesSubplot object at 0x116e7dc90>], dtype=object)
UFO data
Drinks data
Content source: josiahdavis/python_data_analysis
Similar notebooks: