Python for Data Analysis

This is a hands on workshop aimed at getting you comfortable with the the syntax of core data analysis concepts in Python. Some background in base Python is useful, but not required to learn from this workshop.

  • Keystrokes for the IPython notebook
  • Reading and Summarizing Data
  • Filtering and Sorting Data
  • Modifying Columns
  • Handling Missing Values
  • EXERCISE: Working with drinks data
  • Indexing and Slicing Data
  • Analyzing across time
  • Split-Apply-Combine
  • Merging Data
  • Writing Data
  • Other Useful Features

Keystrokes for the IPython Notebook

There are two modes: Command (enabled by esc) and Edit (enabled by enter). The table below has a quick reference of the main keystrokes that I will be using in the workshop. To get the full list go to Help -> Keyboard Shortcuts.

Mac PC
Command Mode esc
Delete d, d
Markdown m
Run Cell control, return
Run Cell and Insert Below option, return
Insert Above a
Insert Below b
Edit Mode return Enter
Run Cell control, return
Run Cell and Insert Below option, return

Reading and Summarizing Data

Reading Data


In [1]:
# Import the pandas and numpy libraries
import pandas as pd
import numpy as np

In [2]:
# Read a file with an absolute path
ufo = pd.read_csv('/Users/josiahdavis/Documents/GitHub/python_data_analysis/ufo_sightings.csv')

In [50]:
# Alterntively, read the the file using a relative path
ufo = pd.read_csv('ufo_sightings.csv')

In [51]:
# Alterntively read in the file from the internet
ufo = pd.read_csv('https://raw.githubusercontent.com/josiahdavis/python_data_analysis/master/ufo_sightings.csv')

In [52]:
# Get help on a function
help(pd.read_csv)


Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, float_precision=None, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False, skip_blank_lines=True)
    Read CSV (comma-separated) file into DataFrame
    
    Also supports optionally iterating or breaking of the file
    into chunks.
    
    Parameters
    ----------
    filepath_or_buffer : string or file handle / StringIO
        The string could be a URL. Valid URL schemes include
        http, ftp, s3, and file. For file URLs, a
        host is expected. For instance, a local file could be
        file ://localhost/path/to/table.csv
    sep : string, default ','
        Delimiter to use. If sep is None, will try to automatically determine
        this. Regular expressions are accepted.
    engine : {'c', 'python'}
        Parser engine to use. The C engine is faster while the python engine is
        currently more feature-complete.
    lineterminator : string (length 1), default None
        Character to break file into lines. Only valid with C parser
    quotechar : string (length 1)
        The character used to denote the start and end of a quoted item. Quoted
        items can include the delimiter and it will be ignored.
    quoting : int or csv.QUOTE_* instance, default None
        Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
        QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
        Default (None) results in QUOTE_MINIMAL behavior.
    skipinitialspace : boolean, default False
        Skip spaces after delimiter
    escapechar : string (length 1), default None
        One-character string used to escape delimiter when quoting is QUOTE_NONE.
    dtype : Type name or dict of column -> type
        Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
        (Unsupported with engine='python')
    compression : {'gzip', 'bz2', None}, default None
        For on-the-fly decompression of on-disk data
    dialect : string or csv.Dialect instance, default None
        If None defaults to Excel dialect. Ignored if sep longer than 1 char
        See csv.Dialect documentation for more details
    header : int, list of ints
        Row number(s) to use as the column names, and the start of the
        data.  Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
        pass ``header=0`` to be able to replace existing names. The header can be
        a list of integers that specify row locations for a multi-index on the
        columns E.g. [0,1,3]. Intervening rows that are not specified will be
        skipped (e.g. 2 in this example are skipped). Note that this parameter
        ignores commented lines and empty lines if ``skip_blank_lines=True``, so header=0
        denotes the first line of data rather than the first line of the file.
    skiprows : list-like or integer
        Line numbers to skip (0-indexed) or number of lines to skip (int)
        at the start of the file
    index_col : int or sequence or False, default None
        Column to use as the row labels of the DataFrame. If a sequence is given, a
        MultiIndex is used. If you have a malformed file with delimiters at the end
        of each line, you might consider index_col=False to force pandas to _not_
        use the first column as the index (row names)
    names : array-like
        List of column names to use. If file contains no header row, then you
        should explicitly pass header=None
    prefix : string, default None
        Prefix to add to column numbers when no header, e.g 'X' for X0, X1, ...
    na_values : list-like or dict, default None
        Additional strings to recognize as NA/NaN. If dict passed, specific
        per-column NA values
    true_values : list
        Values to consider as True
    false_values : list
        Values to consider as False
    keep_default_na : bool, default True
        If na_values are specified and keep_default_na is False the default NaN
        values are overridden, otherwise they're appended to
    parse_dates : boolean, list of ints or names, list of lists, or dict
        If True -> try parsing the index.
        If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
        If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
        {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
        A fast-path exists for iso8601-formatted dates.
    keep_date_col : boolean, default False
        If True and parse_dates specifies combining multiple columns then
        keep the original columns.
    date_parser : function
        Function to use for converting a sequence of string columns to an
        array of datetime instances. The default uses dateutil.parser.parser
        to do the conversion.
    dayfirst : boolean, default False
        DD/MM format dates, international and European format
    thousands : str, default None
        Thousands separator
    comment : str, default None
        Indicates remainder of line should not be parsed. If found at the
        beginning of a line, the line will be ignored altogether. This parameter
        must be a single character. Like empty lines (as long as ``skip_blank_lines=True``),
        fully commented lines are ignored by the parameter `header`
        but not by `skiprows`. For example, if comment='#', parsing
        '#empty\na,b,c\n1,2,3' with `header=0` will result in 'a,b,c' being
        treated as the header.
    decimal : str, default '.'
        Character to recognize as decimal point. E.g. use ',' for European data
    nrows : int, default None
        Number of rows of file to read. Useful for reading pieces of large files
    iterator : boolean, default False
        Return TextFileReader object
    chunksize : int, default None
        Return TextFileReader object for iteration
    skipfooter : int, default 0
        Number of lines at bottom of file to skip (Unsupported with engine='c')
    converters : dict, default None
        Dict of functions for converting values in certain columns. Keys can either
        be integers or column labels
    verbose : boolean, default False
        Indicate number of NA values placed in non-numeric columns
    delimiter : string, default None
        Alternative argument name for sep. Regular expressions are accepted.
    encoding : string, default None
        Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
        standard encodings
        <https://docs.python.org/3/library/codecs.html#standard-encodings>`_
    squeeze : boolean, default False
        If the parsed data only contains one column then return a Series
    na_filter : boolean, default True
        Detect missing value markers (empty strings and the value of na_values). In
        data without any NAs, passing na_filter=False can improve the performance
        of reading a large file
    usecols : array-like
        Return a subset of the columns.
        Results in much faster parsing time and lower memory usage.
    mangle_dupe_cols : boolean, default True
        Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
    tupleize_cols : boolean, default False
        Leave a list of tuples on columns as is (default is to convert to
        a Multi Index on the columns)
    error_bad_lines : boolean, default True
        Lines with too many fields (e.g. a csv line with too many commas) will by
        default cause an exception to be raised, and no DataFrame will be returned.
        If False, then these "bad lines" will dropped from the DataFrame that is
        returned. (Only valid with C parser)
    warn_bad_lines : boolean, default True
        If error_bad_lines is False, and warn_bad_lines is True, a warning for each
        "bad line" will be output. (Only valid with C parser).
    infer_datetime_format : boolean, default False
        If True and parse_dates is enabled for a column, attempt to infer
        the datetime format to speed up the processing
    skip_blank_lines : boolean, default True
        If True, skip over blank lines rather than interpreting as NaN values
    
    Returns
    -------
    result : DataFrame or TextParser

Summarize the data that was just read in


In [53]:
ufo.head(10)          # Look at the top 10 observations


Out[53]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
5 Valley City NaN DISK ND 9/15/1934 15:30
6 Crater Lake NaN CIRCLE CA 6/15/1935 0:00
7 Alma NaN DISK MI 7/15/1936 0:00
8 Eklutna NaN CIGAR AK 10/15/1936 17:00
9 Hubbard NaN CYLINDER OR 6/15/1937 0:00

In [55]:
ufo.tail()            # Bottom x observations (defaults to 5)


Out[55]:
City Colors Reported Shape Reported State Time
80538 Neligh NaN CIRCLE NE 9/4/2014 23:20
80539 Uhrichsville NaN LIGHT OH 9/5/2014 1:14
80540 Tucson RED BLUE NaN AZ 9/5/2014 2:40
80541 Orland park RED LIGHT IL 9/5/2014 3:43
80542 Loughman NaN LIGHT FL 9/5/2014 5:30

In [56]:
ufo.describe()        # get summary statistics for columns


Out[56]:
City Colors Reported Shape Reported State Time
count 80496 17034 72141 80543 80543
unique 13504 31 27 52 68901
top Seattle ORANGE LIGHT CA 7/4/2014 22:00
freq 646 5216 16332 10743 45

In [57]:
ufo.index             # "the index" (aka "the labels")


Out[57]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

In [58]:
ufo.columns           # column names (which is "an index")


Out[58]:
Index([u'City', u'Colors Reported', u'Shape Reported', u'State', u'Time'], dtype='object')

In [59]:
ufo.dtypes            # data types of each column


Out[59]:
City               object
Colors Reported    object
Shape Reported     object
State              object
Time               object
dtype: object

In [60]:
ufo.values            # underlying numpy array


Out[60]:
array([['Ithaca', nan, 'TRIANGLE', 'NY', '6/1/1930 22:00'],
       ['Willingboro', nan, 'OTHER', 'NJ', '6/30/1930 20:00'],
       ['Holyoke', nan, 'OVAL', 'CO', '2/15/1931 14:00'],
       ..., 
       ['Tucson', 'RED BLUE', nan, 'AZ', '9/5/2014 2:40'],
       ['Orland park', 'RED', 'LIGHT', 'IL', '9/5/2014 3:43'],
       ['Loughman', nan, 'LIGHT', 'FL', '9/5/2014 5:30']], dtype=object)

In [61]:
ufo.info()            # concise summary


<class 'pandas.core.frame.DataFrame'>
Int64Index: 80543 entries, 0 to 80542
Data columns (total 5 columns):
City               80496 non-null object
Colors Reported    17034 non-null object
Shape Reported     72141 non-null object
State              80543 non-null object
Time               80543 non-null object
dtypes: object(5)
memory usage: 3.7+ MB

Filtering and Sorting Data


In [62]:
# Select a single column
ufo['State']


Out[62]:
0     NY
1     NJ
2     CO
3     KS
4     NY
5     ND
6     CA
7     MI
8     AK
9     OR
10    CA
11    AL
12    SC
13    IA
14    MI
...
80528    OH
80529    WA
80530    FL
80531    VA
80532    MA
80533    IA
80534    TX
80535    KY
80536    PA
80537    NE
80538    NE
80539    OH
80540    AZ
80541    IL
80542    FL
Name: State, Length: 80543, dtype: object

In [63]:
ufo.State                       # This is equivalent


Out[63]:
0     NY
1     NJ
2     CO
3     KS
4     NY
5     ND
6     CA
7     MI
8     AK
9     OR
10    CA
11    AL
12    SC
13    IA
14    MI
...
80528    OH
80529    WA
80530    FL
80531    VA
80532    MA
80533    IA
80534    TX
80535    KY
80536    PA
80537    NE
80538    NE
80539    OH
80540    AZ
80541    IL
80542    FL
Name: State, Length: 80543, dtype: object

In [64]:
# Select multiple columns
ufo[['State', 'City','Shape Reported']]


Out[64]:
State City Shape Reported
0 NY Ithaca TRIANGLE
1 NJ Willingboro OTHER
2 CO Holyoke OVAL
3 KS Abilene DISK
4 NY New York Worlds Fair LIGHT
5 ND Valley City DISK
6 CA Crater Lake CIRCLE
7 MI Alma DISK
8 AK Eklutna CIGAR
9 OR Hubbard CYLINDER
10 CA Fontana LIGHT
11 AL Waterloo FIREBALL
12 SC Belton SPHERE
13 IA Keokuk OVAL
14 MI Ludington DISK
15 CA Forest Home CIRCLE
16 CA Los Angeles NaN
17 GA Hapeville NaN
18 TN Oneida RECTANGLE
19 AK Bering Sea OTHER
20 NE Nebraska DISK
21 LA NaN NaN
22 LA NaN LIGHT
23 KY Owensboro RECTANGLE
24 WV Wilderness DISK
25 CA San Diego CIGAR
26 WV Wilderness DISK
27 NM Clovis DISK
28 NM Los Alamos DISK
29 UT Ft. Duschene DISK
... ... ... ...
80513 NJ Manahawkin CIRCLE
80514 MA New Bedford LIGHT
80515 VA Woodbridge TRIANGLE
80516 CA Glendale CYLINDER
80517 NH Laconia CIGAR
80518 PA Langhorne OTHER
80519 IL Glen Ellyn CIRCLE
80520 PA Waynesburg FLASH
80521 OH Canal Winchester TRIANGLE
80522 MA Lawrence LIGHT
80523 MD Ellicott City LIGHT
80524 WA Olympia LIGHT
80525 IA Iowa City LIGHT
80526 MA Marshfield LIGHT
80527 WA Seattle DIAMOND
80528 OH North Royalton TRIANGLE
80529 WA Deer Park LIGHT
80530 FL Glen St. Mary DISK
80531 VA Abingdon CIRCLE
80532 MA Lawrence LIGHT
80533 IA Melbourne OVAL
80534 TX Burleson LIGHT
80535 KY Elizabethtown CIRCLE
80536 PA Wyoming DISK
80537 NE Neligh CIRCLE
80538 NE Neligh CIRCLE
80539 OH Uhrichsville LIGHT
80540 AZ Tucson NaN
80541 IL Orland park LIGHT
80542 FL Loughman LIGHT

80543 rows × 3 columns


In [65]:
my_cols = ['State', 'City', 'Shape Reported']
ufo[my_cols]                    # This is equivalent


Out[65]:
State City Shape Reported
0 NY Ithaca TRIANGLE
1 NJ Willingboro OTHER
2 CO Holyoke OVAL
3 KS Abilene DISK
4 NY New York Worlds Fair LIGHT
5 ND Valley City DISK
6 CA Crater Lake CIRCLE
7 MI Alma DISK
8 AK Eklutna CIGAR
9 OR Hubbard CYLINDER
10 CA Fontana LIGHT
11 AL Waterloo FIREBALL
12 SC Belton SPHERE
13 IA Keokuk OVAL
14 MI Ludington DISK
15 CA Forest Home CIRCLE
16 CA Los Angeles NaN
17 GA Hapeville NaN
18 TN Oneida RECTANGLE
19 AK Bering Sea OTHER
20 NE Nebraska DISK
21 LA NaN NaN
22 LA NaN LIGHT
23 KY Owensboro RECTANGLE
24 WV Wilderness DISK
25 CA San Diego CIGAR
26 WV Wilderness DISK
27 NM Clovis DISK
28 NM Los Alamos DISK
29 UT Ft. Duschene DISK
... ... ... ...
80513 NJ Manahawkin CIRCLE
80514 MA New Bedford LIGHT
80515 VA Woodbridge TRIANGLE
80516 CA Glendale CYLINDER
80517 NH Laconia CIGAR
80518 PA Langhorne OTHER
80519 IL Glen Ellyn CIRCLE
80520 PA Waynesburg FLASH
80521 OH Canal Winchester TRIANGLE
80522 MA Lawrence LIGHT
80523 MD Ellicott City LIGHT
80524 WA Olympia LIGHT
80525 IA Iowa City LIGHT
80526 MA Marshfield LIGHT
80527 WA Seattle DIAMOND
80528 OH North Royalton TRIANGLE
80529 WA Deer Park LIGHT
80530 FL Glen St. Mary DISK
80531 VA Abingdon CIRCLE
80532 MA Lawrence LIGHT
80533 IA Melbourne OVAL
80534 TX Burleson LIGHT
80535 KY Elizabethtown CIRCLE
80536 PA Wyoming DISK
80537 NE Neligh CIRCLE
80538 NE Neligh CIRCLE
80539 OH Uhrichsville LIGHT
80540 AZ Tucson NaN
80541 IL Orland park LIGHT
80542 FL Loughman LIGHT

80543 rows × 3 columns


In [74]:
# Logical filtering
ufo[ufo.State == 'TX']          # Select only rows where State == 'TX'


Out[74]:
City Colors Reported Shape Reported State Time
37 Dallas NaN SPHERE TX 7/15/1945 14:00
43 Alice NaN DISK TX 3/15/1946 15:30
49 Conroe NaN OTHER TX 1/10/1947 20:00
92 Borger NaN DISK TX 6/15/1948 16:00
114 Post NaN DISK TX 9/15/1949 21:00
115 San Marcos NaN CYLINDER TX 10/10/1949 20:30
116 Lackland AFB NaN LIGHT TX 10/10/1949 21:00
132 Fort Worth NaN DISK TX 6/15/1950 20:00
134 Galveston NaN SPHERE TX 6/20/1950 17:00
174 Greenville ORANGE TRIANGLE TX 4/15/1952 16:00
189 Pecos NaN DISK TX 6/30/1952 3:00
201 Arlington NaN DISK TX 7/7/1952 13:00
213 Smithson's Valley NaN CIGAR TX 8/15/1952 17:00
223 Oklahoma NaN CIGAR TX 1/1/1953 14:00
235 Fort Worth NaN SPHERE TX 6/15/1953 13:00
237 Galveston NaN CIGAR TX 6/17/1953 7:30
252 Mauriceville NaN DISK TX 10/15/1953 14:00
257 Lackland Air Force Base NaN DISK TX 11/25/1953 4:25
267 Houston NaN DISK TX 6/1/1954 0:00
280 Ellington AFB NaN DISK TX 6/15/1954 16:30
283 Thrifty NaN CIRCLE TX 6/25/1954 3:30
304 Beaumont RED DISK TX 9/9/1954 12:30
307 Blanco NaN OTHER TX 10/15/1954 16:30
310 Thelma NaN DISK TX 11/14/1954 1:00
330 Lubbock NaN CIGAR TX 6/30/1955 13:00
331 Dumas NaN LIGHT TX 6/30/1955 21:00
340 Eagle Pass NaN CIGAR TX 8/15/1955 0:01
359 Edroy NaN NaN TX 6/15/1956 1:00
373 Houston NaN TRIANGLE TX 6/30/1956 18:30
387 Edna NaN CIRCLE TX 10/10/1956 21:00
... ... ... ... ... ...
79676 Austin RED NaN TX 8/1/2014 20:55
79697 Fort Worth BLUE CIRCLE TX 8/2/2014 5:58
79708 Von Ormy NaN LIGHT TX 8/2/2014 21:20
79746 Cleburne NaN LIGHT TX 8/3/2014 20:15
79763 Cypress NaN FIREBALL TX 8/4/2014 0:00
79774 Port Neches NaN LIGHT TX 8/5/2014 0:00
79790 Amarillo NaN TRIANGLE TX 8/5/2014 23:00
79803 Allen BLUE RECTANGLE TX 8/6/2014 22:30
79848 Allen NaN LIGHT TX 8/9/2014 4:00
79874 San Angelo NaN LIGHT TX 8/9/2014 23:30
79891 Katy NaN FIREBALL TX 8/10/2014 23:00
79919 McKinney RED GREEN RECTANGLE TX 8/13/2014 2:05
79921 Cypress NaN TRIANGLE TX 8/13/2014 8:45
79933 Comfort NaN LIGHT TX 8/13/2014 22:00
79938 Pearland NaN CONE TX 8/14/2014 7:30
79965 Austin NaN CIRCLE TX 8/15/2014 6:45
79969 Austin RED CIRCLE TX 8/15/2014 18:45
79986 Houston YELLOW CIRCLE TX 8/15/2014 23:30
80038 Borger NaN LIGHT TX 8/17/2014 21:15
80075 Round Rock ORANGE LIGHT TX 8/19/2014 1:32
80113 Menard RED LIGHT TX 8/20/2014 21:40
80134 Morton NaN SPHERE TX 8/21/2014 20:15
80141 Port Neches GREEN BLUE LIGHT TX 8/21/2014 23:00
80240 New Braunfels RED LIGHT TX 8/24/2014 21:05
80241 Atascocita NaN CIRCLE TX 8/24/2014 21:15
80287 Lago Vista NaN TRIANGLE TX 8/26/2014 3:35
80350 Houston NaN TRIANGLE TX 8/28/2014 19:55
80414 Eustace NaN LIGHT TX 8/30/2014 23:00
80476 Fort Worth NaN OTHER TX 9/2/2014 16:05
80534 Burleson NaN LIGHT TX 9/4/2014 22:30

4186 rows × 5 columns


In [76]:
ufo[~(ufo.State == 'TX')]       # Select everything where the test fails


Out[76]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
5 Valley City NaN DISK ND 9/15/1934 15:30
6 Crater Lake NaN CIRCLE CA 6/15/1935 0:00
7 Alma NaN DISK MI 7/15/1936 0:00
8 Eklutna NaN CIGAR AK 10/15/1936 17:00
9 Hubbard NaN CYLINDER OR 6/15/1937 0:00
10 Fontana NaN LIGHT CA 8/15/1937 21:00
11 Waterloo NaN FIREBALL AL 6/1/1939 20:00
12 Belton RED SPHERE SC 6/30/1939 20:00
13 Keokuk NaN OVAL IA 7/7/1939 2:00
14 Ludington NaN DISK MI 6/1/1941 13:00
15 Forest Home NaN CIRCLE CA 7/2/1941 11:30
16 Los Angeles NaN NaN CA 2/25/1942 0:00
17 Hapeville NaN NaN GA 6/1/1942 22:30
18 Oneida NaN RECTANGLE TN 7/15/1942 1:00
19 Bering Sea RED OTHER AK 4/30/1943 23:00
20 Nebraska NaN DISK NE 6/1/1943 15:00
21 NaN NaN NaN LA 8/15/1943 0:00
22 NaN NaN LIGHT LA 8/15/1943 0:00
23 Owensboro NaN RECTANGLE KY 10/15/1943 11:00
24 Wilderness NaN DISK WV 1/1/1944 10:00
25 San Diego NaN CIGAR CA 1/1/1944 12:00
26 Wilderness NaN DISK WV 1/1/1944 12:00
27 Clovis NaN DISK NM 4/2/1944 11:00
28 Los Alamos NaN DISK NM 6/1/1944 12:00
29 Ft. Duschene NaN DISK UT 6/30/1944 10:00
... ... ... ... ... ...
80512 Malta NaN LIGHT ID 9/4/2014 5:41
80513 Manahawkin NaN CIRCLE NJ 9/4/2014 8:30
80514 New Bedford NaN LIGHT MA 9/4/2014 9:47
80515 Woodbridge NaN TRIANGLE VA 9/4/2014 16:30
80516 Glendale NaN CYLINDER CA 9/4/2014 17:38
80517 Laconia NaN CIGAR NH 9/4/2014 19:02
80518 Langhorne NaN OTHER PA 9/4/2014 19:30
80519 Glen Ellyn RED CIRCLE IL 9/4/2014 20:20
80520 Waynesburg NaN FLASH PA 9/4/2014 20:25
80521 Canal Winchester NaN TRIANGLE OH 9/4/2014 20:30
80522 Lawrence ORANGE LIGHT MA 9/4/2014 20:30
80523 Ellicott City NaN LIGHT MD 9/4/2014 20:45
80524 Olympia RED LIGHT WA 9/4/2014 21:10
80525 Iowa City BLUE LIGHT IA 9/4/2014 21:11
80526 Marshfield NaN LIGHT MA 9/4/2014 21:15
80527 Seattle NaN DIAMOND WA 9/4/2014 21:15
80528 North Royalton RED TRIANGLE OH 9/4/2014 21:30
80529 Deer Park NaN LIGHT WA 9/4/2014 22:00
80530 Glen St. Mary NaN DISK FL 9/4/2014 22:00
80531 Abingdon NaN CIRCLE VA 9/4/2014 22:05
80532 Lawrence NaN LIGHT MA 9/4/2014 22:15
80533 Melbourne NaN OVAL IA 9/4/2014 22:15
80535 Elizabethtown NaN CIRCLE KY 9/4/2014 22:30
80536 Wyoming RED DISK PA 9/4/2014 23:00
80537 Neligh NaN CIRCLE NE 9/4/2014 23:20
80538 Neligh NaN CIRCLE NE 9/4/2014 23:20
80539 Uhrichsville NaN LIGHT OH 9/5/2014 1:14
80540 Tucson RED BLUE NaN AZ 9/5/2014 2:40
80541 Orland park RED LIGHT IL 9/5/2014 3:43
80542 Loughman NaN LIGHT FL 9/5/2014 5:30

76357 rows × 5 columns


In [68]:
ufo[ufo.State != 'TX']          # Same thing as before


Out[68]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
5 Valley City NaN DISK ND 9/15/1934 15:30
6 Crater Lake NaN CIRCLE CA 6/15/1935 0:00
7 Alma NaN DISK MI 7/15/1936 0:00
8 Eklutna NaN CIGAR AK 10/15/1936 17:00
9 Hubbard NaN CYLINDER OR 6/15/1937 0:00
10 Fontana NaN LIGHT CA 8/15/1937 21:00
11 Waterloo NaN FIREBALL AL 6/1/1939 20:00
12 Belton RED SPHERE SC 6/30/1939 20:00
13 Keokuk NaN OVAL IA 7/7/1939 2:00
14 Ludington NaN DISK MI 6/1/1941 13:00
15 Forest Home NaN CIRCLE CA 7/2/1941 11:30
16 Los Angeles NaN NaN CA 2/25/1942 0:00
17 Hapeville NaN NaN GA 6/1/1942 22:30
18 Oneida NaN RECTANGLE TN 7/15/1942 1:00
19 Bering Sea RED OTHER AK 4/30/1943 23:00
20 Nebraska NaN DISK NE 6/1/1943 15:00
21 NaN NaN NaN LA 8/15/1943 0:00
22 NaN NaN LIGHT LA 8/15/1943 0:00
23 Owensboro NaN RECTANGLE KY 10/15/1943 11:00
24 Wilderness NaN DISK WV 1/1/1944 10:00
25 San Diego NaN CIGAR CA 1/1/1944 12:00
26 Wilderness NaN DISK WV 1/1/1944 12:00
27 Clovis NaN DISK NM 4/2/1944 11:00
28 Los Alamos NaN DISK NM 6/1/1944 12:00
29 Ft. Duschene NaN DISK UT 6/30/1944 10:00
... ... ... ... ... ...
80512 Malta NaN LIGHT ID 9/4/2014 5:41
80513 Manahawkin NaN CIRCLE NJ 9/4/2014 8:30
80514 New Bedford NaN LIGHT MA 9/4/2014 9:47
80515 Woodbridge NaN TRIANGLE VA 9/4/2014 16:30
80516 Glendale NaN CYLINDER CA 9/4/2014 17:38
80517 Laconia NaN CIGAR NH 9/4/2014 19:02
80518 Langhorne NaN OTHER PA 9/4/2014 19:30
80519 Glen Ellyn RED CIRCLE IL 9/4/2014 20:20
80520 Waynesburg NaN FLASH PA 9/4/2014 20:25
80521 Canal Winchester NaN TRIANGLE OH 9/4/2014 20:30
80522 Lawrence ORANGE LIGHT MA 9/4/2014 20:30
80523 Ellicott City NaN LIGHT MD 9/4/2014 20:45
80524 Olympia RED LIGHT WA 9/4/2014 21:10
80525 Iowa City BLUE LIGHT IA 9/4/2014 21:11
80526 Marshfield NaN LIGHT MA 9/4/2014 21:15
80527 Seattle NaN DIAMOND WA 9/4/2014 21:15
80528 North Royalton RED TRIANGLE OH 9/4/2014 21:30
80529 Deer Park NaN LIGHT WA 9/4/2014 22:00
80530 Glen St. Mary NaN DISK FL 9/4/2014 22:00
80531 Abingdon NaN CIRCLE VA 9/4/2014 22:05
80532 Lawrence NaN LIGHT MA 9/4/2014 22:15
80533 Melbourne NaN OVAL IA 9/4/2014 22:15
80535 Elizabethtown NaN CIRCLE KY 9/4/2014 22:30
80536 Wyoming RED DISK PA 9/4/2014 23:00
80537 Neligh NaN CIRCLE NE 9/4/2014 23:20
80538 Neligh NaN CIRCLE NE 9/4/2014 23:20
80539 Uhrichsville NaN LIGHT OH 9/5/2014 1:14
80540 Tucson RED BLUE NaN AZ 9/5/2014 2:40
80541 Orland park RED LIGHT IL 9/5/2014 3:43
80542 Loughman NaN LIGHT FL 9/5/2014 5:30

76357 rows × 5 columns


In [69]:
ufo.City[ufo.State == 'TX']     # Select only city columm where State == 'TX'


Out[69]:
37                Dallas
43                 Alice
49                Conroe
92                Borger
114                 Post
115           San Marcos
116         Lackland AFB
132           Fort Worth
134            Galveston
174           Greenville
189                Pecos
201            Arlington
213    Smithson's Valley
223             Oklahoma
235           Fort Worth
...
79965           Austin
79969           Austin
79986          Houston
80038           Borger
80075       Round Rock
80113           Menard
80134           Morton
80141      Port Neches
80240    New Braunfels
80241       Atascocita
80287       Lago Vista
80350          Houston
80414          Eustace
80476       Fort Worth
80534         Burleson
Name: City, Length: 4186, dtype: object

In [70]:
ufo[ufo.State == 'TX'].City     # Same thing as before


Out[70]:
37                Dallas
43                 Alice
49                Conroe
92                Borger
114                 Post
115           San Marcos
116         Lackland AFB
132           Fort Worth
134            Galveston
174           Greenville
189                Pecos
201            Arlington
213    Smithson's Valley
223             Oklahoma
235           Fort Worth
...
79965           Austin
79969           Austin
79986          Houston
80038           Borger
80075       Round Rock
80113           Menard
80134           Morton
80141      Port Neches
80240    New Braunfels
80241       Atascocita
80287       Lago Vista
80350          Houston
80414          Eustace
80476       Fort Worth
80534         Burleson
Name: City, Length: 4186, dtype: object

In [71]:
ufo[(ufo.State == 'CA') | (ufo.State =='TX')] # Select only records where State is 'CA' or State is 'TX'


Out[71]:
City Colors Reported Shape Reported State Time
6 Crater Lake NaN CIRCLE CA 6/15/1935 0:00
10 Fontana NaN LIGHT CA 8/15/1937 21:00
15 Forest Home NaN CIRCLE CA 7/2/1941 11:30
16 Los Angeles NaN NaN CA 2/25/1942 0:00
25 San Diego NaN CIGAR CA 1/1/1944 12:00
33 Salinas AFB NaN DISK CA 6/1/1945 12:00
37 Dallas NaN SPHERE TX 7/15/1945 14:00
38 Huntington Beach NaN DISK CA 8/8/1945 12:00
42 Merced NaN SPHERE CA 2/1/1946 17:00
43 Alice NaN DISK TX 3/15/1946 15:30
44 Blairsden GREEN SPHERE CA 6/30/1946 19:00
49 Conroe NaN OTHER TX 1/10/1947 20:00
52 San Deigo NaN TRIANGLE CA 4/15/1947 23:00
56 Oroville NaN NaN CA 6/1/1947 18:00
61 Hayward NaN CIRCLE CA 6/20/1947 12:00
65 Maywood NaN DISK CA 7/1/1947 20:00
75 Richmond NaN CIRCLE CA 7/14/1947 11:00
82 San Jose BLUE CHEVRON CA 7/15/1947 21:00
84 Modesto BLUE DISK CA 8/8/1947 22:00
90 Kelseyville NaN DISK CA 4/30/1948 12:00
92 Borger NaN DISK TX 6/15/1948 16:00
95 Ojai NaN LIGHT CA 7/15/1948 22:00
110 Mill Valley NaN LIGHT CA 8/15/1949 21:00
114 Post NaN DISK TX 9/15/1949 21:00
115 San Marcos NaN CYLINDER TX 10/10/1949 20:30
116 Lackland AFB NaN LIGHT TX 10/10/1949 21:00
129 Napa GREEN DISK CA 6/10/1950 0:00
132 Fort Worth NaN DISK TX 6/15/1950 20:00
134 Galveston NaN SPHERE TX 6/20/1950 17:00
142 Los Padres National Forest NaN FORMATION CA 7/15/1950 23:00
... ... ... ... ... ...
80270 Oceanside NaN VARIOUS CA 8/25/2014 21:38
80272 Murrieta NaN LIGHT CA 8/25/2014 22:00
80275 Escondido NaN SPHERE CA 8/25/2014 22:09
80276 Aliso Viejo ORANGE FIREBALL CA 8/25/2014 22:11
80278 Carlsbad ORANGE DISK CA 8/25/2014 22:20
80279 San Marcos YELLOW LIGHT CA 8/25/2014 22:28
80287 Lago Vista NaN TRIANGLE TX 8/26/2014 3:35
80296 Ventura RED LIGHT CA 8/26/2014 21:30
80311 Woodland Hills NaN CIGAR CA 8/27/2014 16:00
80312 Menifee NaN NaN CA 8/27/2014 16:30
80328 Santa Ysabel NaN OTHER CA 8/28/2014 4:02
80349 San Francisco RED RECTANGLE CA 8/28/2014 19:45
80350 Houston NaN TRIANGLE TX 8/28/2014 19:55
80351 Rancho Cucamonga NaN LIGHT CA 8/28/2014 20:00
80360 Fontana RED BLUE CIGAR CA 8/28/2014 22:00
80364 Big Pine NaN FLASH CA 8/28/2014 22:28
80365 West Hills NaN TEARDROP CA 8/28/2014 22:28
80366 Folsom ORANGE BLUE TRIANGLE CA 8/28/2014 22:30
80373 Ocean Beach NaN CYLINDER CA 8/29/2014 8:12
80379 Apple Valley ORANGE OVAL CA 8/29/2014 21:30
80394 San Francisco NaN RECTANGLE CA 8/30/2014 4:15
80396 El Cajon NaN DIAMOND CA 8/30/2014 8:20
80405 San Diego NaN LIGHT CA 8/30/2014 21:00
80408 Fallbrook ORANGE FIREBALL CA 8/30/2014 21:20
80414 Eustace NaN LIGHT TX 8/30/2014 23:00
80421 Manteca NaN SPHERE CA 8/31/2014 2:20
80442 El Cajon NaN TRIANGLE CA 8/31/2014 22:35
80476 Fort Worth NaN OTHER TX 9/2/2014 16:05
80516 Glendale NaN CYLINDER CA 9/4/2014 17:38
80534 Burleson NaN LIGHT TX 9/4/2014 22:30

14929 rows × 5 columns


In [72]:
ufo_dallas = ufo[(ufo.City == 'Dallas') & (ufo.State =='TX')]  # Select only Dallas, TX records

In [77]:
ufo[ufo.City.isin(['Austin','Dallas', 'Houston'])] # Select only Austin, Dallas, or Houston records


Out[77]:
City Colors Reported Shape Reported State Time
37 Dallas NaN SPHERE TX 7/15/1945 14:00
267 Houston NaN DISK TX 6/1/1954 0:00
373 Houston NaN TRIANGLE TX 6/30/1956 18:30
392 Dallas NaN NaN TX 1/1/1957 21:00
414 Dallas NaN OTHER TX 6/30/1957 0:00
428 Houston NaN DIAMOND TX 7/15/1957 13:00
463 Houston NaN LIGHT TX 10/15/1957 20:30
490 Houston NaN DISK TX 6/15/1958 20:00
584 Houston NaN TRIANGLE TX 3/12/1960 23:00
585 Houston NaN OTHER TX 4/1/1960 21:00
666 Dallas NaN DISK GA 6/22/1961 0:00
787 Houston NaN NaN TX 6/1/1963 0:00
793 Houston NaN CIGAR TX 6/1/1963 12:00
964 Austin NaN NaN TX 4/1/1965 5:30
1147 Houston NaN FORMATION TX 12/22/1965 9:00
1604 Houston NaN TRIANGLE TX 6/1/1968 22:00
1661 Dallas NaN OTHER TX 7/1/1968 3:30
1672 Houston NaN DISK TX 7/7/1968 12:00
1712 Austin NaN DISK TX 8/15/1968 5:00
1792 Dallas NaN LIGHT TX 5/15/1969 20:00
1889 Austin NaN SPHERE TX 9/9/1969 21:10
1945 Houston NaN DISK TX 5/1/1970 3:00
1974 Houston NaN DISK TX 6/15/1970 3:00
2155 Houston NaN CIRCLE TX 9/15/1971 18:00
2205 Austin NaN LIGHT TX 6/1/1972 18:00
2248 Dallas GREEN NaN TX 6/30/1972 12:00
2278 Houston NaN LIGHT TX 7/15/1972 20:00
2348 Houston NaN RECTANGLE TX 2/10/1973 16:00
2359 Houston NaN DISK TX 4/5/1973 21:30
2387 Austin NaN SPHERE TX 6/1/1973 21:00
... ... ... ... ... ...
76081 Dallas NaN TRIANGLE TX 2/17/2014 22:00
76093 Houston NaN DISK TX 2/18/2014 18:00
76121 Houston NaN NaN TX 2/19/2014 21:20
76548 Houston NaN FLASH TX 3/18/2014 20:45
76712 Houston NaN TRIANGLE TX 3/29/2014 1:00
76855 Austin RED LIGHT TX 4/5/2014 22:15
76905 Austin NaN NaN TX 4/8/2014 21:15
76966 Houston RED BLUE SPHERE TX 4/11/2014 0:30
77527 Dallas NaN OTHER TX 5/9/2014 8:00
78353 Austin NaN CYLINDER TX 6/19/2014 23:00
78354 Austin GREEN BLUE CYLINDER TX 6/19/2014 23:00
78420 Dallas GREEN BLUE LIGHT TX 6/21/2014 22:00
78925 Houston ORANGE FIREBALL TX 7/4/2014 22:30
78983 Austin NaN LIGHT TX 7/4/2014 23:30
79128 Austin NaN SPHERE TX 7/7/2014 23:00
79212 Austin NaN LIGHT TX 7/12/2014 21:00
79223 Austin YELLOW LIGHT TX 7/12/2014 21:05
79227 Austin NaN LIGHT TX 7/12/2014 21:08
79236 Austin NaN LIGHT TX 7/12/2014 21:12
79238 Austin NaN LIGHT TX 7/12/2014 21:15
79302 Austin ORANGE CIRCLE TX 7/14/2014 5:10
79390 Houston NaN LIGHT TX 7/18/2014 22:00
79572 Austin GREEN NaN TX 7/27/2014 0:30
79628 Houston NaN OTHER TX 7/30/2014 12:00
79676 Austin RED NaN TX 8/1/2014 20:55
79767 Dallas NaN TRIANGLE GA 8/4/2014 21:00
79965 Austin NaN CIRCLE TX 8/15/2014 6:45
79969 Austin RED CIRCLE TX 8/15/2014 18:45
79986 Houston YELLOW CIRCLE TX 8/15/2014 23:30
80350 Houston NaN TRIANGLE TX 8/28/2014 19:55

789 rows × 5 columns

Sorting


In [78]:
ufo.State.order()                               # only works for a Series


Out[78]:
76702    AK
74536    AK
28987    AK
8551     AK
76703    AK
31387    AK
2374     AK
44188    AK
18424    AK
61798    AK
1891     AK
33161    AK
73924    AK
48085    AK
67403    AK
...
14586    WY
1177     WY
77831    WY
80082    WY
188      WY
26882    WY
41926    WY
54342    WY
79897    WY
55682    WY
52869    WY
49543    WY
7183     WY
58080    WY
21766    WY
Name: State, Length: 80543, dtype: object

In [79]:
ufo.sort_index(inplace=True)                    # sort rows by label

In [80]:
ufo.sort_index(ascending=False, inplace=False)


Out[80]:
City Colors Reported Shape Reported State Time
80542 Loughman NaN LIGHT FL 9/5/2014 5:30
80541 Orland park RED LIGHT IL 9/5/2014 3:43
80540 Tucson RED BLUE NaN AZ 9/5/2014 2:40
80539 Uhrichsville NaN LIGHT OH 9/5/2014 1:14
80538 Neligh NaN CIRCLE NE 9/4/2014 23:20
80537 Neligh NaN CIRCLE NE 9/4/2014 23:20
80536 Wyoming RED DISK PA 9/4/2014 23:00
80535 Elizabethtown NaN CIRCLE KY 9/4/2014 22:30
80534 Burleson NaN LIGHT TX 9/4/2014 22:30
80533 Melbourne NaN OVAL IA 9/4/2014 22:15
80532 Lawrence NaN LIGHT MA 9/4/2014 22:15
80531 Abingdon NaN CIRCLE VA 9/4/2014 22:05
80530 Glen St. Mary NaN DISK FL 9/4/2014 22:00
80529 Deer Park NaN LIGHT WA 9/4/2014 22:00
80528 North Royalton RED TRIANGLE OH 9/4/2014 21:30
80527 Seattle NaN DIAMOND WA 9/4/2014 21:15
80526 Marshfield NaN LIGHT MA 9/4/2014 21:15
80525 Iowa City BLUE LIGHT IA 9/4/2014 21:11
80524 Olympia RED LIGHT WA 9/4/2014 21:10
80523 Ellicott City NaN LIGHT MD 9/4/2014 20:45
80522 Lawrence ORANGE LIGHT MA 9/4/2014 20:30
80521 Canal Winchester NaN TRIANGLE OH 9/4/2014 20:30
80520 Waynesburg NaN FLASH PA 9/4/2014 20:25
80519 Glen Ellyn RED CIRCLE IL 9/4/2014 20:20
80518 Langhorne NaN OTHER PA 9/4/2014 19:30
80517 Laconia NaN CIGAR NH 9/4/2014 19:02
80516 Glendale NaN CYLINDER CA 9/4/2014 17:38
80515 Woodbridge NaN TRIANGLE VA 9/4/2014 16:30
80514 New Bedford NaN LIGHT MA 9/4/2014 9:47
80513 Manahawkin NaN CIRCLE NJ 9/4/2014 8:30
... ... ... ... ... ...
29 Ft. Duschene NaN DISK UT 6/30/1944 10:00
28 Los Alamos NaN DISK NM 6/1/1944 12:00
27 Clovis NaN DISK NM 4/2/1944 11:00
26 Wilderness NaN DISK WV 1/1/1944 12:00
25 San Diego NaN CIGAR CA 1/1/1944 12:00
24 Wilderness NaN DISK WV 1/1/1944 10:00
23 Owensboro NaN RECTANGLE KY 10/15/1943 11:00
22 NaN NaN LIGHT LA 8/15/1943 0:00
21 NaN NaN NaN LA 8/15/1943 0:00
20 Nebraska NaN DISK NE 6/1/1943 15:00
19 Bering Sea RED OTHER AK 4/30/1943 23:00
18 Oneida NaN RECTANGLE TN 7/15/1942 1:00
17 Hapeville NaN NaN GA 6/1/1942 22:30
16 Los Angeles NaN NaN CA 2/25/1942 0:00
15 Forest Home NaN CIRCLE CA 7/2/1941 11:30
14 Ludington NaN DISK MI 6/1/1941 13:00
13 Keokuk NaN OVAL IA 7/7/1939 2:00
12 Belton RED SPHERE SC 6/30/1939 20:00
11 Waterloo NaN FIREBALL AL 6/1/1939 20:00
10 Fontana NaN LIGHT CA 8/15/1937 21:00
9 Hubbard NaN CYLINDER OR 6/15/1937 0:00
8 Eklutna NaN CIGAR AK 10/15/1936 17:00
7 Alma NaN DISK MI 7/15/1936 0:00
6 Crater Lake NaN CIRCLE CA 6/15/1935 0:00
5 Valley City NaN DISK ND 9/15/1934 15:30
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
3 Abilene NaN DISK KS 6/1/1931 13:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00

80543 rows × 5 columns


In [81]:
ufo.sort_index(by='State')                      # sort rows by specific column


Out[81]:
City Colors Reported Shape Reported State Time
76702 Wasilla NaN VARIOUS AK 3/28/2014 11:17
74536 Ketchikan NaN LIGHT AK 12/5/2013 17:04
28987 Anchorage NaN SPHERE AK 5/28/2004 8:35
8551 Cantwell NaN CIRCLE AK 8/20/1995 11:30
76703 Chugiak NaN LIGHT AK 3/28/2014 11:30
31387 Fairbanks NaN NaN AK 12/20/2004 20:00
2374 Chugiak NaN RECTANGLE AK 5/15/1973 22:00
44188 Alaska NaN CIGAR AK 4/29/2008 0:00
18424 North Pole NaN DISK AK 1/24/2001 0:00
61798 Anchorage RED LIGHT AK 3/2/2012 19:30
1891 Fairbanks NaN CIGAR AK 9/20/1969 23:00
33161 Susitna Lodge NaN OTHER AK 7/7/2005 22:00
73924 Kuparuk Oilfield NaN LIGHT AK 11/9/2013 17:30
48085 Fairbanks GREEN LIGHT AK 2/20/2009 22:00
67403 Fairbanks ORANGE LIGHT AK 11/22/2012 19:05
35828 White Mountain NaN SPHERE AK 3/21/2006 0:00
61819 Bethel NaN LIGHT AK 3/3/2012 22:15
5426 Adak NaN LIGHT AK 6/1/1986 0:00
68716 Palmer ORANGE LIGHT AK 2/23/2013 21:30
32284 Fairbanks NaN CIRCLE AK 4/11/2005 0:00
76317 Anchorage RED LIGHT AK 3/1/2014 0:00
39041 Anchorage RED DISK AK 1/27/2007 23:07
22150 Bethel NaN OTHER AK 6/5/2002 23:45
17589 Fairbanks NaN NaN AK 10/2/2000 23:00
7714 Anchorage NaN LIGHT AK 1/17/1995 22:45
28204 Fairbanks YELLOW LIGHT AK 3/11/2004 21:30
76721 Anchorage NaN CIGAR AK 3/29/2014 20:45
467 St Lawrence Island NaN OTHER AK 11/10/1957 20:00
33158 Susitna Lodge BLUE RECTANGLE AK 7/7/2005 18:45
76106 Anchorage NaN FLASH AK 2/19/2014 3:00
... ... ... ... ... ...
2976 Fort Washakie NaN LIGHT WY 7/7/1975 19:00
79995 Cheyenne NaN TRIANGLE WY 8/16/2014 17:05
33421 Clark NaN LIGHT WY 7/31/2005 4:20
23548 Bosler NaN NaN WY 10/15/2002 18:00
38054 Laramie NaN RECTANGLE WY 11/7/2006 21:00
9277 Cody NaN OTHER WY 5/1/1996 22:00
20848 Clark NaN LIGHT WY 11/15/2001 21:10
13716 On highway; location unknown NaN OTHER WY 6/15/1999 22:00
12215 Sheridan NaN OTHER WY 9/24/1998 0:15
74651 Powell NaN CHEVRON WY 12/11/2013 18:25
45615 Gillette NaN EGG WY 8/11/2008 20:15
7064 Mammoth Hot Springs YELLOW CHEVRON WY 6/4/1993 21:55
37171 Riverton NaN FORMATION WY 8/13/2006 22:20
47842 Casper NaN NaN WY 2/2/2009 6:00
35111 Evanston NaN SPHERE WY 12/12/2005 23:00
14586 Egg Harbor NaN NaN WY 9/15/1999 0:00
1177 Francis E Warren AFB NaN NaN WY 4/15/1966 23:00
77831 Rock Springs NaN LIGHT WY 5/25/2014 20:00
80082 Yellowstone National Park YELLOW TEARDROP WY 8/19/2014 7:40
188 Cheyenne NaN DISK WY 6/30/1952 0:00
26882 Cheyenne NaN FIREBALL WY 11/1/2003 0:00
41926 Opal NaN NaN WY 10/20/2007 1:00
54342 Cokeville NaN FORMATION WY 8/29/2010 21:20
79897 Evanston NaN CONE WY 8/11/2014 19:45
55682 Buffalo NaN DISK WY 12/17/2010 18:15
52869 Gillette NaN LIGHT WY 6/6/2010 2:45
49543 Powell NaN CIRCLE WY 7/8/2009 4:25
7183 Laramie NaN VARIOUS WY 8/23/1993 19:56
58080 Casper NaN LIGHT WY 7/20/2011 22:13
21766 Laramie NaN VARIOUS WY 4/4/2002 19:45

80543 rows × 5 columns


In [82]:
ufo.sort_index(by=['State', 'Shape Reported'])  # sort by multiple columns


Out[82]:
City Colors Reported Shape Reported State Time
8 Eklutna NaN CIGAR AK 10/15/1936 17:00
413 Arctic NaN CIGAR AK 6/30/1957 0:00
1306 Delta Junction ORANGE CIGAR AK 9/15/1966 18:00
1891 Fairbanks NaN CIGAR AK 9/20/1969 23:00
7233 Fairbanks BLUE CIGAR AK 11/23/1993 11:24
10682 Kenai NaN CIGAR AK 8/8/1997 23:00
44188 Alaska NaN CIGAR AK 4/29/2008 0:00
76721 Anchorage NaN CIGAR AK 3/29/2014 20:45
1494 Gula NaN CIRCLE AK 8/15/1967 23:00
2304 Anchorage BLUE CIRCLE AK 8/30/1972 1:30
3511 Kodiak Island NaN CIRCLE AK 6/15/1977 2:00
5333 Juneau NaN CIRCLE AK 9/5/1985 1:30
8551 Cantwell NaN CIRCLE AK 8/20/1995 11:30
12470 Anchorage NaN CIRCLE AK 10/31/1998 10:00
14154 Ketchikan NaN CIRCLE AK 8/11/1999 23:07
15376 Seward ORANGE CIRCLE AK 11/27/1999 11:00
20182 Fairbanks NaN CIRCLE AK 8/27/2001 5:30
24310 Anchorage NaN CIRCLE AK 1/24/2003 8:30
25700 Fairbanks NaN CIRCLE AK 8/7/2003 1:30
29737 Wrangell NaN CIRCLE AK 8/2/2004 0:00
29914 Anchorage NaN CIRCLE AK 8/16/2004 12:08
32284 Fairbanks NaN CIRCLE AK 4/11/2005 0:00
37906 Anchorage NaN CIRCLE AK 10/23/2006 21:00
39338 Eagle River BLUE CIRCLE AK 3/4/2007 19:45
41774 Willow GREEN CIRCLE AK 10/6/2007 22:00
47434 Venetie RED GREEN CIRCLE AK 1/2/2009 21:00
53969 Anchorage NaN CIRCLE AK 8/6/2010 18:00
54283 Craig GREEN CIRCLE AK 8/27/2010 0:00
55673 Anchorage NaN CIRCLE AK 12/16/2010 12:00
56673 Wasilla NaN CIRCLE AK 4/4/2011 4:00
... ... ... ... ... ...
18314 Gillette NaN VARIOUS WY 1/8/2001 18:30
21766 Laramie NaN VARIOUS WY 4/4/2002 19:45
59219 Yellowstone National Park NaN VARIOUS WY 9/20/2011 19:55
60572 Laramie ORANGE VARIOUS WY 12/15/2011 17:30
1177 Francis E Warren AFB NaN NaN WY 4/15/1966 23:00
2437 Powell NaN NaN WY 7/5/1973 1:00
5065 Wheeling NaN NaN WY 6/15/1984 20:30
5232 Old Faithful NaN NaN WY 6/2/1985 23:30
7093 Laramie NaN NaN WY 6/17/1993 15:00
7684 Glendo NaN NaN WY 1/6/1995 17:30
9264 Mile 69 NaN NaN WY 4/28/1996 1:00
11490 Cheyenne NaN NaN WY 5/2/1998 20:30
13756 Laramie BLUE NaN WY 6/20/1999 23:00
14586 Egg Harbor NaN NaN WY 9/15/1999 0:00
14618 Cheyenne NaN NaN WY 9/17/1999 21:55
15667 Rock Spings, 2miles west of NaN NaN WY 1/6/2000 5:45
20080 Rawlins RED ORANGE NaN WY 8/17/2001 22:20
23548 Bosler NaN NaN WY 10/15/2002 18:00
35060 Cheyenne ORANGE NaN WY 12/5/2005 21:00
36535 Seminoe Lake NaN NaN WY 6/14/2006 23:00
38849 Shawnee NaN NaN WY 1/10/2007 19:00
40260 Thermopolis NaN NaN WY 6/14/2007 23:00
41926 Opal NaN NaN WY 10/20/2007 1:00
46209 Glenrock RED GREEN NaN WY 10/1/2008 20:00
46430 Colony NaN NaN WY 10/17/2008 19:20
47842 Casper NaN NaN WY 2/2/2009 6:00
52361 Cheyenne NaN NaN WY 4/10/2010 21:00
53722 Lander NaN NaN WY 7/18/2010 23:00
73095 Clark NaN NaN WY 10/5/2013 12:15
79629 Hulett NaN NaN WY 7/30/2014 13:00

80543 rows × 5 columns


In [83]:
ufo.sort_index(by=['State', 'Shape Reported'], ascending=[False, True], inplace=True)  # specify sort order

Modifying Columns


In [84]:
# Add a new column as a function of existing columns
ufo['Location'] = ufo['City'] + ', ' + ufo['State']
ufo.head()


Out[84]:
City Colors Reported Shape Reported State Time Location
7064 Mammoth Hot Springs YELLOW CHEVRON WY 6/4/1993 21:55 Mammoth Hot Springs, WY
24658 Jackson Hole NaN CHEVRON WY 3/16/2003 11:00 Jackson Hole, WY
61928 Rock Springs NaN CHEVRON WY 3/12/2012 1:02 Rock Springs, WY
73885 Casper NaN CHEVRON WY 11/7/2013 1:45 Casper, WY
74651 Powell NaN CHEVRON WY 12/11/2013 18:25 Powell, WY

In [86]:
# Rename columns
ufo.rename(columns={'Colors Reported':'Colors', 'Shape Reported':'Shape'}, inplace=True)
ufo.head()


Out[86]:
City Colors Shape State Time Location
7064 Mammoth Hot Springs YELLOW CHEVRON WY 6/4/1993 21:55 Mammoth Hot Springs, WY
24658 Jackson Hole NaN CHEVRON WY 3/16/2003 11:00 Jackson Hole, WY
61928 Rock Springs NaN CHEVRON WY 3/12/2012 1:02 Rock Springs, WY
73885 Casper NaN CHEVRON WY 11/7/2013 1:45 Casper, WY
74651 Powell NaN CHEVRON WY 12/11/2013 18:25 Powell, WY

In [87]:
# Hide a column (temporarily)
ufo.drop(['Location'], axis=1)


Out[87]:
City Colors Shape State Time
7064 Mammoth Hot Springs YELLOW CHEVRON WY 6/4/1993 21:55
24658 Jackson Hole NaN CHEVRON WY 3/16/2003 11:00
61928 Rock Springs NaN CHEVRON WY 3/12/2012 1:02
73885 Casper NaN CHEVRON WY 11/7/2013 1:45
74651 Powell NaN CHEVRON WY 12/11/2013 18:25
1461 Wyoming NaN CIGAR WY 7/12/1967 2:00
12072 Cheyenne NaN CIGAR WY 9/4/1998 15:00
60064 Torrington NaN CIGAR WY 11/5/2011 21:30
72264 Wyoming NaN CIGAR WY 9/1/2013 0:00
1985 Cheyenne NaN CIRCLE WY 6/20/1970 2:00
3326 Casper NaN CIRCLE WY 7/25/1976 18:00
9399 Gillette NaN CIRCLE WY 6/15/1996 2:30
23452 Lander GREEN CIRCLE WY 10/6/2002 19:25
29083 Hanna NaN CIRCLE WY 6/5/2004 16:00
49543 Powell NaN CIRCLE WY 7/8/2009 4:25
49739 Laramie NaN CIRCLE WY 7/21/2009 22:15
52524 Evansville NaN CIRCLE WY 4/30/2010 21:30
57240 Mills NaN CIRCLE WY 6/3/2011 15:00
58355 Cheyenne NaN CIRCLE WY 8/5/2011 18:13
59084 Casper RED GREEN BLUE CIRCLE WY 9/12/2011 20:30
59194 La Barge NaN CIRCLE WY 9/18/2011 21:00
62612 Rock Springs NaN CIRCLE WY 4/21/2012 23:05
63577 Laramie NaN CIRCLE WY 6/20/2012 16:00
68629 Point of Rocks GREEN CIRCLE WY 2/14/2013 6:00
72481 Gillette NaN CIRCLE WY 9/11/2013 9:00
76250 Casper NaN CIRCLE WY 2/25/2014 2:20
77097 Yellowstone North Entrance RED YELLOW CIRCLE WY 4/17/2014 18:00
79897 Evanston NaN CONE WY 8/11/2014 19:45
3328 Cowley NaN CYLINDER WY 7/27/1976 1:00
13823 Buffalo NaN CYLINDER WY 7/1/1999 22:30
... ... ... ... ... ...
8821 Prince of Wales Island NaN NaN AK 10/27/1995 20:28
9129 Mountain Village NaN NaN AK 3/2/1996 0:30
9212 Nikishi RED NaN AK 4/12/1996 2:00
10262 North Pole NaN NaN AK 5/20/1997 15:00
13552 Wacker City NaN NaN AK 5/26/1999 18:00
15741 Juneau NaN NaN AK 1/18/2000 7:45
17589 Fairbanks NaN NaN AK 10/2/2000 23:00
18116 North Pole NaN NaN AK 12/12/2000 22:00
21296 Anchorage RED NaN AK 1/20/2002 20:00
23064 St. George Island NaN NaN AK 8/30/2002 23:30
31387 Fairbanks NaN NaN AK 12/20/2004 20:00
33534 Suriname NaN NaN AK 8/11/2005 6:11
34623 Glennallen NaN NaN AK 11/1/2005 0:30
44957 Fairbanks BLUE NaN AK 7/3/2008 16:40
47362 Fairbanks NaN NaN AK 12/29/2008 15:00
48773 Eagle River NaN NaN AK 5/9/2009 2:30
52781 Prince William Sound NaN NaN AK 5/29/2010 9:45
53051 Elk NaN NaN AK 6/25/2010 10:22
56186 Fairbanks NaN NaN AK 2/11/2011 8:00
56499 Hooper Bay NaN NaN AK 3/15/2011 0:03
61311 Anchorage ORANGE NaN AK 1/22/2012 20:15
65642 Fairbanks NaN NaN AK 9/3/2012 21:57
68292 Anchorage NaN NaN AK 1/7/2013 1:50
71252 Sterling NaN NaN AK 7/22/2013 3:13
73090 Anchorage NaN NaN AK 10/5/2013 2:55
73916 Fairbanks ORANGE BLUE NaN AK 11/8/2013 23:00
74639 Fairbanks ORANGE NaN AK 12/10/2013 23:15
74969 Anchorage NaN NaN AK 12/26/2013 18:21
75471 Talkitna NaN NaN AK 1/14/2014 23:00
77458 Anchorage NaN NaN AK 5/5/2014 16:00

80543 rows × 5 columns


In [88]:
# Delete a column (permanently)
del ufo['Location']

Handling Missing Values


In [89]:
# Missing values are often just excluded
ufo.describe()                          # Excludes missing values


Out[89]:
City Colors Shape State Time
count 80496 17034 72141 80543 80543
unique 13504 31 27 52 68901
top Seattle ORANGE LIGHT CA 7/4/2014 22:00
freq 646 5216 16332 10743 45

In [90]:
ufo.Shape.value_counts()                # Excludes missing values


Out[90]:
LIGHT        16332
TRIANGLE      7816
CIRCLE        7725
FIREBALL      6249
OTHER         5506
SPHERE        5231
DISK          5226
OVAL          3721
FORMATION     2405
CIGAR         1983
VARIOUS       1957
FLASH         1329
RECTANGLE     1295
CYLINDER      1252
DIAMOND       1152
CHEVRON        940
EGG            733
TEARDROP       723
CONE           310
CROSS          241
DELTA            7
CRESCENT         2
ROUND            2
HEXAGON          1
FLARE            1
PYRAMID          1
DOME             1
dtype: int64

In [91]:
ufo.Shape.value_counts(dropna=False)    # Includes missing values


Out[91]:
LIGHT        16332
NaN           8402
TRIANGLE      7816
CIRCLE        7725
FIREBALL      6249
OTHER         5506
SPHERE        5231
DISK          5226
OVAL          3721
FORMATION     2405
CIGAR         1983
VARIOUS       1957
FLASH         1329
RECTANGLE     1295
CYLINDER      1252
DIAMOND       1152
CHEVRON        940
EGG            733
TEARDROP       723
CONE           310
CROSS          241
DELTA            7
CRESCENT         2
ROUND            2
HEXAGON          1
PYRAMID          1
DOME             1
FLARE            1
dtype: int64

In [92]:
# Find missing values in a Series
ufo.Shape.isnull()       # True if NaN, False otherwise


Out[92]:
7064     False
24658    False
61928    False
73885    False
74651    False
1461     False
12072    False
60064    False
72264    False
1985     False
3326     False
9399     False
23452    False
29083    False
49543    False
...
48773    True
52781    True
53051    True
56186    True
56499    True
61311    True
65642    True
68292    True
71252    True
73090    True
73916    True
74639    True
74969    True
75471    True
77458    True
Name: Shape, Length: 80543, dtype: bool

In [93]:
ufo.Shape.notnull()      # False if NaN, True otherwise


Out[93]:
7064     True
24658    True
61928    True
73885    True
74651    True
1461     True
12072    True
60064    True
72264    True
1985     True
3326     True
9399     True
23452    True
29083    True
49543    True
...
48773    False
52781    False
53051    False
56186    False
56499    False
61311    False
65642    False
68292    False
71252    False
73090    False
73916    False
74639    False
74969    False
75471    False
77458    False
Name: Shape, Length: 80543, dtype: bool

In [94]:
ufo.Shape.isnull().sum() # Count the missing values


Out[94]:
8402

In [95]:
# Find missing values in a DataFrame
ufo.isnull()


Out[95]:
City Colors Shape State Time
7064 False False False False False
24658 False True False False False
61928 False True False False False
73885 False True False False False
74651 False True False False False
1461 False True False False False
12072 False True False False False
60064 False True False False False
72264 False True False False False
1985 False True False False False
3326 False True False False False
9399 False True False False False
23452 False False False False False
29083 False True False False False
49543 False True False False False
49739 False True False False False
52524 False True False False False
57240 False True False False False
58355 False True False False False
59084 False False False False False
59194 False True False False False
62612 False True False False False
63577 False True False False False
68629 False False False False False
72481 False True False False False
76250 False True False False False
77097 False False False False False
79897 False True False False False
3328 False True False False False
13823 False True False False False
... ... ... ... ... ...
8821 False True True False False
9129 False True True False False
9212 False False True False False
10262 False True True False False
13552 False True True False False
15741 False True True False False
17589 False True True False False
18116 False True True False False
21296 False False True False False
23064 False True True False False
31387 False True True False False
33534 False True True False False
34623 False True True False False
44957 False False True False False
47362 False True True False False
48773 False True True False False
52781 False True True False False
53051 False True True False False
56186 False True True False False
56499 False True True False False
61311 False False True False False
65642 False True True False False
68292 False True True False False
71252 False True True False False
73090 False True True False False
73916 False False True False False
74639 False False True False False
74969 False True True False False
75471 False True True False False
77458 False True True False False

80543 rows × 5 columns


In [96]:
# Count the missing values in a DataFrame
ufo.isnull().sum()


Out[96]:
City         47
Colors    63509
Shape      8402
State         0
Time          0
dtype: int64

In [97]:
# Exclude rows with missing values in a dataframe
ufo[(ufo.Shape.notnull()) & (ufo.Colors.notnull())]


Out[97]:
City Colors Shape State Time
7064 Mammoth Hot Springs YELLOW CHEVRON WY 6/4/1993 21:55
23452 Lander GREEN CIRCLE WY 10/6/2002 19:25
59084 Casper RED GREEN BLUE CIRCLE WY 9/12/2011 20:30
68629 Point of Rocks GREEN CIRCLE WY 2/14/2013 6:00
77097 Yellowstone North Entrance RED YELLOW CIRCLE WY 4/17/2014 18:00
23356 Fort Bridger GREEN DISK WY 9/28/2002 18:30
67644 Rock Springs RED DISK WY 12/7/2012 6:30
12241 Rock Springs GREEN FIREBALL WY 9/26/1998 3:05
23132 Gillette GREEN FIREBALL WY 9/6/2002 20:30
10729 Atlantic City RED LIGHT WY 8/22/1997 23:30
24173 Rock Springs RED GREEN LIGHT WY 1/8/2003 20:15
38591 Sheridan RED GREEN LIGHT WY 12/19/2006 21:00
40417 Douglas ORANGE LIGHT WY 6/24/2007 0:30
61636 Alpine RED LIGHT WY 2/15/2012 21:05
63575 Bill GREEN LIGHT WY 6/20/2012 12:42
71785 Devils Tower GREEN LIGHT WY 8/12/2013 21:30
75664 Fort Washakie RED LIGHT WY 1/23/2014 23:00
76253 Evanston ORANGE LIGHT WY 2/25/2014 18:35
77802 Rock Springs ORANGE LIGHT WY 5/24/2014 22:00
46556 Rock Springs ORANGE OTHER WY 10/25/2008 21:00
53963 Cheyenne YELLOW OTHER WY 8/6/2010 0:30
20679 Evanston GREEN OVAL WY 10/24/2001 23:00
23134 Baggs RED OVAL WY 9/6/2002 21:00
45717 Laramie RED OVAL WY 8/20/2008 15:30
163 Green River GREEN SPHERE WY 7/3/1951 12:00
35099 Fox Park RED GREEN SPHERE WY 12/11/2005 21:15
57383 Green River RED GREEN SPHERE WY 6/16/2011 23:30
66994 Casper GREEN SPHERE WY 11/7/2012 19:30
74569 Laramie ORANGE SPHERE WY 12/7/2013 1:00
80082 Yellowstone National Park YELLOW TEARDROP WY 8/19/2014 7:40
... ... ... ... ... ...
20579 Fairbanks RED OTHER AK 10/11/2001 2:15
25724 Anchorage RED OTHER AK 8/10/2003 3:00
26770 Anchorage RED OTHER AK 10/22/2003 7:00
38099 North Pole GREEN OTHER AK 11/10/2006 23:15
3149 Ft. Yukon BLUE OVAL AK 2/13/1976 13:30
5459 Teller RED GREEN OVAL AK 6/20/1986 23:30
11112 Fairbanks GREEN OVAL AK 1/8/1998 22:38
67995 Fairbanks RED YELLOW OVAL AK 12/25/2012 16:30
72717 Anchorage RED ORANGE OVAL AK 9/21/2013 21:20
76790 Kenai RED OVAL AK 4/2/2014 11:00
33158 Susitna Lodge BLUE RECTANGLE AK 7/7/2005 18:45
3395 Soldotna RED SPHERE AK 10/20/1976 22:00
4716 Juneau ORANGE SPHERE AK 6/15/1982 14:00
5554 Girdwood ORANGE SPHERE AK 11/15/1986 14:30
11313 North Pole BLUE SPHERE AK 3/15/1998 20:00
12222 North Pole ORANGE SPHERE AK 9/24/1998 22:35
18127 North Pole ORANGE SPHERE AK 12/15/2000 16:36
27518 Anchorage ORANGE SPHERE AK 12/25/2003 0:00
51682 Willow RED SPHERE AK 1/1/2010 12:01
55791 Wasilla RED SPHERE AK 12/31/2010 21:00
60708 Fairbanks RED BLUE SPHERE AK 12/24/2011 2:00
61938 Palmer ORANGE SPHERE AK 3/12/2012 21:45
65707 Wasilla ORANGE SPHERE AK 9/8/2012 19:00
49308 Juneau RED TRIANGLE AK 6/27/2009 2:30
59064 Wasilla YELLOW TRIANGLE AK 9/11/2011 9:00
60830 North Pole ORANGE YELLOW TRIANGLE AK 12/30/2011 4:47
68192 Delta Junction ORANGE TRIANGLE AK 1/1/2013 1:00
76701 Bethel ORANGE TRIANGLE AK 3/28/2014 11:03
77045 Wasilla RED TRIANGLE AK 4/14/2014 22:45
18233 Anchorage RED VARIOUS AK 12/31/2000 21:00

15516 rows × 5 columns


In [98]:
# Drop missing values
ufo.dropna()             # Drop a row if ANY values are missing


Out[98]:
City Colors Shape State Time
7064 Mammoth Hot Springs YELLOW CHEVRON WY 6/4/1993 21:55
23452 Lander GREEN CIRCLE WY 10/6/2002 19:25
59084 Casper RED GREEN BLUE CIRCLE WY 9/12/2011 20:30
68629 Point of Rocks GREEN CIRCLE WY 2/14/2013 6:00
77097 Yellowstone North Entrance RED YELLOW CIRCLE WY 4/17/2014 18:00
23356 Fort Bridger GREEN DISK WY 9/28/2002 18:30
67644 Rock Springs RED DISK WY 12/7/2012 6:30
12241 Rock Springs GREEN FIREBALL WY 9/26/1998 3:05
23132 Gillette GREEN FIREBALL WY 9/6/2002 20:30
10729 Atlantic City RED LIGHT WY 8/22/1997 23:30
24173 Rock Springs RED GREEN LIGHT WY 1/8/2003 20:15
38591 Sheridan RED GREEN LIGHT WY 12/19/2006 21:00
40417 Douglas ORANGE LIGHT WY 6/24/2007 0:30
61636 Alpine RED LIGHT WY 2/15/2012 21:05
63575 Bill GREEN LIGHT WY 6/20/2012 12:42
71785 Devils Tower GREEN LIGHT WY 8/12/2013 21:30
75664 Fort Washakie RED LIGHT WY 1/23/2014 23:00
76253 Evanston ORANGE LIGHT WY 2/25/2014 18:35
77802 Rock Springs ORANGE LIGHT WY 5/24/2014 22:00
46556 Rock Springs ORANGE OTHER WY 10/25/2008 21:00
53963 Cheyenne YELLOW OTHER WY 8/6/2010 0:30
20679 Evanston GREEN OVAL WY 10/24/2001 23:00
23134 Baggs RED OVAL WY 9/6/2002 21:00
45717 Laramie RED OVAL WY 8/20/2008 15:30
163 Green River GREEN SPHERE WY 7/3/1951 12:00
35099 Fox Park RED GREEN SPHERE WY 12/11/2005 21:15
57383 Green River RED GREEN SPHERE WY 6/16/2011 23:30
66994 Casper GREEN SPHERE WY 11/7/2012 19:30
74569 Laramie ORANGE SPHERE WY 12/7/2013 1:00
80082 Yellowstone National Park YELLOW TEARDROP WY 8/19/2014 7:40
... ... ... ... ... ...
20579 Fairbanks RED OTHER AK 10/11/2001 2:15
25724 Anchorage RED OTHER AK 8/10/2003 3:00
26770 Anchorage RED OTHER AK 10/22/2003 7:00
38099 North Pole GREEN OTHER AK 11/10/2006 23:15
3149 Ft. Yukon BLUE OVAL AK 2/13/1976 13:30
5459 Teller RED GREEN OVAL AK 6/20/1986 23:30
11112 Fairbanks GREEN OVAL AK 1/8/1998 22:38
67995 Fairbanks RED YELLOW OVAL AK 12/25/2012 16:30
72717 Anchorage RED ORANGE OVAL AK 9/21/2013 21:20
76790 Kenai RED OVAL AK 4/2/2014 11:00
33158 Susitna Lodge BLUE RECTANGLE AK 7/7/2005 18:45
3395 Soldotna RED SPHERE AK 10/20/1976 22:00
4716 Juneau ORANGE SPHERE AK 6/15/1982 14:00
5554 Girdwood ORANGE SPHERE AK 11/15/1986 14:30
11313 North Pole BLUE SPHERE AK 3/15/1998 20:00
12222 North Pole ORANGE SPHERE AK 9/24/1998 22:35
18127 North Pole ORANGE SPHERE AK 12/15/2000 16:36
27518 Anchorage ORANGE SPHERE AK 12/25/2003 0:00
51682 Willow RED SPHERE AK 1/1/2010 12:01
55791 Wasilla RED SPHERE AK 12/31/2010 21:00
60708 Fairbanks RED BLUE SPHERE AK 12/24/2011 2:00
61938 Palmer ORANGE SPHERE AK 3/12/2012 21:45
65707 Wasilla ORANGE SPHERE AK 9/8/2012 19:00
49308 Juneau RED TRIANGLE AK 6/27/2009 2:30
59064 Wasilla YELLOW TRIANGLE AK 9/11/2011 9:00
60830 North Pole ORANGE YELLOW TRIANGLE AK 12/30/2011 4:47
68192 Delta Junction ORANGE TRIANGLE AK 1/1/2013 1:00
76701 Bethel ORANGE TRIANGLE AK 3/28/2014 11:03
77045 Wasilla RED TRIANGLE AK 4/14/2014 22:45
18233 Anchorage RED VARIOUS AK 12/31/2000 21:00

15510 rows × 5 columns


In [99]:
ufo.dropna(how='all')    # Drop a row only if ALL values are missing


Out[99]:
City Colors Shape State Time
7064 Mammoth Hot Springs YELLOW CHEVRON WY 6/4/1993 21:55
24658 Jackson Hole NaN CHEVRON WY 3/16/2003 11:00
61928 Rock Springs NaN CHEVRON WY 3/12/2012 1:02
73885 Casper NaN CHEVRON WY 11/7/2013 1:45
74651 Powell NaN CHEVRON WY 12/11/2013 18:25
1461 Wyoming NaN CIGAR WY 7/12/1967 2:00
12072 Cheyenne NaN CIGAR WY 9/4/1998 15:00
60064 Torrington NaN CIGAR WY 11/5/2011 21:30
72264 Wyoming NaN CIGAR WY 9/1/2013 0:00
1985 Cheyenne NaN CIRCLE WY 6/20/1970 2:00
3326 Casper NaN CIRCLE WY 7/25/1976 18:00
9399 Gillette NaN CIRCLE WY 6/15/1996 2:30
23452 Lander GREEN CIRCLE WY 10/6/2002 19:25
29083 Hanna NaN CIRCLE WY 6/5/2004 16:00
49543 Powell NaN CIRCLE WY 7/8/2009 4:25
49739 Laramie NaN CIRCLE WY 7/21/2009 22:15
52524 Evansville NaN CIRCLE WY 4/30/2010 21:30
57240 Mills NaN CIRCLE WY 6/3/2011 15:00
58355 Cheyenne NaN CIRCLE WY 8/5/2011 18:13
59084 Casper RED GREEN BLUE CIRCLE WY 9/12/2011 20:30
59194 La Barge NaN CIRCLE WY 9/18/2011 21:00
62612 Rock Springs NaN CIRCLE WY 4/21/2012 23:05
63577 Laramie NaN CIRCLE WY 6/20/2012 16:00
68629 Point of Rocks GREEN CIRCLE WY 2/14/2013 6:00
72481 Gillette NaN CIRCLE WY 9/11/2013 9:00
76250 Casper NaN CIRCLE WY 2/25/2014 2:20
77097 Yellowstone North Entrance RED YELLOW CIRCLE WY 4/17/2014 18:00
79897 Evanston NaN CONE WY 8/11/2014 19:45
3328 Cowley NaN CYLINDER WY 7/27/1976 1:00
13823 Buffalo NaN CYLINDER WY 7/1/1999 22:30
... ... ... ... ... ...
8821 Prince of Wales Island NaN NaN AK 10/27/1995 20:28
9129 Mountain Village NaN NaN AK 3/2/1996 0:30
9212 Nikishi RED NaN AK 4/12/1996 2:00
10262 North Pole NaN NaN AK 5/20/1997 15:00
13552 Wacker City NaN NaN AK 5/26/1999 18:00
15741 Juneau NaN NaN AK 1/18/2000 7:45
17589 Fairbanks NaN NaN AK 10/2/2000 23:00
18116 North Pole NaN NaN AK 12/12/2000 22:00
21296 Anchorage RED NaN AK 1/20/2002 20:00
23064 St. George Island NaN NaN AK 8/30/2002 23:30
31387 Fairbanks NaN NaN AK 12/20/2004 20:00
33534 Suriname NaN NaN AK 8/11/2005 6:11
34623 Glennallen NaN NaN AK 11/1/2005 0:30
44957 Fairbanks BLUE NaN AK 7/3/2008 16:40
47362 Fairbanks NaN NaN AK 12/29/2008 15:00
48773 Eagle River NaN NaN AK 5/9/2009 2:30
52781 Prince William Sound NaN NaN AK 5/29/2010 9:45
53051 Elk NaN NaN AK 6/25/2010 10:22
56186 Fairbanks NaN NaN AK 2/11/2011 8:00
56499 Hooper Bay NaN NaN AK 3/15/2011 0:03
61311 Anchorage ORANGE NaN AK 1/22/2012 20:15
65642 Fairbanks NaN NaN AK 9/3/2012 21:57
68292 Anchorage NaN NaN AK 1/7/2013 1:50
71252 Sterling NaN NaN AK 7/22/2013 3:13
73090 Anchorage NaN NaN AK 10/5/2013 2:55
73916 Fairbanks ORANGE BLUE NaN AK 11/8/2013 23:00
74639 Fairbanks ORANGE NaN AK 12/10/2013 23:15
74969 Anchorage NaN NaN AK 12/26/2013 18:21
75471 Talkitna NaN NaN AK 1/14/2014 23:00
77458 Anchorage NaN NaN AK 5/5/2014 16:00

80543 rows × 5 columns


In [100]:
# Fill in missing values for a series
ufo.Colors.fillna(value='Unknown', inplace=True)

In [101]:
# Fill in missing values for the DataFrame
ufo.fillna(value='Unknown', inplace=True)

Exercise: Working with the Drinks Data

(Be on the lookout for a curveball question)


In [31]:
# Read drinks.csv (in the 'drinks_data' folder) into a DataFrame called 'drinks'

In [32]:
# Print the first 10 rows

In [33]:
# Examine the data types of all columns

In [34]:
# Print the 'beer_servings' Series

In [35]:
# Calculate the average 'beer_servings' for the entire dataset

In [36]:
# Print all columns, but only show rows where the country is in Europe

In [37]:
# Calculate the average 'beer_servings' for all of Europe

In [38]:
# Only show European countries with 'wine_servings' greater than 300

In [39]:
# Determine which 10 countries have the highest 'total_litres_of_pure_alcohol'

In [40]:
# Determine which country has the highest value for 'beer_servings'

In [41]:
# Count the number of occurrences of each 'continent' value and see if it looks correct

In [42]:
# Determine which countries do not have continent designations

In [43]:
# Determine the number of countries per continent. Does it look right?

Solutions


In [102]:
# Read drinks.csv (in the drinks_data folder) into a DataFrame called 'drinks'
drinks = pd.read_csv('drinks_data/drinks.csv')

In [103]:
# Print the first 10 rows
drinks.head(10)


Out[103]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 AS
1 Albania 89 132 54 4.9 EU
2 Algeria 25 0 14 0.7 AF
3 Andorra 245 138 312 12.4 EU
4 Angola 217 57 45 5.9 AF
5 Antigua & Barbuda 102 128 45 4.9 NaN
6 Argentina 193 25 221 8.3 SA
7 Armenia 21 179 11 3.8 EU
8 Australia 261 72 212 10.4 OC
9 Austria 279 75 191 9.7 EU

In [104]:
# Examine the data types of all columns
drinks.dtypes
drinks.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       170 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 10.6+ KB

In [ ]:
# Print the 'beer_servings' Series
drinks.beer_servings
drinks['beer_servings']

In [107]:
# Calculate the average 'beer_servings' for the entire dataset
drinks.describe()                   # Mean is provided in the summary from describe()
drinks.beer_servings.mean()         # Alternatively, calculate the mean directly


Out[107]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
count 193.000000 193.000000 193.000000 193.000000
mean 106.160622 80.994819 49.450777 4.717098
std 101.143103 88.284312 79.697598 3.773298
min 0.000000 0.000000 0.000000 0.000000
25% 20.000000 4.000000 1.000000 1.300000
50% 76.000000 56.000000 8.000000 4.200000
75% 188.000000 128.000000 59.000000 7.200000
max 376.000000 438.000000 370.000000 14.400000

In [108]:
# Print all columns, but only show rows where the country is in Europe
drinks[drinks.continent=='EU']


Out[108]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
1 Albania 89 132 54 4.9 EU
3 Andorra 245 138 312 12.4 EU
7 Armenia 21 179 11 3.8 EU
9 Austria 279 75 191 9.7 EU
10 Azerbaijan 21 46 5 1.3 EU
15 Belarus 142 373 42 14.4 EU
16 Belgium 295 84 212 10.5 EU
21 Bosnia-Herzegovina 76 173 8 4.6 EU
25 Bulgaria 231 252 94 10.3 EU
42 Croatia 230 87 254 10.2 EU
44 Cyprus 192 154 113 8.2 EU
45 Czech Republic 361 170 134 11.8 EU
48 Denmark 224 81 278 10.4 EU
57 Estonia 224 194 59 9.5 EU
60 Finland 263 133 97 10.0 EU
61 France 127 151 370 11.8 EU
64 Georgia 52 100 149 5.4 EU
65 Germany 346 117 175 11.3 EU
67 Greece 133 112 218 8.3 EU
75 Hungary 234 215 185 11.3 EU
76 Iceland 233 61 78 6.6 EU
81 Ireland 313 118 165 11.4 EU
83 Italy 85 42 237 6.5 EU
93 Latvia 281 216 62 10.5 EU
98 Lithuania 343 244 56 12.9 EU
99 Luxembourg 236 133 271 11.4 EU
105 Malta 149 100 120 6.6 EU
111 Monaco 0 0 0 0.0 EU
113 Montenegro 31 114 128 4.9 EU
120 Netherlands 251 88 190 9.4 EU
126 Norway 169 71 129 6.7 EU
135 Poland 343 215 56 10.9 EU
136 Portugal 194 67 339 11.0 EU
139 Moldova 109 226 18 6.3 EU
140 Romania 297 122 167 10.4 EU
147 San Marino 0 0 0 0.0 EU
151 Serbia 283 131 127 9.6 EU
155 Slovakia 196 293 116 11.4 EU
156 Slovenia 270 51 276 10.6 EU
160 Spain 284 157 112 10.0 EU
165 Sweden 152 60 186 7.2 EU
166 Switzerland 185 100 280 10.2 EU
170 Macedonia 106 27 86 3.9 EU
180 Ukraine 206 237 45 8.9 EU
182 United Kingdom 219 126 195 10.4 EU

In [109]:
# Calculate the average 'beer_servings' for all of Europe (hint: use the .mean() function)
drinks[drinks.continent=='EU'].beer_servings.mean()


Out[109]:
193.77777777777777

In [110]:
# Only show European countries with 'wine_servings' greater than 300
drinks[(drinks.continent=='EU') & (drinks.wine_servings > 300)]


Out[110]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
3 Andorra 245 138 312 12.4 EU
61 France 127 151 370 11.8 EU
136 Portugal 194 67 339 11.0 EU

In [111]:
# Determine which 10 countries have the highest 'total_litres_of_pure_alcohol'
drinks.sort_index(by='total_litres_of_pure_alcohol').tail(10)


Out[111]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
99 Luxembourg 236 133 271 11.4 EU
155 Slovakia 196 293 116 11.4 EU
81 Ireland 313 118 165 11.4 EU
141 Russian Federation 247 326 73 11.5 AS
61 France 127 151 370 11.8 EU
45 Czech Republic 361 170 134 11.8 EU
68 Grenada 199 438 28 11.9 NaN
3 Andorra 245 138 312 12.4 EU
98 Lithuania 343 244 56 12.9 EU
15 Belarus 142 373 42 14.4 EU

In [112]:
# Determine which country has the highest value for 'beer_servings' (hint: use the .max() function)
drinks[drinks.beer_servings==drinks.beer_servings.max()].country


Out[112]:
117    Namibia
Name: country, dtype: object

In [113]:
drinks[['country', 'beer_servings']].sort_index(by='beer_servings', ascending=False).head(1) # This is equivalent


Out[113]:
country beer_servings
117 Namibia 376

In [114]:
# Count the number of occurrences of each 'continent' value and see if it looks correct
drinks.continent.value_counts()


Out[114]:
AF    53
EU    45
AS    44
OC    16
SA    12
dtype: int64

In [115]:
# Determine which countries do not have continent designations
drinks[drinks.continent.isnull()].country


Out[115]:
5                 Antigua & Barbuda
11                          Bahamas
14                         Barbados
17                           Belize
32                           Canada
41                       Costa Rica
43                             Cuba
50                         Dominica
51               Dominican Republic
54                      El Salvador
68                          Grenada
69                        Guatemala
73                            Haiti
74                         Honduras
84                          Jamaica
109                          Mexico
122                       Nicaragua
130                          Panama
143               St. Kitts & Nevis
144                       St. Lucia
145    St. Vincent & the Grenadines
174               Trinidad & Tobago
184                             USA
Name: country, dtype: object

In [116]:
# Due to "na_filter = True" default within pd.read_csv()
help(pd.read_csv)


Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, float_precision=None, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False, skip_blank_lines=True)
    Read CSV (comma-separated) file into DataFrame
    
    Also supports optionally iterating or breaking of the file
    into chunks.
    
    Parameters
    ----------
    filepath_or_buffer : string or file handle / StringIO
        The string could be a URL. Valid URL schemes include
        http, ftp, s3, and file. For file URLs, a
        host is expected. For instance, a local file could be
        file ://localhost/path/to/table.csv
    sep : string, default ','
        Delimiter to use. If sep is None, will try to automatically determine
        this. Regular expressions are accepted.
    engine : {'c', 'python'}
        Parser engine to use. The C engine is faster while the python engine is
        currently more feature-complete.
    lineterminator : string (length 1), default None
        Character to break file into lines. Only valid with C parser
    quotechar : string (length 1)
        The character used to denote the start and end of a quoted item. Quoted
        items can include the delimiter and it will be ignored.
    quoting : int or csv.QUOTE_* instance, default None
        Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
        QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
        Default (None) results in QUOTE_MINIMAL behavior.
    skipinitialspace : boolean, default False
        Skip spaces after delimiter
    escapechar : string (length 1), default None
        One-character string used to escape delimiter when quoting is QUOTE_NONE.
    dtype : Type name or dict of column -> type
        Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
        (Unsupported with engine='python')
    compression : {'gzip', 'bz2', None}, default None
        For on-the-fly decompression of on-disk data
    dialect : string or csv.Dialect instance, default None
        If None defaults to Excel dialect. Ignored if sep longer than 1 char
        See csv.Dialect documentation for more details
    header : int, list of ints
        Row number(s) to use as the column names, and the start of the
        data.  Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
        pass ``header=0`` to be able to replace existing names. The header can be
        a list of integers that specify row locations for a multi-index on the
        columns E.g. [0,1,3]. Intervening rows that are not specified will be
        skipped (e.g. 2 in this example are skipped). Note that this parameter
        ignores commented lines and empty lines if ``skip_blank_lines=True``, so header=0
        denotes the first line of data rather than the first line of the file.
    skiprows : list-like or integer
        Line numbers to skip (0-indexed) or number of lines to skip (int)
        at the start of the file
    index_col : int or sequence or False, default None
        Column to use as the row labels of the DataFrame. If a sequence is given, a
        MultiIndex is used. If you have a malformed file with delimiters at the end
        of each line, you might consider index_col=False to force pandas to _not_
        use the first column as the index (row names)
    names : array-like
        List of column names to use. If file contains no header row, then you
        should explicitly pass header=None
    prefix : string, default None
        Prefix to add to column numbers when no header, e.g 'X' for X0, X1, ...
    na_values : list-like or dict, default None
        Additional strings to recognize as NA/NaN. If dict passed, specific
        per-column NA values
    true_values : list
        Values to consider as True
    false_values : list
        Values to consider as False
    keep_default_na : bool, default True
        If na_values are specified and keep_default_na is False the default NaN
        values are overridden, otherwise they're appended to
    parse_dates : boolean, list of ints or names, list of lists, or dict
        If True -> try parsing the index.
        If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
        If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
        {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
        A fast-path exists for iso8601-formatted dates.
    keep_date_col : boolean, default False
        If True and parse_dates specifies combining multiple columns then
        keep the original columns.
    date_parser : function
        Function to use for converting a sequence of string columns to an
        array of datetime instances. The default uses dateutil.parser.parser
        to do the conversion.
    dayfirst : boolean, default False
        DD/MM format dates, international and European format
    thousands : str, default None
        Thousands separator
    comment : str, default None
        Indicates remainder of line should not be parsed. If found at the
        beginning of a line, the line will be ignored altogether. This parameter
        must be a single character. Like empty lines (as long as ``skip_blank_lines=True``),
        fully commented lines are ignored by the parameter `header`
        but not by `skiprows`. For example, if comment='#', parsing
        '#empty\na,b,c\n1,2,3' with `header=0` will result in 'a,b,c' being
        treated as the header.
    decimal : str, default '.'
        Character to recognize as decimal point. E.g. use ',' for European data
    nrows : int, default None
        Number of rows of file to read. Useful for reading pieces of large files
    iterator : boolean, default False
        Return TextFileReader object
    chunksize : int, default None
        Return TextFileReader object for iteration
    skipfooter : int, default 0
        Number of lines at bottom of file to skip (Unsupported with engine='c')
    converters : dict, default None
        Dict of functions for converting values in certain columns. Keys can either
        be integers or column labels
    verbose : boolean, default False
        Indicate number of NA values placed in non-numeric columns
    delimiter : string, default None
        Alternative argument name for sep. Regular expressions are accepted.
    encoding : string, default None
        Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
        standard encodings
        <https://docs.python.org/3/library/codecs.html#standard-encodings>`_
    squeeze : boolean, default False
        If the parsed data only contains one column then return a Series
    na_filter : boolean, default True
        Detect missing value markers (empty strings and the value of na_values). In
        data without any NAs, passing na_filter=False can improve the performance
        of reading a large file
    usecols : array-like
        Return a subset of the columns.
        Results in much faster parsing time and lower memory usage.
    mangle_dupe_cols : boolean, default True
        Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
    tupleize_cols : boolean, default False
        Leave a list of tuples on columns as is (default is to convert to
        a Multi Index on the columns)
    error_bad_lines : boolean, default True
        Lines with too many fields (e.g. a csv line with too many commas) will by
        default cause an exception to be raised, and no DataFrame will be returned.
        If False, then these "bad lines" will dropped from the DataFrame that is
        returned. (Only valid with C parser)
    warn_bad_lines : boolean, default True
        If error_bad_lines is False, and warn_bad_lines is True, a warning for each
        "bad line" will be output. (Only valid with C parser).
    infer_datetime_format : boolean, default False
        If True and parse_dates is enabled for a column, attempt to infer
        the datetime format to speed up the processing
    skip_blank_lines : boolean, default True
        If True, skip over blank lines rather than interpreting as NaN values
    
    Returns
    -------
    result : DataFrame or TextParser

Indexing and Slicing Data

Create a new index


In [117]:
ufo.set_index('State', inplace=True)
ufo.index


Out[117]:
Index([u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', u'WY', ...], dtype='object')

In [118]:
ufo.index.is_unique


Out[118]:
False

In [119]:
ufo.sort_index(inplace=True)
ufo.head(25)


Out[119]:
City Colors Shape Time
State
AK Wainwright Unknown LIGHT 1/2/2002 21:32
AK Mountain Village Unknown Unknown 3/2/1996 0:30
AK Anchorage Unknown Unknown 5/5/2014 16:00
AK Talkitna Unknown Unknown 1/14/2014 23:00
AK Anchorage Unknown Unknown 12/26/2013 18:21
AK Fairbanks ORANGE Unknown 12/10/2013 23:15
AK Fairbanks ORANGE BLUE Unknown 11/8/2013 23:00
AK Anchorage Unknown Unknown 10/5/2013 2:55
AK Sterling Unknown Unknown 7/22/2013 3:13
AK Anchorage Unknown Unknown 1/7/2013 1:50
AK Fairbanks Unknown Unknown 9/3/2012 21:57
AK Anchorage ORANGE Unknown 1/22/2012 20:15
AK Hooper Bay Unknown Unknown 3/15/2011 0:03
AK Fairbanks Unknown Unknown 2/11/2011 8:00
AK Elk Unknown Unknown 6/25/2010 10:22
AK Prince William Sound Unknown Unknown 5/29/2010 9:45
AK Eagle River Unknown Unknown 5/9/2009 2:30
AK Fairbanks Unknown Unknown 12/29/2008 15:00
AK Fairbanks BLUE Unknown 7/3/2008 16:40
AK Glennallen Unknown Unknown 11/1/2005 0:30
AK Suriname Unknown Unknown 8/11/2005 6:11
AK Fairbanks Unknown Unknown 12/20/2004 20:00
AK St. George Island Unknown Unknown 8/30/2002 23:30
AK Anchorage RED Unknown 1/20/2002 20:00
AK North Pole Unknown Unknown 12/12/2000 22:00

loc: filter rows by LABEL, and select columns by LABEL


In [120]:
ufo.loc['FL',:]                                     # row with label FL`


Out[120]:
City Colors Shape Time
State
FL Plantation Unknown TEARDROP 11/5/2000 5:12
FL Wildwood RED GREEN Unknown 9/2/2014 5:00
FL Kissimmee Unknown Unknown 8/22/2014 21:10
FL Cape Canaveral RED GREEN Unknown 8/6/2014 22:00
FL Port Saint Lucie Unknown Unknown 8/5/2014 4:00
FL South Daytona RED Unknown 7/30/2014 23:15
FL Sarasota Unknown Unknown 7/26/2014 21:00
FL Jacksonville Unknown Unknown 7/22/2014 0:05
FL Ruskin Unknown Unknown 7/14/2014 21:15
FL Delray Beach ORANGE Unknown 7/4/2014 22:00
FL Clermont RED Unknown 7/4/2014 19:00
FL Nokomis Unknown Unknown 6/24/2014 12:15
FL Orlando Unknown Unknown 6/8/2014 10:15
FL Palm City Unknown Unknown 6/5/2014 21:30
FL Port Richey Unknown Unknown 5/7/2014 9:30
FL Clermont Unknown Unknown 5/4/2014 4:00
FL Orange Park Unknown Unknown 5/2/2014 2:37
FL Tarpon Springs Unknown Unknown 4/26/2014 21:45
FL Orlando RED Unknown 4/19/2014 21:30
FL New Port Richey Unknown Unknown 4/13/2014 22:30
FL Panama City Beach ORANGE Unknown 4/12/2014 20:00
FL Parrish ORANGE Unknown 4/11/2014 23:00
FL Lehigh Acres Unknown Unknown 4/2/2014 22:00
FL Port St. Lucie Unknown Unknown 4/1/2014 23:00
FL Celebration RED ORANGE Unknown 3/31/2014 21:00
FL Lehigh Acres Unknown Unknown 3/27/2014 23:30
FL Kissimmee Unknown Unknown 3/20/2014 22:00
FL Davie Unknown Unknown 3/19/2014 16:00
FL St. Cloud Unknown Unknown 3/9/2014 22:10
FL Tampa ORANGE Unknown 3/8/2014 21:25
... ... ... ... ...
FL St. George Island RED CHEVRON 5/23/2005 21:30
FL Miami Gardens Unknown CHEVRON 10/27/2004 22:00
FL Jacksonville Unknown CHEVRON 4/30/2004 16:00
FL Orlando Unknown CHEVRON 3/29/2004 5:15
FL Pensacola Unknown CHEVRON 11/8/2003 19:00
FL New Port Richey Unknown CHEVRON 8/14/2003 18:00
FL Pensacola Unknown CHEVRON 4/10/2003 23:15
FL Panama City Unknown CHEVRON 3/13/2003 23:00
FL Merritt Island Unknown CHEVRON 9/19/2002 21:00
FL St. James City ORANGE CHEVRON 5/10/2002 20:00
FL Hollywood Unknown CHEVRON 11/21/2001 23:45
FL Perry Unknown CHEVRON 11/20/2001 4:05
FL Panama City Beach ORANGE CHEVRON 11/20/2001 2:55
FL Melbourne Unknown CHEVRON 7/7/2001 21:45
FL Miami Unknown CHEVRON 10/13/2000 20:00
FL Warrington Unknown CHEVRON 8/18/2000 20:15
FL Delray Beach Unknown CHEVRON 4/19/2000 23:09
FL Jacksonville Unknown CHEVRON 10/17/1999 15:00
FL Jacksonville Unknown CHEVRON 10/17/1999 15:00
FL Winter Park Unknown CHEVRON 4/11/1999 23:49
FL St. Petersburg Unknown CHEVRON 3/23/1999 21:15
FL Ormond Beach Unknown CHEVRON 2/14/1997 20:00
FL Brooksville Unknown CHEVRON 4/15/1994 22:00
FL Cedar Key Unknown CHEVRON 1/12/1994 18:00
FL Miami Unknown CHEVRON 9/29/1989 6:15
FL Tallahassee Unknown CHEVRON 8/20/1987 21:00
FL Fort Lauderdale Unknown CHEVRON 12/3/1977 0:50
FL Opa Locka Unknown CHEVRON 1/1/1967 13:00
FL Jacksonville Unknown CHEVRON 11/12/1966 0:00
FL Opa Locka Unknown CHEVRON 11/1/1966 12:00

4804 rows × 4 columns


In [ ]:
ufo.loc[:'FL',:]                                    # rows with labels  through'FL'

In [ ]:
ufo.loc['FL':'HI', 'City':'Shape']         # rows FL, columns 'City' through 'Shape Reported'

In [ ]:
ufo.loc[:, 'City':'Shape']                 # all rows, columns 'City' through 'Shape Reported'

In [ ]:
ufo.loc[['FL', 'TX'], ['City','Shape']]    # rows FL and TX, columns 'City' and 'Shape Reported'

iloc: filter rows by POSITION, and select columns by POSITION


In [ ]:
ufo.iloc[0,:]                       # row with 0th position (first row)

In [ ]:
ufo.iloc[0:3,:]                     # rows with positions 0 through 2 (not 3)

In [ ]:
ufo.iloc[0:3, 0:3]                  # rows and columns with positions 0 through 2

In [ ]:
ufo.iloc[:, 0:3]                    # all rows, columns with positions 0 through 2

In [ ]:
ufo.iloc[[0,2], [0,1]]              # 1st and 3rd row, 1st and 2nd column

Add another level to the index


In [121]:
ufo.set_index('City', inplace=True, append=True) # Adds to existing index
ufo.sort_index(inplace=True)
ufo.head(25)


Out[121]:
Colors Shape Time
State City
AK Adak Unknown LIGHT 6/1/1986 0:00
Alaska Unknown CIGAR 4/29/2008 0:00
Alaska Unknown DISK 1/15/1998 13:00
Alaska Unknown LIGHT 9/1/1999 23:00
Alaska cruise Unknown LIGHT 5/11/2011 21:00
Anchor Point Unknown DISK 5/24/2005 18:30
Anchorage Unknown Unknown 5/5/2014 16:00
Anchorage Unknown Unknown 12/26/2013 18:21
Anchorage Unknown Unknown 10/5/2013 2:55
Anchorage Unknown Unknown 1/7/2013 1:50
Anchorage ORANGE Unknown 1/22/2012 20:15
Anchorage RED Unknown 1/20/2002 20:00
Anchorage Unknown CIGAR 3/29/2014 20:45
Anchorage BLUE CIRCLE 8/30/1972 1:30
Anchorage Unknown CIRCLE 10/31/1998 10:00
Anchorage Unknown CIRCLE 1/24/2003 8:30
Anchorage Unknown CIRCLE 8/16/2004 12:08
Anchorage Unknown CIRCLE 10/23/2006 21:00
Anchorage Unknown CIRCLE 8/6/2010 18:00
Anchorage Unknown CIRCLE 12/16/2010 12:00
Anchorage Unknown CIRCLE 10/21/2011 21:00
Anchorage Unknown CIRCLE 11/23/2011 11:00
Anchorage ORANGE CIRCLE 10/23/2013 20:00
Anchorage RED CIRCLE 12/19/2013 16:00
Anchorage RED CIRCLE 12/23/2013 22:22

In [122]:
ufo.loc[['ND', 'WY'],:] # Select all records from ND AND WY


Out[122]:
Colors Shape Time
State City
ND Amenia Unknown DISK 7/15/1978 18:00
Aneta Unknown CIRCLE 12/20/2001 19:30
Arthur Unknown Unknown 9/7/1958 19:00
Beach Unknown DISK 7/3/1996 11:30
Belcourt Unknown OTHER 6/18/2012 7:00
Belcourt Unknown CIRCLE 4/14/2012 6:00
Belfield Unknown LIGHT 7/17/1997 23:00
Belfield Unknown DISK 9/15/2000 19:00
Berthold Unknown CIGAR 7/15/1976 15:00
Bismarck ORANGE Unknown 11/24/1996 19:30
Bismarck Unknown VARIOUS 9/10/2008 22:30
Bismarck ORANGE TRIANGLE 4/25/2012 23:30
Bismarck Unknown TRIANGLE 9/29/2011 22:30
Bismarck RED RECTANGLE 6/30/2008 23:00
Bismarck Unknown OTHER 11/28/2003 23:45
Bismarck Unknown LIGHT 8/5/2013 23:15
Bismarck Unknown LIGHT 6/4/2010 22:00
Bismarck Unknown LIGHT 7/1/2000 2:30
Bismarck Unknown LIGHT 11/24/1996 18:07
Bismarck Unknown FORMATION 10/11/1996 16:00
Bismarck Unknown DISK 4/11/2009 11:20
Bismarck GREEN BLUE CIRCLE 8/14/2004 4:15
Brunswick County Unknown Unknown 10/25/1995 16:55
Burlington Unknown FIREBALL 8/14/1998 16:00
Cartwright Unknown RECTANGLE 5/20/1999 14:00
Casselton Unknown TRIANGLE 4/11/2012 21:58
Casselton Unknown TRIANGLE 4/11/2012 21:48
Casselton RED BLUE Unknown 8/24/2008 1:00
Central U.S. Unknown DIAMOND 4/2/1995 13:00
Colfax Unknown DISK 1/15/2008 21:00
... ... ... ... ...
WY Sheridan Unknown OVAL 9/6/2002 21:00
Sheridan Unknown OTHER 9/24/1998 0:15
Sheridan Unknown LIGHT 8/7/2012 22:20
Sheridan RED GREEN LIGHT 12/19/2006 21:00
Sheridan Unknown LIGHT 10/28/2004 2:00
Shirley Basin Road Unknown LIGHT 3/15/2006 23:30
Ten Sleep Unknown DISK 3/21/1998 20:00
Teton National Park Unknown OVAL 7/19/2008 12:00
Teton National Park Unknown OTHER 9/26/2005 11:00
Thermopolis Unknown Unknown 6/14/2007 23:00
Torrington Unknown CIGAR 11/5/2011 21:30
Wheeling Unknown Unknown 6/15/1984 20:30
Worland Unknown OVAL 2/15/2008 5:00
Worland Unknown LIGHT 6/17/2003 22:42
Wyoming Unknown OTHER 9/28/2005 16:30
Wyoming Unknown OTHER 4/20/1983 0:00
Wyoming Unknown CIGAR 9/1/2013 0:00
Wyoming Unknown CIGAR 7/12/1967 2:00
Yellowstone Unknown LIGHT 8/4/2013 22:00
Yellowstone National Park Unknown VARIOUS 9/20/2011 19:55
Yellowstone National Park YELLOW TEARDROP 8/19/2014 7:40
Yellowstone National Park Unknown SPHERE 9/29/2003 20:35
Yellowstone National Park Unknown LIGHT 2/15/2011 20:00
Yellowstone National Park Unknown FORMATION 6/19/2007 23:00
Yellowstone National Park Unknown FLASH 6/25/2011 23:30
Yellowstone National Park Unknown EGG 6/30/1971 14:00
Yellowstone North Entrance RED YELLOW CIRCLE 4/17/2014 18:00
Yellowstone Park Unknown DISK 7/15/1956 0:00
na Unknown TEARDROP 9/19/2003 14:00
remote Unknown FLASH 10/23/2008 4:45

394 rows × 3 columns


In [123]:
ufo.loc['ND':'WY',:]    # Select all records from ND THROUGH WY


Out[123]:
Colors Shape Time
State City
ND Amenia Unknown DISK 7/15/1978 18:00
Aneta Unknown CIRCLE 12/20/2001 19:30
Arthur Unknown Unknown 9/7/1958 19:00
Beach Unknown DISK 7/3/1996 11:30
Belcourt Unknown OTHER 6/18/2012 7:00
Belcourt Unknown CIRCLE 4/14/2012 6:00
Belfield Unknown LIGHT 7/17/1997 23:00
Belfield Unknown DISK 9/15/2000 19:00
Berthold Unknown CIGAR 7/15/1976 15:00
Bismarck ORANGE Unknown 11/24/1996 19:30
Bismarck Unknown VARIOUS 9/10/2008 22:30
Bismarck ORANGE TRIANGLE 4/25/2012 23:30
Bismarck Unknown TRIANGLE 9/29/2011 22:30
Bismarck RED RECTANGLE 6/30/2008 23:00
Bismarck Unknown OTHER 11/28/2003 23:45
Bismarck Unknown LIGHT 8/5/2013 23:15
Bismarck Unknown LIGHT 6/4/2010 22:00
Bismarck Unknown LIGHT 7/1/2000 2:30
Bismarck Unknown LIGHT 11/24/1996 18:07
Bismarck Unknown FORMATION 10/11/1996 16:00
Bismarck Unknown DISK 4/11/2009 11:20
Bismarck GREEN BLUE CIRCLE 8/14/2004 4:15
Brunswick County Unknown Unknown 10/25/1995 16:55
Burlington Unknown FIREBALL 8/14/1998 16:00
Cartwright Unknown RECTANGLE 5/20/1999 14:00
Casselton Unknown TRIANGLE 4/11/2012 21:58
Casselton Unknown TRIANGLE 4/11/2012 21:48
Casselton RED BLUE Unknown 8/24/2008 1:00
Central U.S. Unknown DIAMOND 4/2/1995 13:00
Colfax Unknown DISK 1/15/2008 21:00
... ... ... ... ...
WY Sheridan Unknown OVAL 9/6/2002 21:00
Sheridan Unknown OTHER 9/24/1998 0:15
Sheridan Unknown LIGHT 8/7/2012 22:20
Sheridan RED GREEN LIGHT 12/19/2006 21:00
Sheridan Unknown LIGHT 10/28/2004 2:00
Shirley Basin Road Unknown LIGHT 3/15/2006 23:30
Ten Sleep Unknown DISK 3/21/1998 20:00
Teton National Park Unknown OVAL 7/19/2008 12:00
Teton National Park Unknown OTHER 9/26/2005 11:00
Thermopolis Unknown Unknown 6/14/2007 23:00
Torrington Unknown CIGAR 11/5/2011 21:30
Wheeling Unknown Unknown 6/15/1984 20:30
Worland Unknown OVAL 2/15/2008 5:00
Worland Unknown LIGHT 6/17/2003 22:42
Wyoming Unknown OTHER 9/28/2005 16:30
Wyoming Unknown OTHER 4/20/1983 0:00
Wyoming Unknown CIGAR 9/1/2013 0:00
Wyoming Unknown CIGAR 7/12/1967 2:00
Yellowstone Unknown LIGHT 8/4/2013 22:00
Yellowstone National Park Unknown VARIOUS 9/20/2011 19:55
Yellowstone National Park YELLOW TEARDROP 8/19/2014 7:40
Yellowstone National Park Unknown SPHERE 9/29/2003 20:35
Yellowstone National Park Unknown LIGHT 2/15/2011 20:00
Yellowstone National Park Unknown FORMATION 6/19/2007 23:00
Yellowstone National Park Unknown FLASH 6/25/2011 23:30
Yellowstone National Park Unknown EGG 6/30/1971 14:00
Yellowstone North Entrance RED YELLOW CIRCLE 4/17/2014 18:00
Yellowstone Park Unknown DISK 7/15/1956 0:00
na Unknown TEARDROP 9/19/2003 14:00
remote Unknown FLASH 10/23/2008 4:45

34607 rows × 3 columns


In [124]:
ufo.loc[('ND', 'Bismarck'),:]  # Select all records from Bismark, ND


Out[124]:
Colors Shape Time
State City
ND Bismarck ORANGE Unknown 11/24/1996 19:30
Bismarck Unknown VARIOUS 9/10/2008 22:30
Bismarck ORANGE TRIANGLE 4/25/2012 23:30
Bismarck Unknown TRIANGLE 9/29/2011 22:30
Bismarck RED RECTANGLE 6/30/2008 23:00
Bismarck Unknown OTHER 11/28/2003 23:45
Bismarck Unknown LIGHT 8/5/2013 23:15
Bismarck Unknown LIGHT 6/4/2010 22:00
Bismarck Unknown LIGHT 7/1/2000 2:30
Bismarck Unknown LIGHT 11/24/1996 18:07
Bismarck Unknown FORMATION 10/11/1996 16:00
Bismarck Unknown DISK 4/11/2009 11:20
Bismarck GREEN BLUE CIRCLE 8/14/2004 4:15

In [125]:
ufo.loc[('ND', 'Bismarck'):('ND','Casselton'),:] # Select all records from Bismark, ND through Casselton, ND


Out[125]:
Colors Shape Time
State City
ND Bismarck ORANGE Unknown 11/24/1996 19:30
Bismarck Unknown VARIOUS 9/10/2008 22:30
Bismarck ORANGE TRIANGLE 4/25/2012 23:30
Bismarck Unknown TRIANGLE 9/29/2011 22:30
Bismarck RED RECTANGLE 6/30/2008 23:00
Bismarck Unknown OTHER 11/28/2003 23:45
Bismarck Unknown LIGHT 8/5/2013 23:15
Bismarck Unknown LIGHT 6/4/2010 22:00
Bismarck Unknown LIGHT 7/1/2000 2:30
Bismarck Unknown LIGHT 11/24/1996 18:07
Bismarck Unknown FORMATION 10/11/1996 16:00
Bismarck Unknown DISK 4/11/2009 11:20
Bismarck GREEN BLUE CIRCLE 8/14/2004 4:15
Brunswick County Unknown Unknown 10/25/1995 16:55
Burlington Unknown FIREBALL 8/14/1998 16:00
Cartwright Unknown RECTANGLE 5/20/1999 14:00
Casselton Unknown TRIANGLE 4/11/2012 21:58
Casselton Unknown TRIANGLE 4/11/2012 21:48
Casselton RED BLUE Unknown 8/24/2008 1:00

In [126]:
ufo.reset_index(level='City', inplace=True) # Remove the City from the index
ufo.head()


Out[126]:
City Colors Shape Time
State
AK Adak Unknown LIGHT 6/1/1986 0:00
AK Alaska Unknown CIGAR 4/29/2008 0:00
AK Alaska Unknown DISK 1/15/1998 13:00
AK Alaska Unknown LIGHT 9/1/1999 23:00
AK Alaska cruise Unknown LIGHT 5/11/2011 21:00

In [127]:
ufo.reset_index(inplace=True)               # Remove all columns from the index

In [130]:
ufo.head()


Out[130]:
State City Colors Shape Time
0 AK Adak Unknown LIGHT 6/1/1986 0:00
1 AK Alaska Unknown CIGAR 4/29/2008 0:00
2 AK Alaska Unknown DISK 1/15/1998 13:00
3 AK Alaska Unknown LIGHT 9/1/1999 23:00
4 AK Alaska cruise Unknown LIGHT 5/11/2011 21:00

Analyzing Across Time


In [129]:
# Reset the index
ufo.dtypes


Out[129]:
State     object
City      object
Colors    object
Shape     object
Time      object
dtype: object

In [131]:
# Convert Time column to date-time format (defined in Pandas)
# Reference: https://docs.python.org/2/library/time.html#time.strftime
ufo['Time'] = pd.to_datetime(ufo['Time'], format="%m/%d/%Y %H:%M")
ufo.dtypes


Out[131]:
State             object
City              object
Colors            object
Shape             object
Time      datetime64[ns]
dtype: object

In [132]:
# Compute date range
ufo.Time.min()


Out[132]:
Timestamp('2014-09-05 05:30:00')

In [133]:
ufo.Time.max()


Out[133]:
Timestamp('2014-09-05 05:30:00')

In [ ]:
# Slice using time
ufo[ufo.Time > pd.datetime(1995, 1, 1)]  # Slice using the time

In [ ]:
ufo[(ufo.Time > pd.datetime(1995, 1, 1)) & (ufo.State =='TX')]  # Works with other logical conditions, as expected

In [134]:
# Set the index to time
ufo.set_index('Time', inplace=True)
ufo.sort_index(inplace=True)
ufo.head()


Out[134]:
State City Colors Shape
Time
1930-06-01 22:00:00 NY Ithaca Unknown TRIANGLE
1930-06-30 20:00:00 NJ Willingboro Unknown OTHER
1931-02-15 14:00:00 CO Holyoke Unknown OVAL
1931-06-01 13:00:00 KS Abilene Unknown DISK
1933-04-18 19:00:00 NY New York Worlds Fair Unknown LIGHT

In [135]:
# Access particular times/ranges
ufo.loc['1995',:]
ufo.loc['1995-01',:]
ufo.loc['1995-01-01',:]


Out[135]:
State City Colors Shape
Time
1995-01-01 00:02:00 CA Pacific Grove Unknown Unknown
1995-01-01 01:00:00 NV Mt. Charleston Unknown OVAL
1995-01-01 02:00:00 CA El Granada Unknown FIREBALL
1995-01-01 03:00:00 KS Kansas Unknown LIGHT
1995-01-01 06:00:00 SC Travelers Rest Unknown TRIANGLE
1995-01-01 10:00:00 CA Oxnard Unknown VARIOUS
1995-01-01 15:00:00 SC Greenwood RED SPHERE
1995-01-01 15:00:00 FL Cedar Key Unknown TRIANGLE
1995-01-01 21:00:00 AZ Tucson Unknown TRIANGLE
1995-01-01 22:00:00 WA Shelton Unknown Unknown
1995-01-01 22:45:00 CA Anaheim Unknown Unknown
1995-01-01 23:00:00 NY Southampton Unknown LIGHT

In [136]:
# Access range of times/ranges
ufo.loc['1995':,:]
ufo.loc['1995':'1996',:]
ufo.loc['1995-12-01':'1996-01',:]


Out[136]:
State City Colors Shape
Time
1995-12-01 17:20:00 WI Waukesha Unknown Unknown
1995-12-01 21:00:00 CA Mount Shasta Unknown Unknown
1995-12-01 22:55:00 MI Owosso Unknown LIGHT
1995-12-01 23:40:00 IL Orland Park Unknown Unknown
1995-12-02 22:00:00 CA Sebastopal Unknown Unknown
1995-12-03 11:20:00 AZ Chandler Unknown Unknown
1995-12-03 21:00:00 OR Boring Unknown Unknown
1995-12-03 22:00:00 WA Nisqually Delta Unknown Unknown
1995-12-04 02:00:00 NM Deming Unknown VARIOUS
1995-12-04 02:14:00 WA Seattle Unknown LIGHT
1995-12-04 22:16:00 NY Geneva Unknown Unknown
1995-12-05 20:30:00 WA North Bend Unknown CHEVRON
1995-12-05 20:30:00 WA North Bend Unknown Unknown
1995-12-06 17:00:00 WA Bothell Unknown Unknown
1995-12-06 22:45:00 WA Lynnwood Unknown Unknown
1995-12-07 20:40:00 FL Oldsmar ORANGE BLUE LIGHT
1995-12-08 18:00:00 CA Unknown Unknown VARIOUS
1995-12-09 18:30:00 IL Springfield Unknown Unknown
1995-12-10 05:00:00 SD Rapid City BLUE CIRCLE
1995-12-10 06:00:00 NM Clayton Unknown Unknown
1995-12-10 18:12:00 AZ Phoenix Unknown Unknown
1995-12-10 22:45:00 TX Houston Unknown Unknown
1995-12-10 23:00:00 IL Jolliet RED Unknown
1995-12-12 22:30:00 NY Glen Unknown Unknown
1995-12-13 18:45:00 PA Upper Makefield Unknown Unknown
1995-12-14 04:24:00 ID Boise Unknown Unknown
1995-12-14 14:45:00 CA Pomona Unknown ROUND
1995-12-14 21:40:00 NV Las Vegas BLUE Unknown
1995-12-15 19:30:00 MD Elkridge GREEN FIREBALL
1995-12-15 20:50:00 KY Hawesville ORANGE VARIOUS
... ... ... ... ...
1996-01-13 05:45:00 MO Lancaster Unknown Unknown
1996-01-13 20:22:00 MO Osage Beach Unknown Unknown
1996-01-13 21:00:00 OR Selma Unknown TRIANGLE
1996-01-14 06:05:00 SC Charleston Unknown Unknown
1996-01-14 17:00:00 VA Chesterfield Unknown Unknown
1996-01-14 21:00:00 IL Collinsville Unknown Unknown
1996-01-15 00:00:00 FL Panama City Beach Unknown CIRCLE
1996-01-15 02:30:00 MN Hill City Unknown OVAL
1996-01-15 05:00:00 WA Spanaway Unknown Unknown
1996-01-15 18:00:00 TX Loredo RED FORMATION
1996-01-15 22:00:00 AL Birmingham Unknown TEARDROP
1996-01-16 07:40:00 WA Seattle Unknown Unknown
1996-01-16 22:32:00 WA Monitor ORANGE Unknown
1996-01-16 22:32:00 WA Monitor Unknown Unknown
1996-01-18 03:15:00 CA San Diego Unknown Unknown
1996-01-18 06:30:00 AR Fayetteville Unknown Unknown
1996-01-20 21:44:00 NJ Stockton Unknown LIGHT
1996-01-20 22:00:00 WI Madison Unknown Unknown
1996-01-22 11:00:00 CA Fullerton Unknown TRIANGLE
1996-01-22 21:00:00 CA Redding Unknown Unknown
1996-01-22 21:30:00 MO Clark Unknown Unknown
1996-01-23 20:15:00 ND Minot Unknown Unknown
1996-01-25 15:04:00 TN Bartlett Unknown Unknown
1996-01-25 21:15:00 CA Redding RED Unknown
1996-01-25 21:30:00 AZ Phoenix Unknown Unknown
1996-01-26 23:59:00 ME Sedgewick Unknown Unknown
1996-01-28 00:00:00 AZ Chandler Unknown Unknown
1996-01-30 00:00:00 OR Oregon City Unknown Unknown
1996-01-31 02:00:00 AL Heflin Unknown LIGHT
1996-01-31 20:15:00 FL Buckhead Ridge Unknown LIGHT

103 rows × 4 columns


In [137]:
# Access elements of the timestamp
# Reference: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components
ufo.index.year
ufo.index.month
ufo.index.weekday
ufo.index.day
ufo.index.time
ufo.index.hour


Out[137]:
array([22, 20, 14, ...,  2,  3,  5], dtype=int32)

In [138]:
# Create a new variable with time element
ufo['Year'] = ufo.index.year
ufo['Month'] = ufo.index.month
ufo['Day'] = ufo.index.day
ufo['Weekday'] = ufo.index.weekday
ufo['Hour'] = ufo.index.hour

Split-Apply-Combine

Drawing by Hadley Wickham


In [141]:
# For each year, calculate the count of sightings
ufo.groupby('Year').City.count()


Out[141]:
2012    7263
2013    7003
2014    5382
2011    5089
2008    4655
2009    4251
2010    4154
2007    4058
2004    3850
2005    3787
2003    3507
2006    3445
2002    2933
2001    2925
1999    2774
...
1948    9
1945    9
1946    8
1944    8
1943    5
1939    3
1942    3
1931    2
1937    2
1941    2
1936    2
1930    2
1935    1
1933    1
1934    1
Length: 82, dtype: int64

In [ ]:
# For each Shape, calculate the first sighting, last sighting, and range of sightings. 
ufo.groupby('Shape').Year.min()
ufo.groupby('Shape').Year.max()

In [ ]:
# Specify the variable outside of the apply statement
ufo.groupby('Shape').Year.apply(lambda x: x.max())

In [ ]:
# Specifiy the variable within the apply statement
ufo.groupby('Shape').apply(lambda x: x.Year.max() - x.Year.min())

In [ ]:
# Specify a custom function to use in the apply statement
def get_max_year(df):
    try:
        return df.Year.max()
    except:
        return ''
ufo.groupby('Shape').apply(lambda x: get_max_year(x))

In [ ]:
# Split/combine can occur on multiple columns at the same time
ufo.groupby(['Weekday','Hour']).City.count()

Merging Data


In [ ]:
# Read in population data
pop = pd.read_csv('population.csv')
pop.head()

In [ ]:
ufo.head()

In [ ]:
# Merge the data together
ufo = pd.merge(ufo, pop, on='State', how = 'left')

In [ ]:
# Specify keys if columns have different names
ufo = pd.merge(ufo, pop, left_on='State', right_on='State', how = 'left')

In [ ]:
# Observe the new Population column
ufo.head()

In [ ]:
# Check for values that didn't make it (length)
ufo.Population.isnull().sum()

In [ ]:
# Check for values that didn't make it (values)
ufo[ufo.Population.isnull()]

In [ ]:
# Change the records that didn't match up using np.where command
ufo['State'] = np.where(ufo['State'] == 'Fl', 'FL', ufo['State'])

In [ ]:
# Alternatively, change the state using native python string functionality
ufo['State'] = ufo['State'].str.upper()

In [ ]:
# Merge again, this time get all of the records
ufo = pd.merge(ufo, pop, on='State', how = 'left')

Writing Data


In [ ]:
ufo.to_csv('ufo_new.csv')

In [ ]:
ufo.to_csv('ufo_new.csv', index=False)  # Index is not included in the csv

Other Useful Features

Detect duplicate rows


In [ ]:
ufo.duplicated()                                # Series of logicals

In [ ]:
ufo.duplicated().sum()                          # count of duplicates

In [ ]:
ufo[ufo.duplicated(['State','Time'])]           # only show duplicates

In [ ]:
ufo[ufo.duplicated()==False]                    # only show unique rows

In [ ]:
ufo_unique = ufo[~ufo.duplicated()]             # only show unique rows

In [ ]:
ufo.duplicated(['State','Time']).sum()          # columns for identifying duplicates

Map existing values to other values


In [146]:
ufo['Weekday'] = ufo.Weekday.map({  0:'Mon', 1:'Tue', 2:'Wed', 
                                    3:'Thu', 4:'Fri', 5:'Sat', 
                                    6:'Sun'})

Pivot rows to columns


In [ ]:
ufo.groupby(['Weekday','Hour']).City.count()

In [ ]:
ufo.groupby(['Weekday','Hour']).City.count().unstack(0) # Make first row level a column

In [ ]:
ufo.groupby(['Weekday','Hour']).City.count().unstack(1) # Make second row level a column
# Note: .stack() transforms columns to rows

Randomly sample a DataFrame


In [ ]:
idxs = np.random.rand(len(ufo)) < 0.66   # create a Series of booleans
train = ufo[idxs]                        # will contain about 66% of the rows
test = ufo[~idxs]                        # will contain the remaining rows

Replace all instances of a value


In [ ]:
ufo.Shape.replace('DELTA', 'TRIANGLE')   # replace values in a Series

In [ ]:
ufo.replace('PYRAMID', 'TRIANGLE')       # replace values throughout a DataFrame

One more thing...


In [143]:
%matplotlib inline

In [144]:
# Plot the number of sightings over time
ufo.groupby('Year').City.count().plot(  kind='line', 
                                        color='r', 
                                        linewidth=2, 
                                        title='UFO Sightings by year')


Out[144]:
<matplotlib.axes._subplots.AxesSubplot at 0x113e6a150>

In [147]:
# Plot the number of sightings over the day of week and time of day
ufo.groupby(['Weekday','Hour']).City.count().unstack(0).plot(   kind='line', 
                                                                linewidth=2,
                                                                title='UFO Sightings by Time of Day')


Out[147]:
<matplotlib.axes._subplots.AxesSubplot at 0x11586ef10>

In [148]:
# Plot multiple plots on the same plot (plots neeed to be in column format)              
ufo_fourth = ufo[(ufo.Year.isin([2011, 2012, 2013, 2014])) & (ufo.Month == 7)]
ufo_fourth.groupby(['Year', 'Day']).City.count().unstack(0).plot(   kind = 'bar',
                                                                    subplots=True,
                                                                    figsize=(7,9))


Out[148]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x113abbcd0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x116da5a50>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x116e28750>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x116e7dc90>], dtype=object)