In [1]:

    
%xmode plain









    



Exception reporting mode: Plain

Pandas: Basics, Reading Data

Pandas documentation

Pandas is a Python library for convenient data analysis, visualization, and formatting.

Pandas is kind of like R combined with SQL, but in Python.

Imports

To display plots in the Jupyter/IPython notebook, we must execute the following line first.



In [2]:

    
%matplotlib inline

To use Pandas, we must import the library:



In [3]:

    
import pandas

Creating a DataFrame

Pandas documentation: DataFrame

The primary data structure in Pandas is the DataFrame. It contains data similar to a SQL table, a CSV file, or a spreadsheet.

There are a few methods in which we can create a DataFrame.

List of Dictionaries

From list_of_dicts, each dictionary is one row. The dictionary key represents the column name, and the value represents the value for the row and column.



In [4]:

    
list_of_dicts = [{'a': 1, 'b': 2, 'c': 3}, {'a': 4, 'b': 5, 'c': 6, 'd': 7}]
df = pandas.DataFrame(list_of_dicts)



In [5]:

    
df

DataFrames consist of rows and columns, just like a SQL table or spreadsheet. The columns are fields, and the rows are records or observations. In the above example, the data frame df has two rows and four columns.

If a particular dictionary in that list doesn't have a value for that key, then it assigns NaN ("not a number") for that index-column pair. As we see above, column d was absent in row 0. Therefore, it assigned that cell NaN.

Another example:



In [6]:

    
list_of_dicts = [{'a': 1}, {'b': 5}]
df = pandas.DataFrame(list_of_dicts)



In [7]:

    
df

List of Lists

From list_of_lists, each inner list represents one row. column_names represents the column names.



In [8]:

    
list_of_lists = [[1, 2, 3], [4, 5, 6]]
column_names = ['a', 'b', 'c']
df = pandas.DataFrame(data=list_of_lists, columns=column_names)



In [9]:

    
df

If the number of columns exceeds the number of values with an inner list, Pandas will fill empty values with NaN:



In [10]:

    
list_of_lists = [[1, 2], [4, 5, 6]]
column_names = ['a', 'b', 'c']
df = pandas.DataFrame(data=list_of_lists, columns=column_names)



In [11]:

    
df

External Data: CSV

Pandas documentation: Input/Output

To load data from a CSV file directly to a DataFrame, where the first row of the CSV is a header row containing column names:

df = pandas.read_csv('my_data.csv')

If there is no header row, you must name each column manually:

df = pandas.read_csv('my_data.csv', header=None, names=['col1', 'col2', ...., 'col12'])

In the following example, the CSV file uses a semicolon as a field delimiter:



In [12]:

    
df = pandas.read_csv('data/red_wine.csv', delimiter=';')

External Data: SQL

We will use SQLAlchemy, a Python library that makes it convenient to read SQL databases (SQLite, MySQL, PostgreSQL) in Python.

To install SQLAlchemy, at the prompt (Mac users: the Terminal prompt; Windows users: the Anaconda prompt):

pip install sqlalchemy

If Mac users get a permissions error, then try:

sudo pip install sqlalchemy

Import:



In [13]:

    
import sqlalchemy

External Data: SQLite

Let's begin with a simple example using SQLite.

engine is an object that makes connections to the database. We must specify the location of the SQLite database. (A SQLite database is a single file.)

The engine string takes the following format:

engine = create_engine('sqlite:///path_to_database')



In [14]:

    
engine = sqlalchemy.create_engine('sqlite:///data/red_wine.db')

read_sql allows us to use a SQL query string to query to database.



In [15]:

    
df = pandas.read_sql('select * from Wine limit 3', engine)



In [16]:

    
df









    Out[16]:






  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      0
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
      5
    
    
      1
      7.8
      0.88
      0.00
      2.6
      0.098
      25.0
      67.0
      0.9968
      3.20
      0.68
      9.8
      5
    
    
      2
      7.8
      0.76
      0.04
      2.3
      0.092
      15.0
      54.0
      0.9970
      3.26
      0.65
      9.8
      5



In [17]:

    
df = pandas.read_sql('select chlorides, density, alcohol from Wine limit 3', engine)



In [18]:

    
df

External Data: PostgreSQL

The engine string takes the following format:

engine = create_engine('postgresql://username:password@host:port/database')



In [19]:

    
engine = sqlalchemy.create_engine('postgresql://steve@localhost:5432/winequalityred')
df = pandas.read_sql('select * from "Wine" limit 3', engine)



In [20]:

    
df









    Out[20]:






  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      0
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
      5
    
    
      1
      7.8
      0.88
      0.00
      2.6
      0.098
      25.0
      67.0
      0.9968
      3.20
      0.68
      9.8
      5
    
    
      2
      7.8
      0.76
      0.04
      2.3
      0.092
      15.0
      54.0
      0.9970
      3.26
      0.65
      9.8
      5



In [21]:

    
engine = sqlalchemy.create_engine('postgresql://steve@localhost:5432/winequalitywhite')
df = pandas.read_sql('select * from "Wine" limit 3', engine)



In [22]:

    
df









    Out[22]:






  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      0
      7.0
      0.27
      0.36
      20.7
      0.045
      45.0
      170.0
      1.0010
      3.00
      0.45
      8.8
      6
    
    
      1
      6.3
      0.30
      0.34
      1.6
      0.049
      14.0
      132.0
      0.9940
      3.30
      0.49
      9.5
      6
    
    
      2
      8.1
      0.28
      0.40
      6.9
      0.050
      30.0
      97.0
      0.9951
      3.26
      0.44
      10.1
      6

DataFrame Properties

Load a data frame:



In [23]:

    
df = pandas.read_csv('data/red_wine.csv', delimiter=';')

The shape attribute returns a tuple containing the number of rows and columns:



In [24]:

    
df.shape









    Out[24]:





(1599, 12)

The columns attribute returns a list of column names:



In [25]:

    
df.columns









    Out[25]:





Index([u'fixed acidity', u'volatile acidity', u'citric acid',
       u'residual sugar', u'chlorides', u'free sulfur dioxide',
       u'total sulfur dioxide', u'density', u'pH', u'sulphates', u'alcohol',
       u'quality'],
      dtype='object')

DataFrame Methods

The describe method provides a statistical summary of the data frame:



In [26]:

    
df.describe()









    Out[26]:






  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      count
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
    
    
      mean
      8.319637
      0.527821
      0.270976
      2.538806
      0.087467
      15.874922
      46.467792
      0.996747
      3.311113
      0.658149
      10.422983
      5.636023
    
    
      std
      1.741096
      0.179060
      0.194801
      1.409928
      0.047065
      10.460157
      32.895324
      0.001887
      0.154386
      0.169507
      1.065668
      0.807569
    
    
      min
      4.600000
      0.120000
      0.000000
      0.900000
      0.012000
      1.000000
      6.000000
      0.990070
      2.740000
      0.330000
      8.400000
      3.000000
    
    
      25%
      7.100000
      0.390000
      0.090000
      1.900000
      0.070000
      7.000000
      22.000000
      0.995600
      3.210000
      0.550000
      9.500000
      5.000000
    
    
      50%
      7.900000
      0.520000
      0.260000
      2.200000
      0.079000
      14.000000
      38.000000
      0.996750
      3.310000
      0.620000
      10.200000
      6.000000
    
    
      75%
      9.200000
      0.640000
      0.420000
      2.600000
      0.090000
      21.000000
      62.000000
      0.997835
      3.400000
      0.730000
      11.100000
      6.000000
    
    
      max
      15.900000
      1.580000
      1.000000
      15.500000
      0.611000
      72.000000
      289.000000
      1.003690
      4.010000
      2.000000
      14.900000
      8.000000

The head method shows the first n rows (by default, n is 5). This method is good to get an immediate feel for the contents of the data frame:



In [27]:

    
df.head(2)









    Out[27]:






  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      0
      7.4
      0.70
      0.0
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
      5
    
    
      1
      7.8
      0.88
      0.0
      2.6
      0.098
      25.0
      67.0
      0.9968
      3.20
      0.68
      9.8
      5



In [28]:

    
df.head()









    Out[28]:






  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      0
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
      5
    
    
      1
      7.8
      0.88
      0.00
      2.6
      0.098
      25.0
      67.0
      0.9968
      3.20
      0.68
      9.8
      5
    
    
      2
      7.8
      0.76
      0.04
      2.3
      0.092
      15.0
      54.0
      0.9970
      3.26
      0.65
      9.8
      5
    
    
      3
      11.2
      0.28
      0.56
      1.9
      0.075
      17.0
      60.0
      0.9980
      3.16
      0.58
      9.8
      6
    
    
      4
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
      5

The tail method shows the last n rows (by default, n is 5).



In [29]:

    
df.tail()









    Out[29]:






  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      1594
      6.2
      0.600
      0.08
      2.0
      0.090
      32.0
      44.0
      0.99490
      3.45
      0.58
      10.5
      5
    
    
      1595
      5.9
      0.550
      0.10
      2.2
      0.062
      39.0
      51.0
      0.99512
      3.52
      0.76
      11.2
      6
    
    
      1596
      6.3
      0.510
      0.13
      2.3
      0.076
      29.0
      40.0
      0.99574
      3.42
      0.75
      11.0
      6
    
    
      1597
      5.9
      0.645
      0.12
      2.0
      0.075
      32.0
      44.0
      0.99547
      3.57
      0.71
      10.2
      5
    
    
      1598
      6.0
      0.310
      0.47
      3.6
      0.067
      18.0
      42.0
      0.99549
      3.39
      0.66
      11.0
      6

Back to Index

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.0010	3.00	0.45	8.8	6
1	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.9940	3.30	0.49	9.5	6
2	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000
mean	8.319637	0.527821	0.270976	2.538806	0.087467	15.874922	46.467792	0.996747	3.311113	0.658149	10.422983	5.636023
std	1.741096	0.179060	0.194801	1.409928	0.047065	10.460157	32.895324	0.001887	0.154386	0.169507	1.065668	0.807569
min	4.600000	0.120000	0.000000	0.900000	0.012000	1.000000	6.000000	0.990070	2.740000	0.330000	8.400000	3.000000
25%	7.100000	0.390000	0.090000	1.900000	0.070000	7.000000	22.000000	0.995600	3.210000	0.550000	9.500000	5.000000
50%	7.900000	0.520000	0.260000	2.200000	0.079000	14.000000	38.000000	0.996750	3.310000	0.620000	10.200000	6.000000
75%	9.200000	0.640000	0.420000	2.600000	0.090000	21.000000	62.000000	0.997835	3.400000	0.730000	11.100000	6.000000
max	15.900000	1.580000	1.000000	15.500000	0.611000	72.000000	289.000000	1.003690	4.010000	2.000000	14.900000	8.000000

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
1594	6.2	0.600	0.08	2.0	0.090	32.0	44.0	0.99490	3.45	0.58	10.5	5
1595	5.9	0.550	0.10	2.2	0.062	39.0	51.0	0.99512	3.52	0.76	11.2	6
1596	6.3	0.510	0.13	2.3	0.076	29.0	40.0	0.99574	3.42	0.75	11.0	6
1597	5.9	0.645	0.12	2.0	0.075	32.0	44.0	0.99547	3.57	0.71	10.2	5
1598	6.0	0.310	0.47	3.6	0.067	18.0	42.0	0.99549	3.39	0.66	11.0	6