In [1]:
%xmode plain
%matplotlib inline
import pandas, sqlalchemy


Exception reporting mode: Plain

Pandas: Analysis and Visualization

Load a data frame:


In [2]:
df = pandas.read_csv('data/red_wine.csv', delimiter=';')

Indexing

To retrieve a column, we use square brackets, similar to indexing a dictionary.


In [3]:
df['chlorides']


Out[3]:
0       0.076
1       0.098
2       0.092
3       0.075
4       0.076
5       0.075
6       0.069
7       0.065
8       0.073
9       0.071
10      0.097
11      0.071
12      0.089
13      0.114
14      0.176
15      0.170
16      0.092
17      0.368
18      0.086
19      0.341
20      0.077
21      0.082
22      0.106
23      0.084
24      0.085
25      0.080
26      0.080
27      0.106
28      0.080
29      0.082
        ...  
1569    0.056
1570    0.230
1571    0.038
1572    0.069
1573    0.075
1574    0.074
1575    0.060
1576    0.081
1577    0.076
1578    0.118
1579    0.053
1580    0.068
1581    0.053
1582    0.053
1583    0.074
1584    0.061
1585    0.066
1586    0.065
1587    0.066
1588    0.068
1589    0.073
1590    0.077
1591    0.089
1592    0.076
1593    0.068
1594    0.090
1595    0.062
1596    0.076
1597    0.075
1598    0.067
Name: chlorides, dtype: float64

The output value is itself a DataFrame.

We can retrieve multiple columns as well by nesting the column names inside a list:


In [4]:
df[['chlorides', 'volatile acidity']]


Out[4]:
chlorides volatile acidity
0 0.076 0.700
1 0.098 0.880
2 0.092 0.760
3 0.075 0.280
4 0.076 0.700
5 0.075 0.660
6 0.069 0.600
7 0.065 0.650
8 0.073 0.580
9 0.071 0.500
10 0.097 0.580
11 0.071 0.500
12 0.089 0.615
13 0.114 0.610
14 0.176 0.620
15 0.170 0.620
16 0.092 0.280
17 0.368 0.560
18 0.086 0.590
19 0.341 0.320
20 0.077 0.220
21 0.082 0.390
22 0.106 0.430
23 0.084 0.490
24 0.085 0.400
25 0.080 0.390
26 0.080 0.410
27 0.106 0.430
28 0.080 0.710
29 0.082 0.645
... ... ...
1569 0.056 0.510
1570 0.230 0.360
1571 0.038 0.380
1572 0.069 0.690
1573 0.075 0.580
1574 0.074 0.310
1575 0.060 0.520
1576 0.081 0.300
1577 0.076 0.700
1578 0.118 0.670
1579 0.053 0.560
1580 0.068 0.350
1581 0.053 0.560
1582 0.053 0.715
1583 0.074 0.460
1584 0.061 0.320
1585 0.066 0.390
1586 0.065 0.310
1587 0.066 0.610
1588 0.068 0.660
1589 0.073 0.725
1590 0.077 0.550
1591 0.089 0.740
1592 0.076 0.510
1593 0.068 0.620
1594 0.090 0.600
1595 0.062 0.550
1596 0.076 0.510
1597 0.075 0.645
1598 0.067 0.310

1599 rows × 2 columns

To retrieve selected rows, we can use a slice, just like slicing a list or a string:


In [5]:
df[:3]


Out[5]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5

However, in Pandas, you cannot use a single integer to retrieve a single row (which I find confusing):


In [6]:
# error:
# df[0]

Instead, I would do the following:


In [7]:
df[0:1]


Out[7]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

You can filter by columns and rows simultaneously:


In [8]:
df[['chlorides', 'volatile acidity']][:3]


Out[8]:
chlorides volatile acidity
0 0.076 0.70
1 0.098 0.88
2 0.092 0.76

Filter by Binary Masking

When making a logical comparison between a DataFrame and a single number, the result is a binary mask.

A binary mask is itself a DataFrame containing Boolean values (True or False) comparing each element of the data frame with the other number:


In [9]:
df['chlorides'] <= 0.08


Out[9]:
0        True
1       False
2       False
3        True
4        True
5        True
6        True
7        True
8        True
9        True
10      False
11       True
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20       True
21      False
22      False
23      False
24      False
25       True
26       True
27      False
28       True
29      False
        ...  
1569     True
1570    False
1571     True
1572     True
1573     True
1574     True
1575     True
1576    False
1577     True
1578    False
1579     True
1580     True
1581     True
1582     True
1583     True
1584     True
1585     True
1586     True
1587     True
1588     True
1589     True
1590     True
1591    False
1592     True
1593     True
1594    False
1595     True
1596     True
1597     True
1598     True
Name: chlorides, dtype: bool

The binary mask can be used to index the data frame. The following returns all rows whose chlorides values is less than 0.08.


In [10]:
df[df['chlorides'] <= 0.08]


Out[10]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
5 7.4 0.660 0.00 1.8 0.075 13.0 40.0 0.99780 3.51 0.56 9.4 5
6 7.9 0.600 0.06 1.6 0.069 15.0 59.0 0.99640 3.30 0.46 9.4 5
7 7.3 0.650 0.00 1.2 0.065 15.0 21.0 0.99460 3.39 0.47 10.0 7
8 7.8 0.580 0.02 2.0 0.073 9.0 18.0 0.99680 3.36 0.57 9.5 7
9 7.5 0.500 0.36 6.1 0.071 17.0 102.0 0.99780 3.35 0.80 10.5 5
11 7.5 0.500 0.36 6.1 0.071 17.0 102.0 0.99780 3.35 0.80 10.5 5
20 8.9 0.220 0.48 1.8 0.077 29.0 60.0 0.99680 3.39 0.53 9.4 6
25 6.3 0.390 0.16 1.4 0.080 11.0 23.0 0.99550 3.34 0.56 9.3 5
26 7.6 0.410 0.24 1.8 0.080 4.0 11.0 0.99620 3.28 0.59 9.5 5
28 7.1 0.710 0.00 1.9 0.080 14.0 35.0 0.99720 3.47 0.55 9.4 5
33 6.9 0.605 0.12 10.7 0.073 40.0 83.0 0.99930 3.45 0.52 9.4 6
37 8.1 0.380 0.28 2.1 0.066 13.0 30.0 0.99680 3.23 0.73 9.7 7
39 7.3 0.450 0.36 5.9 0.074 12.0 87.0 0.99780 3.33 0.83 10.5 5
40 7.3 0.450 0.36 5.9 0.074 12.0 87.0 0.99780 3.33 0.83 10.5 5
43 8.1 0.660 0.22 2.2 0.069 9.0 23.0 0.99680 3.30 1.20 10.3 5
44 6.8 0.670 0.02 1.8 0.050 5.0 11.0 0.99620 3.48 0.52 9.5 5
45 4.6 0.520 0.15 2.1 0.054 8.0 65.0 0.99340 3.90 0.56 13.1 4
48 6.4 0.400 0.23 1.6 0.066 5.0 12.0 0.99580 3.34 0.56 9.2 5
49 5.6 0.310 0.37 1.4 0.074 12.0 96.0 0.99540 3.32 0.58 9.2 5
50 8.8 0.660 0.26 1.7 0.074 4.0 23.0 0.99710 3.15 0.74 9.2 5
51 6.6 0.520 0.04 2.2 0.069 8.0 15.0 0.99560 3.40 0.63 9.4 6
52 6.6 0.500 0.04 2.1 0.068 6.0 14.0 0.99550 3.39 0.64 9.4 6
56 10.2 0.420 0.57 3.4 0.070 4.0 10.0 0.99710 3.04 0.63 9.6 5
58 7.8 0.590 0.18 2.3 0.076 17.0 54.0 0.99750 3.43 0.59 10.0 5
59 7.3 0.390 0.31 2.4 0.074 9.0 46.0 0.99620 3.41 0.54 9.4 6
60 8.8 0.400 0.40 2.2 0.079 19.0 52.0 0.99800 3.44 0.64 9.2 5
66 7.5 0.520 0.11 1.5 0.079 11.0 39.0 0.99680 3.42 0.58 9.6 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1564 7.2 0.695 0.13 2.0 0.076 12.0 20.0 0.99546 3.29 0.54 10.1 5
1565 6.7 0.670 0.02 1.9 0.061 26.0 42.0 0.99489 3.39 0.82 10.9 6
1566 6.7 0.160 0.64 2.1 0.059 24.0 52.0 0.99494 3.34 0.71 11.2 6
1567 7.2 0.695 0.13 2.0 0.076 12.0 20.0 0.99546 3.29 0.54 10.1 5
1568 7.0 0.560 0.13 1.6 0.077 25.0 42.0 0.99629 3.34 0.59 9.2 5
1569 6.2 0.510 0.14 1.9 0.056 15.0 34.0 0.99396 3.48 0.57 11.5 6
1571 6.4 0.380 0.14 2.2 0.038 15.0 25.0 0.99514 3.44 0.65 11.1 6
1572 7.3 0.690 0.32 2.2 0.069 35.0 104.0 0.99632 3.33 0.51 9.5 5
1573 6.0 0.580 0.20 2.4 0.075 15.0 50.0 0.99467 3.58 0.67 12.5 6
1574 5.6 0.310 0.78 13.9 0.074 23.0 92.0 0.99677 3.39 0.48 10.5 6
1575 7.5 0.520 0.40 2.2 0.060 12.0 20.0 0.99474 3.26 0.64 11.8 6
1577 6.2 0.700 0.15 5.1 0.076 13.0 27.0 0.99622 3.54 0.60 11.9 6
1579 6.2 0.560 0.09 1.7 0.053 24.0 32.0 0.99402 3.54 0.60 11.3 5
1580 7.4 0.350 0.33 2.4 0.068 9.0 26.0 0.99470 3.36 0.60 11.9 6
1581 6.2 0.560 0.09 1.7 0.053 24.0 32.0 0.99402 3.54 0.60 11.3 5
1582 6.1 0.715 0.10 2.6 0.053 13.0 27.0 0.99362 3.57 0.50 11.9 5
1583 6.2 0.460 0.29 2.1 0.074 32.0 98.0 0.99578 3.33 0.62 9.8 5
1584 6.7 0.320 0.44 2.4 0.061 24.0 34.0 0.99484 3.29 0.80 11.6 7
1585 7.2 0.390 0.44 2.6 0.066 22.0 48.0 0.99494 3.30 0.84 11.5 6
1586 7.5 0.310 0.41 2.4 0.065 34.0 60.0 0.99492 3.34 0.85 11.4 6
1587 5.8 0.610 0.11 1.8 0.066 18.0 28.0 0.99483 3.55 0.66 10.9 6
1588 7.2 0.660 0.33 2.5 0.068 34.0 102.0 0.99414 3.27 0.78 12.8 6
1589 6.6 0.725 0.20 7.8 0.073 29.0 79.0 0.99770 3.29 0.54 9.2 5
1590 6.3 0.550 0.15 1.8 0.077 26.0 35.0 0.99314 3.32 0.82 11.6 6
1592 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1593 6.8 0.620 0.08 1.9 0.068 28.0 38.0 0.99651 3.42 0.82 9.5 6
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

874 rows × 12 columns

The ampersand, &, performs a logical-AND, i.e. an intersection, between each condition:


In [11]:
df[(df['chlorides'] >= 0.075) & (df['chlorides'] <= 0.080)]


Out[11]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.40 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.80 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.40 5
5 7.4 0.660 0.00 1.8 0.075 13.0 40.0 0.99780 3.51 0.56 9.40 5
20 8.9 0.220 0.48 1.8 0.077 29.0 60.0 0.99680 3.39 0.53 9.40 6
25 6.3 0.390 0.16 1.4 0.080 11.0 23.0 0.99550 3.34 0.56 9.30 5
26 7.6 0.410 0.24 1.8 0.080 4.0 11.0 0.99620 3.28 0.59 9.50 5
28 7.1 0.710 0.00 1.9 0.080 14.0 35.0 0.99720 3.47 0.55 9.40 5
58 7.8 0.590 0.18 2.3 0.076 17.0 54.0 0.99750 3.43 0.59 10.00 5
60 8.8 0.400 0.40 2.2 0.079 19.0 52.0 0.99800 3.44 0.64 9.20 5
66 7.5 0.520 0.11 1.5 0.079 11.0 39.0 0.99680 3.42 0.58 9.60 5
67 6.6 0.705 0.07 1.6 0.076 6.0 15.0 0.99620 3.44 0.58 10.70 5
70 7.7 0.630 0.08 1.9 0.076 15.0 27.0 0.99670 3.32 0.54 9.50 6
79 8.3 0.625 0.20 1.5 0.080 27.0 119.0 0.99720 3.16 1.12 9.10 4
85 6.9 0.550 0.15 2.2 0.076 19.0 40.0 0.99610 3.41 0.59 10.10 5
89 7.0 0.620 0.08 1.8 0.076 8.0 24.0 0.99780 3.48 0.53 9.00 5
90 7.9 0.520 0.26 1.9 0.079 42.0 140.0 0.99640 3.23 0.54 9.50 5
98 7.6 0.900 0.06 2.5 0.079 5.0 10.0 0.99670 3.39 0.56 9.80 5
99 8.1 0.545 0.18 1.9 0.080 13.0 35.0 0.99720 3.30 0.59 9.00 6
101 7.8 0.500 0.30 1.9 0.075 8.0 22.0 0.99590 3.31 0.56 10.40 6
102 8.1 0.545 0.18 1.9 0.080 13.0 35.0 0.99720 3.30 0.59 9.00 6
103 8.1 0.575 0.22 2.1 0.077 12.0 65.0 0.99670 3.29 0.51 9.20 5
105 8.1 0.575 0.22 2.1 0.077 12.0 65.0 0.99670 3.29 0.51 9.20 5
113 10.1 0.310 0.44 2.3 0.080 22.0 46.0 0.99880 3.32 0.67 9.70 6
116 8.3 0.540 0.28 1.9 0.077 11.0 40.0 0.99780 3.39 0.61 10.00 6
122 7.3 0.695 0.00 2.5 0.075 3.0 13.0 0.99800 3.49 0.52 9.20 5
123 8.0 0.710 0.00 2.6 0.080 11.0 34.0 0.99760 3.44 0.53 9.50 5
150 7.3 0.330 0.47 2.1 0.077 5.0 11.0 0.99580 3.33 0.53 10.30 6
159 6.8 0.600 0.18 1.9 0.079 18.0 86.0 0.99680 3.59 0.57 9.30 6
162 7.8 0.530 0.04 1.7 0.076 17.0 31.0 0.99640 3.33 0.56 10.00 6
... ... ... ... ... ... ... ... ... ... ... ... ...
1447 6.8 0.670 0.00 1.9 0.080 22.0 39.0 0.99701 3.40 0.74 9.70 5
1448 6.9 0.580 0.01 1.9 0.080 40.0 54.0 0.99683 3.40 0.73 9.70 5
1462 6.8 0.640 0.03 2.3 0.075 14.0 31.0 0.99545 3.36 0.58 10.40 6
1463 6.9 0.630 0.01 2.4 0.076 14.0 39.0 0.99522 3.34 0.53 10.80 6
1480 5.6 0.620 0.03 1.5 0.080 6.0 13.0 0.99498 3.66 0.62 10.10 4
1495 7.0 0.430 0.02 1.9 0.080 15.0 28.0 0.99492 3.35 0.81 10.60 6
1500 7.5 0.725 0.04 1.5 0.076 8.0 15.0 0.99508 3.26 0.53 9.60 5
1502 7.3 0.585 0.18 2.4 0.078 15.0 60.0 0.99638 3.31 0.54 9.80 5
1503 6.2 0.440 0.39 2.5 0.077 6.0 14.0 0.99555 3.51 0.69 11.00 6
1505 6.7 0.760 0.02 1.8 0.078 6.0 12.0 0.99600 3.55 0.63 9.95 3
1513 6.4 0.560 0.15 1.8 0.078 17.0 65.0 0.99294 3.33 0.60 10.50 6
1523 6.8 0.480 0.25 2.0 0.076 29.0 61.0 0.99530 3.34 0.60 10.40 5
1524 6.0 0.420 0.19 2.0 0.075 22.0 47.0 0.99522 3.39 0.78 10.00 6
1535 7.0 0.550 0.13 2.2 0.075 15.0 35.0 0.99590 3.36 0.59 9.70 6
1536 6.1 0.530 0.08 1.9 0.077 24.0 45.0 0.99528 3.60 0.68 10.30 6
1553 7.3 0.735 0.00 2.2 0.080 18.0 28.0 0.99765 3.41 0.60 9.40 5
1559 7.8 0.600 0.26 2.0 0.080 31.0 131.0 0.99622 3.21 0.52 9.90 5
1560 7.8 0.600 0.26 2.0 0.080 31.0 131.0 0.99622 3.21 0.52 9.90 5
1561 7.8 0.600 0.26 2.0 0.080 31.0 131.0 0.99622 3.21 0.52 9.90 5
1562 7.2 0.695 0.13 2.0 0.076 12.0 20.0 0.99546 3.29 0.54 10.10 5
1563 7.2 0.695 0.13 2.0 0.076 12.0 20.0 0.99546 3.29 0.54 10.10 5
1564 7.2 0.695 0.13 2.0 0.076 12.0 20.0 0.99546 3.29 0.54 10.10 5
1567 7.2 0.695 0.13 2.0 0.076 12.0 20.0 0.99546 3.29 0.54 10.10 5
1568 7.0 0.560 0.13 1.6 0.077 25.0 42.0 0.99629 3.34 0.59 9.20 5
1573 6.0 0.580 0.20 2.4 0.075 15.0 50.0 0.99467 3.58 0.67 12.50 6
1577 6.2 0.700 0.15 5.1 0.076 13.0 27.0 0.99622 3.54 0.60 11.90 6
1590 6.3 0.550 0.15 1.8 0.077 26.0 35.0 0.99314 3.32 0.82 11.60 6
1592 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.00 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.00 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.20 5

303 rows × 12 columns

The pipe, |, performs a logical-OR, i.e. a union, between each condition:


In [12]:
df[(df['chlorides'] < 0.075) | (df['chlorides'] > 0.080)]


Out[12]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.80 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.80 5
6 7.9 0.600 0.06 1.6 0.069 15.0 59.0 0.99640 3.30 0.46 9.40 5
7 7.3 0.650 0.00 1.2 0.065 15.0 21.0 0.99460 3.39 0.47 10.00 7
8 7.8 0.580 0.02 2.0 0.073 9.0 18.0 0.99680 3.36 0.57 9.50 7
9 7.5 0.500 0.36 6.1 0.071 17.0 102.0 0.99780 3.35 0.80 10.50 5
10 6.7 0.580 0.08 1.8 0.097 15.0 65.0 0.99590 3.28 0.54 9.20 5
11 7.5 0.500 0.36 6.1 0.071 17.0 102.0 0.99780 3.35 0.80 10.50 5
12 5.6 0.615 0.00 1.6 0.089 16.0 59.0 0.99430 3.58 0.52 9.90 5
13 7.8 0.610 0.29 1.6 0.114 9.0 29.0 0.99740 3.26 1.56 9.10 5
14 8.9 0.620 0.18 3.8 0.176 52.0 145.0 0.99860 3.16 0.88 9.20 5
15 8.9 0.620 0.19 3.9 0.170 51.0 148.0 0.99860 3.17 0.93 9.20 5
16 8.5 0.280 0.56 1.8 0.092 35.0 103.0 0.99690 3.30 0.75 10.50 7
17 8.1 0.560 0.28 1.7 0.368 16.0 56.0 0.99680 3.11 1.28 9.30 5
18 7.4 0.590 0.08 4.4 0.086 6.0 29.0 0.99740 3.38 0.50 9.00 4
19 7.9 0.320 0.51 1.8 0.341 17.0 56.0 0.99690 3.04 1.08 9.20 6
21 7.6 0.390 0.31 2.3 0.082 23.0 71.0 0.99820 3.52 0.65 9.70 5
22 7.9 0.430 0.21 1.6 0.106 10.0 37.0 0.99660 3.17 0.91 9.50 5
23 8.5 0.490 0.11 2.3 0.084 9.0 67.0 0.99680 3.17 0.53 9.40 5
24 6.9 0.400 0.14 2.4 0.085 21.0 40.0 0.99680 3.43 0.63 9.70 6
27 7.9 0.430 0.21 1.6 0.106 10.0 37.0 0.99660 3.17 0.91 9.50 5
29 7.8 0.645 0.00 2.0 0.082 8.0 16.0 0.99640 3.38 0.59 9.80 6
30 6.7 0.675 0.07 2.4 0.089 17.0 82.0 0.99580 3.35 0.54 10.10 5
31 6.9 0.685 0.00 2.5 0.105 22.0 37.0 0.99660 3.46 0.57 10.60 6
32 8.3 0.655 0.12 2.3 0.083 15.0 113.0 0.99660 3.17 0.66 9.80 5
33 6.9 0.605 0.12 10.7 0.073 40.0 83.0 0.99930 3.45 0.52 9.40 6
34 5.2 0.320 0.25 1.8 0.103 13.0 50.0 0.99570 3.38 0.55 9.20 5
35 7.8 0.645 0.00 5.5 0.086 5.0 18.0 0.99860 3.40 0.55 9.60 6
36 7.8 0.600 0.14 2.4 0.086 3.0 15.0 0.99750 3.42 0.60 10.80 6
37 8.1 0.380 0.28 2.1 0.066 13.0 30.0 0.99680 3.23 0.73 9.70 7
... ... ... ... ... ... ... ... ... ... ... ... ...
1555 7.0 0.560 0.17 1.7 0.065 15.0 24.0 0.99514 3.44 0.68 10.55 7
1556 6.6 0.880 0.04 2.2 0.066 12.0 20.0 0.99636 3.53 0.56 9.90 5
1557 6.6 0.855 0.02 2.4 0.062 15.0 23.0 0.99627 3.54 0.60 11.00 6
1558 6.9 0.630 0.33 6.7 0.235 66.0 115.0 0.99787 3.22 0.56 9.50 5
1565 6.7 0.670 0.02 1.9 0.061 26.0 42.0 0.99489 3.39 0.82 10.90 6
1566 6.7 0.160 0.64 2.1 0.059 24.0 52.0 0.99494 3.34 0.71 11.20 6
1569 6.2 0.510 0.14 1.9 0.056 15.0 34.0 0.99396 3.48 0.57 11.50 6
1570 6.4 0.360 0.53 2.2 0.230 19.0 35.0 0.99340 3.37 0.93 12.40 6
1571 6.4 0.380 0.14 2.2 0.038 15.0 25.0 0.99514 3.44 0.65 11.10 6
1572 7.3 0.690 0.32 2.2 0.069 35.0 104.0 0.99632 3.33 0.51 9.50 5
1574 5.6 0.310 0.78 13.9 0.074 23.0 92.0 0.99677 3.39 0.48 10.50 6
1575 7.5 0.520 0.40 2.2 0.060 12.0 20.0 0.99474 3.26 0.64 11.80 6
1576 8.0 0.300 0.63 1.6 0.081 16.0 29.0 0.99588 3.30 0.78 10.80 6
1578 6.8 0.670 0.15 1.8 0.118 13.0 20.0 0.99540 3.42 0.67 11.30 6
1579 6.2 0.560 0.09 1.7 0.053 24.0 32.0 0.99402 3.54 0.60 11.30 5
1580 7.4 0.350 0.33 2.4 0.068 9.0 26.0 0.99470 3.36 0.60 11.90 6
1581 6.2 0.560 0.09 1.7 0.053 24.0 32.0 0.99402 3.54 0.60 11.30 5
1582 6.1 0.715 0.10 2.6 0.053 13.0 27.0 0.99362 3.57 0.50 11.90 5
1583 6.2 0.460 0.29 2.1 0.074 32.0 98.0 0.99578 3.33 0.62 9.80 5
1584 6.7 0.320 0.44 2.4 0.061 24.0 34.0 0.99484 3.29 0.80 11.60 7
1585 7.2 0.390 0.44 2.6 0.066 22.0 48.0 0.99494 3.30 0.84 11.50 6
1586 7.5 0.310 0.41 2.4 0.065 34.0 60.0 0.99492 3.34 0.85 11.40 6
1587 5.8 0.610 0.11 1.8 0.066 18.0 28.0 0.99483 3.55 0.66 10.90 6
1588 7.2 0.660 0.33 2.5 0.068 34.0 102.0 0.99414 3.27 0.78 12.80 6
1589 6.6 0.725 0.20 7.8 0.073 29.0 79.0 0.99770 3.29 0.54 9.20 5
1591 5.4 0.740 0.09 1.7 0.089 16.0 26.0 0.99402 3.67 0.56 11.60 6
1593 6.8 0.620 0.08 1.9 0.068 28.0 38.0 0.99651 3.42 0.82 9.50 6
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.50 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.20 6
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.00 6

1296 rows × 12 columns

You could include conditions with multiple fields:


In [13]:
df[(df['chlorides'] > 0.100) & (df['pH'] < 3.20)]


Out[13]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
14 8.9 0.620 0.18 3.8 0.176 52.0 145.0 0.99860 3.16 0.88 9.2 5
15 8.9 0.620 0.19 3.9 0.170 51.0 148.0 0.99860 3.17 0.93 9.2 5
17 8.1 0.560 0.28 1.7 0.368 16.0 56.0 0.99680 3.11 1.28 9.3 5
19 7.9 0.320 0.51 1.8 0.341 17.0 56.0 0.99690 3.04 1.08 9.2 6
22 7.9 0.430 0.21 1.6 0.106 10.0 37.0 0.99660 3.17 0.91 9.5 5
27 7.9 0.430 0.21 1.6 0.106 10.0 37.0 0.99660 3.17 0.91 9.5 5
54 7.6 0.510 0.15 2.8 0.110 33.0 73.0 0.99550 3.17 0.63 10.2 6
81 7.8 0.430 0.70 1.9 0.464 22.0 67.0 0.99740 3.13 1.28 9.4 5
83 7.3 0.670 0.26 1.8 0.401 16.0 51.0 0.99690 3.16 1.14 9.4 5
86 8.6 0.490 0.28 1.9 0.110 20.0 136.0 0.99720 2.93 1.95 9.9 6
88 9.3 0.390 0.44 2.1 0.107 34.0 125.0 0.99780 3.14 1.22 9.5 5
91 8.6 0.490 0.28 1.9 0.110 20.0 136.0 0.99720 2.93 1.95 9.9 6
92 8.6 0.490 0.29 2.0 0.110 19.0 133.0 0.99720 2.93 1.98 9.8 5
106 7.8 0.410 0.68 1.7 0.467 18.0 69.0 0.99730 3.08 1.31 9.3 5
110 7.8 0.560 0.19 1.8 0.104 12.0 47.0 0.99640 3.19 0.93 9.5 5
114 7.8 0.560 0.19 1.8 0.104 12.0 47.0 0.99640 3.19 0.93 9.5 5
125 9.0 0.620 0.04 1.9 0.146 27.0 90.0 0.99840 3.16 0.70 9.4 5
145 8.1 0.670 0.55 1.8 0.117 32.0 141.0 0.99680 3.17 0.62 9.4 5
147 7.6 0.490 0.26 1.6 0.236 10.0 88.0 0.99680 3.11 0.80 9.3 5
151 9.2 0.520 1.00 3.4 0.610 32.0 69.0 0.99960 2.74 2.00 9.4 4
169 7.5 0.705 0.24 1.8 0.360 15.0 63.0 0.99640 3.00 1.59 9.5 5
181 8.9 0.610 0.49 2.0 0.270 23.0 110.0 0.99720 3.12 1.02 9.3 5
216 8.7 0.625 0.16 2.0 0.101 13.0 49.0 0.99620 3.14 0.57 11.0 5
221 7.4 0.530 0.26 2.0 0.101 16.0 72.0 0.99570 3.15 0.57 9.4 5
226 8.9 0.590 0.50 2.0 0.337 27.0 81.0 0.99640 3.04 1.61 9.5 6
240 8.9 0.635 0.37 1.7 0.263 5.0 62.0 0.99710 3.00 1.09 9.3 5
242 7.7 0.580 0.10 1.8 0.102 28.0 109.0 0.99565 3.08 0.49 9.8 6
258 7.7 0.410 0.76 1.8 0.611 8.0 45.0 0.99680 3.06 1.26 9.4 5
307 10.3 0.410 0.42 2.4 0.213 6.0 14.0 0.99940 3.19 0.62 9.5 6
308 10.3 0.430 0.44 2.4 0.214 5.0 12.0 0.99940 3.19 0.63 9.5 6
... ... ... ... ... ... ... ... ... ... ... ... ...
795 10.8 0.890 0.30 2.6 0.132 7.0 60.0 0.99786 2.99 1.18 10.2 5
796 8.7 0.460 0.31 2.5 0.126 24.0 64.0 0.99746 3.10 0.74 9.6 5
874 10.4 0.380 0.46 2.1 0.104 6.0 10.0 0.99664 3.12 0.65 11.8 7
911 9.1 0.280 0.46 9.0 0.114 3.0 9.0 0.99901 3.18 0.60 10.9 6
966 9.0 0.380 0.41 2.4 0.103 6.0 10.0 0.99604 3.13 0.58 11.9 7
977 8.4 0.590 0.29 2.6 0.109 31.0 119.0 0.99801 3.15 0.50 9.1 5
1009 9.6 0.500 0.36 2.8 0.116 26.0 55.0 0.99722 3.18 0.68 10.9 5
1051 8.5 0.460 0.59 1.4 0.414 16.0 45.0 0.99702 3.03 1.34 9.2 5
1057 7.6 0.420 0.25 3.9 0.104 28.0 90.0 0.99784 3.15 0.57 9.1 5
1098 8.0 0.310 0.45 2.1 0.216 5.0 16.0 0.99358 3.15 0.81 12.5 7
1109 10.8 0.470 0.43 2.1 0.171 27.0 66.0 0.99820 3.17 0.76 10.8 6
1138 7.5 0.410 0.15 3.7 0.104 29.0 94.0 0.99786 3.14 0.58 9.1 5
1152 8.3 0.600 0.25 2.2 0.118 9.0 38.0 0.99616 3.15 0.53 9.8 5
1155 8.3 0.600 0.25 2.2 0.118 9.0 38.0 0.99616 3.15 0.53 9.8 5
1159 10.2 0.410 0.43 2.2 0.110 11.0 37.0 0.99728 3.16 0.67 10.8 5
1165 8.5 0.440 0.50 1.9 0.369 15.0 38.0 0.99634 3.01 1.10 9.4 5
1166 9.9 0.540 0.26 2.0 0.111 7.0 60.0 0.99709 2.94 0.98 10.2 5
1203 9.9 0.740 0.19 5.8 0.111 33.0 76.0 0.99878 3.14 0.55 9.4 5
1260 8.6 0.635 0.68 1.8 0.403 19.0 56.0 0.99632 3.02 1.15 9.3 5
1319 9.1 0.760 0.68 1.7 0.414 18.0 64.0 0.99652 2.90 1.33 9.1 6
1365 7.8 0.500 0.09 2.2 0.115 10.0 42.0 0.99710 3.18 0.62 9.5 5
1370 8.7 0.780 0.51 1.7 0.415 12.0 66.0 0.99623 3.00 1.17 9.2 5
1372 8.7 0.780 0.51 1.7 0.415 12.0 66.0 0.99623 3.00 1.17 9.2 5
1384 7.1 0.755 0.15 1.8 0.107 20.0 84.0 0.99593 3.19 0.50 9.5 5
1413 9.9 0.570 0.25 2.0 0.104 12.0 89.0 0.99630 3.04 0.90 10.1 5
1434 10.2 0.540 0.37 15.4 0.214 55.0 95.0 1.00369 3.18 0.77 9.0 6
1435 10.2 0.540 0.37 15.4 0.214 55.0 95.0 1.00369 3.18 0.77 9.0 6
1436 10.0 0.380 0.38 1.6 0.169 27.0 90.0 0.99914 3.15 0.65 8.5 5
1474 9.9 0.500 0.50 13.8 0.205 48.0 82.0 1.00242 3.16 0.75 8.8 5
1476 9.9 0.500 0.50 13.8 0.205 48.0 82.0 1.00242 3.16 0.75 8.8 5

89 rows × 12 columns

Filter by query method

The query method provides another way to filter a DataFrame similar to the binary mask above.


In [14]:
df.query('chlorides > 0.100 and pH < 3.20')


Out[14]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
14 8.9 0.620 0.18 3.8 0.176 52.0 145.0 0.99860 3.16 0.88 9.2 5
15 8.9 0.620 0.19 3.9 0.170 51.0 148.0 0.99860 3.17 0.93 9.2 5
17 8.1 0.560 0.28 1.7 0.368 16.0 56.0 0.99680 3.11 1.28 9.3 5
19 7.9 0.320 0.51 1.8 0.341 17.0 56.0 0.99690 3.04 1.08 9.2 6
22 7.9 0.430 0.21 1.6 0.106 10.0 37.0 0.99660 3.17 0.91 9.5 5
27 7.9 0.430 0.21 1.6 0.106 10.0 37.0 0.99660 3.17 0.91 9.5 5
54 7.6 0.510 0.15 2.8 0.110 33.0 73.0 0.99550 3.17 0.63 10.2 6
81 7.8 0.430 0.70 1.9 0.464 22.0 67.0 0.99740 3.13 1.28 9.4 5
83 7.3 0.670 0.26 1.8 0.401 16.0 51.0 0.99690 3.16 1.14 9.4 5
86 8.6 0.490 0.28 1.9 0.110 20.0 136.0 0.99720 2.93 1.95 9.9 6
88 9.3 0.390 0.44 2.1 0.107 34.0 125.0 0.99780 3.14 1.22 9.5 5
91 8.6 0.490 0.28 1.9 0.110 20.0 136.0 0.99720 2.93 1.95 9.9 6
92 8.6 0.490 0.29 2.0 0.110 19.0 133.0 0.99720 2.93 1.98 9.8 5
106 7.8 0.410 0.68 1.7 0.467 18.0 69.0 0.99730 3.08 1.31 9.3 5
110 7.8 0.560 0.19 1.8 0.104 12.0 47.0 0.99640 3.19 0.93 9.5 5
114 7.8 0.560 0.19 1.8 0.104 12.0 47.0 0.99640 3.19 0.93 9.5 5
125 9.0 0.620 0.04 1.9 0.146 27.0 90.0 0.99840 3.16 0.70 9.4 5
145 8.1 0.670 0.55 1.8 0.117 32.0 141.0 0.99680 3.17 0.62 9.4 5
147 7.6 0.490 0.26 1.6 0.236 10.0 88.0 0.99680 3.11 0.80 9.3 5
151 9.2 0.520 1.00 3.4 0.610 32.0 69.0 0.99960 2.74 2.00 9.4 4
169 7.5 0.705 0.24 1.8 0.360 15.0 63.0 0.99640 3.00 1.59 9.5 5
181 8.9 0.610 0.49 2.0 0.270 23.0 110.0 0.99720 3.12 1.02 9.3 5
216 8.7 0.625 0.16 2.0 0.101 13.0 49.0 0.99620 3.14 0.57 11.0 5
221 7.4 0.530 0.26 2.0 0.101 16.0 72.0 0.99570 3.15 0.57 9.4 5
226 8.9 0.590 0.50 2.0 0.337 27.0 81.0 0.99640 3.04 1.61 9.5 6
240 8.9 0.635 0.37 1.7 0.263 5.0 62.0 0.99710 3.00 1.09 9.3 5
242 7.7 0.580 0.10 1.8 0.102 28.0 109.0 0.99565 3.08 0.49 9.8 6
258 7.7 0.410 0.76 1.8 0.611 8.0 45.0 0.99680 3.06 1.26 9.4 5
307 10.3 0.410 0.42 2.4 0.213 6.0 14.0 0.99940 3.19 0.62 9.5 6
308 10.3 0.430 0.44 2.4 0.214 5.0 12.0 0.99940 3.19 0.63 9.5 6
... ... ... ... ... ... ... ... ... ... ... ... ...
795 10.8 0.890 0.30 2.6 0.132 7.0 60.0 0.99786 2.99 1.18 10.2 5
796 8.7 0.460 0.31 2.5 0.126 24.0 64.0 0.99746 3.10 0.74 9.6 5
874 10.4 0.380 0.46 2.1 0.104 6.0 10.0 0.99664 3.12 0.65 11.8 7
911 9.1 0.280 0.46 9.0 0.114 3.0 9.0 0.99901 3.18 0.60 10.9 6
966 9.0 0.380 0.41 2.4 0.103 6.0 10.0 0.99604 3.13 0.58 11.9 7
977 8.4 0.590 0.29 2.6 0.109 31.0 119.0 0.99801 3.15 0.50 9.1 5
1009 9.6 0.500 0.36 2.8 0.116 26.0 55.0 0.99722 3.18 0.68 10.9 5
1051 8.5 0.460 0.59 1.4 0.414 16.0 45.0 0.99702 3.03 1.34 9.2 5
1057 7.6 0.420 0.25 3.9 0.104 28.0 90.0 0.99784 3.15 0.57 9.1 5
1098 8.0 0.310 0.45 2.1 0.216 5.0 16.0 0.99358 3.15 0.81 12.5 7
1109 10.8 0.470 0.43 2.1 0.171 27.0 66.0 0.99820 3.17 0.76 10.8 6
1138 7.5 0.410 0.15 3.7 0.104 29.0 94.0 0.99786 3.14 0.58 9.1 5
1152 8.3 0.600 0.25 2.2 0.118 9.0 38.0 0.99616 3.15 0.53 9.8 5
1155 8.3 0.600 0.25 2.2 0.118 9.0 38.0 0.99616 3.15 0.53 9.8 5
1159 10.2 0.410 0.43 2.2 0.110 11.0 37.0 0.99728 3.16 0.67 10.8 5
1165 8.5 0.440 0.50 1.9 0.369 15.0 38.0 0.99634 3.01 1.10 9.4 5
1166 9.9 0.540 0.26 2.0 0.111 7.0 60.0 0.99709 2.94 0.98 10.2 5
1203 9.9 0.740 0.19 5.8 0.111 33.0 76.0 0.99878 3.14 0.55 9.4 5
1260 8.6 0.635 0.68 1.8 0.403 19.0 56.0 0.99632 3.02 1.15 9.3 5
1319 9.1 0.760 0.68 1.7 0.414 18.0 64.0 0.99652 2.90 1.33 9.1 6
1365 7.8 0.500 0.09 2.2 0.115 10.0 42.0 0.99710 3.18 0.62 9.5 5
1370 8.7 0.780 0.51 1.7 0.415 12.0 66.0 0.99623 3.00 1.17 9.2 5
1372 8.7 0.780 0.51 1.7 0.415 12.0 66.0 0.99623 3.00 1.17 9.2 5
1384 7.1 0.755 0.15 1.8 0.107 20.0 84.0 0.99593 3.19 0.50 9.5 5
1413 9.9 0.570 0.25 2.0 0.104 12.0 89.0 0.99630 3.04 0.90 10.1 5
1434 10.2 0.540 0.37 15.4 0.214 55.0 95.0 1.00369 3.18 0.77 9.0 6
1435 10.2 0.540 0.37 15.4 0.214 55.0 95.0 1.00369 3.18 0.77 9.0 6
1436 10.0 0.380 0.38 1.6 0.169 27.0 90.0 0.99914 3.15 0.65 8.5 5
1474 9.9 0.500 0.50 13.8 0.205 48.0 82.0 1.00242 3.16 0.75 8.8 5
1476 9.9 0.500 0.50 13.8 0.205 48.0 82.0 1.00242 3.16 0.75 8.8 5

89 rows × 12 columns

Groupby

groupby allows us to perform aggregations such as sums, means, or counts.

Let's find the unique values in the quality column throughout the data frame.


In [15]:
df['quality'].unique()


Out[15]:
array([5, 6, 7, 4, 8, 3])

Create a GroupBy object:


In [16]:
gb = df.groupby('quality')

In [17]:
gb


Out[17]:
<pandas.core.groupby.DataFrameGroupBy object at 0x10ec6f590>

For each quality level, find the mean, max, or count of the other columns:


In [18]:
gb.mean()


Out[18]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
quality
3 8.360000 0.884500 0.171000 2.635000 0.122500 11.000000 24.900000 0.997464 3.398000 0.570000 9.955000
4 7.779245 0.693962 0.174151 2.694340 0.090679 12.264151 36.245283 0.996542 3.381509 0.596415 10.265094
5 8.167254 0.577041 0.243686 2.528855 0.092736 16.983847 56.513950 0.997104 3.304949 0.620969 9.899706
6 8.347179 0.497484 0.273824 2.477194 0.084956 15.711599 40.869906 0.996615 3.318072 0.675329 10.629519
7 8.872362 0.403920 0.375176 2.720603 0.076588 14.045226 35.020101 0.996104 3.290754 0.741256 11.465913
8 8.566667 0.423333 0.391111 2.577778 0.068444 13.277778 33.444444 0.995212 3.267222 0.767778 12.094444

In [19]:
gb.max()


Out[19]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
quality
3 11.6 1.580 0.66 5.7 0.267 34.0 49.0 1.00080 3.63 0.86 11.0
4 12.5 1.130 1.00 12.9 0.610 41.0 119.0 1.00100 3.90 2.00 13.1
5 15.9 1.330 0.79 15.5 0.611 68.0 155.0 1.00315 3.74 1.98 14.9
6 14.3 1.040 0.78 15.4 0.415 72.0 165.0 1.00369 4.01 1.95 14.0
7 15.6 0.915 0.76 8.9 0.358 54.0 289.0 1.00320 3.78 1.36 14.0
8 12.6 0.850 0.72 6.4 0.086 42.0 88.0 0.99880 3.72 1.10 14.0

In [20]:
gb.count()


Out[20]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
quality
3 10 10 10 10 10 10 10 10 10 10 10
4 53 53 53 53 53 53 53 53 53 53 53
5 681 681 681 681 681 681 681 681 681 681 681
6 638 638 638 638 638 638 638 638 638 638 638
7 199 199 199 199 199 199 199 199 199 199 199
8 18 18 18 18 18 18 18 18 18 18 18

The result of the above methods are DataFrame objects. That means that you can index the result like any other DataFrame:


In [21]:
gb.count()['fixed acidity']


Out[21]:
quality
3     10
4     53
5    681
6    638
7    199
8     18
Name: fixed acidity, dtype: int64

To group by combinations of multiple column values, enclose the column names in a list:


In [22]:
df.groupby(['pH', 'quality']).count()


Out[22]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density sulphates alcohol
pH quality
2.74 4 1 1 1 1 1 1 1 1 1 1
2.86 6 1 1 1 1 1 1 1 1 1 1
2.87 6 1 1 1 1 1 1 1 1 1 1
2.88 5 1 1 1 1 1 1 1 1 1 1
8 1 1 1 1 1 1 1 1 1 1
2.89 5 2 2 2 2 2 2 2 2 2 2
6 2 2 2 2 2 2 2 2 2 2
2.90 6 1 1 1 1 1 1 1 1 1 1
2.92 5 3 3 3 3 3 3 3 3 3 3
7 1 1 1 1 1 1 1 1 1 1
2.93 5 1 1 1 1 1 1 1 1 1 1
6 2 2 2 2 2 2 2 2 2 2
2.94 5 2 2 2 2 2 2 2 2 2 2
6 2 2 2 2 2 2 2 2 2 2
2.95 7 1 1 1 1 1 1 1 1 1 1
2.98 5 2 2 2 2 2 2 2 2 2 2
6 1 1 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1 1 1
8 1 1 1 1 1 1 1 1 1 1
2.99 5 2 2 2 2 2 2 2 2 2 2
3.00 5 4 4 4 4 4 4 4 4 4 4
6 2 2 2 2 2 2 2 2 2 2
3.01 5 1 1 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1 1 1
7 3 3 3 3 3 3 3 3 3 3
3.02 5 3 3 3 3 3 3 3 3 3 3
6 1 1 1 1 1 1 1 1 1 1
7 4 4 4 4 4 4 4 4 4 4
3.03 5 4 4 4 4 4 4 4 4 4 4
6 2 2 2 2 2 2 2 2 2 2
... ... ... ... ... ... ... ... ... ... ... ...
3.61 5 3 3 3 3 3 3 3 3 3 3
6 5 5 5 5 5 5 5 5 5 5
3.62 4 1 1 1 1 1 1 1 1 1 1
5 2 2 2 2 2 2 2 2 2 2
6 1 1 1 1 1 1 1 1 1 1
3.63 3 1 1 1 1 1 1 1 1 1 1
5 2 2 2 2 2 2 2 2 2 2
3.66 4 1 1 1 1 1 1 1 1 1 1
5 2 2 2 2 2 2 2 2 2 2
6 1 1 1 1 1 1 1 1 1 1
3.67 5 2 2 2 2 2 2 2 2 2 2
6 1 1 1 1 1 1 1 1 1 1
3.68 5 2 2 2 2 2 2 2 2 2 2
6 2 2 2 2 2 2 2 2 2 2
7 1 1 1 1 1 1 1 1 1 1
3.69 5 2 2 2 2 2 2 2 2 2 2
6 2 2 2 2 2 2 2 2 2 2
3.70 6 1 1 1 1 1 1 1 1 1 1
3.71 6 1 1 1 1 1 1 1 1 1 1
7 3 3 3 3 3 3 3 3 3 3
3.72 5 2 2 2 2 2 2 2 2 2 2
8 1 1 1 1 1 1 1 1 1 1
3.74 5 1 1 1 1 1 1 1 1 1 1
3.75 4 1 1 1 1 1 1 1 1 1 1
3.78 6 1 1 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1 1 1
3.85 6 1 1 1 1 1 1 1 1 1 1
3.90 4 1 1 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1 1 1
4.01 6 2 2 2 2 2 2 2 2 2 2

270 rows × 10 columns

Sorting

Use sort_values to sort a DataFrame:


In [23]:
df.sort_values('fixed acidity')[:5]


Out[23]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
45 4.6 0.52 0.15 2.1 0.054 8.0 65.0 0.99340 3.90 0.56 13.1 4
95 4.7 0.60 0.17 2.3 0.058 17.0 106.0 0.99320 3.85 0.60 12.9 6
821 4.9 0.42 0.00 2.1 0.048 16.0 42.0 0.99154 3.71 0.74 14.0 7
588 5.0 0.42 0.24 2.0 0.060 19.0 50.0 0.99170 3.72 0.74 14.0 8
94 5.0 1.02 0.04 1.4 0.045 41.0 85.0 0.99380 3.75 0.48 10.5 4

In [24]:
df.sort_values('fixed acidity', ascending=False)[:5]


Out[24]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
652 15.9 0.360 0.65 7.5 0.096 22.0 71.0 0.99760 2.98 0.84 14.9 5
557 15.6 0.645 0.49 4.2 0.095 10.0 23.0 1.00315 2.92 0.74 11.1 5
442 15.6 0.685 0.76 3.7 0.100 6.0 43.0 1.00320 2.95 0.68 11.2 7
555 15.5 0.645 0.49 4.2 0.095 10.0 23.0 1.00315 2.92 0.74 11.1 5
554 15.5 0.645 0.49 4.2 0.095 10.0 23.0 1.00315 2.92 0.74 11.1 5

Sort by multiple columns:


In [25]:
df.sort_values(['fixed acidity', 'volatile acidity'], ascending=[True, True])


Out[25]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
45 4.6 0.520 0.15 2.10 0.054 8.0 65.0 0.99340 3.90 0.56 13.100000 4
95 4.7 0.600 0.17 2.30 0.058 17.0 106.0 0.99320 3.85 0.60 12.900000 6
821 4.9 0.420 0.00 2.10 0.048 16.0 42.0 0.99154 3.71 0.74 14.000000 7
1270 5.0 0.380 0.01 1.60 0.048 26.0 60.0 0.99084 3.70 0.75 14.000000 6
1114 5.0 0.400 0.50 4.30 0.046 29.0 80.0 0.99020 3.49 0.66 13.600000 6
588 5.0 0.420 0.24 2.00 0.060 19.0 50.0 0.99170 3.72 0.74 14.000000 8
1321 5.0 0.740 0.00 1.20 0.041 16.0 46.0 0.99258 4.01 0.59 12.500000 6
94 5.0 1.020 0.04 1.40 0.045 41.0 85.0 0.99380 3.75 0.48 10.500000 4
553 5.0 1.040 0.24 1.60 0.050 32.0 96.0 0.99340 3.74 0.62 11.500000 5
1228 5.1 0.420 0.00 1.80 0.044 18.0 88.0 0.99157 3.68 0.73 13.600000 7
695 5.1 0.470 0.02 1.30 0.034 18.0 44.0 0.99210 3.90 0.62 12.800000 6
1157 5.1 0.510 0.18 2.10 0.042 16.0 101.0 0.99240 3.46 0.87 12.900000 7
802 5.1 0.585 0.00 1.70 0.044 14.0 86.0 0.99264 3.56 0.94 12.900000 7
34 5.2 0.320 0.25 1.80 0.103 13.0 50.0 0.99570 3.38 0.55 9.200000 5
142 5.2 0.340 0.00 1.80 0.050 27.0 63.0 0.99160 3.68 0.79 14.000000 6
144 5.2 0.340 0.00 1.80 0.050 27.0 63.0 0.99160 3.68 0.79 14.000000 6
230 5.2 0.480 0.04 1.60 0.054 19.0 106.0 0.99270 3.54 0.62 12.200000 7
1377 5.2 0.490 0.26 2.30 0.090 23.0 74.0 0.99530 3.71 0.62 12.200000 6
1300 5.2 0.645 0.00 2.15 0.080 15.0 28.0 0.99444 3.78 0.61 12.500000 6
1475 5.3 0.470 0.11 2.20 0.048 16.0 89.0 0.99182 3.54 0.88 13.566667 7
1477 5.3 0.470 0.11 2.20 0.048 16.0 89.0 0.99182 3.54 0.88 13.600000 7
444 5.3 0.570 0.01 1.70 0.054 5.0 27.0 0.99340 3.57 0.84 12.500000 7
916 5.3 0.715 0.19 1.50 0.161 7.0 62.0 0.99395 3.62 0.61 11.000000 5
1111 5.4 0.420 0.27 2.00 0.092 23.0 55.0 0.99471 3.78 0.64 12.300000 7
1537 5.4 0.580 0.08 1.90 0.059 20.0 31.0 0.99484 3.50 0.64 10.200000 6
1316 5.4 0.740 0.00 1.20 0.041 16.0 46.0 0.99258 4.01 0.59 12.500000 6
1591 5.4 0.740 0.09 1.70 0.089 16.0 26.0 0.99402 3.67 0.56 11.600000 6
198 5.4 0.835 0.08 1.20 0.046 13.0 93.0 0.99240 3.57 0.85 13.000000 7
1269 5.5 0.490 0.03 1.80 0.044 28.0 87.0 0.99080 3.50 0.82 14.000000 8
49 5.6 0.310 0.37 1.40 0.074 12.0 96.0 0.99540 3.32 0.58 9.200000 5
... ... ... ... ... ... ... ... ... ... ... ... ...
205 12.8 0.300 0.74 2.60 0.095 9.0 28.0 0.99940 3.20 0.77 10.800000 7
206 12.8 0.300 0.74 2.60 0.095 9.0 28.0 0.99940 3.20 0.77 10.800000 7
364 12.8 0.615 0.66 5.80 0.083 7.0 42.0 1.00220 3.07 0.73 10.000000 7
366 12.8 0.615 0.66 5.80 0.083 7.0 42.0 1.00220 3.07 0.73 10.000000 7
429 12.8 0.840 0.63 2.40 0.088 13.0 35.0 0.99970 3.10 0.60 10.400000 6
538 12.9 0.350 0.49 5.80 0.066 5.0 35.0 1.00140 3.20 0.66 12.000000 7
811 12.9 0.500 0.55 2.80 0.072 7.0 24.0 1.00012 3.09 0.68 10.900000 6
470 13.0 0.320 0.65 2.60 0.093 15.0 47.0 0.99960 3.05 0.61 10.600000 5
559 13.0 0.470 0.49 4.30 0.085 6.0 47.0 1.00210 3.30 0.68 12.700000 6
564 13.0 0.470 0.49 4.30 0.085 6.0 47.0 1.00210 3.30 0.68 12.700000 6
611 13.2 0.380 0.55 2.70 0.081 5.0 16.0 1.00060 2.98 0.54 9.400000 5
601 13.2 0.460 0.52 2.20 0.071 12.0 35.0 1.00060 3.10 0.56 9.000000 6
603 13.2 0.460 0.52 2.20 0.071 12.0 35.0 1.00060 3.10 0.56 9.000000 6
509 13.3 0.290 0.75 2.80 0.084 23.0 43.0 0.99860 3.04 0.68 11.400000 7
294 13.3 0.340 0.52 3.20 0.094 17.0 53.0 1.00140 3.05 0.81 9.500000 6
680 13.3 0.430 0.58 1.90 0.070 15.0 40.0 1.00040 3.06 0.49 9.000000 5
328 13.4 0.270 0.62 2.60 0.082 6.0 21.0 1.00020 3.16 0.67 9.700000 6
353 13.5 0.530 0.79 4.80 0.120 23.0 77.0 1.00180 3.18 0.77 13.000000 5
381 13.7 0.415 0.68 2.90 0.085 17.0 43.0 1.00140 3.06 0.80 10.000000 6
391 13.7 0.415 0.68 2.90 0.085 17.0 43.0 1.00140 3.06 0.80 10.000000 6
347 13.8 0.490 0.67 3.00 0.093 6.0 15.0 0.99860 3.02 0.93 12.000000 6
374 14.0 0.410 0.63 3.80 0.089 6.0 47.0 1.00140 3.01 0.81 10.800000 6
544 14.3 0.310 0.74 1.80 0.075 6.0 15.0 1.00080 2.86 0.79 8.400000 6
243 15.0 0.210 0.44 2.20 0.075 10.0 24.0 1.00005 3.07 0.84 9.200000 7
244 15.0 0.210 0.44 2.20 0.075 10.0 24.0 1.00005 3.07 0.84 9.200000 7
554 15.5 0.645 0.49 4.20 0.095 10.0 23.0 1.00315 2.92 0.74 11.100000 5
555 15.5 0.645 0.49 4.20 0.095 10.0 23.0 1.00315 2.92 0.74 11.100000 5
557 15.6 0.645 0.49 4.20 0.095 10.0 23.0 1.00315 2.92 0.74 11.100000 5
442 15.6 0.685 0.76 3.70 0.100 6.0 43.0 1.00320 2.95 0.68 11.200000 7
652 15.9 0.360 0.65 7.50 0.096 22.0 71.0 0.99760 2.98 0.84 14.900000 5

1599 rows × 12 columns

Note that this does not change the DataFrame itself.


In [26]:
df.head()


Out[26]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

Add Column

To add a column you can use bracket notation in a manner similar to extending a dictionary.

Create a `DataFrame`:

In [27]:
x = pandas.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])

In [28]:
x


Out[28]:
a b
0 1 2
1 3 4

For a new key, z, define the values for each row:


In [29]:
x['z'] = [55, 66]

In [30]:
x


Out[30]:
a b z
0 1 2 55
1 3 4 66

You could use this manner to define a new column as a function of existing columns:


In [31]:
x['b_per_a'] = x['b']/x['a']

In [32]:
x


Out[32]:
a b z b_per_a
0 1 2 55 2.000000
1 3 4 66 1.333333

Rename Column

By default,


In [33]:
x.rename(columns={'a': 'gallons', 'b': 'miles', 'b_per_a': 'mpg'}, inplace=True)

In [34]:
x


Out[34]:
gallons miles z mpg
0 1 2 55 2.000000
1 3 4 66 1.333333

By default, the rename method returns a renamed copy of the original data frame. To modify the original data frame, i.e. in place, we set inplace=True.

Drop Rows or Columns

The drop method is used to drop rows or columns from a DataFrame.

To drop a column:


In [35]:
x.drop('z', axis=1)


Out[35]:
gallons miles mpg
0 1 2 2.000000
1 3 4 1.333333

By default, the drop method returns a modified copy of the original data frame, i.e. the original data frame is unmodified:


In [36]:
x


Out[36]:
gallons miles z mpg
0 1 2 55 2.000000
1 3 4 66 1.333333

To modify the original data frame, i.e. in place, we set inplace=True.


In [37]:
x.drop('z', axis=1, inplace=True)

In [38]:
x


Out[38]:
gallons miles mpg
0 1 2 2.000000
1 3 4 1.333333

To drop a row:


In [39]:
x.drop(0)


Out[39]:
gallons miles mpg
1 3 4 1.333333

Drop Missing Values

For illustration, add a row/column containing a missing value:


In [40]:
x['w'] = [1, None]

In [41]:
x


Out[41]:
gallons miles mpg w
0 1 2 2.000000 1.0
1 3 4 1.333333 NaN

Drop all columns containing at least one missing value:


In [42]:
x.dropna(axis=1)


Out[42]:
gallons miles mpg
0 1 2 2.000000
1 3 4 1.333333

Drop all rows containing at least one missing value:


In [43]:
x.dropna(axis=0)


Out[43]:
gallons miles mpg w
0 1 2 2.0 1.0

As always, unless we set inplace=True, the original data frame is left unmodified:


In [44]:
x


Out[44]:
gallons miles mpg w
0 1 2 2.000000 1.0
1 3 4 1.333333 NaN

Fill Missing Values

To fill all missing data cells with a specified value, e.g. -1:


In [45]:
x.fillna(-1)


Out[45]:
gallons miles mpg w
0 1 2 2.000000 1.0
1 3 4 1.333333 -1.0

Pairwise Correlation


In [46]:
df.corr()


Out[46]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
fixed acidity 1.000000 -0.256131 0.671703 0.114777 0.093705 -0.153794 -0.113181 0.668047 -0.682978 0.183006 -0.061668 0.124052
volatile acidity -0.256131 1.000000 -0.552496 0.001918 0.061298 -0.010504 0.076470 0.022026 0.234937 -0.260987 -0.202288 -0.390558
citric acid 0.671703 -0.552496 1.000000 0.143577 0.203823 -0.060978 0.035533 0.364947 -0.541904 0.312770 0.109903 0.226373
residual sugar 0.114777 0.001918 0.143577 1.000000 0.055610 0.187049 0.203028 0.355283 -0.085652 0.005527 0.042075 0.013732
chlorides 0.093705 0.061298 0.203823 0.055610 1.000000 0.005562 0.047400 0.200632 -0.265026 0.371260 -0.221141 -0.128907
free sulfur dioxide -0.153794 -0.010504 -0.060978 0.187049 0.005562 1.000000 0.667666 -0.021946 0.070377 0.051658 -0.069408 -0.050656
total sulfur dioxide -0.113181 0.076470 0.035533 0.203028 0.047400 0.667666 1.000000 0.071269 -0.066495 0.042947 -0.205654 -0.185100
density 0.668047 0.022026 0.364947 0.355283 0.200632 -0.021946 0.071269 1.000000 -0.341699 0.148506 -0.496180 -0.174919
pH -0.682978 0.234937 -0.541904 -0.085652 -0.265026 0.070377 -0.066495 -0.341699 1.000000 -0.196648 0.205633 -0.057731
sulphates 0.183006 -0.260987 0.312770 0.005527 0.371260 0.051658 0.042947 0.148506 -0.196648 1.000000 0.093595 0.251397
alcohol -0.061668 -0.202288 0.109903 0.042075 -0.221141 -0.069408 -0.205654 -0.496180 0.205633 0.093595 1.000000 0.476166
quality 0.124052 -0.390558 0.226373 0.013732 -0.128907 -0.050656 -0.185100 -0.174919 -0.057731 0.251397 0.476166 1.000000

In [47]:
df[['density', 'pH', 'alcohol']].corr()


Out[47]:
density pH alcohol
density 1.000000 -0.341699 -0.496180
pH -0.341699 1.000000 0.205633
alcohol -0.496180 0.205633 1.000000

Visualization

Matplotlib is the most popular and compatible plotting library in Python. Much of its functionality is integrated into Pandas.

In the Jupyter/IPython notebook, we must execute the following line in order to display plots in the notebook:


In [2]:
%matplotlib inline

The following import is a common convention when creating plots in Python:


In [5]:
import seaborn
import matplotlib.pyplot as plt

In [6]:
plt.plot([3,1,4,1,5])


Out[6]:
[<matplotlib.lines.Line2D at 0x10c8c6150>]

We can set the visual style to something a bit more pleasing than the default:


In [7]:
import matplotlib
matplotlib.style.use('ggplot')

In [8]:
plt.plot([4,7,4,2,6,])


Out[8]:
[<matplotlib.lines.Line2D at 0x10cc86b90>]

Plot a histogram of the fixed acidity values:


In [9]:
df['fixed acidity'].plot.hist()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-2521b0caa77f> in <module>()
----> 1 df['fixed acidity'].plot.hist()

NameError: name 'df' is not defined

Scatter plot the fixed acidity vs. volatile acidity:


In [52]:
df.plot.scatter(x='fixed acidity', y='volatile acidity')


Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x10ee9f290>

In [53]:
df.plot.hexbin(x='fixed acidity', y='volatile acidity', gridsize=30)


Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f137790>

Create a box-and-whisker plot of multiple columns:


In [55]:
df[['fixed acidity', 'volatile acidity']].plot.box()


Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f14acd0>