Numpy and Pandas

Numpy

NumPy is the fundamental package for scientific computing with Python.

http://www.numpy.org/

It contains among other things:

  • a powerful N-dimensional array object
  • useful linear algebra, Fourier transform, and random number capabilities
  • tools for integrating C/C++ and Fortran code

In [ ]:
!conda install --yes --c conda-forge ipywidgets numpy pandas nomkl seaborn ipywidgets jupyter


Using Anaconda Cloud api site https://api.anaconda.org
Fetching package metadata ...

In [1]:
import numpy

In [2]:
numpy.ones((2, 3))


Out[2]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [3]:
a = numpy.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
a


Out[3]:
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [4]:
a.shape


Out[4]:
(4, 3)

In [5]:
a.ndim


Out[5]:
2

In [6]:
a.size


Out[6]:
12

In [7]:
a - numpy.random.random(a.shape)


Out[7]:
array([[  0.37772541,   1.4749529 ,   2.40223093],
       [  3.69957528,   4.84057082,   5.01177475],
       [  6.20454538,   7.2169713 ,   8.82870779],
       [  9.38959571,  10.0096205 ,  11.93880162]])

In [8]:
a.ravel()


Out[8]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [9]:
a


Out[9]:
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [10]:
a[1:-1]


Out[10]:
array([[4, 5, 6],
       [7, 8, 9]])

In [11]:
'>qwweqwe<'[1:-1]


Out[11]:
'qwweqwe'

In [12]:
a


Out[12]:
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [13]:
a[:,1]


Out[13]:
array([ 2,  5,  8, 11])

In [14]:
a[a % 2 == 0] = -1

In [15]:
a


Out[15]:
array([[ 1, -1,  3],
       [-1,  5, -1],
       [ 7, -1,  9],
       [-1, 11, -1]])

Pandas

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

http://pandas.pydata.org/

It features:

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets.

In [16]:
import pandas

In [17]:
boston_dataset = pandas.read_csv("../static/Boston.csv")

In [18]:
boston_dataset[:10]


Out[18]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PT B LSTAT MV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
5 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
6 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9
7 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15 27.1
8 0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.93 16.5
9 0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.10 18.9

In [19]:
boston_dataset[boston_dataset['MV'] < 7]


Out[19]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PT B LSTAT MV
398 38.35180 0.0 18.1 0 0.693 5.453 100.0 1.4896 24 666 20.2 396.90 30.59 5.0
399 9.91655 0.0 18.1 0 0.693 5.852 77.8 1.5004 24 666 20.2 338.16 29.97 6.3
400 25.04610 0.0 18.1 0 0.693 5.987 100.0 1.5888 24 666 20.2 396.90 26.77 5.6
405 67.92080 0.0 18.1 0 0.693 5.683 100.0 1.4254 24 666 20.2 384.97 22.98 5.0

In [20]:
boston_dataset['TARGET'] = boston_dataset['MV'].astype(int)

In [21]:
boston_dataset[:10]


Out[21]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PT B LSTAT MV TARGET
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0 24
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6 21
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7 34
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4 33
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2 36
5 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7 28
6 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9 22
7 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15 27.1 27
8 0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.93 16.5 16
9 0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.10 18.9 18

In [22]:
%pylab inline

import seaborn
seaborn.set_context('talk')


Populating the interactive namespace from numpy and matplotlib

Plot a histogram with 50 bins


In [23]:
boston_dataset['MV'].hist(bins=50);


Jupyter Notebook interactive features


In [24]:
def plot_by(dataset, column='MV', bins_count=10):
    plot = boston_dataset[column].hist(bins=bins_count)
    
    # Plot settings.
    pyplot.title('%s Values' % column)
    pyplot.ylabel('N')

from ipywidgets import interact, fixed
interact(
    plot_by,
    dataset=fixed(boston_dataset),
    column=boston_dataset.columns.tolist(),
    bins_count=(5,50)
);



In [ ]: