```
In [ ]:
```from IPython.display import Image

- most fundamental third-party package for scientific computing in Python
- multidimensional
**array**data structures - associated functions and methods to manipulate them.
- Other third-party packages, including
**pandas**, use NumPy arrays as backends for more specialized data structures

- While Python comes with several container types (
`list`

,`tuple`

,`dict`

), - NumPy's arrays are implemented closer to the hardware, and are therefore more
**efficient**than the built-in types. - This is particularly true for large data, for which NumPy scales much better than Python's built-in data structures.
- NumPy arrays also retain a suite of associated functions and methods that allow for efficient
*array-oriented*computing.

```
In [ ]:
```import numpy as np

- You can index an array in the same way you can index Python lists using
**slice notation**

```
In [ ]:
```lst = list(range(1000))
arr = np.arange(1000)

Here's what the array looks like

```
In [ ]:
```arr[:10]

```
In [ ]:
```arr[10:20]

```
In [ ]:
```arr[10:20:2]

```
In [ ]:
```type(arr)

```
In [ ]:
```%timeit [i ** 2 for i in lst]

```
In [ ]:
```%timeit arr ** 2

We can index arrays in the same ways as lists

```
In [ ]:
```arr[5:10]

```
In [ ]:
```arr[-1]

- arrays are
**homogeneously typed**- all elements of an array must be of the same type.
- we see why when we think about the memory layout

- lists can contain elements of arbitrary type

```
In [ ]:
```['a', 2, (1, 3)]

```
In [ ]:
```lst[0] = 'some other type'

```
In [ ]:
```lst[:3]

- We can't do this with an array

```
In [ ]:
```arr[0] = 'some other type'

- The data type is contained in the
`dtype`

attribute

```
In [ ]:
```arr.dtype

- The dtype is fixed
- Other types will be cast to this type

```
In [ ]:
```arr[0] = 1.234

```
In [ ]:
```arr[:10]

- Sometimes it's useful to peak under the hood to fix ideas
- A block of memory with some extra information on how to intepret its contents

```
In [ ]:
```Image("https://docs.scipy.org/doc/numpy/_images/threefundamental.png")

```
In [ ]:
```np.zeros(5, dtype=float)

```
In [ ]:
```np.zeros(5, dtype=int)

```
In [ ]:
```np.zeros(5, dtype=complex)

```
In [ ]:
```np.ones(5, dtype=float)

- We have seen how the
`arange`

function generates an array for a range of integers. `linspace`

and`logspace`

functions to create linearly and logarithmically-spaced**grids**respectively, with a fixed number of points and including both ends of the specified interval:

```
In [ ]:
```np.linspace(0, 1, num=5)

```
In [ ]:
```np.logspace(1, 4, num=4)

Finally, it is often useful to create arrays with random numbers that follow a specific **distribution**. The `np.random`

module contains a number of functions that can be used to this effect, for example this will produce an array of 5 random samples taken from a **standard normal** distribution (0 mean and variance 1) $ X \sim N(0, 1) $:

```
In [ ]:
```np.random.randn(5)

$X \sim N(9, 3)$

```
In [ ]:
```norm10 = np.random.normal(loc=9, scale=3, size=10)

```
In [ ]:
```%load solutions/random_number.py

- Above we showed how to index with numbers and
**slices** - NumPy indexing is much more powerful than Python indexing
- You can index with other arrays
- Boolean arrays
- Integer arrays

`norm10`

we want to replace all values above 9 with the value 0. We can do so by first finding the *mask* that indicates where this condition is `True`

or `False`

:

```
In [ ]:
```mask = norm10 > 9
mask

```
In [ ]:
```norm10[mask]

- Likewise you can index with integer arrays

```
In [ ]:
```norm10[[1, 4, 6]]

- This form of indexing is known as
**fancy-indexing** - You can use fancy-indexing for assignment
- This is particularly useful for assignment given some condition

```
In [ ]:
```norm10[norm10 > 9] = 0

```
In [ ]:
``````
norm10
```

```
In [ ]:
```norm10[[1, 4, 7]] = 10

```
In [ ]:
``````
norm10
```

- When we use slice notation to look at part of an array, it produces a
**view** - That is, it points to the same memory of the original array

```
In [ ]:
```x = np.arange(10)

```
In [ ]:
``````
x
```

```
In [ ]:
```y = x[::2]
y

```
In [ ]:
```y[3] = 100
y

```
In [ ]:
``````
x
```

- This, however, produces a
**copy** - Operating on the copy will not affect the original array

```
In [ ]:
```a = norm10[[0, 1, 5]]

```
In [ ]:
``````
a
```

```
In [ ]:
```a[:] = -10

```
In [ ]:
``````
a
```

```
In [ ]:
``````
norm10
```

Create an array `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`

without typing the values by hand. Assign 100 to elements 2 to 5 (zero-index). Print the array.

Create the same array as in step one above. Create an array from a slice of elements 2 to 5. Assign 100 to the slice. *Hint* try `[:]`

to address all of the *elements* of an array. Print the original array and the slice.

```
In [ ]:
``````
# [Solution here]
```

```
In [ ]:
```%load solutions/copies_vs_views.py

- NumPy can create arrays of aribtrary dimensions, and all the methods illustrated in the previous section work with more than one dimension.
- For example, a list of lists can be used to initialize a two dimensional array:

```
In [ ]:
```samples_list = [[632, 1638, 569, 115], [433,1130,754,555]]
samples_array = np.array(samples_list)
samples_array.shape

```
In [ ]:
```print(samples_array)

`[ ]`

operators, multidimensional arrays support a more natural indexing syntax with a single set of brackets and a set of comma-separated indices:

```
In [ ]:
```samples_list[0][1]

```
In [ ]:
```samples_array[0,1]

```
In [ ]:
```np.zeros((2,3))

```
In [ ]:
```np.random.normal(10, 3, size=(2, 4))

**reshaped** at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is via the array's `reshape`

method.

```
In [ ]:
```arr = np.arange(8).reshape(2,4)
arr

```
In [ ]:
```arr[1, 2:4]

```
In [ ]:
```arr[:, 2]

If you only provide one index, then you will get the corresponding row.

```
In [ ]:
```arr[1]

**properties and methods** that arrays have. The following provide basic information about the size, shape and data in the array:

```
In [ ]:
```print('Data type :', samples_array.dtype)
print('Total number of elements :', samples_array.size)
print('Number of dimensions :', samples_array.ndim)
print('Shape (dimensionality) :', samples_array.shape)
print('Memory used (in bytes) :', samples_array.nbytes)

Arrays also have many useful methods, some especially useful ones are:

```
In [ ]:
```print('Minimum and maximum :', samples_array.min(), samples_array.max())
print('Sum, mean and standard deviation:', samples_array.sum(), samples_array.mean(), samples_array.std())

`axis`

parameter; for example:

```
In [ ]:
```samples_array.sum(axis=0)

```
In [ ]:
```samples_array.sum(axis=1)

- Notice that summing over the rows returned a 1d array above.
- If you want to preserve the dimensions use the
`keepdims`

keyword

```
In [ ]:
```samples_array.sum(axis=1, keepdims=True)

`.T`

attribute, which allows you to access the transpose of the array:

```
In [ ]:
```samples_array.T

There is a wide variety of methods and properties of arrays.

```
In [ ]:
```[attr for attr in dir(samples_array) if not attr.startswith('__')]

- memory is a linear address space
- by adding information on
`shape`

and`strides`

we can interpet bytes laid out linearly in memory as a multidimensional object

```
In [ ]:
```Image('https://ipython-books.github.io/images/layout.png')

```
In [ ]:
```%load solutions/matrix_creation.py

```
In [ ]:
```sample1 = np.array([632, 1638, 569, 115])
sample2 = np.array([433,1130,754,555])
sample_sum = sample1 + sample2

```
In [ ]:
```np.array([632, 1638, 569, 115])

```
In [ ]:
```print('{0} X {1} = {2}'.format(sample1, sample2, sample1 * sample2))

In Python 3.5, you can use the `@`

operator to get the inner product (or matrix multiplication) (!)

```
In [ ]:
```print('{0} . {1} = {2}'.format(sample1, sample2, sample1 @ sample2))

- this implies that the dimension of the arrays for each operation must match in size,
- numpy will
**broadcast**dimensions when possible - For example, suppose that you want to add the number 1.5 to each element arr1
- We achieve this by
**broadcasting**

```
In [ ]:
```sample1 + 1.5

In this case, numpy looked at both operands and saw that the first was a one-dimensional array of length 4 and the second was a scalar, considered a zero-dimensional object. The broadcasting rules allow numpy to:

- create new array of length 1
- extend the array to match the size of the corresponding array

So in the above example, the scalar 1.5 is effectively cast to a 1-dimensional array of length 1, then stretched to length 4 to match the dimension of arr1. After this, element-wise addition can proceed as now both operands are one-dimensional arrays of length 4.

This broadcasting behavior is powerful, especially because when NumPy broadcasts to create new dimensions or to stretch existing ones, it doesn't actually replicate the data. In the example above the operation is carried as if the 1.5 was a 1-d array with 1.5 in all of its entries, but no actual array was ever created. This saves memory and improves the performance of operations.

When broadcasting, NumPy compares the sizes of each dimension in each operand. It starts with the trailing dimensions, working forward and creating dimensions as needed to accomodate the operation. Two dimensions are considered compatible for operation when:

- they are equal in size
- one is scalar (or size 1)

If these conditions are not met, an exception is thrown, indicating that the arrays have incompatible shapes.

```
In [ ]:
```sample1 + np.array([7,8])

```
In [ ]:
```b = np.array([10, 20, 30, 40])
bcast_sum = sample1 + b

```
In [ ]:
```print('{0}\n\n+ {1}\n{2}\n{3}'.format(sample1, b, '-'*21, bcast_sum))

```
In [ ]:
```c = np.array([-100, 100])
sample1 + c

**trailing** dimensions. Here, c would need to have a trailing dimension of 1 for the broadcasting to work. We can augment arrays with dimensions on the fly, by indexing it with a np.newaxis object, which adds an "empty" dimension:

```
In [ ]:
```cplus = c[:, np.newaxis]
cplus

```
In [ ]:
```cplus.shape

```
In [ ]:
```sample1 + cplus

```
In [ ]:
```sample1[:, np.newaxis] + c

Divide each column of the array:

```
a = np.arange(25).reshape(5, 5)
```

elementwise with the array

```
b = np.array([1., 5, 10, 15, 20])
```

```
In [ ]:
``````
# [Solution here]
```

```
In [ ]:
```%load solutions/broadcasting.py

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

**Tabular**data with heterogeneously-typed columns, as you might find in an SQL table or Excel spreadsheet- Ordered and unordered (not necessarily fixed-frequency)
**time series**data. - Arbitrary
**matrix**data with row and column labels

Virtually any statistical dataset, labeled or unlabeled, can be converted to a pandas data structure for cleaning, transformation, and analysis.

- Easy handling of
**missing data** **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects- Automatic and explicit
**data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically - Powerful, flexible
**group by functionality**to perform split-apply-combine operations on data sets - Intelligent label-based
**slicing, fancy indexing, and subsetting**of large data sets - Intuitive
**merging and joining**data sets - Flexible
**reshaping and pivoting**of data sets **Hierarchical labeling**of axes- Robust
**IO tools**for loading data from flat files, Excel files, databases, and HDF5 **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

```
In [ ]:
```import pandas as pd

- A pandas Series is a generationalization of 1d numpy array
- A series has an
*index*that labels each element in the vector. - A
`Series`

can be thought of as an ordered key-value store.

```
In [ ]:
```counts = pd.Series([632, 1638, 569, 115])
counts

`Series`

, while the index is a pandas `Index`

object.

```
In [ ]:
```counts.values

Pandas provides a labeled index to access the rows

```
In [ ]:
```counts.index

We can assign meaningful labels to the index, if they are available:

```
In [ ]:
```bacteria = pd.Series([632, 1638, 569, 115],
index=['Firmicutes', 'Proteobacteria',
'Actinobacteria', 'Bacteroidetes'])
bacteria

```
In [ ]:
```np.log(bacteria)

- Creation from a dict
- Returned in key-sorted order

```
In [ ]:
```bacteria_dict = {
'Firmicutes': 632,
'Proteobacteria': 1638,
'Actinobacteria': 569,
'Bacteroidetes': 115
}
pd.Series(bacteria_dict)

Inevitably, we want to be able to store, view and manipulate data that is *multivariate*, where for every index there are multiple fields or columns of data, often of varying data type.

A `DataFrame`

is a tabular data structure, encapsulating multiple series like columns in a spreadsheet.

```
In [ ]:
```data = pd.DataFrame({'value': [632, 1638, 569, 115, 433, 1130, 754, 555],
'patient': [1, 1, 1, 1, 2, 2, 2, 2],
'phylum': ['Firmicutes', 'Proteobacteria', 'Actinobacteria',
'Bacteroidetes', 'Firmicutes', 'Proteobacteria',
'Actinobacteria', 'Bacteroidetes']})
data

- We often will want to peak at the first few rows of a DataFrame
- You can use
`head`

to do this

```
In [ ]:
```data.head()

The first axis of a DataFrame also has an index that represent the labeled columns

```
In [ ]:
```data.columns

- Pandas provides sophisticated I/O functionality
`read_csv`

is a highly optimized csv reader

```
In [ ]:
```vessels = pd.read_csv("../data/AIS/vessel_information.csv")
vessels.head()

- Read a single file
`../data/NationalFoodSurvey/NFS_1974.csv`

```
In [ ]:
```%load solutions/read_nfs_1974.py