Handling with large data is easy in Python. In the simplest way using arrays. However, they are pretty slow. Numpy and Panda are two great libraries for dealing with datasets. Numpy isused for homogenous n-dimensional data (matrices). Pandas is used for heterogenous tables (CSV, MS Excel tables). Pandas is internally based on Numpy, too. See http://scipy-lectures.github.io/ for a more detailed lesson.
In [16]:
import numpy as np
In [17]:
# Generating a random array
X = np.random.random((3, 5)) # a 3 x 5 array
print(X)
In [18]:
# get a single element
X[0, 0]
Out[18]:
In [19]:
# get a row
X[1]
Out[19]:
In [20]:
# get a column
X[:, 1]
Out[20]:
In [21]:
# Transposing an array
X.T
Out[21]:
In [22]:
print(X.shape)
print(X.reshape(5, 3)) #change the layout of the matrix
In [23]:
# indexing by an array of integers (fancy indexing)
indices = np.array([3, 1, 0])
print(indices)
X[:, indices]
Out[23]:
In [24]:
X
Out[24]:
In [25]:
X.shape
Out[25]:
In [26]:
np.sum(X, axis=1) # 1...columns
Out[26]:
In [27]:
np.max(X, axis=0) # 0...rows
Out[27]:
based on http://pandas.pydata.org/pandas-docs/stable/10min.html
In [28]:
import numpy as np
import pandas as pd
In [29]:
#use a standard dataset of heterogenous data
cars = pd.read_csv('data/mtcars.csv')
cars.head()
Out[29]:
In [30]:
#list all columns
cars.columns
Out[30]:
In [31]:
#we want to use the car as the "primary key" of a row
cars.index = cars.pop('car')
cars.head()
Out[31]:
In [32]:
#describe our dataset
cars.describe()
Out[32]:
In [33]:
cars.sort_index(inplace=True)
cars.head()
Out[33]:
In [34]:
cars.sort_values('mpg').head(15)
Out[34]:
In [35]:
cars.sort_values('hp', ascending=False).head()
Out[35]:
Note While many of the NumPy access methods work on DataFrames, use the pandas-specific data access methods, .at, .iat, .loc, .iloc and .ix.
See the Indexing section and below.
In [36]:
#single column
cars['mpg']
#depending on the name also cars.mpg works
Out[36]:
In [37]:
#or a slice of rows
cars[2:5]
Out[37]:
In [38]:
#by label = primary key
cars.loc['Fiat 128':'Lotus Europa']
Out[38]:
In [39]:
#selection by position
cars.iloc[3]
Out[39]:
In [40]:
cars.iloc[3:5, 0:2]
Out[40]:
In [41]:
cars[cars.cyl > 6] # more than 6 cylinders
Out[41]:
In [42]:
cars_na = pd.read_csv('data/mtcars_with_nas.csv')
In [43]:
cars_na.isnull().head(4)
Out[43]:
In [44]:
#fill with a default value
cars_na.fillna(0).head(4)
Out[44]:
In [45]:
#or drop the rows
print(cars_na.shape)
#drop rows with na values
print(cars_na.dropna().shape)
#drop columns with na values
print(cars_na.dropna(axis=1).shape)
#see also http://pandas.pydata.org/pandas-docs/stable/missing_data.html
In [46]:
#stats
cars.mean()
Out[46]:
In [47]:
cars.mean(axis=1)
Out[47]:
In [48]:
#grouping
cars.groupby('cyl').mean()
Out[48]:
In [49]:
#grouping different aggregation methods
cars.groupby('cyl').agg({ 'mpg': 'mean', 'qsec': 'min'})
Out[49]:
In [10]:
#loading gapminder data (taken from https://github.com/jennybc/gapminder)
# file located at 'data/gapminder-unfiltered.tsv' it uses tabular character as separator
# use the first column as index
In [9]:
#what are the columns of this dataset?
In [8]:
#what is the maximal year contained?
In [4]:
#just select all data of the year 2007
In [7]:
#locate Austria and print it
In [6]:
#list the top 10 countries by life expectancy (lifeExp)
In [1]:
#what is the total population (pop) per continent