Открытый курс по машинному обучению

Авторы материала: программист-исследователь Mail.ru Group, старший преподаватель Факультета Компьютерных Наук ВШЭ Юрий Кашницкий и Data Scientist в Segmento Екатерина Демидова. Материал распространяется на условиях лицензии Creative Commons CC BY-NC-SA 4.0. Можно использовать в любых целях (редактировать, поправлять и брать за основу), кроме коммерческих, но с обязательным упоминанием автора материала.

Тема 1. Первичный анализ данных с Pandas

Часть 0. Работа с векторами в библиотеке NumPy

Numpy - это библиотека Python для вычислительно эффективных операций с многомерными массивами, предназначенная в основном для научных вычислений.



In [1]:

    
# Python 2 and 3 compatibility
from __future__ import (absolute_import, division,
                        print_function, unicode_literals)
# отключим предупреждения Anaconda
import warnings
warnings.simplefilter('ignore')
import numpy as np



In [2]:

    
a = np.array([0, 1, 2, 3])
a









    Out[2]:





array([0, 1, 2, 3])

Такой массив может содержать:

значения физических величин в разые моменты времени при моделировании
значения сигнала, измеренного прибором
интенсивности пикселов
3D координаты объектов, полученных, например, при МРТ
...

Зачем NumPy: Эффективность базовых операций



In [3]:

    
L = range(1000)



In [4]:

    
%timeit [i**2 for i in L]









    



1.02 ms ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



In [5]:

    
a = np.arange(1000)



In [6]:

    
%timeit a**2









    



8.65 µs ± 379 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Интерактивная справка



In [7]:

    
?np.array

поиск в документации



In [8]:

    
np.lookfor('create array')









    



Search results for 'create array'
---------------------------------
numpy.array
    Create an array.
numpy.memmap
    Create a memory-map to an array stored in a *binary* file on disk.
numpy.diagflat
    Create a two-dimensional array with the flattened input as a diagonal.
numpy.fromiter
    Create a new 1-dimensional array from an iterable object.
numpy.partition
    Return a partitioned copy of an array.
numpy.ctypeslib.as_array
    Create a numpy array from a ctypes array or a ctypes POINTER.
numpy.ma.diagflat
    Create a two-dimensional array with the flattened input as a diagonal.
numpy.ma.make_mask
    Create a boolean mask from an array.
numpy.ctypeslib.as_ctypes
    Create and return a ctypes object from a numpy array.  Actually
numpy.ma.mrecords.fromarrays
    Creates a mrecarray from a (flat) list of masked arrays.
numpy.ma.mvoid.__new__
    Create a new masked array from scratch.
numpy.lib.format.open_memmap
    Open a .npy file as a memory-mapped array.
numpy.ma.MaskedArray.__new__
    Create a new masked array from scratch.
numpy.lib.arrayterator.Arrayterator
    Buffered iterator for big arrays.
numpy.ma.mrecords.fromtextfile
    Creates a mrecarray from data stored in the file `filename`.
numpy.asarray
    Convert the input to an array.
numpy.ndarray
    ndarray(shape, dtype=float, buffer=None, offset=0,
numpy.recarray
    Construct an ndarray that allows field access using attributes.
numpy.chararray
    chararray(shape, itemsize=1, unicode=False, buffer=None, offset=0,
numpy.pad
    Pads an array.
numpy.asanyarray
    Convert the input to an ndarray, but pass ndarray subclasses through.
numpy.copy
    Return an array copy of the given object.
numpy.diag
    Extract a diagonal or construct a diagonal array.
numpy.load
    Load arrays or pickled objects from ``.npy``, ``.npz`` or pickled files.
numpy.sort
    Return a sorted copy of an array.
numpy.array_equiv
    Returns True if input arrays are shape consistent and all elements equal.
numpy.dtype
    Create a data type object.
numpy.choose
    Construct an array from an index array and a set of arrays to choose from.
numpy.nditer
    Efficient multi-dimensional iterator object to iterate over arrays.
numpy.swapaxes
    Interchange two axes of an array.
numpy.full_like
    Return a full array with the same shape and type as a given array.
numpy.ones_like
    Return an array of ones with the same shape and type as a given array.
numpy.ma.mrecords.MaskedRecords.__new__
    Create a new masked array from scratch.
numpy.empty_like
    Return a new array with the same shape and type as a given array.
numpy.nan_to_num
    Replace nan with zero and inf with finite numbers.
numpy.zeros_like
    Return an array of zeros with the same shape and type as a given array.
numpy.asarray_chkfinite
    Convert the input to an array, checking for NaNs or Infs.
numpy.diag_indices
    Return the indices to access the main diagonal of an array.
numpy.chararray.tolist
    a.tolist()
numpy.ma.choose
    Use an index array to construct a new array from a set of choices.
numpy.savez_compressed
    Save several arrays into a single file in compressed ``.npz`` format.
numpy.matlib.rand
    Return a matrix of random values with given shape.
numpy.ma.empty_like
    Return a new array with the same shape and type as a given array.
numpy.ma.make_mask_none
    Return a boolean mask of the given shape, filled with False.
numpy.ma.mrecords.fromrecords
    Creates a MaskedRecords from a list of records.
numpy.around
    Evenly round to the given number of decimals.
numpy.source
    Print or write to a file the source code for a NumPy object.
numpy.diagonal
    Return specified diagonals.
numpy.einsum_path
    Evaluates the lowest cost contraction order for an einsum expression by
numpy.histogram2d
    Compute the bi-dimensional histogram of two data samples.
numpy.fft.ifft
    Compute the one-dimensional inverse discrete Fourier Transform.
numpy.fft.ifftn
    Compute the N-dimensional inverse discrete Fourier Transform.
numpy.busdaycalendar
    A business day calendar object that efficiently stores information



In [9]:

    
np.con*?

Библиотеку принято импортировать так



In [10]:

    
import numpy as np

Создание массивов

1-D:



In [11]:

    
a = np.array([0, 1, 2, 3])
a









    Out[11]:





array([0, 1, 2, 3])



In [12]:

    
a.ndim









    Out[12]:





1



In [13]:

    
a.shape









    Out[13]:





(4,)



In [14]:

    
len(a)









    Out[14]:





4

2-D, 3-D, ...:



In [15]:

    
b = np.array([[0, 1, 2], [3, 4, 5]])    # 2 x 3 array
b









    Out[15]:





array([[0, 1, 2],
       [3, 4, 5]])



In [16]:

    
b.ndim









    Out[16]:





2



In [17]:

    
b.shape









    Out[17]:





(2, 3)



In [18]:

    
len(b)     # returns the size of the first dimension









    Out[18]:





2



In [19]:

    
c = np.array([[[1], [2]], [[3], [4]]])
c









    Out[19]:





array([[[1],
        [2]],

       [[3],
        [4]]])



In [20]:

    
c.shape









    Out[20]:





(2, 2, 1)

Методы для создания массивов

На практике мы редко добавляем элементы по одному

Равномерно распределенные элементы:



In [21]:

    
a = np.arange(10) # 0 .. n-1  (!)
a









    Out[21]:





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])



In [22]:

    
b = np.arange(1, 9, 2) # start, end (exclusive), step
b









    Out[22]:





array([1, 3, 5, 7])

по числу элементов:



In [23]:

    
c = np.linspace(0, 1, 6)   # start, end, num-points
c









    Out[23]:





array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])



In [24]:

    
d = np.linspace(0, 1, 5, endpoint=False)
d









    Out[24]:





array([ 0. ,  0.2,  0.4,  0.6,  0.8])

Часто встречающиеся массивы:



In [25]:

    
a = np.ones((3, 3))  # reminder: (3, 3) is a tuple
a









    Out[25]:





array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])



In [26]:

    
b = np.zeros((2, 2))
b









    Out[26]:





array([[ 0.,  0.],
       [ 0.,  0.]])



In [27]:

    
c = np.eye(3)
c









    Out[27]:





array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])



In [28]:

    
d = np.diag(np.array([1, 2, 3, 4]))
d









    Out[28]:





array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])

np.random генерация случайных чисел (Mersenne Twister PRNG):



In [29]:

    
a = np.random.rand(4)       # uniform in [0, 1]
a









    Out[29]:





array([ 0.83865024,  0.10513654,  0.75886643,  0.63323799])



In [30]:

    
b = np.random.randn(4)      # Gaussian
b









    Out[30]:





array([-2.24426932,  2.2036515 ,  0.01805773,  0.30863579])



In [31]:

    
np.random.seed(1234)        # Setting the random seed

Основные типы данных NumPy

Точка после числа означает, что это тип данных float64



In [32]:

    
a = np.array([1, 2, 3])
a.dtype









    Out[32]:





dtype('int64')



In [33]:

    
b = np.array([1., 2., 3.])
b.dtype









    Out[33]:





dtype('float64')

Можно задать тип данных явно. По умолчанию - float64



In [34]:

    
c = np.array([1, 2, 3], dtype=float)
c.dtype









    Out[34]:





dtype('float64')



In [35]:

    
a = np.ones((3, 3))
a.dtype









    Out[35]:





dtype('float64')

Прочие типы данных:

Комплексные числа



In [36]:

    
d = np.array([1+2j, 3+4j, 5+6*1j])
d.dtype









    Out[36]:





dtype('complex128')

Bool



In [37]:

    
e = np.array([True, False, False, True])
e.dtype









    Out[37]:





dtype('bool')

Строки

На строки память выделяется "жадно" - по максимальному числу литер в строке. В этом примере на каждую строку выделяется по 7 литер, и тип данных - 'S7'



In [38]:

    
f = np.array(['Bonjour', 'Hello', 'Hallo',])
f.dtype     # <--- strings containing max. 7 letters









    Out[38]:





dtype('<U7')

Основы визуализации

$ ipython notebook --pylab=inline

Или из тетрадки:



In [39]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

Параметр inline говорит серверу IPython о том, что результаты будут отображаться в самой тетрадке, а не в новом окне.

Импортируем Matplotlib



In [40]:

    
import matplotlib.pyplot as plt  # the tidy way



In [41]:

    
x = np.linspace(0, 3, 20)
y = np.linspace(0, 9, 20)
plt.plot(x, y)       # line plot    
plt.show()           # <-- shows the plot (not needed with pylab)

Или с использованием pylab:



In [42]:

    
plot(x, y)       # line plot









    Out[42]:





[<matplotlib.lines.Line2D at 0x7fd283ae14e0>]

Использование import matplotlib.pyplot as plt рекомендуется для скриптов, а pylab - в тетрадках IPython.

Отображение одномерных массивов:



In [43]:

    
x = np.linspace(0, 3, 20)
y = np.linspace(0, 9, 20)
plt.plot(x, y)       # line plot









    Out[43]:





[<matplotlib.lines.Line2D at 0x7fd283a4c438>]



In [44]:

    
plt.plot(x, y, 'o')  # dot plot









    Out[44]:





[<matplotlib.lines.Line2D at 0x7fd283a2cf60>]

Отображение двухмерных массивов (например, изображений):



In [45]:

    
image = np.random.rand(30, 30)
plt.imshow(image, cmap=plt.cm.hot)    
plt.colorbar()









    Out[45]:





<matplotlib.colorbar.Colorbar at 0x7fd283963da0>

Индексирование массивов и срезы

В целом так же, как со встроенными последовательностями Python (например, как со списками).



In [46]:

    
a = np.arange(10)
a









    Out[46]:





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])



In [47]:

    
a[0], a[2], a[-1]









    Out[47]:





(0, 2, 9)

Работает и популярный в Python способ отражения массива:



In [48]:

    
a[::-1]









    Out[48]:





array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

Для многомерных массивов индексы - это кортежи целых чисел



In [49]:

    
a = np.diag(np.arange(3))
a









    Out[49]:





array([[0, 0, 0],
       [0, 1, 0],
       [0, 0, 2]])



In [50]:

    
a[1, 1]









    Out[50]:





1



In [51]:

    
a[2, 1] = 10 # third line, second column
a









    Out[51]:





array([[ 0,  0,  0],
       [ 0,  1,  0],
       [ 0, 10,  2]])



In [52]:

    
a[1]









    Out[52]:





array([0, 1, 0])

Срезы



In [53]:

    
a = np.arange(10)
a









    Out[53]:





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])



In [54]:

    
a[2:9:3] # [start:end:step]









    Out[54]:





array([2, 5, 8])

Последний индекс не включается



In [55]:

    
a[:4]









    Out[55]:





array([0, 1, 2, 3])

По умолчанию `start` - 0, `end` - индекс последнего элемента, `step` - 1:



In [56]:

    
a[1:3]









    Out[56]:





array([1, 2])



In [57]:

    
a[::2]









    Out[57]:





array([0, 2, 4, 6, 8])



In [58]:

    
a[3:]









    Out[58]:





array([3, 4, 5, 6, 7, 8, 9])

Можно совмещать присваивание и срез:



In [59]:

    
a = np.arange(10)
a[5:] = 10
a









    Out[59]:





array([ 0,  1,  2,  3,  4, 10, 10, 10, 10, 10])



In [60]:

    
b = np.arange(5)
a[5:] = b[::-1]
a









    Out[60]:





array([0, 1, 2, 3, 4, 4, 3, 2, 1, 0])

Пример. Матрица делителей

Отобразить матрицу, в которой вычеркивается (x, y), если y делится на x.



In [61]:

    
from IPython.display import Image
Image(filename='../img/prime-sieve.png')









    Out[61]:

Создадим массив is_prime, заполненний значениями True



In [62]:

    
is_prime = np.ones((100,), dtype=bool)

Вычеркнем 0 и 1 как не являющиеся простыми:



In [63]:

    
is_prime[:2] = 0

Для каждого натурального j начиная с 2, "вычеркнем" числа, ему кратные:



In [64]:

    
N_max = int(np.sqrt(len(is_prime)))
for j in range(2, N_max):
    is_prime[2*j::j] = False
    
is_prime









    Out[64]:





array([False, False,  True,  True, False,  True, False,  True, False,
       False, False,  True, False,  True, False, False, False,  True,
       False,  True, False, False, False,  True, False, False, False,
       False, False,  True, False,  True, False, False, False, False,
       False,  True, False, False, False,  True, False,  True, False,
       False, False,  True, False, False, False, False, False,  True,
       False, False, False, False, False,  True, False,  True, False,
       False, False, False, False,  True, False, False, False,  True,
       False,  True, False, False, False, False, False,  True, False,
       False, False,  True, False, False, False, False, False,  True,
       False, False, False, False, False, False, False,  True, False, False], dtype=bool)

Индексирование масками



In [65]:

    
np.random.seed(3)
a = np.random.random_integers(0, 20, 15)
a









    Out[65]:





array([10,  3,  8,  0, 19, 10, 11,  9, 10,  6,  0, 20, 12,  7, 14])



In [66]:

    
(a % 3 == 0)









    Out[66]:





array([False,  True, False,  True, False, False, False,  True, False,
        True,  True, False,  True, False, False], dtype=bool)



In [67]:

    
mask = (a % 3 == 0)
extract_from_a = a[mask] # or,  a[a%3==0]
extract_from_a           # extract a sub-array with the mask









    Out[67]:





array([ 3,  0,  9,  6,  0, 12])

Индексирование маской может быть очень полезным для присваивания значений части элементов массива:



In [68]:

    
a[a % 3 == 0] = -1
a









    Out[68]:





array([10, -1,  8, -1, 19, 10, 11, -1, 10, -1, -1, 20, -1,  7, 14])

Индексирование массивом целых чисел



In [69]:

    
a = np.arange(0, 100, 10)
a









    Out[69]:





array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])



In [70]:

    
a[[2, 3, 2, 4, 2]]  # note: [2, 3, 2, 4, 2] is a Python list









    Out[70]:





array([20, 30, 20, 40, 20])



In [71]:

    
a[[9, 7]] = -100
a









    Out[71]:





array([   0,   10,   20,   30,   40,   50,   60, -100,   80, -100])



In [72]:

    
a = np.arange(10)
idx = np.array([[3, 4], [9, 7]])
idx.shape









    Out[72]:





(2, 2)



In [73]:

    
a[idx]









    Out[73]:





array([[3, 4],
       [9, 7]])