Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:
What exactly do they do?
In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997
Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992
In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005
Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012
The recommendation problem in its most basic form is quite simple to define:
|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1 | ? | ? | 4 | ? | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_2 | 3 | ? | ? | 2 | 2 |
|-------------------+-----+-----+-----+-----+-----|
| u_3 | 3 | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_4 | ? | 1 | 2 | 1 | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_5 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_6 | 2 | ? | 2 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_7 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_8 | 3 | 1 | 5 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_9 | ? | ? | ? | ? | 2 |
|-------------------+-----+-----+-----+-----+-----|
Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.
The literature has lots of examples of systems that try to combine the strengths of the two main approaches. This can be done in a number of ways:
Content-based techniques are limited by the amount of metadata that is available to describe an item. There are domains in which feature extraction methods are expensive or time consuming, e.g., processing multimedia data such as graphics, audio/video streams. In the context of grocery items for example, it's often the case that item information is only partial or completely missing. Examples include:
A user has to have rated a sufficient number of items before a recommender system can have a good idea of what their preferences are. In a content-based system, the aggregation function needs ratings to aggregate.
Collaborative filters rely on an item being rated by many users to compute aggregates of those ratings. Think of this as the exact counterpart of the new user problem for content-based systems.
When looking at the more general versions of content-based and collaborative systems, the success of the recommender system depends on the availability of a critical mass of user/item iteractions. We get a first glance at the data sparsity problem by quantifying the ratio of existing ratings vs $|U|x|I|$. A highly sparse matrix of interactions makes it difficult to compute similarities between users and items. As an example, for a user whose tastes are unusual compared to the rest of the population, there will not be any other users who are particularly similar, leading to poor recommendations.
We've put this together from our experience and a number of sources, please check the references at the bottom of this document.
The goal of this tutorial is to provide you with a hands-on overview of two of the main libraries from the scientific and data analysis communities. We're going to use:
MovieLens from GroupLens Research: grouplens.org
The MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies.
In [1]:
from IPython.core.display import Image
Image(filename='./pycon_reco_flow.png')
Out[1]:
It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
In [2]:
import numpy as np
# set some print options
np.set_printoptions(precision=4)
np.set_printoptions(threshold=5)
np.set_printoptions(suppress=True)
# init random gen
np.random.seed(2)
Think of ndarrays as the building blocks for pydata. A multidimensional array object that acts as a container for data to be passed between algorithms. Also, libraries written in a lower-level language, such as C or Fortran, can operate on the data stored in a NumPy array without copying any data.
In [3]:
import numpy as np
# build an array using the array function
arr = np.array([0, 9, 5, 4, 3])
arr
Out[3]:
In [4]:
np.zeros(4)
Out[4]:
In [5]:
np.ones(4)
Out[5]:
In [6]:
np.empty(4)
Out[6]:
In [7]:
np.arange(4)
Out[7]:
In [8]:
arr = np.random.randn(5)
arr
Out[8]:
In [9]:
arr.dtype
Out[9]:
In [10]:
arr.shape
Out[10]:
In [11]:
# you can be explicit about the data type that you want
np.empty(4, dtype=np.int32)
Out[11]:
In [12]:
np.array(['numpy','pandas','pytables'], dtype=np.string_)
Out[12]:
In [13]:
float_arr = np.array([4.4, 5.52425, -0.1234, 98.1], dtype=np.float64)
# truncate the decimal part
float_arr.astype(np.int32)
Out[13]:
In [113]:
arr = np.array([0, 9, 1, 4, 64])
arr[3]
Out[113]:
In [114]:
arr[1:3]
Out[114]:
In [116]:
arr[:2]
Out[116]:
In [115]:
# set the last two elements to 555
arr[-2:] = 55
arr
Out[115]:
A good way to think about indexing in multidimensional arrays is that you are
moving along the values of the shape property. So, a 4d array arr_4d
, with a
shape of (w,x,y,z)
will result in indexed views such that:
arr_4d[i].shape == (x,y,z)
arr_4d[i,j].shape == (y,z)
arr_4d[i,j,k].shape == (z,)
For the case of slices, what you are doing is selecting a range of elements along a particular axis:
In [17]:
arr_2d = np.array([[5,3,4],[0,1,2],[1,1,10],[0,0,0.1]])
arr_2d
Out[17]:
In [18]:
# get the first row
arr_2d[0]
Out[18]:
In [19]:
# get the first column
arr_2d[:,0]
Out[19]:
In [20]:
# get the first two rows
arr_2d[:2]
Out[20]:
In [117]:
arr = np.array([0, 3, 1, 4, 64])
arr
Out[117]:
In [118]:
subarr = arr[2:4]
subarr[1] = 99
arr
Out[118]:
In [23]:
arr = np.array([10, 20])
idx = np.array([True, False])
arr[idx]
Out[23]:
In [125]:
arr_2d = np.random.randn(5)
arr_2d
Out[125]:
In [126]:
arr_2d < 0
Out[126]:
In [127]:
arr_2d[arr_2d < 0]
Out[127]:
In [129]:
arr_2d[(arr_2d > -0.5) & (arr_2d < 0)]
Out[129]:
In [130]:
arr_2d[arr_2d < 0] = 0
arr_2d
Out[130]:
In [29]:
arr = np.arange(18).reshape(6,3)
arr
Out[29]:
In [30]:
# fancy selection of rows in a particular order
arr[[0,4,4]]
Out[30]:
In [31]:
# index into individual elements and flatten
arr[[5,3,1],[2,1,0]]
Out[31]:
In [32]:
# select a submatrix
arr[np.ix_([5,3,1],[2,1])]
Out[32]:
--> Go to question set
In [33]:
arr = np.array([0, 9, 1.02, 4, 32])
arr - arr
Out[33]:
In [34]:
arr * arr
Out[34]:
In [35]:
arr = np.array([0, 9, 1.02, 4, 64])
5 * arr
Out[35]:
In [36]:
10 + arr
Out[36]:
In [37]:
arr ** .5
Out[37]:
The case of arrays of different shapes is slightly more complicated. The gist of it is that the shape of the operands need to conform to a certain specification. Don't worry if this does not make sense right away.
In [132]:
arr = np.random.randn(4,2)
arr
Out[132]:
In [133]:
mean_row = np.mean(arr, axis=0)
mean_row
Out[133]:
In [134]:
centered_rows = arr - mean_row
centered_rows
Out[134]:
In [135]:
np.mean(centered_rows, axis=0)
Out[135]:
In [136]:
mean_col = np.mean(arr, axis=1)
mean_col
Out[136]:
In [42]:
centered_cols = arr - mean_col
In [137]:
# make the 1-D array a column vector
mean_col.reshape((4,1))
Out[137]:
In [138]:
centered_cols = arr - mean_col.reshape((4,1))
centered_rows
Out[138]:
In [139]:
centered_cols.mean(axis=1)
Out[139]:
In [140]:
np.nan != np.nan
Out[140]:
In [142]:
np.array([10,5,4,np.nan,1,np.nan]) == np.nan
Out[142]:
In [144]:
np.isnan(np.array([10,5,4,np.nan,1,np.nan]))
Out[144]:
--> Go to question set
Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.
The heart of pandas is the DataFrame object for data manipulation. It features:
In [43]:
import pandas as pd
pd.set_printoptions(precision=3, notebook_repr_html=True)
In [44]:
import pandas as pd
values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
ser = pd.Series(values)
print ser
In [45]:
values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
ser = pd.Series(data=values, index=labels)
print ser
In [46]:
movie_rating = {
'age': 1,
'gender': 'F',
'genres': 'Drama',
'movie_id': 1193,
'occupation': 10,
'rating': 5,
'timestamp': 978300760,
'title': "One Flew Over the Cuckoo's Nest (1975)",
'user_id': 1,
'zip': '48067'
}
ser = pd.Series(movie_rating)
print ser
In [47]:
ser.index
Out[47]:
In [48]:
ser.values
Out[48]:
In [49]:
ser[0]
Out[49]:
In [50]:
ser['gender']
Out[50]:
In [51]:
ser.get_value('gender')
Out[51]:
In [52]:
ser_1 = pd.Series(data=[1,3,4], index=['A', 'B', 'C'])
ser_2 = pd.Series(data=[5,5,5], index=['A', 'G', 'C'])
print ser_1 + ser_2
In [53]:
# build from a dict of equal-length lists or ndarrays
pd.DataFrame({'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]})
Out[53]:
You can explicitly set the column names and index values as well.
In [54]:
pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
columns=['col_1', 'col_2', 'col_3'])
Out[54]:
In [55]:
pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
columns=['col_1', 'col_2', 'col_3'],
index=['obs1', 'obs2', 'obs3', 'obs4'])
Out[55]:
You can also think of it as a dictionary of Series objects.
In [145]:
movie_rating = {
'gender': 'F',
'genres': 'Drama',
'movie_id': 1193,
'rating': 5,
'timestamp': 978300760,
'user_id': 1,
}
ser_1 = pd.Series(movie_rating)
ser_2 = pd.Series(movie_rating)
df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})
df.columns.name = 'rating_events'
df.index.name = 'rating_data'
df
Out[145]:
In [146]:
df = df.T
df
Out[146]:
In [147]:
df.columns
Out[147]:
In [148]:
df.index
Out[148]:
In [149]:
df.values
Out[149]:
In [60]:
df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})
df.drop('genres', axis=0)
Out[60]:
In [61]:
df.drop('r_1', axis=1)
Out[61]:
In [62]:
# careful with the order here
df['r_3'] = ['F', 'Drama', 1193, 5, 978300760, 1]
df
Out[62]:
--> Go to question set
You can index into a column using it's label, or with dot notation
In [150]:
df = pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
columns=['col_1', 'col_2', 'col_3'],
index=['obs1', 'obs2', 'obs3', 'obs4'])
df['col_1']
Out[150]:
In [151]:
df.col_1
Out[151]:
You can also use multiple columns to select a subset of them:
In [152]:
df[['col_2', 'col_1']]
Out[152]:
The .ix method gives you the most flexibility to index into certain rows, or even rows and columns:
In [153]:
df.ix['obs3']
Out[153]:
In [154]:
df.ix[0]
Out[154]:
In [155]:
df.ix[:2]
Out[155]:
In [156]:
df.ix[:2, 'col_2']
Out[156]:
In [157]:
df.ix[:2, ['col_1', 'col_2']]
Out[157]:
--> Go to question set
Break!!
In [71]:
import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('data/ml-1m/users.dat',
sep='::', header=None, names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('data/ml-1m/ratings.dat',
sep='::', header=None, names=rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('data/ml-1m/movies.dat',
sep='::', header=None, names=mnames)
# show how one of them looks
ratings.head(5)
Out[71]:
Using pd.merge
we get it all into one big DataFrame.
In [72]:
movielens = pd.merge(pd.merge(ratings, users), movies)
movielens
Out[72]:
This subsection will generate training and testing sets for evaluation. You do not need to understand every single line of code, just the general gist:
In [73]:
# let's work with a smaller subset for speed reasons
movielens = movielens.ix[np.random.choice(movielens.index, size=10000, replace=False)]
print movielens.shape
print movielens.user_id.nunique()
print movielens.movie_id.nunique()
In [74]:
user_ids_larger_1 = pd.value_counts(movielens.user_id, sort=False) > 1
movielens = movielens[user_ids_larger_1[movielens.user_id]]
print movielens.shape
np.all(movielens.user_id.value_counts() > 1)
Out[74]:
We now generate train and test subsets using groupby and apply.
In [75]:
def assign_to_set(df):
sampled_ids = np.random.choice(df.index,
size=np.int64(np.ceil(df.index.size * 0.2)),
replace=False)
df.ix[sampled_ids, 'for_testing'] = True
return df
movielens['for_testing'] = False
grouped = movielens.groupby('user_id', group_keys=False).apply(assign_to_set)
movielens_train = movielens[grouped.for_testing == False]
movielens_test = movielens[grouped.for_testing == True]
print movielens_train.shape
print movielens_test.shape
print movielens_train.index & movielens_test.index
Store these two sets in text files:
In [76]:
movielens_train.to_csv('data/movielens_train.csv')
movielens_test.to_csv('data/movielens_test.csv')
In [77]:
def compute_rmse(y_pred, y_true):
""" Compute Root Mean Squared Error. """
return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))
In [78]:
def evaluate(estimate_f):
""" RMSE-based predictive performance evaluation with pandas. """
ids_to_estimate = zip(movielens_test.user_id, movielens_test.movie_id)
estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
real = movielens_test.rating.values
return compute_rmse(estimated, real)
In [79]:
def estimate1(user_id, item_id):
""" Simple content-filtering based on mean ratings. """
return movielens_train.ix[movielens_train.user_id == user_id, 'rating'].mean()
print 'RMSE for estimate1: %s' % evaluate(estimate1)
In [80]:
def estimate2(user_id, movie_id):
""" Simple collaborative filter based on mean ratings. """
ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
if ratings_by_others.empty: return 3.0
return ratings_by_others.rating.mean()
print 'RMSE for estimate2: %s' % evaluate(estimate2)
--> Go to question set
Possibly incorporating metadata about items, which makes the term 'content' make more sense now.
$$ r_{u,i} = k \sum_{i' \in I(u)} sim(i, i') \; r_{u,i'} $$$$ r_{u,i} = \bar r_u + k \sum_{i' \in I(u)} sim(i, i') \; (r_{u,i'} - \bar r_u) $$Here $k$ is a normalizing factor,
$$ k = \frac{1}{\sum_{i' \in I(u)} |sim(i,i')|} $$and $\bar r_u$ is the average rating of user u:
$$ \bar r_u = \frac{\sum_{i \in I(u)} r_{u,i}}{|I(u)|} $$Possibly incorporating metadata about users.
$$ r_{u,i} = k \sum_{u' \in U(i)} sim(u, u') \; r_{u',i} $$$$ r_{u,i} = \bar r_u + k \sum_{u' \in U(i)} sim(u, u') \; (r_{u',i} - \bar r_u) $$Here $k$ is a normalizing factor,
$$ k = \frac{1}{\sum_{u' \in U(i)} |sim(u,u')|} $$and $\bar r_u$ is the average rating of user u:
$$ \bar r_u = \frac{\sum_{i \in I(u)} r_{u,i}}{|I(u)|} $$
In [158]:
print movielens_train.groupby('gender')['rating'].mean()
In [159]:
print movielens_train.groupby(['gender', 'age'])['rating'].mean()
In [160]:
# transform the ratings frame into a ratings matrix
ratings_mtx_df = movielens_train.pivot_table(values='rating',
rows='user_id',
cols='movie_id')
ratings_mtx_df
Out[160]:
In [161]:
# with an integer axis index only label-based indexing is possible
ratings_mtx_df.ix[ratings_mtx_df.index[-15:],ratings_mtx_df.columns[:15]]
Out[161]:
The more interesting case with pivot_table
is as an interface to
groupby
:
In [162]:
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean()
print by_gender_title
In [163]:
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean().unstack('gender')
by_gender_title.head(10)
Out[163]:
In [165]:
by_gender_title = movielens_train.pivot_table(values='rating', rows='title', cols='gender')
by_gender_title.head(10)
Out[165]:
We're going to need a user index from the users portion of the dataset. This will allow us to retrieve information given a specific user_id in a more convenient way:
In [87]:
user_info = users.set_index('user_id')
user_info.head(5)
Out[87]:
With this in hand, we can now ask what the gender of a particular user_id is like so:
In [166]:
user_id = 3
user_info.ix[user_id, 'gender']
Out[166]:
In [89]:
def estimate3(user_id, movie_id):
""" Collaborative filtering using an implicit sim(u,u'). """
ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
if ratings_by_others.empty: return 3.0
means_by_gender = ratings_by_others.pivot_table('rating', rows='movie_id', cols='gender')
user_gender = user_info.ix[user_id, 'gender']
if user_gender in means_by_gender.columns:
return means_by_gender.ix[movie_id, user_gender]
else:
return means_by_gender.ix[movie_id].mean()
print 'RMSE for reco3: %s' % evaluate(estimate3)
At this point it seems worthwhile to write a learn
that pre-computes whatever datastructures we need at estimation time.
In [90]:
class Reco3:
""" Collaborative filtering using an implicit sim(u,u'). """
def learn(self):
""" Prepare datastructures for estimation. """
self.means_by_gender = movielens_train.pivot_table('rating', rows='movie_id', cols='gender')
def estimate(self, user_id, movie_id):
""" Mean ratings by other users of the same gender. """
if movie_id not in self.means_by_gender.index: return 3.0
user_gender = user_info.ix[user_id, 'gender']
if ~np.isnan(self.means_by_gender.ix[movie_id, user_gender]):
return self.means_by_gender.ix[movie_id, user_gender]
else:
return self.means_by_gender.ix[movie_id].mean()
reco = Reco3()
reco.learn()
print 'RMSE for reco3: %s' % evaluate(reco.estimate)
In [91]:
class Reco4:
""" Collaborative filtering using an implicit sim(u,u'). """
def learn(self):
""" Prepare datastructures for estimation. """
self.means_by_age = movielens_train.pivot_table('rating', rows='movie_id', cols='age')
def estimate(self, user_id, movie_id):
""" Mean ratings by other users of the same age. """
if movie_id not in self.means_by_age.index: return 3.0
user_age = user_info.ix[user_id, 'age']
if ~np.isnan(self.means_by_age.ix[movie_id, user_age]):
return self.means_by_age.ix[movie_id, user_age]
else:
return self.means_by_age.ix[movie_id].mean()
reco = Reco4()
reco.learn()
print 'RMSE for reco4: %s' % evaluate(reco.estimate)
These were all written to operate on two pandas Series, each one representing the rating history of two different users. You can also apply them to any two feature vectors that describe users or items. In all cases, the higher the return value, the more similar two Series are. You might need to add checks for edge cases, such as divisions by zero, etc.
In [92]:
def euclidean(s1, s2):
"""Take two pd.Series objects and return their euclidean 'similarity'."""
diff = s1 - s2
return 1 / (1 + np.sqrt(np.sum(diff ** 2)))
In [93]:
def cosine(s1, s2):
"""Take two pd.Series objects and return their cosine similarity."""
return np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))
In [94]:
def pearson(s1, s2):
"""Take two pd.Series objects and return a pearson correlation."""
s1_c = s1 - s1.mean()
s2_c = s2 - s2.mean()
return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))
In [95]:
def jaccard(s1, s2):
dotp = np.sum(s1 * s2)
return dotp / (np.sum(s1 ** 2) + np.sum(s2 ** 2) - dotp)
def binjaccard(s1, s2):
dotp = (s1.index & s2.index).size
return dotp / (s1.sum() + s2.sum() - dotp)
In [96]:
class Reco5:
""" Collaborative filtering using a custom sim(u,u'). """
def learn(self):
""" Prepare datastructures for estimation. """
self.all_user_profiles = movielens.pivot_table('rating', rows='movie_id', cols='user_id')
def estimate(self, user_id, movie_id):
""" Ratings weighted by correlation similarity. """
ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
if ratings_by_others.empty: return 3.0
ratings_by_others.set_index('user_id', inplace=True)
their_ids = ratings_by_others.index
their_ratings = ratings_by_others.rating
their_profiles = self.all_user_profiles[their_ids]
user_profile = self.all_user_profiles[user_id]
sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
ratings_sims = ratings_sims[ ratings_sims.sim > 0]
if ratings_sims.empty:
return their_ratings.mean()
else:
return np.average(ratings_sims.rating, weights=ratings_sims.sim)
reco = Reco5()
reco.learn()
print 'RMSE for reco5: %s' % evaluate(reco.estimate)
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
From hdfgroup.org: HDF5 is a Hierarchical Data Format consisting of a data format specification and a supporting library implementation.
HDF5 files are organized in a hierarchical structure, with two primary structures: groups and datasets.
In [97]:
import tables as tb
h5file = tb.openFile('data/tutorial.h5', mode='w', title='Test file')
h5file
Out[97]:
In [98]:
group_1 = h5file.createGroup(h5file.root, 'group_1', 'Group One')
group_2 = h5file.createGroup('/', 'group_2', 'Group Two')
h5file
Out[98]:
In [99]:
h5file.createArray(group_1, 'random_arr_1', np.random.randn(30),
"Just a bunch of random numbers")
Out[99]:
In [100]:
h5file
Out[100]:
In [101]:
h5file.root.group_1.random_arr_1
Out[101]:
In [102]:
h5file.root.group_1.random_arr_1[:5]
Out[102]:
In [103]:
from datetime import datetime
h5file.setNodeAttr(group_1, 'last_modified', datetime.utcnow())
group_1._v_attrs
Out[103]:
In [104]:
h5file.getNodeAttr(group_1,'last_modified')
Out[104]:
In [105]:
group_3 = h5file.createGroup(h5file.root, 'group_3', 'Group Three')
ndim = 6000000
h5file.createArray(group_3, 'random_group_3',
numpy.zeros((ndim,ndim)), "A very very large array")
In [111]:
rows = 10
cols = 10
earr = h5file.createEArray(group_3, 'EArray', tb.Int8Atom(),
(0, cols), "A very very large array, second try.")
for i in range(rows):
earr.append(numpy.zeros((1, cols)))
In [112]:
earr
Out[112]: