In [1]:
import sys
sys.path.append('../..')
In [2]:
from bestPy.datastructures import Transactions
With transaction data, we simply mean a list of sales from your store with each entry consisting of a timestamp, a unique customer ID, and a unique article ID. Saved, for example, into a comma-separated values (CSV) file, such a transaction list could look something like this:
Timestamp|Customer ID|Article
=============================
1331072795;4;AC016EL50CPHALID-1749
1331074425;1;AC016EL67BJWALID-932
1331306282;12;SA848EL83DOYALID-2416
1331306282;12;BL152EL82CRXALID-1817
1331306313;11;CA189EL29AGOALID-170
1331306332;11;LE629EL54ANHALID-345
1331306341;10;OL756EL65HDYALID-4834
1331306414;7;OL756EL55HAMALID-4744
Since CSV is indeed a common way of storing such information, it is also one of the currently supported data sources for bestPy
.
To read from a CSV file, we presuppose that the ordering of the columns is timestamp; customer ID; article ID
and that there are no columns other than these three. Data can then be read in by instantiating a new Transactions
object via its class method from_csv()
, which takes the path to and name of the CSV file as argument, like so:
In [3]:
file = 'examples_data.csv'
data = Transactions.from_csv(file)
Ooohps! What happened here? No worries! The reader just detected that the first two lines in the file were the column headers and the separation line (as in the example above), neither of which is interpreted as a valid transaction. More about that shortly.
First, we note that Tab completion on the method from_csv()
reveals a second argument separator
(defaulting to ';'), which allows specifying the character(s) delimiting the columns in the CSV file. If, for example, an actual comma were used instead of the semicolon in the example above, one would simply write
file_with_commas = '/path/to/file_with_commas.csv'
data = Transactions.from_csv(file_with_commas, ',')
postgreSQL
databaseAnother common data source for sales transactions is some sort of database. bestPy
therefore provides a reader for the popular and open postgreSQL
database. Specifically, we assume that in your database you have a table that conceptually mirrors the structure of the CSV file, that is, the table has three columns (or fields) containing the respective information.
To access the database containing that table, quite a number of parameters need to be specified. To facilitate the process, bestPy
requires you to fill out a form, which we import like so:
In [4]:
from bestPy.datastructures import PostgreSQLparams
You can get a fresh copy of the form by creating an instance of PostgreSQLparams
.
In [5]:
database = PostgreSQLparams()
Again, Tab completion is your friend. You might want to start with filling in the fields that start with "login_
". The login
attribute itself allows you to monitor progress. It represents the current login string that will eventually be sent to the database:
In [6]:
database.login
Out[6]:
These are obviously just placeholders. Let's see what happens if we start setting attributes.
In [7]:
database.login_db_name = 'my_database'
database.login_host = 'my_host'
database.login
Out[7]:
Of, course, you can also query the individual attributes. Note how they are prepended such the database will understand your login credentials.
In [8]:
print(database.login_db_name)
print(database.login_host)
Complete the login fields and then move on to provide the name of the table ...
In [9]:
database.login_user = 'my_user'
database.login_password = 'my_pwd'
print(database.table)
database.table = 'my_table'
database.table
Out[9]:
... as well as the timestamp, userID, and itemID column names.
In [10]:
database.userID = 'my_customer_column'
database.itemID = 'my_article_column'
database.timestamp = 'my_time_column'
Finally, specify the maximum number of transactions you want to retrieve from the database. If you simply want all there are, then say so by specifying:
In [11]:
database.limit = 'all'
Now that you have completed the form, you can used the from_postgreSQL()
class method of Transactions
to read the transaction data.
data = Transactions.from_postgreSQL(database)
Either way, we now have our data and it is time to inspect the
Transactions
data objectsLet's go back to the data
object we created from the CSV file. Remember that we got two warnings because the two header lines were not interpreted as actual transactions? Well, this fact is reflected in the aptly named attribute number_of_corrupted_records
.
In [12]:
data.number_of_corrupted_records
Out[12]:
More importantly, the number of OK transactions can be retrieved through:
In [13]:
data.number_of_transactions
Out[13]:
Information on customers and articles is held in the user
and item
atrributes, respectively. How many there are of each sort, for example, can be determined by inspecting the respective count
attributes.
In [14]:
print(data.user.count)
print(data.item.count)
Whatever the actual customer and article IDs are in the data, bestPy
internally assigns a unique integer index to each customer and each article. To find out which integer index corresponds to which (potentially alphanumeric) ID and vice versa, use the respective id_of
and index_of
dictionaries.
In [15]:
print(data.user.id_of[0]) # find ID from index
print(data.item.index_of['NI743EL91KBWALID-6808']) # find index from ID
The data itself is stored in the matrix
attribute, where rows are customers, columns are articles, and entries are how often a given customer bought a given article. Feel free to eplore the attributes of the matrix
object by yourself! Here, we are just going to mention that the number of non-zero entries in that matrix, that is, the number of unique cutomer-article pairings in the data can be retrieved as:
In [16]:
data.number_of_userItem_pairs
Out[16]:
That's it for the data structures. Next, we will return to these omnious warnings flashing across the screen when importing lines from CSV files that aren't transactions.
In [ ]: