CHAPTER 2

Transaction Data

Havong gotten our feet wet, we now turn to the question of how data is fed into bestPy, how it is stored there, and how to access it.

Preliminaries

We only need this because the examples folder is a subdirectory of the bestPy package.


In [1]:
import sys
sys.path.append('../..')

Import

Since we are not going to actually recommend anything in the present chapter, we are just going to import bestPy's primary data structure Transactions from the aptly named subpackage bestPy.datastructures.


In [2]:
from bestPy.datastructures import Transactions

With transaction data, we simply mean a list of sales from your store with each entry consisting of a timestamp, a unique customer ID, and a unique article ID. Saved, for example, into a comma-separated values (CSV) file, such a transaction list could look something like this:

Timestamp|Customer ID|Article
=============================
1331072795;4;AC016EL50CPHALID-1749
1331074425;1;AC016EL67BJWALID-932
1331306282;12;SA848EL83DOYALID-2416
1331306282;12;BL152EL82CRXALID-1817
1331306313;11;CA189EL29AGOALID-170
1331306332;11;LE629EL54ANHALID-345
1331306341;10;OL756EL65HDYALID-4834
1331306414;7;OL756EL55HAMALID-4744

Since CSV is indeed a common way of storing such information, it is also one of the currently supported data sources for bestPy.

From CSV file

To read from a CSV file, we presuppose that the ordering of the columns is timestamp; customer ID; article ID and that there are no columns other than these three. Data can then be read in by instantiating a new Transactions object via its class method from_csv(), which takes the path to and name of the CSV file as argument, like so:


In [3]:
file = 'examples_data.csv'
data = Transactions.from_csv(file)


WARNING:root:Could not interpret transaction on line 1. Skipping.
WARNING:root:Could not interpret transaction on line 2. Skipping.

Ooohps! What happened here? No worries! The reader just detected that the first two lines in the file were the column headers and the separation line (as in the example above), neither of which is interpreted as a valid transaction. More about that shortly.

First, we note that Tab completion on the method from_csv() reveals a second argument separator (defaulting to ';'), which allows specifying the character(s) delimiting the columns in the CSV file. If, for example, an actual comma were used instead of the semicolon in the example above, one would simply write

file_with_commas = '/path/to/file_with_commas.csv'
data = Transactions.from_csv(file_with_commas, ',')

From postgreSQL database

Another common data source for sales transactions is some sort of database. bestPy therefore provides a reader for the popular and open postgreSQL database. Specifically, we assume that in your database you have a table that conceptually mirrors the structure of the CSV file, that is, the table has three columns (or fields) containing the respective information.

To access the database containing that table, quite a number of parameters need to be specified. To facilitate the process, bestPy requires you to fill out a form, which we import like so:


In [4]:
from bestPy.datastructures import PostgreSQLparams

You can get a fresh copy of the form by creating an instance of PostgreSQLparams.


In [5]:
database = PostgreSQLparams()

Again, Tab completion is your friend. You might want to start with filling in the fields that start with "login_". The login attribute itself allows you to monitor progress. It represents the current login string that will eventually be sent to the database:


In [6]:
database.login


Out[6]:
'<dbname> <host> <password> <user>'

These are obviously just placeholders. Let's see what happens if we start setting attributes.


In [7]:
database.login_db_name = 'my_database'
database.login_host = 'my_host'
database.login


Out[7]:
"dbname='my_database' host='my_host' <password> <user>"

Of, course, you can also query the individual attributes. Note how they are prepended such the database will understand your login credentials.


In [8]:
print(database.login_db_name)
print(database.login_host)


dbname='my_database'
host='my_host'

Complete the login fields and then move on to provide the name of the table ...


In [9]:
database.login_user = 'my_user'
database.login_password = 'my_pwd'

print(database.table)
database.table = 'my_table'
database.table


<table>
Out[9]:
'my_table'

... as well as the timestamp, userID, and itemID column names.


In [10]:
database.userID = 'my_customer_column'
database.itemID = 'my_article_column'
database.timestamp = 'my_time_column'

Finally, specify the maximum number of transactions you want to retrieve from the database. If you simply want all there are, then say so by specifying:


In [11]:
database.limit = 'all'

Now that you have completed the form, you can used the from_postgreSQL() class method of Transactions to read the transaction data.

data = Transactions.from_postgreSQL(database)

Either way, we now have our data and it is time to inspect the

Attributes of Transactions data objects

Let's go back to the data object we created from the CSV file. Remember that we got two warnings because the two header lines were not interpreted as actual transactions? Well, this fact is reflected in the aptly named attribute number_of_corrupted_records.


In [12]:
data.number_of_corrupted_records


Out[12]:
2

More importantly, the number of OK transactions can be retrieved through:


In [13]:
data.number_of_transactions


Out[13]:
100000

Information on customers and articles is held in the user and item atrributes, respectively. How many there are of each sort, for example, can be determined by inspecting the respective count attributes.


In [14]:
print(data.user.count)
print(data.item.count)


38747
8255

Whatever the actual customer and article IDs are in the data, bestPy internally assigns a unique integer index to each customer and each article. To find out which integer index corresponds to which (potentially alphanumeric) ID and vice versa, use the respective id_of and index_of dictionaries.


In [15]:
print(data.user.id_of[0])  # find ID from index
print(data.item.index_of['NI743EL91KBWALID-6808'])  # find index from ID


4
356

The data itself is stored in the matrix attribute, where rows are customers, columns are articles, and entries are how often a given customer bought a given article. Feel free to eplore the attributes of the matrix object by yourself! Here, we are just going to mention that the number of non-zero entries in that matrix, that is, the number of unique cutomer-article pairings in the data can be retrieved as:


In [16]:
data.number_of_userItem_pairs


Out[16]:
66947

That's it for the data structures. Next, we will return to these omnious warnings flashing across the screen when importing lines from CSV files that aren't transactions.


In [ ]: