In this tutorial, we will be using two Dataset types, namely Sampled Dataset and Transposed Dataset. This tutorial assumes familiarity with Procedures. We suggest going through the Procedures and Functions Tutorial beforehand.
To run the examples below, we created a toy dataset using the description of machine learning concepts from Wikipedia. Our dataset is made up of two columns. The first column contains the name of the machine learning concept. The second column contains the corresponding description.
The notebook cells below use pymldb
's Connection
class to make REST API calls. You can check out the Using pymldb
Tutorial for more details.
In [8]:
from pymldb import Connection
mldb = Connection("http://localhost")
A powerful feature of MLDB allows us to execute SQL code while loading data. Here, the tokenize
function first splits the columns into (lowercased) individual words, removing words with a length of less than 4 characters. Notice that we have chosen to use a dataset of type sparse.mutable since we do know ahead time the column size for each row.
In [9]:
print mldb.put('/v1/procedures/import_ML_concepts', {
"type":"import.text",
"params":
{
"dataFileUrl":"http://public.mldb.ai/datasets/MachineLearningConcepts.csv",
"outputDataset":{
"id":"ml_concepts", # can be accessed later using id
"type": "sparse.mutable" # sparse mutable dataset is needed since we tokenize words below
},
"named": "Concepts", #row name expression for output dataset
"select":
"""
tokenize(
lower(Text),
{splitChars: ' -''"?!;:/[]*,().',
minTokenLength: 4}) AS *
""", # within the tokenize function:
# lower case
# leave out punctuation and spaces
# allow only words > 4 characters
"runOnCreation": True
}
}
)
We can use the Query API to get the data into a Pandas DataFrame to take a quick look at it. You will notice that certain cells have a 'NaN' value as seen below. This is because the dataset is sparse: every word is not present in the description of every concept. Those missing values are representend as NaNs in a Pandas DataFrame.
In [10]:
mldb.query("SELECT * FROM ml_concepts LIMIT 5")
Out[10]:
In [11]:
mldb.query(
"""
SELECT horizontal_sum({*}) AS count
FROM ml_concepts
"""
)
Out[11]:
In [12]:
print mldb.put("/v1/datasets/transposed_concepts", {
"type": "transposed",
"params":
{
"dataset": {"id":"ml_concepts"}
}
}
)
The code above transposed the data and created the transposed_concepts
dataset. Now that we have transposed our data, we can easily see which words are most frequently used in the Machine Learning concept descriptions. Please note that we have joined below the results from the transposed_concepts
dataset and the transpose inline function to compare outputs.
In [13]:
mldb.query(
"""
SELECT
horizontal_count({a.*}) AS top_words_transp_dataset,
horizontal_count({b.*}) AS top_words_transp_function
NAMED a.rowName()
FROM transposed_concepts AS a
JOIN transpose(ml_concepts) AS b
ON a.rowName() = b.rowName()
ORDER BY top_words_transp_dataset DESC
LIMIT 10
"""
)
Out[13]:
While the methods yield similar results, the transposed dataset allows the user to keep the output dataset in memory. This may be useful if you want to use the dataset in multiple steps of your data analysis. In cases where you don't need to use the transpose more than once, you can simply use the inline syntax.
In [14]:
print mldb.put("/v1/datasets/sampled_tokens", {
"type": "sampled",
"params":
{
"rows": 10,
"withReplacement": False,
"dataset": {"id":"transposed_concepts"},
"seed": 0
}
}
)
In [15]:
mldb.query("SELECT * FROM sampled_tokens")
Out[15]:
The above code would be simlar to the following SQL function:
mldb.query(
"""
SELECT *
FROM sample(
transposed_concepts,
{
rows: 10,
withReplacement: False
}
)
"""
)
Again, the two methods provide the same desired outcome, allowing you to choose how to best manipulate your data.
As seen above, the two different ways to either transpose or sample data are equivalent. It is recommended to use Dataset types instead of SQL functions when the created table will be reused or called later in the program.
As seen in this tutorial, MLDB allows you to virtually manipluate data in multiple ways. This grants greater flexibility when constructing models and analyzing complex data.
Check out the other Tutorials and Demos.