Virtual Manipulation of Datasets Tutorial

In this tutorial, we will be using two Dataset types, namely Sampled Dataset and Transposed Dataset. This tutorial assumes familiarity with Procedures. We suggest going through the Procedures and Functions Tutorial beforehand.

To run the examples below, we created a toy dataset using the description of machine learning concepts from Wikipedia. Our dataset is made up of two columns. The first column contains the name of the machine learning concept. The second column contains the corresponding description.

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.



In [8]:

    
from pymldb import Connection
mldb = Connection("http://localhost")

Loading the Dataset

A powerful feature of MLDB allows us to execute SQL code while loading data. Here, the tokenize function first splits the columns into (lowercased) individual words, removing words with a length of less than 4 characters. Notice that we have chosen to use a dataset of type sparse.mutable since we do know ahead time the column size for each row.



In [9]:

    
print mldb.put('/v1/procedures/import_ML_concepts', {
        "type":"import.text",
        "params": 
        {
            "dataFileUrl":"http://public.mldb.ai/datasets/MachineLearningConcepts.csv",
            "outputDataset":{
                "id":"ml_concepts", # can be accessed later using id
                "type": "sparse.mutable" # sparse mutable dataset is needed since we tokenize words below
            },
            "named": "Concepts", #row name expression for output dataset
            "select": 
                """ 
                    tokenize(
                        lower(Text), 
                        {splitChars: ' -''"?!;:/[]*,().',  
                        minTokenLength: 4}) AS *
                """, # within the tokenize function:
                        # lower case
                        # leave out punctuation and spaces
                        # allow only words > 4 characters
            "runOnCreation": True
        }
    }
)









    



<Response [201]>

A quick look at the data

We can use the Query API to get the data into a Pandas DataFrame to take a quick look at it. You will notice that certain cells have a 'NaN' value as seen below. This is because the dataset is sparse: every word is not present in the description of every concept. Those missing values are representend as NaNs in a Pandas DataFrame.



In [10]:

    
mldb.query("SELECT * FROM ml_concepts LIMIT 5")









    Out[10]:






  
    
      
      addition
      algorithm
      algorithms
      also
      analysis
      analyze
      applications
      approach
      assigns
      associated
      ...
      popularized
      provide
      rather
      recurrent
      serve
      stored
      systems
      threshold
      understanding
      wrong
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Support vector machine
      1
      2
      1
      1
      1
      1
      1
      1
      1
      1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Logistic regression
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Deep belief network
      NaN
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Restricted boltzmann machines
      NaN
      1
      2
      NaN
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Hopfield network
      NaN
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
    
  

5 rows × 286 columns

Exploring the data

Let's count the number of words that describes each concept.



In [11]:

    
mldb.query(
    """
    SELECT horizontal_sum({*}) AS count
    FROM ml_concepts
    """
)









    Out[11]:






  
    
      
      count
    
    
      _rowName
      
    
  
  
    
      Support vector machine
      149
    
    
      Logistic regression
      99
    
    
      Deep belief network
      98
    
    
      Restricted boltzmann machines
      124
    
    
      Hopfield network
      53
    
    
      Naive bayes classifier
      165
    
    
      Boltzmann machine
      89
    
    
      Autoencoder
      31
    
    
      Artificial neural network
      40

Taking the transpose of data

There are two ways to take the transpose of a dataset:

1. using a transposed dataset
2. using a transpose SQL function



In [12]:

    
print mldb.put("/v1/datasets/transposed_concepts", {
        "type": "transposed",
        "params": 
        {
            "dataset": {"id":"ml_concepts"}
        }
    }
)









    



<Response [201]>

The code above transposed the data and created the transposed_concepts dataset. Now that we have transposed our data, we can easily see which words are most frequently used in the Machine Learning concept descriptions. Please note that we have joined below the results from the transposed_concepts dataset and the transpose inline function to compare outputs.



In [13]:

    
mldb.query(
    """
        SELECT 
            horizontal_count({a.*}) AS top_words_transp_dataset,
            horizontal_count({b.*}) AS top_words_transp_function
        NAMED a.rowName()
        FROM transposed_concepts AS a
        JOIN transpose(ml_concepts) AS b
            ON a.rowName() = b.rowName()
            ORDER BY top_words_transp_dataset DESC
            LIMIT 10
    """
)









    Out[13]:






  
    
      
      top_words_transp_dataset
      top_words_transp_function
    
    
      _rowName
      
      
    
  
  
    
      learning
      7
      7
    
    
      with
      7
      7
    
    
      neural
      6
      6
    
    
      machine
      6
      6
    
    
      training
      5
      5
    
    
      used
      5
      5
    
    
      machines
      5
      5
    
    
      network
      5
      5
    
    
      model
      5
      5
    
    
      which
      4
      4

While the methods yield similar results, the transposed dataset allows the user to keep the output dataset in memory. This may be useful if you want to use the dataset in multiple steps of your data analysis. In cases where you don't need to use the transpose more than once, you can simply use the inline syntax.

Taking a sample of data

There are two ways to take samples of dataset:

1. using a sampled dataset 
2. using a sample SQL function



In [14]:

    
print mldb.put("/v1/datasets/sampled_tokens", {
        "type": "sampled",
        "params": 
        {
            "rows": 10,
            "withReplacement": False,
            "dataset": {"id":"transposed_concepts"},
            "seed": 0
        }
    }
)









    



<Response [201]>



In [15]:

    
mldb.query("SELECT * FROM sampled_tokens")









    Out[15]:






  
    
      
      Naive bayes classifier
      Artificial neural network
      Boltzmann machine
      Restricted boltzmann machines
      Support vector machine
      Deep belief network
      Logistic regression
      Hopfield network
    
    
      _rowName
      
      
      
      
      
      
      
      
    
  
  
    
      maximum
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      nervous
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      random
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      studied
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      neurons
      NaN
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
    
    
      symmetric
      NaN
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
    
    
      marked
      NaN
      NaN
      NaN
      NaN
      1
      NaN
      NaN
      NaN
    
    
      such
      1
      NaN
      NaN
      NaN
      NaN
      1
      1
      NaN
    
    
      hopfield
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      4
    
    
      only
      NaN
      NaN
      NaN
      NaN
      1
      NaN
      NaN
      NaN

The above code would be simlar to the following SQL function:

mldb.query(
    """
    SELECT * 
    FROM sample(
        transposed_concepts, 
        {
            rows: 10,
            withReplacement: False
        }
    )                      
    """
)

Again, the two methods provide the same desired outcome, allowing you to choose how to best manipulate your data.

Using datasets vs. SQL functions

As seen above, the two different ways to either transpose or sample data are equivalent. It is recommended to use Dataset types instead of SQL functions when the created table will be reused or called later in the program.

As seen in this tutorial, MLDB allows you to virtually manipluate data in multiple ways. This grants greater flexibility when constructing models and analyzing complex data.

Where to next?

Check out the other Tutorials and Demos.

	addition	algorithm	algorithms	also	analysis	analyze	applications	approach	assigns	associated	...	popularized	provide	rather	recurrent	serve	stored	systems	threshold	understanding	wrong
_rowName
Support vector machine	1	2	1	1	1	1	1	1	1	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Logistic regression	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Deep belief network	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Restricted boltzmann machines	NaN	1	2	NaN	NaN	NaN	1	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Hopfield network	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	...	1	1	1	1	1	1	1	1	1	1

	top_words_transp_dataset	top_words_transp_function
_rowName
learning	7	7
with	7	7
neural	6	6
machine	6	6
training	5	5
used	5	5
machines	5	5
network	5	5
model	5	5
which	4	4

	Naive bayes classifier	Artificial neural network	Boltzmann machine	Restricted boltzmann machines	Support vector machine	Deep belief network	Logistic regression	Hopfield network
_rowName
maximum	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
nervous	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN
random	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN
studied	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
neurons	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN
symmetric	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN
marked	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN
such	1	NaN	NaN	NaN	NaN	1	1	NaN
hopfield	NaN	NaN	1	NaN	NaN	NaN	NaN	4
only	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN