Introduction to Data Science

“This scientific way of analysing data or extracting knowledge out of data is called Data science.”

                                                    OR

“Data science is all about making sense out of the data or extracting the knowledge from the data using data science techniques.”

                                                    OR

“Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either it is structured data, semi-structure data or unstructured data.”

Introduction to Natural Language Processing (NLP)

  • What is Natural Language?
  • What is Natural Language Processing (NLP)?

    • Natural language processing is the ability of computational technologies and/or computational linguistics to process human natural language.

    • Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages

    • Natural language processing can be defined as the automatic (or semi-automatic) processing of human natural language

Applications of NLP

Components of NLP

Development cycle of NLP applications

Introduction to Machine Learning (ML)

  • In 1959, a researcher named Arthur Samuel gave computers the ability to learn without being explicitly programmed. He evolved this concept of ML from the study of pattern recognition and computational learning theory in AI.
  • In 1997, Tom Mitchell gave us an accurate definition that has been useful to those who can understand basic math. The definition of ML as per Tom Mitchell is:
    • computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
  • Let's link the preceding definition with our previous example.
    • To identify a license plate is called task T . You will run some ML programs using examples of license plates called experience E, and if it successfully learns, then it can predict the next unseen license plate that is called performance measure P.

Types of Machine Learning

Machine Learning Application

It's time for some basic hands on

  • I'm using a very small dataset of student test scores and the amount of hours they studied.

  • Intuitively, we know that there must be a relationship right? The more you study, the better your test scores should be.
  • We're going to use linear regression to prove this relationship.

Steps we are going to follow

  1. Load Data
  2. Initialization of the parameters
  3. Define Linear Equation
  4. Define and understand Sum of Squared Error value and equation
  5. Calculate Gradient Descent in order to get the line of best fit

Linear Regression using Gradient Descent based optimization

Linear Equation

Sum of Squared Error Value and Equation

Sum of Squared distance equation in statistics

Sum of squared distances formula (to calculate our error) linear regression

Partial derivative with respect to b and m (to perform gradient descent)


In [ ]:
from numpy import *

# y = mx + b
# m is slope, b is y-intercept
# here we are calculating the sum of squared error by using the equation which we have seen.
def compute_error_for_line_given_points(b, m, points):
    totalError = 0
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        totalError += (y - (m * x + b)) ** 2
    return totalError / float(len(points))

def step_gradient(b_current, m_current, points, learningRate):
    b_gradient = 0
    m_gradient = 0
    N = float(len(points))
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        # Here we are coding up out partial derivatives equations and
        # generate the updated value for m and b to get the local minima
        b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
        m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
    # we are multiplying the b_gradient and m_gradient with learningrate
    # so it is important to choose ideal learning rate if we make it to high then our model learn nothing
    # if we make it to small then our training is to slow and there are the chances of over fitting
    # so learning rate is important hyper parameter.
    new_b = b_current - (learningRate * b_gradient)
    new_m = m_current - (learningRate * m_gradient)
    return [new_b, new_m]

def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
    b = starting_b
    m = starting_m
    for i in range(num_iterations):
        # we are using step_gradient function to calculate the actual partial derivatives for error function
        b, m = step_gradient(b, m, array(points), learning_rate)
    return [b, m]

def run():
    # Step 1 : Read data

    # genfromtext is used to read out data from data.csv file.
    points = genfromtxt("./data/data.csv", delimiter=",")

    # Step2 : Define certain hyperparameters

    # how fast our model will converge means how fast we will get the line of best fit.
    # Converge means how fast our ML model get the optimal line of best fit.
    learning_rate = 0.0001
    # Here we need to draw the line which is best fit for our data.
    # so we are using y = mx + b ( x and y are points; m is slop; b is the y intercept)
    # for initial y-intercept guess
    initial_b = 0
    # initial slope guess
    initial_m = 0
    # How much do you want to train the model?
    # Here data set is small so we iterate this model for 1000 times.
    num_iterations = 1000
    
    # Step 3 - print the values of b, m and all function which calculate gradient descent and errors
    # Here we are printing the initial values of b, m and error.
    # As well as there is the function compute_error_for_line_given_points()
    # which compute the errors for given point
    print "Starting gradient descent at b = {0}, m = {1}, error = {2}".format(initial_b, initial_m,
                                                                              compute_error_for_line_given_points(initial_b, initial_m, points))
    print "Running..."

    # By using this gradient_descent_runner() function we will actually calculate gradient descent
    [b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)

    # Here we are printing the values of b, m and error after getting the line of best fit for the given dataset.
    print "After {0} iterations b = {1}, m = {2}, error = {3}".format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points))

if __name__ == '__main__':
    run()