Introduction to Statistics

Types of Data

Continuous

  • Data that can take on any value in an interval.

Discrete

  • Data that can take on only integer values, such as counts.

Categorical

  • Data that can take on only a specific set of values representing a set of possible categories.

Binary

  • A special case of categorical data with just two categories of values (0/1, true/false).

Ordinal

  • Categorical data that has an explicit ordering.

Rectangular Data

Data frame

  • Rectangular data (like a spreadsheet) is the basic data structure for statistical and machine learning models.

Feature

  • A column in the table is commonly referred to as a feature. (attribute, input, predictor, variable)

Outcome

  • Many data science projects involve predicting an outcome—often a yes/no outcome. The features are sometimes used to predict the outcome in an experiment or study. (dependent variable, response, target, output)

Records

  • A row in the table is commonly referred to as a record. (case, example, instance, observation, pattern, sample)

Nonrectangular Data Structures

There are other data structures besides rectangular data.

  • Time series data records successive measurements of the same variable.

  • Spatial data structures, which are used in mapping and location analytics, are more complex and varied than rectangular data structures.

  • Graph (or network) data structures are used to represent physical, social, and abstract relationships.

Estimates of Location

  • Variables with measured or count data might have thousands of distinct values.
  • A basic step in exploring data is getting a “typical value” for each feature (variable): an estimate of where most of the data is located (i.e., its central tendency).

Mean

  • The sum of all values divided by the number of values. (average)
  • $\bar{x}$ to represent a mean of sample from population

Weighted mean

  • The sum of all values times a weight divided by the sum of the weights.(weighted average)

Trimmed mean

  • The average of all values after dropping a fixed number of extreme values. (truncated mean)

Median

  • The value such that one-half of the data lies above and below.

Weighted median

  • The value such that one-half of the sum of the weights lies above and below the sorted data.

Robust

  • Not sensitive to extreme values.
  • The median is referred to as a robust estimate of location since it is not influenced by outliers (extreme cases)

Outlier

  • A data value that is very different from most of the data.

Estimates of Variability

  • Location is just one dimension in summarizing a feature.

  • A second dimension, variability, also referred to as dispersion, measures whether the data values are tightly clustered or spread out.

Deviations

  • The difference between the observed values and the estimate of location. (errors, residuals)

Variance

  • The sum of squared deviations from the mean divided by n – 1 where n is the number of data values. (mean-squared-error)

Standard deviation

  • The square root of the variance.

Mean absolute deviation

  • The mean of the absolute value of the deviations from the mean.

Median absolute deviation from the median

  • The median of the absolute value of the deviations from the median.

Range

  • The difference between the largest and the smallest value in a data set.

Order statistics

  • Metrics based on the data values sorted from smallest to biggest. (rank)

Percentile

  • The value such that P percent of the values take on this value or less and (100–P) percent take on this value or more.

Interquartile range

  • The difference between the 75th percentile and the 25th percentile. (IQR)

Exploring the Distribution

Boxplot

  • A plot introduced by Tukey as a quick way to visualize the distribution of data.

Frequency table

  • A tally of the count of numeric data values that fall into a set of intervals (bins).

Histogram

  • A plot of the frequency table with the bins on the x-axis and the count (or proportion) on the y-axis.

Density plot

  • A smoothed version of the histogram, often based on a kernal density estimate.

Exploring Categorical Data

Mode

  • The most commonly occurring category or value in a data set.

Expected value

  • When the categories can be associated with a numeric value, this gives an average value based on a category’s probability of occurrence.

  • A marketer for a new cloud technology, for example, offers two levels of service, one priced at \$300 per month and another at \$50 permonth. The marketer offers free webinars to generate leads, and the firm figures that 5% of the attendees will sign up for the \$300 service, 15% for the \$50 service, and 80% will not sign up for anything.

Bar charts

  • The frequency or proportion for each category plotted as bars.

Correlation

Correlation coefficient

  • A metric that measures the extent to which numeric variables are associated with one another (ranges from –1 to +1).

Correlation matrix

  • A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.

Scatterplot

  • A plot in which the x-axis is the value of one variable, and the y-axis the value of another.

Normal Distribution

  • The bell-shaped normal distribution is iconic in traditional statistics.

  • The fact that distributions of sample statistics are often normally shaped has made it a powerful tool in the development of mathematical formulas that approximate those distributions.

Error

  • The difference between a data point and a predicted or average value.

Standardize

  • Subtract the mean and divide by the standard deviation. ### z-score

  • The result of standardizing an individual data point.

Standard normal

A normal distribution with mean = 0 and standard deviation = 1.

QQ-Plot

  • A plot to visualize how close a sample distribution is to a normal distribution.

In a normal distribution

  • 68% of the data lies within one standard deviation of the mean.
  • 95% lies within two standard deviations.
  • 99.7% data with in three standard deviations.

Basic Probability

  • Probability is concerned with the outcome of trials.
  • Trials are also called experiments or observations (multiple trials).
  • Trials refers to an event whose outcome is unknown.

Sample Space (S) - Set of all possible elementary outcomes of a trial

Events (E) - An event is the specification of the outcome of a trial.

  • The probability of an event is always between 0 and 1.
  • The probability of an event and its complement is always 1.

In [ ]: