Introduction to Statistics

Types of Data

Continuous

Data that can take on any value in an interval.

Discrete

Data that can take on only integer values, such as counts.

Categorical

Data that can take on only a specific set of values representing a set of possible categories.

Binary

A special case of categorical data with just two categories of values (0/1, true/false).

Ordinal

Categorical data that has an explicit ordering.

Rectangular Data

Data frame

Rectangular data (like a spreadsheet) is the basic data structure for statistical and machine learning models.

Feature

A column in the table is commonly referred to as a feature. (attribute, input, predictor, variable)

Outcome

Many data science projects involve predicting an outcome—often a yes/no outcome. The features are sometimes used to predict the outcome in an experiment or study. (dependent variable, response, target, output)

Records

A row in the table is commonly referred to as a record. (case, example, instance, observation, pattern, sample)

Nonrectangular Data Structures

There are other data structures besides rectangular data.

Time series data records successive measurements of the same variable.
Spatial data structures, which are used in mapping and location analytics, are more complex and varied than rectangular data structures.
Graph (or network) data structures are used to represent physical, social, and abstract relationships.

Estimates of Location

Variables with measured or count data might have thousands of distinct values.
A basic step in exploring data is getting a “typical value” for each feature (variable): an estimate of where most of the data is located (i.e., its central tendency).

Mean

The sum of all values divided by the number of values. (average)
$\bar{x}$ to represent a mean of sample from population

Weighted mean

The sum of all values times a weight divided by the sum of the weights.(weighted average)

Trimmed mean

The average of all values after dropping a fixed number of extreme values. (truncated mean)

Median

The value such that one-half of the data lies above and below.

Weighted median

The value such that one-half of the sum of the weights lies above and below the sorted data.

Robust

Not sensitive to extreme values.
The median is referred to as a robust estimate of location since it is not influenced by outliers (extreme cases)

Outlier

A data value that is very different from most of the data.

Estimates of Variability

Location is just one dimension in summarizing a feature.
A second dimension, variability, also referred to as dispersion, measures whether the data values are tightly clustered or spread out.

Deviations

The difference between the observed values and the estimate of location. (errors, residuals)

Variance

The sum of squared deviations from the mean divided by n – 1 where n is the number of data values. (mean-squared-error)

Standard deviation

The square root of the variance.

Mean absolute deviation

The mean of the absolute value of the deviations from the mean.

Median absolute deviation from the median

The median of the absolute value of the deviations from the median.

Range

The difference between the largest and the smallest value in a data set.

Order statistics

Metrics based on the data values sorted from smallest to biggest. (rank)

Percentile

The value such that P percent of the values take on this value or less and (100–P) percent take on this value or more.

Interquartile range

The difference between the 75th percentile and the 25th percentile. (IQR)

Exploring the Distribution

Boxplot

A plot introduced by Tukey as a quick way to visualize the distribution of data.

Frequency table

A tally of the count of numeric data values that fall into a set of intervals (bins).

Histogram

A plot of the frequency table with the bins on the x-axis and the count (or proportion) on the y-axis.

Density plot

A smoothed version of the histogram, often based on a kernal density estimate.

Exploring Categorical Data

Mode

The most commonly occurring category or value in a data set.

Expected value

When the categories can be associated with a numeric value, this gives an average value based on a category’s probability of occurrence.
A marketer for a new cloud technology, for example, offers two levels of service, one priced at \$300 per month and another at \$50 permonth. The marketer offers free webinars to generate leads, and the firm figures that 5% of the attendees will sign up for the \$300 service, 15% for the \$50 service, and 80% will not sign up for anything.

Bar charts

The frequency or proportion for each category plotted as bars.

Correlation

Correlation coefficient

A metric that measures the extent to which numeric variables are associated with one another (ranges from –1 to +1).

Correlation matrix

A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.

Scatterplot

A plot in which the x-axis is the value of one variable, and the y-axis the value of another.

Normal Distribution

The bell-shaped normal distribution is iconic in traditional statistics.
The fact that distributions of sample statistics are often normally shaped has made it a powerful tool in the development of mathematical formulas that approximate those distributions.

Error

The difference between a data point and a predicted or average value.

Standardize

Subtract the mean and divide by the standard deviation. ### z-score
The result of standardizing an individual data point.

Standard normal

A normal distribution with mean = 0 and standard deviation = 1.

QQ-Plot

A plot to visualize how close a sample distribution is to a normal distribution.

In a normal distribution

68% of the data lies within one standard deviation of the mean.
95% lies within two standard deviations.
99.7% data with in three standard deviations.

Basic Probability

Probability is concerned with the outcome of trials.
Trials are also called experiments or observations (multiple trials).
Trials refers to an event whose outcome is unknown.

Sample Space (S) - Set of all possible elementary outcomes of a trial

Events (E) - An event is the specification of the outcome of a trial.

The probability of an event is always between 0 and 1.
The probability of an event and its complement is always 1.



In [ ]: