Content under Creative Commons Attribution license CC-BY 4.0, code under MIT license (c)2014 Sergio Rojas (srojas@usb.ve) and Erik A Christensen (erikcny@aol.com).
In this chapter we'll learn the benefits of using the combination of Python, NumPy, SciPy, and Matplotlib as a programming environment for any scientific endeavor that requires mathematics; in particular, anything related to numerical computations. We'll explore the environment, learn how to download and install the required libraries, use them for some quick computations, and figure out a few good ways to search for help.
A few links with documentation that can help to enhance the discussion presented on this section of the book are as follows:
In [6]:
import numpy
import scipy
scores=numpy.array([114, 100, 104, 89, 102, 91, 114, 114, 103, 105, 108, 130, 120, 132, 111, 128, 118, 119, 86,
72, 111, 103, 74, 112, 107, 103, 98, 96, 112, 112, 93])
What follows generates the stem plot mentioned in the text
In [7]:
nbins = 6
sortedscores = numpy.sort(scores)
intervals = numpy.linspace(sortedscores[0],sortedscores[-1],nbins+1);
leftEdges = intervals[0:-1];
rightEdges = intervals[1:];
middleEdges = ( leftEdges + rightEdges ) / 2.0;
In [8]:
j=0
temp=sortedscores[j]
count = numpy.zeros([nbins,1])
i=0
istop = len(leftEdges) - 1
while j<len(sortedscores):
while ( ((leftEdges[i] <= temp) & (temp < rightEdges[i])) ):
count[i] = count[i]+1
j=j+1
temp = sortedscores[j]
if i < istop:
i=i+1
else:
j=j+1
if temp == rightEdges[i]:
count[i] = count[i]+1
In [9]:
%matplotlib inline
import matplotlib.pylab as plt
plt.stem(middleEdges, count)
plt.show()
NOTE: to shorten the output of the next command, click the leftmost mouse button on the left of the output to activate a scrolling window. Do the same to anyother help output which follows after this one
In [10]:
dir(scores)
Out[10]:
['T',
'__abs__',
'__add__',
'__and__',
'__array__',
'__array_finalize__',
'__array_interface__',
'__array_prepare__',
'__array_priority__',
'__array_struct__',
'__array_wrap__',
'__bool__',
'__class__',
'__contains__',
'__copy__',
'__deepcopy__',
'__delattr__',
'__delitem__',
'__dir__',
'__divmod__',
'__doc__',
'__eq__',
'__float__',
'__floordiv__',
'__format__',
'__ge__',
'__getattribute__',
'__getitem__',
'__gt__',
'__hash__',
'__iadd__',
'__iand__',
'__ifloordiv__',
'__ilshift__',
'__imod__',
'__imul__',
'__index__',
'__init__',
'__int__',
'__invert__',
'__ior__',
'__ipow__',
'__irshift__',
'__isub__',
'__iter__',
'__itruediv__',
'__ixor__',
'__le__',
'__len__',
'__lshift__',
'__lt__',
'__mod__',
'__mul__',
'__ne__',
'__neg__',
'__new__',
'__or__',
'__pos__',
'__pow__',
'__radd__',
'__rand__',
'__rdivmod__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rfloordiv__',
'__rlshift__',
'__rmod__',
'__rmul__',
'__ror__',
'__rpow__',
'__rrshift__',
'__rshift__',
'__rsub__',
'__rtruediv__',
'__rxor__',
'__setattr__',
'__setitem__',
'__setstate__',
'__sizeof__',
'__str__',
'__sub__',
'__subclasshook__',
'__truediv__',
'__xor__',
'all',
'any',
'argmax',
'argmin',
'argpartition',
'argsort',
'astype',
'base',
'byteswap',
'choose',
'clip',
'compress',
'conj',
'conjugate',
'copy',
'ctypes',
'cumprod',
'cumsum',
'data',
'diagonal',
'dot',
'dtype',
'dump',
'dumps',
'fill',
'flags',
'flat',
'flatten',
'getfield',
'imag',
'item',
'itemset',
'itemsize',
'max',
'mean',
'min',
'nbytes',
'ndim',
'newbyteorder',
'nonzero',
'partition',
'prod',
'ptp',
'put',
'ravel',
'real',
'repeat',
'reshape',
'resize',
'round',
'searchsorted',
'setfield',
'setflags',
'shape',
'size',
'sort',
'squeeze',
'std',
'strides',
'sum',
'swapaxes',
'take',
'tofile',
'tolist',
'tostring',
'trace',
'transpose',
'var',
'view']
In [11]:
xmean = scipy.mean(scores)
sigma = scipy.std(scores)
n = scipy.size(scores)
print ("({0:0.14f}, {1:0.15f}, {2:0.14f})".format(xmean, xmean - 2.576*sigma/scipy.sqrt(n),
xmean + 2.576*sigma/scipy.sqrt(n) ))
(105.83870967741936, 99.343223715529746, 112.33419563930897)
In [12]:
from scipy import stats
result=scipy.stats.bayes_mvs(scores)
NOTE: to shorten the output of the next command, click the leftmost mouse button on the left of the output to activate a scrolling window. Do the same to anyother help output which follows after this one
In [13]:
help(scipy.stats.bayes_mvs)
Help on function bayes_mvs in module scipy.stats.morestats:
bayes_mvs(data, alpha=0.9)
Bayesian confidence intervals for the mean, var, and std.
Parameters
----------
data : array_like
Input data, if multi-dimensional it is flattened to 1-D by `bayes_mvs`.
Requires 2 or more data points.
alpha : float, optional
Probability that the returned confidence interval contains
the true parameter.
Returns
-------
mean_cntr, var_cntr, std_cntr : tuple
The three results are for the mean, variance and standard deviation,
respectively. Each result is a tuple of the form::
(center, (lower, upper))
with `center` the mean of the conditional pdf of the value given the
data, and `(lower, upper)` a confidence interval, centered on the
median, containing the estimate to a probability `alpha`.
Notes
-----
Each tuple of mean, variance, and standard deviation estimates represent
the (center, (lower, upper)) with center the mean of the conditional pdf
of the value given the data and (lower, upper) is a confidence interval
centered on the median, containing the estimate to a probability
`alpha`.
Converts data to 1-D and assumes all data has the same mean and variance.
Uses Jeffrey's prior for variance and std.
Equivalent to tuple((x.mean(), x.interval(alpha)) for x in mvsdist(dat))
References
----------
T.E. Oliphant, "A Bayesian perspective on estimating mean, variance, and
standard-deviation from data", http://hdl.handle.net/1877/438, 2006.
In [14]:
print(result[0])
(105.83870967741936, (101.48825534263035, 110.18916401220837))
In [15]:
help(scipy.stats.bayes_mvs)
Help on function bayes_mvs in module scipy.stats.morestats:
bayes_mvs(data, alpha=0.9)
Bayesian confidence intervals for the mean, var, and std.
Parameters
----------
data : array_like
Input data, if multi-dimensional it is flattened to 1-D by `bayes_mvs`.
Requires 2 or more data points.
alpha : float, optional
Probability that the returned confidence interval contains
the true parameter.
Returns
-------
mean_cntr, var_cntr, std_cntr : tuple
The three results are for the mean, variance and standard deviation,
respectively. Each result is a tuple of the form::
(center, (lower, upper))
with `center` the mean of the conditional pdf of the value given the
data, and `(lower, upper)` a confidence interval, centered on the
median, containing the estimate to a probability `alpha`.
Notes
-----
Each tuple of mean, variance, and standard deviation estimates represent
the (center, (lower, upper)) with center the mean of the conditional pdf
of the value given the data and (lower, upper) is a confidence interval
centered on the median, containing the estimate to a probability
`alpha`.
Converts data to 1-D and assumes all data has the same mean and variance.
Uses Jeffrey's prior for variance and std.
Equivalent to tuple((x.mean(), x.interval(alpha)) for x in mvsdist(dat))
References
----------
T.E. Oliphant, "A Bayesian perspective on estimating mean, variance, and
standard-deviation from data", http://hdl.handle.net/1877/438, 2006.
In [16]:
numpy.info('random')
*** Found in numpy ***
========================
Random Number Generation
========================
==================== =========================================================
Utility functions
==============================================================================
random Uniformly distributed values of a given shape.
bytes Uniformly distributed random bytes.
random_integers Uniformly distributed integers in a given range.
random_sample Uniformly distributed floats in a given range.
permutation Randomly permute a sequence / generate a random sequence.
shuffle Randomly permute a sequence in place.
seed Seed the random number generator.
==================== =========================================================
==================== =========================================================
Compatibility functions
==============================================================================
rand Uniformly distributed values.
randn Normally distributed values.
ranf Uniformly distributed floating point numbers.
randint Uniformly distributed integers in a given range.
==================== =========================================================
==================== =========================================================
Univariate distributions
==============================================================================
beta Beta distribution over ``[0, 1]``.
binomial Binomial distribution.
chisquare :math:`\chi^2` distribution.
exponential Exponential distribution.
f F (Fisher-Snedecor) distribution.
gamma Gamma distribution.
geometric Geometric distribution.
gumbel Gumbel distribution.
hypergeometric Hypergeometric distribution.
laplace Laplace distribution.
logistic Logistic distribution.
lognormal Log-normal distribution.
logseries Logarithmic series distribution.
negative_binomial Negative binomial distribution.
noncentral_chisquare Non-central chi-square distribution.
noncentral_f Non-central F distribution.
normal Normal / Gaussian distribution.
pareto Pareto distribution.
poisson Poisson distribution.
power Power distribution.
rayleigh Rayleigh distribution.
triangular Triangular distribution.
uniform Uniform distribution.
vonmises Von Mises circular distribution.
wald Wald (inverse Gaussian) distribution.
weibull Weibull distribution.
zipf Zipf's distribution over ranked data.
==================== =========================================================
==================== =========================================================
Multivariate distributions
==============================================================================
dirichlet Multivariate generalization of Beta distribution.
multinomial Multivariate generalization of the binomial distribution.
multivariate_normal Multivariate generalization of the normal distribution.
==================== =========================================================
==================== =========================================================
Standard distributions
==============================================================================
standard_cauchy Standard Cauchy-Lorentz distribution.
standard_exponential Standard exponential distribution.
standard_gamma Standard Gamma distribution.
standard_normal Standard normal distribution.
standard_t Standard Student's t-distribution.
==================== =========================================================
==================== =========================================================
Internal functions
==============================================================================
get_state Get tuple representing internal state of generator.
set_state Set state of generator.
==================== =========================================================
----------------------------------------------------------------------------
*** Found in numpy.random ***
random_sample(size=None)
Return random floats in the half-open interval [0.0, 1.0).
Results are from the "continuous uniform" distribution over the
stated interval. To sample :math:`Unif[a, b), b > a` multiply
the output of `random_sample` by `(b-a)` and add `a`::
(b - a) * random_sample() + a
Parameters
----------
size : int or tuple of ints, optional
Defines the shape of the returned array of random floats. If None
(the default), returns a single float.
Returns
-------
out : float or ndarray of floats
Array of random floats of shape `size` (unless ``size=None``, in which
case a single float is returned).
Examples
--------
>>> np.random.random_sample()
0.47108547995356098
>>> type(np.random.random_sample())
<type 'float'>
>>> np.random.random_sample((5,))
array([ 0.30220482, 0.86820401, 0.1654503 , 0.11659149, 0.54323428])
Three-by-two array of random numbers from [-5, 0):
>>> 5 * np.random.random_sample((3, 2)) - 5
array([[-3.99149989, -0.52338984],
[-2.99091858, -0.79479508],
[-1.23204345, -1.75224494]])
----------------------------------------------------------------------------
*** Total of 2 references found. ***
In [17]:
help(numpy.random)
Help on package numpy.random in numpy:
NAME
numpy.random
DESCRIPTION
========================
Random Number Generation
========================
==================== =========================================================
Utility functions
==============================================================================
random Uniformly distributed values of a given shape.
bytes Uniformly distributed random bytes.
random_integers Uniformly distributed integers in a given range.
random_sample Uniformly distributed floats in a given range.
permutation Randomly permute a sequence / generate a random sequence.
shuffle Randomly permute a sequence in place.
seed Seed the random number generator.
==================== =========================================================
==================== =========================================================
Compatibility functions
==============================================================================
rand Uniformly distributed values.
randn Normally distributed values.
ranf Uniformly distributed floating point numbers.
randint Uniformly distributed integers in a given range.
==================== =========================================================
==================== =========================================================
Univariate distributions
==============================================================================
beta Beta distribution over ``[0, 1]``.
binomial Binomial distribution.
chisquare :math:`\chi^2` distribution.
exponential Exponential distribution.
f F (Fisher-Snedecor) distribution.
gamma Gamma distribution.
geometric Geometric distribution.
gumbel Gumbel distribution.
hypergeometric Hypergeometric distribution.
laplace Laplace distribution.
logistic Logistic distribution.
lognormal Log-normal distribution.
logseries Logarithmic series distribution.
negative_binomial Negative binomial distribution.
noncentral_chisquare Non-central chi-square distribution.
noncentral_f Non-central F distribution.
normal Normal / Gaussian distribution.
pareto Pareto distribution.
poisson Poisson distribution.
power Power distribution.
rayleigh Rayleigh distribution.
triangular Triangular distribution.
uniform Uniform distribution.
vonmises Von Mises circular distribution.
wald Wald (inverse Gaussian) distribution.
weibull Weibull distribution.
zipf Zipf's distribution over ranked data.
==================== =========================================================
==================== =========================================================
Multivariate distributions
==============================================================================
dirichlet Multivariate generalization of Beta distribution.
multinomial Multivariate generalization of the binomial distribution.
multivariate_normal Multivariate generalization of the normal distribution.
==================== =========================================================
==================== =========================================================
Standard distributions
==============================================================================
standard_cauchy Standard Cauchy-Lorentz distribution.
standard_exponential Standard exponential distribution.
standard_gamma Standard Gamma distribution.
standard_normal Standard normal distribution.
standard_t Standard Student's t-distribution.
==================== =========================================================
==================== =========================================================
Internal functions
==============================================================================
get_state Get tuple representing internal state of generator.
set_state Set state of generator.
==================== =========================================================
PACKAGE CONTENTS
info
mtrand
setup
FUNCTIONS
beta(...) method of mtrand.RandomState instance
beta(a, b, size=None)
The Beta distribution over ``[0, 1]``.
The Beta distribution is a special case of the Dirichlet distribution,
and is related to the Gamma distribution. It has the probability
distribution function
.. math:: f(x; a,b) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1}
(1 - x)^{\beta - 1},
where the normalisation, B, is the beta function,
.. math:: B(\alpha, \beta) = \int_0^1 t^{\alpha - 1}
(1 - t)^{\beta - 1} dt.
It is often seen in Bayesian inference and order statistics.
Parameters
----------
a : float
Alpha, non-negative.
b : float
Beta, non-negative.
size : tuple of ints, optional
The number of samples to draw. The output is packed according to
the size given.
Returns
-------
out : ndarray
Array of the given shape, containing values drawn from a
Beta distribution.
binomial(...) method of mtrand.RandomState instance
binomial(n, p, size=None)
Draw samples from a binomial distribution.
Samples are drawn from a Binomial distribution with specified
parameters, n trials and p probability of success where
n an integer >= 0 and p is in the interval [0,1]. (n may be
input as a float, but it is truncated to an integer in use)
Parameters
----------
n : float (but truncated to an integer)
parameter, >= 0.
p : float
parameter, >= 0 and <=1.
size : {tuple, int}
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
samples : {ndarray, scalar}
where the values are all integers in [0, n].
See Also
--------
scipy.stats.distributions.binom : probability density function,
distribution or cumulative density function, etc.
Notes
-----
The probability density for the Binomial distribution is
.. math:: P(N) = \binom{n}{N}p^N(1-p)^{n-N},
where :math:`n` is the number of trials, :math:`p` is the probability
of success, and :math:`N` is the number of successes.
When estimating the standard error of a proportion in a population by
using a random sample, the normal distribution works well unless the
product p*n <=5, where p = population proportion estimate, and n =
number of samples, in which case the binomial distribution is used
instead. For example, a sample of 15 people shows 4 who are left
handed, and 11 who are right handed. Then p = 4/15 = 27%. 0.27*15 = 4,
so the binomial distribution should be used in this case.
References
----------
.. [1] Dalgaard, Peter, "Introductory Statistics with R",
Springer-Verlag, 2002.
.. [2] Glantz, Stanton A. "Primer of Biostatistics.", McGraw-Hill,
Fifth Edition, 2002.
.. [3] Lentner, Marvin, "Elementary Applied Statistics", Bogden
and Quigley, 1972.
.. [4] Weisstein, Eric W. "Binomial Distribution." From MathWorld--A
Wolfram Web Resource.
http://mathworld.wolfram.com/BinomialDistribution.html
.. [5] Wikipedia, "Binomial-distribution",
http://en.wikipedia.org/wiki/Binomial_distribution
Examples
--------
Draw samples from the distribution:
>>> n, p = 10, .5 # number of trials, probability of each trial
>>> s = np.random.binomial(n, p, 1000)
# result of flipping a coin 10 times, tested 1000 times.
A real world example. A company drills 9 wild-cat oil exploration
wells, each with an estimated probability of success of 0.1. All nine
wells fail. What is the probability of that happening?
Let's do 20,000 trials of the model, and count the number that
generate zero positive results.
>>> sum(np.random.binomial(9,0.1,20000)==0)/20000.
answer = 0.38885, or 38%.
bytes(...) method of mtrand.RandomState instance
bytes(length)
Return random bytes.
Parameters
----------
length : int
Number of random bytes.
Returns
-------
out : str
String of length `length`.
Examples
--------
>>> np.random.bytes(10)
' eh\x85\x022SZ\xbf\xa4' #random
chisquare(...) method of mtrand.RandomState instance
chisquare(df, size=None)
Draw samples from a chi-square distribution.
When `df` independent random variables, each with standard normal
distributions (mean 0, variance 1), are squared and summed, the
resulting distribution is chi-square (see Notes). This distribution
is often used in hypothesis testing.
Parameters
----------
df : int
Number of degrees of freedom.
size : tuple of ints, int, optional
Size of the returned array. By default, a scalar is
returned.
Returns
-------
output : ndarray
Samples drawn from the distribution, packed in a `size`-shaped
array.
Raises
------
ValueError
When `df` <= 0 or when an inappropriate `size` (e.g. ``size=-1``)
is given.
Notes
-----
The variable obtained by summing the squares of `df` independent,
standard normally distributed random variables:
.. math:: Q = \sum_{i=0}^{\mathtt{df}} X^2_i
is chi-square distributed, denoted
.. math:: Q \sim \chi^2_k.
The probability density function of the chi-squared distribution is
.. math:: p(x) = \frac{(1/2)^{k/2}}{\Gamma(k/2)}
x^{k/2 - 1} e^{-x/2},
where :math:`\Gamma` is the gamma function,
.. math:: \Gamma(x) = \int_0^{-\infty} t^{x - 1} e^{-t} dt.
References
----------
`NIST/SEMATECH e-Handbook of Statistical Methods
<http://www.itl.nist.gov/div898/handbook/eda/section3/eda3666.htm>`_
Examples
--------
>>> np.random.chisquare(2,4)
array([ 1.89920014, 9.00867716, 3.13710533, 5.62318272])
exponential(...) method of mtrand.RandomState instance
exponential(scale=1.0, size=None)
Exponential distribution.
Its probability density function is
.. math:: f(x; \frac{1}{\beta}) = \frac{1}{\beta} \exp(-\frac{x}{\beta}),
for ``x > 0`` and 0 elsewhere. :math:`\beta` is the scale parameter,
which is the inverse of the rate parameter :math:`\lambda = 1/\beta`.
The rate parameter is an alternative, widely used parameterization
of the exponential distribution [3]_.
The exponential distribution is a continuous analogue of the
geometric distribution. It describes many common situations, such as
the size of raindrops measured over many rainstorms [1]_, or the time
between page requests to Wikipedia [2]_.
Parameters
----------
scale : float
The scale parameter, :math:`\beta = 1/\lambda`.
size : tuple of ints
Number of samples to draw. The output is shaped
according to `size`.
References
----------
.. [1] Peyton Z. Peebles Jr., "Probability, Random Variables and
Random Signal Principles", 4th ed, 2001, p. 57.
.. [2] "Poisson Process", Wikipedia,
http://en.wikipedia.org/wiki/Poisson_process
.. [3] "Exponential Distribution, Wikipedia,
http://en.wikipedia.org/wiki/Exponential_distribution
f(...) method of mtrand.RandomState instance
f(dfnum, dfden, size=None)
Draw samples from a F distribution.
Samples are drawn from an F distribution with specified parameters,
`dfnum` (degrees of freedom in numerator) and `dfden` (degrees of freedom
in denominator), where both parameters should be greater than zero.
The random variate of the F distribution (also known as the
Fisher distribution) is a continuous probability distribution
that arises in ANOVA tests, and is the ratio of two chi-square
variates.
Parameters
----------
dfnum : float
Degrees of freedom in numerator. Should be greater than zero.
dfden : float
Degrees of freedom in denominator. Should be greater than zero.
size : {tuple, int}, optional
Output shape. If the given shape is, e.g., ``(m, n, k)``,
then ``m * n * k`` samples are drawn. By default only one sample
is returned.
Returns
-------
samples : {ndarray, scalar}
Samples from the Fisher distribution.
See Also
--------
scipy.stats.distributions.f : probability density function,
distribution or cumulative density function, etc.
Notes
-----
The F statistic is used to compare in-group variances to between-group
variances. Calculating the distribution depends on the sampling, and
so it is a function of the respective degrees of freedom in the
problem. The variable `dfnum` is the number of samples minus one, the
between-groups degrees of freedom, while `dfden` is the within-groups
degrees of freedom, the sum of the number of samples in each group
minus the number of groups.
References
----------
.. [1] Glantz, Stanton A. "Primer of Biostatistics.", McGraw-Hill,
Fifth Edition, 2002.
.. [2] Wikipedia, "F-distribution",
http://en.wikipedia.org/wiki/F-distribution
Examples
--------
An example from Glantz[1], pp 47-40.
Two groups, children of diabetics (25 people) and children from people
without diabetes (25 controls). Fasting blood glucose was measured,
case group had a mean value of 86.1, controls had a mean value of
82.2. Standard deviations were 2.09 and 2.49 respectively. Are these
data consistent with the null hypothesis that the parents diabetic
status does not affect their children's blood glucose levels?
Calculating the F statistic from the data gives a value of 36.01.
Draw samples from the distribution:
>>> dfnum = 1. # between group degrees of freedom
>>> dfden = 48. # within groups degrees of freedom
>>> s = np.random.f(dfnum, dfden, 1000)
The lower bound for the top 1% of the samples is :
>>> sort(s)[-10]
7.61988120985
So there is about a 1% chance that the F statistic will exceed 7.62,
the measured value is 36, so the null hypothesis is rejected at the 1%
level.
gamma(...) method of mtrand.RandomState instance
gamma(shape, scale=1.0, size=None)
Draw samples from a Gamma distribution.
Samples are drawn from a Gamma distribution with specified parameters,
`shape` (sometimes designated "k") and `scale` (sometimes designated
"theta"), where both parameters are > 0.
Parameters
----------
shape : scalar > 0
The shape of the gamma distribution.
scale : scalar > 0, optional
The scale of the gamma distribution. Default is equal to 1.
size : shape_tuple, optional
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
out : ndarray, float
Returns one sample unless `size` parameter is specified.
See Also
--------
scipy.stats.distributions.gamma : probability density function,
distribution or cumulative density function, etc.
Notes
-----
The probability density for the Gamma distribution is
.. math:: p(x) = x^{k-1}\frac{e^{-x/\theta}}{\theta^k\Gamma(k)},
where :math:`k` is the shape and :math:`\theta` the scale,
and :math:`\Gamma` is the Gamma function.
The Gamma distribution is often used to model the times to failure of
electronic components, and arises naturally in processes for which the
waiting times between Poisson distributed events are relevant.
References
----------
.. [1] Weisstein, Eric W. "Gamma Distribution." From MathWorld--A
Wolfram Web Resource.
http://mathworld.wolfram.com/GammaDistribution.html
.. [2] Wikipedia, "Gamma-distribution",
http://en.wikipedia.org/wiki/Gamma-distribution
Examples
--------
Draw samples from the distribution:
>>> shape, scale = 2., 2. # mean and dispersion
>>> s = np.random.gamma(shape, scale, 1000)
Display the histogram of the samples, along with
the probability density function:
>>> import matplotlib.pyplot as plt
>>> import scipy.special as sps
>>> count, bins, ignored = plt.hist(s, 50, normed=True)
>>> y = bins**(shape-1)*(np.exp(-bins/scale) /
... (sps.gamma(shape)*scale**shape))
>>> plt.plot(bins, y, linewidth=2, color='r')
>>> plt.show()
geometric(...) method of mtrand.RandomState instance
geometric(p, size=None)
Draw samples from the geometric distribution.
Bernoulli trials are experiments with one of two outcomes:
success or failure (an example of such an experiment is flipping
a coin). The geometric distribution models the number of trials
that must be run in order to achieve success. It is therefore
supported on the positive integers, ``k = 1, 2, ...``.
The probability mass function of the geometric distribution is
.. math:: f(k) = (1 - p)^{k - 1} p
where `p` is the probability of success of an individual trial.
Parameters
----------
p : float
The probability of success of an individual trial.
size : tuple of ints
Number of values to draw from the distribution. The output
is shaped according to `size`.
Returns
-------
out : ndarray
Samples from the geometric distribution, shaped according to
`size`.
Examples
--------
Draw ten thousand values from the geometric distribution,
with the probability of an individual success equal to 0.35:
>>> z = np.random.geometric(p=0.35, size=10000)
How many trials succeeded after a single run?
>>> (z == 1).sum() / 10000.
0.34889999999999999 #random
get_state(...) method of mtrand.RandomState instance
get_state()
Return a tuple representing the internal state of the generator.
For more details, see `set_state`.
Returns
-------
out : tuple(str, ndarray of 624 uints, int, int, float)
The returned tuple has the following items:
1. the string 'MT19937'.
2. a 1-D array of 624 unsigned integer keys.
3. an integer ``pos``.
4. an integer ``has_gauss``.
5. a float ``cached_gaussian``.
See Also
--------
set_state
Notes
-----
`set_state` and `get_state` are not needed to work with any of the
random distributions in NumPy. If the internal state is manually altered,
the user should know exactly what he/she is doing.
gumbel(...) method of mtrand.RandomState instance
gumbel(loc=0.0, scale=1.0, size=None)
Gumbel distribution.
Draw samples from a Gumbel distribution with specified location and scale.
For more information on the Gumbel distribution, see Notes and References
below.
Parameters
----------
loc : float
The location of the mode of the distribution.
scale : float
The scale parameter of the distribution.
size : tuple of ints
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
out : ndarray
The samples
See Also
--------
scipy.stats.gumbel_l
scipy.stats.gumbel_r
scipy.stats.genextreme
probability density function, distribution, or cumulative density
function, etc. for each of the above
weibull
Notes
-----
The Gumbel (or Smallest Extreme Value (SEV) or the Smallest Extreme Value
Type I) distribution is one of a class of Generalized Extreme Value (GEV)
distributions used in modeling extreme value problems. The Gumbel is a
special case of the Extreme Value Type I distribution for maximums from
distributions with "exponential-like" tails.
The probability density for the Gumbel distribution is
.. math:: p(x) = \frac{e^{-(x - \mu)/ \beta}}{\beta} e^{ -e^{-(x - \mu)/
\beta}},
where :math:`\mu` is the mode, a location parameter, and :math:`\beta` is
the scale parameter.
The Gumbel (named for German mathematician Emil Julius Gumbel) was used
very early in the hydrology literature, for modeling the occurrence of
flood events. It is also used for modeling maximum wind speed and rainfall
rates. It is a "fat-tailed" distribution - the probability of an event in
the tail of the distribution is larger than if one used a Gaussian, hence
the surprisingly frequent occurrence of 100-year floods. Floods were
initially modeled as a Gaussian process, which underestimated the frequency
of extreme events.
It is one of a class of extreme value distributions, the Generalized
Extreme Value (GEV) distributions, which also includes the Weibull and
Frechet.
The function has a mean of :math:`\mu + 0.57721\beta` and a variance of
:math:`\frac{\pi^2}{6}\beta^2`.
References
----------
Gumbel, E. J., *Statistics of Extremes*, New York: Columbia University
Press, 1958.
Reiss, R.-D. and Thomas, M., *Statistical Analysis of Extreme Values from
Insurance, Finance, Hydrology and Other Fields*, Basel: Birkhauser Verlag,
2001.
Examples
--------
Draw samples from the distribution:
>>> mu, beta = 0, 0.1 # location and scale
>>> s = np.random.gumbel(mu, beta, 1000)
Display the histogram of the samples, along with
the probability density function:
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 30, normed=True)
>>> plt.plot(bins, (1/beta)*np.exp(-(bins - mu)/beta)
... * np.exp( -np.exp( -(bins - mu) /beta) ),
... linewidth=2, color='r')
>>> plt.show()
Show how an extreme value distribution can arise from a Gaussian process
and compare to a Gaussian:
>>> means = []
>>> maxima = []
>>> for i in range(0,1000) :
... a = np.random.normal(mu, beta, 1000)
... means.append(a.mean())
... maxima.append(a.max())
>>> count, bins, ignored = plt.hist(maxima, 30, normed=True)
>>> beta = np.std(maxima)*np.pi/np.sqrt(6)
>>> mu = np.mean(maxima) - 0.57721*beta
>>> plt.plot(bins, (1/beta)*np.exp(-(bins - mu)/beta)
... * np.exp(-np.exp(-(bins - mu)/beta)),
... linewidth=2, color='r')
>>> plt.plot(bins, 1/(beta * np.sqrt(2 * np.pi))
... * np.exp(-(bins - mu)**2 / (2 * beta**2)),
... linewidth=2, color='g')
>>> plt.show()
hypergeometric(...) method of mtrand.RandomState instance
hypergeometric(ngood, nbad, nsample, size=None)
Draw samples from a Hypergeometric distribution.
Samples are drawn from a Hypergeometric distribution with specified
parameters, ngood (ways to make a good selection), nbad (ways to make
a bad selection), and nsample = number of items sampled, which is less
than or equal to the sum ngood + nbad.
Parameters
----------
ngood : int or array_like
Number of ways to make a good selection. Must be nonnegative.
nbad : int or array_like
Number of ways to make a bad selection. Must be nonnegative.
nsample : int or array_like
Number of items sampled. Must be at least 1 and at most
``ngood + nbad``.
size : int or tuple of int
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
samples : ndarray or scalar
The values are all integers in [0, n].
See Also
--------
scipy.stats.distributions.hypergeom : probability density function,
distribution or cumulative density function, etc.
Notes
-----
The probability density for the Hypergeometric distribution is
.. math:: P(x) = \frac{\binom{m}{n}\binom{N-m}{n-x}}{\binom{N}{n}},
where :math:`0 \le x \le m` and :math:`n+m-N \le x \le n`
for P(x) the probability of x successes, n = ngood, m = nbad, and
N = number of samples.
Consider an urn with black and white marbles in it, ngood of them
black and nbad are white. If you draw nsample balls without
replacement, then the Hypergeometric distribution describes the
distribution of black balls in the drawn sample.
Note that this distribution is very similar to the Binomial
distribution, except that in this case, samples are drawn without
replacement, whereas in the Binomial case samples are drawn with
replacement (or the sample space is infinite). As the sample space
becomes large, this distribution approaches the Binomial.
References
----------
.. [1] Lentner, Marvin, "Elementary Applied Statistics", Bogden
and Quigley, 1972.
.. [2] Weisstein, Eric W. "Hypergeometric Distribution." From
MathWorld--A Wolfram Web Resource.
http://mathworld.wolfram.com/HypergeometricDistribution.html
.. [3] Wikipedia, "Hypergeometric-distribution",
http://en.wikipedia.org/wiki/Hypergeometric-distribution
Examples
--------
Draw samples from the distribution:
>>> ngood, nbad, nsamp = 100, 2, 10
# number of good, number of bad, and number of samples
>>> s = np.random.hypergeometric(ngood, nbad, nsamp, 1000)
>>> hist(s)
# note that it is very unlikely to grab both bad items
Suppose you have an urn with 15 white and 15 black marbles.
If you pull 15 marbles at random, how likely is it that
12 or more of them are one color?
>>> s = np.random.hypergeometric(15, 15, 15, 100000)
>>> sum(s>=12)/100000. + sum(s<=3)/100000.
# answer = 0.003 ... pretty unlikely!
laplace(...) method of mtrand.RandomState instance
laplace(loc=0.0, scale=1.0, size=None)
Draw samples from the Laplace or double exponential distribution with
specified location (or mean) and scale (decay).
The Laplace distribution is similar to the Gaussian/normal distribution,
but is sharper at the peak and has fatter tails. It represents the
difference between two independent, identically distributed exponential
random variables.
Parameters
----------
loc : float
The position, :math:`\mu`, of the distribution peak.
scale : float
:math:`\lambda`, the exponential decay.
Notes
-----
It has the probability density function
.. math:: f(x; \mu, \lambda) = \frac{1}{2\lambda}
\exp\left(-\frac{|x - \mu|}{\lambda}\right).
The first law of Laplace, from 1774, states that the frequency of an error
can be expressed as an exponential function of the absolute magnitude of
the error, which leads to the Laplace distribution. For many problems in
Economics and Health sciences, this distribution seems to model the data
better than the standard Gaussian distribution
References
----------
.. [1] Abramowitz, M. and Stegun, I. A. (Eds.). Handbook of Mathematical
Functions with Formulas, Graphs, and Mathematical Tables, 9th
printing. New York: Dover, 1972.
.. [2] The Laplace distribution and generalizations
By Samuel Kotz, Tomasz J. Kozubowski, Krzysztof Podgorski,
Birkhauser, 2001.
.. [3] Weisstein, Eric W. "Laplace Distribution."
From MathWorld--A Wolfram Web Resource.
http://mathworld.wolfram.com/LaplaceDistribution.html
.. [4] Wikipedia, "Laplace distribution",
http://en.wikipedia.org/wiki/Laplace_distribution
Examples
--------
Draw samples from the distribution
>>> loc, scale = 0., 1.
>>> s = np.random.laplace(loc, scale, 1000)
Display the histogram of the samples, along with
the probability density function:
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 30, normed=True)
>>> x = np.arange(-8., 8., .01)
>>> pdf = np.exp(-abs(x-loc/scale))/(2.*scale)
>>> plt.plot(x, pdf)
Plot Gaussian for comparison:
>>> g = (1/(scale * np.sqrt(2 * np.pi)) *
... np.exp( - (x - loc)**2 / (2 * scale**2) ))
>>> plt.plot(x,g)
logistic(...) method of mtrand.RandomState instance
logistic(loc=0.0, scale=1.0, size=None)
Draw samples from a Logistic distribution.
Samples are drawn from a Logistic distribution with specified
parameters, loc (location or mean, also median), and scale (>0).
Parameters
----------
loc : float
scale : float > 0.
size : {tuple, int}
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
samples : {ndarray, scalar}
where the values are all integers in [0, n].
See Also
--------
scipy.stats.distributions.logistic : probability density function,
distribution or cumulative density function, etc.
Notes
-----
The probability density for the Logistic distribution is
.. math:: P(x) = P(x) = \frac{e^{-(x-\mu)/s}}{s(1+e^{-(x-\mu)/s})^2},
where :math:`\mu` = location and :math:`s` = scale.
The Logistic distribution is used in Extreme Value problems where it
can act as a mixture of Gumbel distributions, in Epidemiology, and by
the World Chess Federation (FIDE) where it is used in the Elo ranking
system, assuming the performance of each player is a logistically
distributed random variable.
References
----------
.. [1] Reiss, R.-D. and Thomas M. (2001), Statistical Analysis of Extreme
Values, from Insurance, Finance, Hydrology and Other Fields,
Birkhauser Verlag, Basel, pp 132-133.
.. [2] Weisstein, Eric W. "Logistic Distribution." From
MathWorld--A Wolfram Web Resource.
http://mathworld.wolfram.com/LogisticDistribution.html
.. [3] Wikipedia, "Logistic-distribution",
http://en.wikipedia.org/wiki/Logistic-distribution
Examples
--------
Draw samples from the distribution:
>>> loc, scale = 10, 1
>>> s = np.random.logistic(loc, scale, 10000)
>>> count, bins, ignored = plt.hist(s, bins=50)
# plot against distribution
>>> def logist(x, loc, scale):
... return exp((loc-x)/scale)/(scale*(1+exp((loc-x)/scale))**2)
>>> plt.plot(bins, logist(bins, loc, scale)*count.max()/\
... logist(bins, loc, scale).max())
>>> plt.show()
lognormal(...) method of mtrand.RandomState instance
lognormal(mean=0.0, sigma=1.0, size=None)
Return samples drawn from a log-normal distribution.
Draw samples from a log-normal distribution with specified mean,
standard deviation, and array shape. Note that the mean and standard
deviation are not the values for the distribution itself, but of the
underlying normal distribution it is derived from.
Parameters
----------
mean : float
Mean value of the underlying normal distribution
sigma : float, > 0.
Standard deviation of the underlying normal distribution
size : tuple of ints
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
samples : ndarray or float
The desired samples. An array of the same shape as `size` if given,
if `size` is None a float is returned.
See Also
--------
scipy.stats.lognorm : probability density function, distribution,
cumulative density function, etc.
Notes
-----
A variable `x` has a log-normal distribution if `log(x)` is normally
distributed. The probability density function for the log-normal
distribution is:
.. math:: p(x) = \frac{1}{\sigma x \sqrt{2\pi}}
e^{(-\frac{(ln(x)-\mu)^2}{2\sigma^2})}
where :math:`\mu` is the mean and :math:`\sigma` is the standard
deviation of the normally distributed logarithm of the variable.
A log-normal distribution results if a random variable is the *product*
of a large number of independent, identically-distributed variables in
the same way that a normal distribution results if the variable is the
*sum* of a large number of independent, identically-distributed
variables.
References
----------
Limpert, E., Stahel, W. A., and Abbt, M., "Log-normal Distributions
across the Sciences: Keys and Clues," *BioScience*, Vol. 51, No. 5,
May, 2001. http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf
Reiss, R.D. and Thomas, M., *Statistical Analysis of Extreme Values*,
Basel: Birkhauser Verlag, 2001, pp. 31-32.
Examples
--------
Draw samples from the distribution:
>>> mu, sigma = 3., 1. # mean and standard deviation
>>> s = np.random.lognormal(mu, sigma, 1000)
Display the histogram of the samples, along with
the probability density function:
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 100, normed=True, align='mid')
>>> x = np.linspace(min(bins), max(bins), 10000)
>>> pdf = (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2))
... / (x * sigma * np.sqrt(2 * np.pi)))
>>> plt.plot(x, pdf, linewidth=2, color='r')
>>> plt.axis('tight')
>>> plt.show()
Demonstrate that taking the products of random samples from a uniform
distribution can be fit well by a log-normal probability density function.
>>> # Generate a thousand samples: each is the product of 100 random
>>> # values, drawn from a normal distribution.
>>> b = []
>>> for i in range(1000):
... a = 10. + np.random.random(100)
... b.append(np.product(a))
>>> b = np.array(b) / np.min(b) # scale values to be positive
>>> count, bins, ignored = plt.hist(b, 100, normed=True, align='center')
>>> sigma = np.std(np.log(b))
>>> mu = np.mean(np.log(b))
>>> x = np.linspace(min(bins), max(bins), 10000)
>>> pdf = (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2))
... / (x * sigma * np.sqrt(2 * np.pi)))
>>> plt.plot(x, pdf, color='r', linewidth=2)
>>> plt.show()
logseries(...) method of mtrand.RandomState instance
logseries(p, size=None)
Draw samples from a Logarithmic Series distribution.
Samples are drawn from a Log Series distribution with specified
parameter, p (probability, 0 < p < 1).
Parameters
----------
loc : float
scale : float > 0.
size : {tuple, int}
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
samples : {ndarray, scalar}
where the values are all integers in [0, n].
See Also
--------
scipy.stats.distributions.logser : probability density function,
distribution or cumulative density function, etc.
Notes
-----
The probability density for the Log Series distribution is
.. math:: P(k) = \frac{-p^k}{k \ln(1-p)},
where p = probability.
The Log Series distribution is frequently used to represent species
richness and occurrence, first proposed by Fisher, Corbet, and
Williams in 1943 [2]. It may also be used to model the numbers of
occupants seen in cars [3].
References
----------
.. [1] Buzas, Martin A.; Culver, Stephen J., Understanding regional
species diversity through the log series distribution of
occurrences: BIODIVERSITY RESEARCH Diversity & Distributions,
Volume 5, Number 5, September 1999 , pp. 187-195(9).
.. [2] Fisher, R.A,, A.S. Corbet, and C.B. Williams. 1943. The
relation between the number of species and the number of
individuals in a random sample of an animal population.
Journal of Animal Ecology, 12:42-58.
.. [3] D. J. Hand, F. Daly, D. Lunn, E. Ostrowski, A Handbook of Small
Data Sets, CRC Press, 1994.
.. [4] Wikipedia, "Logarithmic-distribution",
http://en.wikipedia.org/wiki/Logarithmic-distribution
Examples
--------
Draw samples from the distribution:
>>> a = .6
>>> s = np.random.logseries(a, 10000)
>>> count, bins, ignored = plt.hist(s)
# plot against distribution
>>> def logseries(k, p):
... return -p**k/(k*log(1-p))
>>> plt.plot(bins, logseries(bins, a)*count.max()/
logseries(bins, a).max(), 'r')
>>> plt.show()
multinomial(...) method of mtrand.RandomState instance
multinomial(n, pvals, size=None)
Draw samples from a multinomial distribution.
The multinomial distribution is a multivariate generalisation of the
binomial distribution. Take an experiment with one of ``p``
possible outcomes. An example of such an experiment is throwing a dice,
where the outcome can be 1 through 6. Each sample drawn from the
distribution represents `n` such experiments. Its values,
``X_i = [X_0, X_1, ..., X_p]``, represent the number of times the outcome
was ``i``.
Parameters
----------
n : int
Number of experiments.
pvals : sequence of floats, length p
Probabilities of each of the ``p`` different outcomes. These
should sum to 1 (however, the last element is always assumed to
account for the remaining probability, as long as
``sum(pvals[:-1]) <= 1)``.
size : tuple of ints
Given a `size` of ``(M, N, K)``, then ``M*N*K`` samples are drawn,
and the output shape becomes ``(M, N, K, p)``, since each sample
has shape ``(p,)``.
Examples
--------
Throw a dice 20 times:
>>> np.random.multinomial(20, [1/6.]*6, size=1)
array([[4, 1, 7, 5, 2, 1]])
It landed 4 times on 1, once on 2, etc.
Now, throw the dice 20 times, and 20 times again:
>>> np.random.multinomial(20, [1/6.]*6, size=2)
array([[3, 4, 3, 3, 4, 3],
[2, 4, 3, 4, 0, 7]])
For the first run, we threw 3 times 1, 4 times 2, etc. For the second,
we threw 2 times 1, 4 times 2, etc.
A loaded dice is more likely to land on number 6:
>>> np.random.multinomial(100, [1/7.]*5)
array([13, 16, 13, 16, 42])
multivariate_normal(...) method of mtrand.RandomState instance
multivariate_normal(mean, cov[, size])
Draw random samples from a multivariate normal distribution.
The multivariate normal, multinormal or Gaussian distribution is a
generalization of the one-dimensional normal distribution to higher
dimensions. Such a distribution is specified by its mean and
covariance matrix. These parameters are analogous to the mean
(average or "center") and variance (standard deviation, or "width,"
squared) of the one-dimensional normal distribution.
Parameters
----------
mean : 1-D array_like, of length N
Mean of the N-dimensional distribution.
cov : 2-D array_like, of shape (N, N)
Covariance matrix of the distribution. Must be symmetric and
positive semi-definite for "physically meaningful" results.
size : int or tuple of ints, optional
Given a shape of, for example, ``(m,n,k)``, ``m*n*k`` samples are
generated, and packed in an `m`-by-`n`-by-`k` arrangement. Because
each sample is `N`-dimensional, the output shape is ``(m,n,k,N)``.
If no shape is specified, a single (`N`-D) sample is returned.
Returns
-------
out : ndarray
The drawn samples, of shape *size*, if that was provided. If not,
the shape is ``(N,)``.
In other words, each entry ``out[i,j,...,:]`` is an N-dimensional
value drawn from the distribution.
Notes
-----
The mean is a coordinate in N-dimensional space, which represents the
location where samples are most likely to be generated. This is
analogous to the peak of the bell curve for the one-dimensional or
univariate normal distribution.
Covariance indicates the level to which two variables vary together.
From the multivariate normal distribution, we draw N-dimensional
samples, :math:`X = [x_1, x_2, ... x_N]`. The covariance matrix
element :math:`C_{ij}` is the covariance of :math:`x_i` and :math:`x_j`.
The element :math:`C_{ii}` is the variance of :math:`x_i` (i.e. its
"spread").
Instead of specifying the full covariance matrix, popular
approximations include:
- Spherical covariance (*cov* is a multiple of the identity matrix)
- Diagonal covariance (*cov* has non-negative elements, and only on
the diagonal)
This geometrical property can be seen in two dimensions by plotting
generated data-points:
>>> mean = [0,0]
>>> cov = [[1,0],[0,100]] # diagonal covariance, points lie on x or y-axis
>>> import matplotlib.pyplot as plt
>>> x,y = np.random.multivariate_normal(mean,cov,5000).T
>>> plt.plot(x,y,'x'); plt.axis('equal'); plt.show()
Note that the covariance matrix must be non-negative definite.
References
----------
Papoulis, A., *Probability, Random Variables, and Stochastic Processes*,
3rd ed., New York: McGraw-Hill, 1991.
Duda, R. O., Hart, P. E., and Stork, D. G., *Pattern Classification*,
2nd ed., New York: Wiley, 2001.
Examples
--------
>>> mean = (1,2)
>>> cov = [[1,0],[1,0]]
>>> x = np.random.multivariate_normal(mean,cov,(3,3))
>>> x.shape
(3, 3, 2)
The following is probably true, given that 0.6 is roughly twice the
standard deviation:
>>> print list( (x[0,0,:] - mean) < 0.6 )
[True, True]
negative_binomial(...) method of mtrand.RandomState instance
negative_binomial(n, p, size=None)
Draw samples from a negative_binomial distribution.
Samples are drawn from a negative_Binomial distribution with specified
parameters, `n` trials and `p` probability of success where `n` is an
integer > 0 and `p` is in the interval [0, 1].
Parameters
----------
n : int
Parameter, > 0.
p : float
Parameter, >= 0 and <=1.
size : int or tuple of ints
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
samples : int or ndarray of ints
Drawn samples.
Notes
-----
The probability density for the Negative Binomial distribution is
.. math:: P(N;n,p) = \binom{N+n-1}{n-1}p^{n}(1-p)^{N},
where :math:`n-1` is the number of successes, :math:`p` is the probability
of success, and :math:`N+n-1` is the number of trials.
The negative binomial distribution gives the probability of n-1 successes
and N failures in N+n-1 trials, and success on the (N+n)th trial.
If one throws a die repeatedly until the third time a "1" appears, then the
probability distribution of the number of non-"1"s that appear before the
third "1" is a negative binomial distribution.
References
----------
.. [1] Weisstein, Eric W. "Negative Binomial Distribution." From
MathWorld--A Wolfram Web Resource.
http://mathworld.wolfram.com/NegativeBinomialDistribution.html
.. [2] Wikipedia, "Negative binomial distribution",
http://en.wikipedia.org/wiki/Negative_binomial_distribution
Examples
--------
Draw samples from the distribution:
A real world example. A company drills wild-cat oil exploration wells, each
with an estimated probability of success of 0.1. What is the probability
of having one success for each successive well, that is what is the
probability of a single success after drilling 5 wells, after 6 wells,
etc.?
>>> s = np.random.negative_binomial(1, 0.1, 100000)
>>> for i in range(1, 11):
... probability = sum(s<i) / 100000.
... print i, "wells drilled, probability of one success =", probability
noncentral_chisquare(...) method of mtrand.RandomState instance
noncentral_chisquare(df, nonc, size=None)
Draw samples from a noncentral chi-square distribution.
The noncentral :math:`\chi^2` distribution is a generalisation of
the :math:`\chi^2` distribution.
Parameters
----------
df : int
Degrees of freedom, should be >= 1.
nonc : float
Non-centrality, should be > 0.
size : int or tuple of ints
Shape of the output.
Notes
-----
The probability density function for the noncentral Chi-square distribution
is
.. math:: P(x;df,nonc) = \sum^{\infty}_{i=0}
\frac{e^{-nonc/2}(nonc/2)^{i}}{i!}P_{Y_{df+2i}}(x),
where :math:`Y_{q}` is the Chi-square with q degrees of freedom.
In Delhi (2007), it is noted that the noncentral chi-square is useful in
bombing and coverage problems, the probability of killing the point target
given by the noncentral chi-squared distribution.
References
----------
.. [1] Delhi, M.S. Holla, "On a noncentral chi-square distribution in the
analysis of weapon systems effectiveness", Metrika, Volume 15,
Number 1 / December, 1970.
.. [2] Wikipedia, "Noncentral chi-square distribution"
http://en.wikipedia.org/wiki/Noncentral_chi-square_distribution
Examples
--------
Draw values from the distribution and plot the histogram
>>> import matplotlib.pyplot as plt
>>> values = plt.hist(np.random.noncentral_chisquare(3, 20, 100000),
... bins=200, normed=True)
>>> plt.show()
Draw values from a noncentral chisquare with very small noncentrality,
and compare to a chisquare.
>>> plt.figure()
>>> values = plt.hist(np.random.noncentral_chisquare(3, .0000001, 100000),
... bins=np.arange(0., 25, .1), normed=True)
>>> values2 = plt.hist(np.random.chisquare(3, 100000),
... bins=np.arange(0., 25, .1), normed=True)
>>> plt.plot(values[1][0:-1], values[0]-values2[0], 'ob')
>>> plt.show()
Demonstrate how large values of non-centrality lead to a more symmetric
distribution.
>>> plt.figure()
>>> values = plt.hist(np.random.noncentral_chisquare(3, 20, 100000),
... bins=200, normed=True)
>>> plt.show()
noncentral_f(...) method of mtrand.RandomState instance
noncentral_f(dfnum, dfden, nonc, size=None)
Draw samples from the noncentral F distribution.
Samples are drawn from an F distribution with specified parameters,
`dfnum` (degrees of freedom in numerator) and `dfden` (degrees of
freedom in denominator), where both parameters > 1.
`nonc` is the non-centrality parameter.
Parameters
----------
dfnum : int
Parameter, should be > 1.
dfden : int
Parameter, should be > 1.
nonc : float
Parameter, should be >= 0.
size : int or tuple of ints
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
samples : scalar or ndarray
Drawn samples.
Notes
-----
When calculating the power of an experiment (power = probability of
rejecting the null hypothesis when a specific alternative is true) the
non-central F statistic becomes important. When the null hypothesis is
true, the F statistic follows a central F distribution. When the null
hypothesis is not true, then it follows a non-central F statistic.
References
----------
Weisstein, Eric W. "Noncentral F-Distribution." From MathWorld--A Wolfram
Web Resource. http://mathworld.wolfram.com/NoncentralF-Distribution.html
Wikipedia, "Noncentral F distribution",
http://en.wikipedia.org/wiki/Noncentral_F-distribution
Examples
--------
In a study, testing for a specific alternative to the null hypothesis
requires use of the Noncentral F distribution. We need to calculate the
area in the tail of the distribution that exceeds the value of the F
distribution for the null hypothesis. We'll plot the two probability
distributions for comparison.
>>> dfnum = 3 # between group deg of freedom
>>> dfden = 20 # within groups degrees of freedom
>>> nonc = 3.0
>>> nc_vals = np.random.noncentral_f(dfnum, dfden, nonc, 1000000)
>>> NF = np.histogram(nc_vals, bins=50, normed=True)
>>> c_vals = np.random.f(dfnum, dfden, 1000000)
>>> F = np.histogram(c_vals, bins=50, normed=True)
>>> plt.plot(F[1][1:], F[0])
>>> plt.plot(NF[1][1:], NF[0])
>>> plt.show()
normal(...) method of mtrand.RandomState instance
normal(loc=0.0, scale=1.0, size=None)
Draw random samples from a normal (Gaussian) distribution.
The probability density function of the normal distribution, first
derived by De Moivre and 200 years later by both Gauss and Laplace
independently [2]_, is often called the bell curve because of
its characteristic shape (see the example below).
The normal distributions occurs often in nature. For example, it
describes the commonly occurring distribution of samples influenced
by a large number of tiny, random disturbances, each with its own
unique distribution [2]_.
Parameters
----------
loc : float
Mean ("centre") of the distribution.
scale : float
Standard deviation (spread or "width") of the distribution.
size : tuple of ints
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
See Also
--------
scipy.stats.distributions.norm : probability density function,
distribution or cumulative density function, etc.
Notes
-----
The probability density for the Gaussian distribution is
.. math:: p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }}
e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} },
where :math:`\mu` is the mean and :math:`\sigma` the standard deviation.
The square of the standard deviation, :math:`\sigma^2`, is called the
variance.
The function has its peak at the mean, and its "spread" increases with
the standard deviation (the function reaches 0.607 times its maximum at
:math:`x + \sigma` and :math:`x - \sigma` [2]_). This implies that
`numpy.random.normal` is more likely to return samples lying close to the
mean, rather than those far away.
References
----------
.. [1] Wikipedia, "Normal distribution",
http://en.wikipedia.org/wiki/Normal_distribution
.. [2] P. R. Peebles Jr., "Central Limit Theorem" in "Probability, Random
Variables and Random Signal Principles", 4th ed., 2001,
pp. 51, 51, 125.
Examples
--------
Draw samples from the distribution:
>>> mu, sigma = 0, 0.1 # mean and standard deviation
>>> s = np.random.normal(mu, sigma, 1000)
Verify the mean and the variance:
>>> abs(mu - np.mean(s)) < 0.01
True
>>> abs(sigma - np.std(s, ddof=1)) < 0.01
True
Display the histogram of the samples, along with
the probability density function:
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 30, normed=True)
>>> plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
... np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
... linewidth=2, color='r')
>>> plt.show()
pareto(...) method of mtrand.RandomState instance
pareto(a, size=None)
Draw samples from a Pareto II or Lomax distribution with specified shape.
The Lomax or Pareto II distribution is a shifted Pareto distribution. The
classical Pareto distribution can be obtained from the Lomax distribution
by adding the location parameter m, see below. The smallest value of the
Lomax distribution is zero while for the classical Pareto distribution it
is m, where the standard Pareto distribution has location m=1.
Lomax can also be considered as a simplified version of the Generalized
Pareto distribution (available in SciPy), with the scale set to one and
the location set to zero.
The Pareto distribution must be greater than zero, and is unbounded above.
It is also known as the "80-20 rule". In this distribution, 80 percent of
the weights are in the lowest 20 percent of the range, while the other 20
percent fill the remaining 80 percent of the range.
Parameters
----------
shape : float, > 0.
Shape of the distribution.
size : tuple of ints
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
See Also
--------
scipy.stats.distributions.lomax.pdf : probability density function,
distribution or cumulative density function, etc.
scipy.stats.distributions.genpareto.pdf : probability density function,
distribution or cumulative density function, etc.
Notes
-----
The probability density for the Pareto distribution is
.. math:: p(x) = \frac{am^a}{x^{a+1}}
where :math:`a` is the shape and :math:`m` the location
The Pareto distribution, named after the Italian economist Vilfredo Pareto,
is a power law probability distribution useful in many real world problems.
Outside the field of economics it is generally referred to as the Bradford
distribution. Pareto developed the distribution to describe the
distribution of wealth in an economy. It has also found use in insurance,
web page access statistics, oil field sizes, and many other problems,
including the download frequency for projects in Sourceforge [1]. It is
one of the so-called "fat-tailed" distributions.
References
----------
.. [1] Francis Hunt and Paul Johnson, On the Pareto Distribution of
Sourceforge projects.
.. [2] Pareto, V. (1896). Course of Political Economy. Lausanne.
.. [3] Reiss, R.D., Thomas, M.(2001), Statistical Analysis of Extreme
Values, Birkhauser Verlag, Basel, pp 23-30.
.. [4] Wikipedia, "Pareto distribution",
http://en.wikipedia.org/wiki/Pareto_distribution
Examples
--------
Draw samples from the distribution:
>>> a, m = 3., 1. # shape and mode
>>> s = np.random.pareto(a, 1000) + m
Display the histogram of the samples, along with
the probability density function:
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 100, normed=True, align='center')
>>> fit = a*m**a/bins**(a+1)
>>> plt.plot(bins, max(count)*fit/max(fit),linewidth=2, color='r')
>>> plt.show()
permutation(...) method of mtrand.RandomState instance
permutation(x)
Randomly permute a sequence, or return a permuted range.
If `x` is a multi-dimensional array, it is only shuffled along its
first index.
Parameters
----------
x : int or array_like
If `x` is an integer, randomly permute ``np.arange(x)``.
If `x` is an array, make a copy and shuffle the elements
randomly.
Returns
-------
out : ndarray
Permuted sequence or array range.
Examples
--------
>>> np.random.permutation(10)
array([1, 7, 4, 3, 0, 9, 2, 5, 8, 6])
>>> np.random.permutation([1, 4, 9, 12, 15])
array([15, 1, 9, 4, 12])
>>> arr = np.arange(9).reshape((3, 3))
>>> np.random.permutation(arr)
array([[6, 7, 8],
[0, 1, 2],
[3, 4, 5]])
poisson(...) method of mtrand.RandomState instance
poisson(lam=1.0, size=None)
Draw samples from a Poisson distribution.
The Poisson distribution is the limit of the Binomial
distribution for large N.
Parameters
----------
lam : float
Expectation of interval, should be >= 0.
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Notes
-----
The Poisson distribution
.. math:: f(k; \lambda)=\frac{\lambda^k e^{-\lambda}}{k!}
For events with an expected separation :math:`\lambda` the Poisson
distribution :math:`f(k; \lambda)` describes the probability of
:math:`k` events occurring within the observed interval :math:`\lambda`.
Because the output is limited to the range of the C long type, a
ValueError is raised when `lam` is within 10 sigma of the maximum
representable value.
References
----------
.. [1] Weisstein, Eric W. "Poisson Distribution." From MathWorld--A Wolfram
Web Resource. http://mathworld.wolfram.com/PoissonDistribution.html
.. [2] Wikipedia, "Poisson distribution",
http://en.wikipedia.org/wiki/Poisson_distribution
Examples
--------
Draw samples from the distribution:
>>> import numpy as np
>>> s = np.random.poisson(5, 10000)
Display histogram of the sample:
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 14, normed=True)
>>> plt.show()
power(...) method of mtrand.RandomState instance
power(a, size=None)
Draws samples in [0, 1] from a power distribution with positive
exponent a - 1.
Also known as the power function distribution.
Parameters
----------
a : float
parameter, > 0
size : tuple of ints
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
samples : {ndarray, scalar}
The returned samples lie in [0, 1].
Raises
------
ValueError
If a<1.
Notes
-----
The probability density function is
.. math:: P(x; a) = ax^{a-1}, 0 \le x \le 1, a>0.
The power function distribution is just the inverse of the Pareto
distribution. It may also be seen as a special case of the Beta
distribution.
It is used, for example, in modeling the over-reporting of insurance
claims.
References
----------
.. [1] Christian Kleiber, Samuel Kotz, "Statistical size distributions
in economics and actuarial sciences", Wiley, 2003.
.. [2] Heckert, N. A. and Filliben, James J. (2003). NIST Handbook 148:
Dataplot Reference Manual, Volume 2: Let Subcommands and Library
Functions", National Institute of Standards and Technology Handbook
Series, June 2003.
http://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/powpdf.pdf
Examples
--------
Draw samples from the distribution:
>>> a = 5. # shape
>>> samples = 1000
>>> s = np.random.power(a, samples)
Display the histogram of the samples, along with
the probability density function:
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, bins=30)
>>> x = np.linspace(0, 1, 100)
>>> y = a*x**(a-1.)
>>> normed_y = samples*np.diff(bins)[0]*y
>>> plt.plot(x, normed_y)
>>> plt.show()
Compare the power function distribution to the inverse of the Pareto.
>>> from scipy import stats
>>> rvs = np.random.power(5, 1000000)
>>> rvsp = np.random.pareto(5, 1000000)
>>> xx = np.linspace(0,1,100)
>>> powpdf = stats.powerlaw.pdf(xx,5)
>>> plt.figure()
>>> plt.hist(rvs, bins=50, normed=True)
>>> plt.plot(xx,powpdf,'r-')
>>> plt.title('np.random.power(5)')
>>> plt.figure()
>>> plt.hist(1./(1.+rvsp), bins=50, normed=True)
>>> plt.plot(xx,powpdf,'r-')
>>> plt.title('inverse of 1 + np.random.pareto(5)')
>>> plt.figure()
>>> plt.hist(1./(1.+rvsp), bins=50, normed=True)
>>> plt.plot(xx,powpdf,'r-')
>>> plt.title('inverse of stats.pareto(5)')
rand(...) method of mtrand.RandomState instance
rand(d0, d1, ..., dn)
Random values in a given shape.
Create an array of the given shape and propagate it with
random samples from a uniform distribution
over ``[0, 1)``.
Parameters
----------
d0, d1, ..., dn : int, optional
The dimensions of the returned array, should all be positive.
If no argument is given a single Python float is returned.
Returns
-------
out : ndarray, shape ``(d0, d1, ..., dn)``
Random values.
See Also
--------
random
Notes
-----
This is a convenience function. If you want an interface that
takes a shape-tuple as the first argument, refer to
np.random.random_sample .
Examples
--------
>>> np.random.rand(3,2)
array([[ 0.14022471, 0.96360618], #random
[ 0.37601032, 0.25528411], #random
[ 0.49313049, 0.94909878]]) #random
randint(...) method of mtrand.RandomState instance
randint(low, high=None, size=None)
Return random integers from `low` (inclusive) to `high` (exclusive).
Return random integers from the "discrete uniform" distribution in the
"half-open" interval [`low`, `high`). If `high` is None (the default),
then results are from [0, `low`).
Parameters
----------
low : int
Lowest (signed) integer to be drawn from the distribution (unless
``high=None``, in which case this parameter is the *highest* such
integer).
high : int, optional
If provided, one above the largest (signed) integer to be drawn
from the distribution (see above for behavior if ``high=None``).
size : int or tuple of ints, optional
Output shape. Default is None, in which case a single int is
returned.
Returns
-------
out : int or ndarray of ints
`size`-shaped array of random integers from the appropriate
distribution, or a single such random int if `size` not provided.
See Also
--------
random.random_integers : similar to `randint`, only for the closed
interval [`low`, `high`], and 1 is the lowest value if `high` is
omitted. In particular, this other one is the one to use to generate
uniformly distributed discrete non-integers.
Examples
--------
>>> np.random.randint(2, size=10)
array([1, 0, 0, 0, 1, 1, 0, 0, 1, 0])
>>> np.random.randint(1, size=10)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Generate a 2 x 4 array of ints between 0 and 4, inclusive:
>>> np.random.randint(5, size=(2, 4))
array([[4, 0, 2, 1],
[3, 2, 2, 0]])
randn(...) method of mtrand.RandomState instance
randn(d0, d1, ..., dn)
Return a sample (or samples) from the "standard normal" distribution.
If positive, int_like or int-convertible arguments are provided,
`randn` generates an array of shape ``(d0, d1, ..., dn)``, filled
with random floats sampled from a univariate "normal" (Gaussian)
distribution of mean 0 and variance 1 (if any of the :math:`d_i` are
floats, they are first converted to integers by truncation). A single
float randomly sampled from the distribution is returned if no
argument is provided.
This is a convenience function. If you want an interface that takes a
tuple as the first argument, use `numpy.random.standard_normal` instead.
Parameters
----------
d0, d1, ..., dn : int, optional
The dimensions of the returned array, should be all positive.
If no argument is given a single Python float is returned.
Returns
-------
Z : ndarray or float
A ``(d0, d1, ..., dn)``-shaped array of floating-point samples from
the standard normal distribution, or a single such float if
no parameters were supplied.
See Also
--------
random.standard_normal : Similar, but takes a tuple as its argument.
Notes
-----
For random samples from :math:`N(\mu, \sigma^2)`, use:
``sigma * np.random.randn(...) + mu``
Examples
--------
>>> np.random.randn()
2.1923875335537315 #random
Two-by-four array of samples from N(3, 6.25):
>>> 2.5 * np.random.randn(2, 4) + 3
array([[-4.49401501, 4.00950034, -1.81814867, 7.29718677], #random
[ 0.39924804, 4.68456316, 4.99394529, 4.84057254]]) #random
random = random_sample(...) method of mtrand.RandomState instance
random_sample(size=None)
Return random floats in the half-open interval [0.0, 1.0).
Results are from the "continuous uniform" distribution over the
stated interval. To sample :math:`Unif[a, b), b > a` multiply
the output of `random_sample` by `(b-a)` and add `a`::
(b - a) * random_sample() + a
Parameters
----------
size : int or tuple of ints, optional
Defines the shape of the returned array of random floats. If None
(the default), returns a single float.
Returns
-------
out : float or ndarray of floats
Array of random floats of shape `size` (unless ``size=None``, in which
case a single float is returned).
Examples
--------
>>> np.random.random_sample()
0.47108547995356098
>>> type(np.random.random_sample())
<type 'float'>
>>> np.random.random_sample((5,))
array([ 0.30220482, 0.86820401, 0.1654503 , 0.11659149, 0.54323428])
Three-by-two array of random numbers from [-5, 0):
>>> 5 * np.random.random_sample((3, 2)) - 5
array([[-3.99149989, -0.52338984],
[-2.99091858, -0.79479508],
[-1.23204345, -1.75224494]])
random_integers(...) method of mtrand.RandomState instance
random_integers(low, high=None, size=None)
Return random integers between `low` and `high`, inclusive.
Return random integers from the "discrete uniform" distribution in the
closed interval [`low`, `high`]. If `high` is None (the default),
then results are from [1, `low`].
Parameters
----------
low : int
Lowest (signed) integer to be drawn from the distribution (unless
``high=None``, in which case this parameter is the *highest* such
integer).
high : int, optional
If provided, the largest (signed) integer to be drawn from the
distribution (see above for behavior if ``high=None``).
size : int or tuple of ints, optional
Output shape. Default is None, in which case a single int is returned.
Returns
-------
out : int or ndarray of ints
`size`-shaped array of random integers from the appropriate
distribution, or a single such random int if `size` not provided.
See Also
--------
random.randint : Similar to `random_integers`, only for the half-open
interval [`low`, `high`), and 0 is the lowest value if `high` is
omitted.
Notes
-----
To sample from N evenly spaced floating-point numbers between a and b,
use::
a + (b - a) * (np.random.random_integers(N) - 1) / (N - 1.)
Examples
--------
>>> np.random.random_integers(5)
4
>>> type(np.random.random_integers(5))
<type 'int'>
>>> np.random.random_integers(5, size=(3.,2.))
array([[5, 4],
[3, 3],
[4, 5]])
Choose five random numbers from the set of five evenly-spaced
numbers between 0 and 2.5, inclusive (*i.e.*, from the set
:math:`{0, 5/8, 10/8, 15/8, 20/8}`):
>>> 2.5 * (np.random.random_integers(5, size=(5,)) - 1) / 4.
array([ 0.625, 1.25 , 0.625, 0.625, 2.5 ])
Roll two six sided dice 1000 times and sum the results:
>>> d1 = np.random.random_integers(1, 6, 1000)
>>> d2 = np.random.random_integers(1, 6, 1000)
>>> dsums = d1 + d2
Display results as a histogram:
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(dsums, 11, normed=True)
>>> plt.show()
random_sample(...) method of mtrand.RandomState instance
random_sample(size=None)
Return random floats in the half-open interval [0.0, 1.0).
Results are from the "continuous uniform" distribution over the
stated interval. To sample :math:`Unif[a, b), b > a` multiply
the output of `random_sample` by `(b-a)` and add `a`::
(b - a) * random_sample() + a
Parameters
----------
size : int or tuple of ints, optional
Defines the shape of the returned array of random floats. If None
(the default), returns a single float.
Returns
-------
out : float or ndarray of floats
Array of random floats of shape `size` (unless ``size=None``, in which
case a single float is returned).
Examples
--------
>>> np.random.random_sample()
0.47108547995356098
>>> type(np.random.random_sample())
<type 'float'>
>>> np.random.random_sample((5,))
array([ 0.30220482, 0.86820401, 0.1654503 , 0.11659149, 0.54323428])
Three-by-two array of random numbers from [-5, 0):
>>> 5 * np.random.random_sample((3, 2)) - 5
array([[-3.99149989, -0.52338984],
[-2.99091858, -0.79479508],
[-1.23204345, -1.75224494]])
ranf = random_sample(...) method of mtrand.RandomState instance
random_sample(size=None)
Return random floats in the half-open interval [0.0, 1.0).
Results are from the "continuous uniform" distribution over the
stated interval. To sample :math:`Unif[a, b), b > a` multiply
the output of `random_sample` by `(b-a)` and add `a`::
(b - a) * random_sample() + a
Parameters
----------
size : int or tuple of ints, optional
Defines the shape of the returned array of random floats. If None
(the default), returns a single float.
Returns
-------
out : float or ndarray of floats
Array of random floats of shape `size` (unless ``size=None``, in which
case a single float is returned).
Examples
--------
>>> np.random.random_sample()
0.47108547995356098
>>> type(np.random.random_sample())
<type 'float'>
>>> np.random.random_sample((5,))
array([ 0.30220482, 0.86820401, 0.1654503 , 0.11659149, 0.54323428])
Three-by-two array of random numbers from [-5, 0):
>>> 5 * np.random.random_sample((3, 2)) - 5
array([[-3.99149989, -0.52338984],
[-2.99091858, -0.79479508],
[-1.23204345, -1.75224494]])
rayleigh(...) method of mtrand.RandomState instance
rayleigh(scale=1.0, size=None)
Draw samples from a Rayleigh distribution.
The :math:`\chi` and Weibull distributions are generalizations of the
Rayleigh.
Parameters
----------
scale : scalar
Scale, also equals the mode. Should be >= 0.
size : int or tuple of ints, optional
Shape of the output. Default is None, in which case a single
value is returned.
Notes
-----
The probability density function for the Rayleigh distribution is
.. math:: P(x;scale) = \frac{x}{scale^2}e^{\frac{-x^2}{2 \cdotp scale^2}}
The Rayleigh distribution arises if the wind speed and wind direction are
both gaussian variables, then the vector wind velocity forms a Rayleigh
distribution. The Rayleigh distribution is used to model the expected
output from wind turbines.
References
----------
.. [1] Brighton Webs Ltd., Rayleigh Distribution,
http://www.brighton-webs.co.uk/distributions/rayleigh.asp
.. [2] Wikipedia, "Rayleigh distribution"
http://en.wikipedia.org/wiki/Rayleigh_distribution
Examples
--------
Draw values from the distribution and plot the histogram
>>> values = hist(np.random.rayleigh(3, 100000), bins=200, normed=True)
Wave heights tend to follow a Rayleigh distribution. If the mean wave
height is 1 meter, what fraction of waves are likely to be larger than 3
meters?
>>> meanvalue = 1
>>> modevalue = np.sqrt(2 / np.pi) * meanvalue
>>> s = np.random.rayleigh(modevalue, 1000000)
The percentage of waves larger than 3 meters is:
>>> 100.*sum(s>3)/1000000.
0.087300000000000003
sample = random_sample(...) method of mtrand.RandomState instance
random_sample(size=None)
Return random floats in the half-open interval [0.0, 1.0).
Results are from the "continuous uniform" distribution over the
stated interval. To sample :math:`Unif[a, b), b > a` multiply
the output of `random_sample` by `(b-a)` and add `a`::
(b - a) * random_sample() + a
Parameters
----------
size : int or tuple of ints, optional
Defines the shape of the returned array of random floats. If None
(the default), returns a single float.
Returns
-------
out : float or ndarray of floats
Array of random floats of shape `size` (unless ``size=None``, in which
case a single float is returned).
Examples
--------
>>> np.random.random_sample()
0.47108547995356098
>>> type(np.random.random_sample())
<type 'float'>
>>> np.random.random_sample((5,))
array([ 0.30220482, 0.86820401, 0.1654503 , 0.11659149, 0.54323428])
Three-by-two array of random numbers from [-5, 0):
>>> 5 * np.random.random_sample((3, 2)) - 5
array([[-3.99149989, -0.52338984],
[-2.99091858, -0.79479508],
[-1.23204345, -1.75224494]])
seed(...) method of mtrand.RandomState instance
seed(seed=None)
Seed the generator.
This method is called when `RandomState` is initialized. It can be
called again to re-seed the generator. For details, see `RandomState`.
Parameters
----------
seed : int or array_like, optional
Seed for `RandomState`.
See Also
--------
RandomState
set_state(...) method of mtrand.RandomState instance
set_state(state)
Set the internal state of the generator from a tuple.
For use if one has reason to manually (re-)set the internal state of the
"Mersenne Twister"[1]_ pseudo-random number generating algorithm.
Parameters
----------
state : tuple(str, ndarray of 624 uints, int, int, float)
The `state` tuple has the following items:
1. the string 'MT19937', specifying the Mersenne Twister algorithm.
2. a 1-D array of 624 unsigned integers ``keys``.
3. an integer ``pos``.
4. an integer ``has_gauss``.
5. a float ``cached_gaussian``.
Returns
-------
out : None
Returns 'None' on success.
See Also
--------
get_state
Notes
-----
`set_state` and `get_state` are not needed to work with any of the
random distributions in NumPy. If the internal state is manually altered,
the user should know exactly what he/she is doing.
For backwards compatibility, the form (str, array of 624 uints, int) is
also accepted although it is missing some information about the cached
Gaussian value: ``state = ('MT19937', keys, pos)``.
References
----------
.. [1] M. Matsumoto and T. Nishimura, "Mersenne Twister: A
623-dimensionally equidistributed uniform pseudorandom number
generator," *ACM Trans. on Modeling and Computer Simulation*,
Vol. 8, No. 1, pp. 3-30, Jan. 1998.
shuffle(...) method of mtrand.RandomState instance
shuffle(x)
Modify a sequence in-place by shuffling its contents.
Parameters
----------
x : array_like
The array or list to be shuffled.
Returns
-------
None
Examples
--------
>>> arr = np.arange(10)
>>> np.random.shuffle(arr)
>>> arr
[1 7 5 2 9 4 3 6 0 8]
This function only shuffles the array along the first index of a
multi-dimensional array:
>>> arr = np.arange(9).reshape((3, 3))
>>> np.random.shuffle(arr)
>>> arr
array([[3, 4, 5],
[6, 7, 8],
[0, 1, 2]])
standard_cauchy(...) method of mtrand.RandomState instance
standard_cauchy(size=None)
Standard Cauchy distribution with mode = 0.
Also known as the Lorentz distribution.
Parameters
----------
size : int or tuple of ints
Shape of the output.
Returns
-------
samples : ndarray or scalar
The drawn samples.
Notes
-----
The probability density function for the full Cauchy distribution is
.. math:: P(x; x_0, \gamma) = \frac{1}{\pi \gamma \bigl[ 1+
(\frac{x-x_0}{\gamma})^2 \bigr] }
and the Standard Cauchy distribution just sets :math:`x_0=0` and
:math:`\gamma=1`
The Cauchy distribution arises in the solution to the driven harmonic
oscillator problem, and also describes spectral line broadening. It
also describes the distribution of values at which a line tilted at
a random angle will cut the x axis.
When studying hypothesis tests that assume normality, seeing how the
tests perform on data from a Cauchy distribution is a good indicator of
their sensitivity to a heavy-tailed distribution, since the Cauchy looks
very much like a Gaussian distribution, but with heavier tails.
References
----------
.. [1] NIST/SEMATECH e-Handbook of Statistical Methods, "Cauchy
Distribution",
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3663.htm
.. [2] Weisstein, Eric W. "Cauchy Distribution." From MathWorld--A
Wolfram Web Resource.
http://mathworld.wolfram.com/CauchyDistribution.html
.. [3] Wikipedia, "Cauchy distribution"
http://en.wikipedia.org/wiki/Cauchy_distribution
Examples
--------
Draw samples and plot the distribution:
>>> s = np.random.standard_cauchy(1000000)
>>> s = s[(s>-25) & (s<25)] # truncate distribution so it plots well
>>> plt.hist(s, bins=100)
>>> plt.show()
standard_exponential(...) method of mtrand.RandomState instance
standard_exponential(size=None)
Draw samples from the standard exponential distribution.
`standard_exponential` is identical to the exponential distribution
with a scale parameter of 1.
Parameters
----------
size : int or tuple of ints
Shape of the output.
Returns
-------
out : float or ndarray
Drawn samples.
Examples
--------
Output a 3x8000 array:
>>> n = np.random.standard_exponential((3, 8000))
standard_gamma(...) method of mtrand.RandomState instance
standard_gamma(shape, size=None)
Draw samples from a Standard Gamma distribution.
Samples are drawn from a Gamma distribution with specified parameters,
shape (sometimes designated "k") and scale=1.
Parameters
----------
shape : float
Parameter, should be > 0.
size : int or tuple of ints
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
samples : ndarray or scalar
The drawn samples.
See Also
--------
scipy.stats.distributions.gamma : probability density function,
distribution or cumulative density function, etc.
Notes
-----
The probability density for the Gamma distribution is
.. math:: p(x) = x^{k-1}\frac{e^{-x/\theta}}{\theta^k\Gamma(k)},
where :math:`k` is the shape and :math:`\theta` the scale,
and :math:`\Gamma` is the Gamma function.
The Gamma distribution is often used to model the times to failure of
electronic components, and arises naturally in processes for which the
waiting times between Poisson distributed events are relevant.
References
----------
.. [1] Weisstein, Eric W. "Gamma Distribution." From MathWorld--A
Wolfram Web Resource.
http://mathworld.wolfram.com/GammaDistribution.html
.. [2] Wikipedia, "Gamma-distribution",
http://en.wikipedia.org/wiki/Gamma-distribution
Examples
--------
Draw samples from the distribution:
>>> shape, scale = 2., 1. # mean and width
>>> s = np.random.standard_gamma(shape, 1000000)
Display the histogram of the samples, along with
the probability density function:
>>> import matplotlib.pyplot as plt
>>> import scipy.special as sps
>>> count, bins, ignored = plt.hist(s, 50, normed=True)
>>> y = bins**(shape-1) * ((np.exp(-bins/scale))/ \
... (sps.gamma(shape) * scale**shape))
>>> plt.plot(bins, y, linewidth=2, color='r')
>>> plt.show()
standard_normal(...) method of mtrand.RandomState instance
standard_normal(size=None)
Returns samples from a Standard Normal distribution (mean=0, stdev=1).
Parameters
----------
size : int or tuple of ints, optional
Output shape. Default is None, in which case a single value is
returned.
Returns
-------
out : float or ndarray
Drawn samples.
Examples
--------
>>> s = np.random.standard_normal(8000)
>>> s
array([ 0.6888893 , 0.78096262, -0.89086505, ..., 0.49876311, #random
-0.38672696, -0.4685006 ]) #random
>>> s.shape
(8000,)
>>> s = np.random.standard_normal(size=(3, 4, 2))
>>> s.shape
(3, 4, 2)
standard_t(...) method of mtrand.RandomState instance
standard_t(df, size=None)
Standard Student's t distribution with df degrees of freedom.
A special case of the hyperbolic distribution.
As `df` gets large, the result resembles that of the standard normal
distribution (`standard_normal`).
Parameters
----------
df : int
Degrees of freedom, should be > 0.
size : int or tuple of ints, optional
Output shape. Default is None, in which case a single value is
returned.
Returns
-------
samples : ndarray or scalar
Drawn samples.
Notes
-----
The probability density function for the t distribution is
.. math:: P(x, df) = \frac{\Gamma(\frac{df+1}{2})}{\sqrt{\pi df}
\Gamma(\frac{df}{2})}\Bigl( 1+\frac{x^2}{df} \Bigr)^{-(df+1)/2}
The t test is based on an assumption that the data come from a Normal
distribution. The t test provides a way to test whether the sample mean
(that is the mean calculated from the data) is a good estimate of the true
mean.
The derivation of the t-distribution was forst published in 1908 by William
Gisset while working for the Guinness Brewery in Dublin. Due to proprietary
issues, he had to publish under a pseudonym, and so he used the name
Student.
References
----------
.. [1] Dalgaard, Peter, "Introductory Statistics With R",
Springer, 2002.
.. [2] Wikipedia, "Student's t-distribution"
http://en.wikipedia.org/wiki/Student's_t-distribution
Examples
--------
From Dalgaard page 83 [1]_, suppose the daily energy intake for 11
women in Kj is:
>>> intake = np.array([5260., 5470, 5640, 6180, 6390, 6515, 6805, 7515, \
... 7515, 8230, 8770])
Does their energy intake deviate systematically from the recommended
value of 7725 kJ?
We have 10 degrees of freedom, so is the sample mean within 95% of the
recommended value?
>>> s = np.random.standard_t(10, size=100000)
>>> np.mean(intake)
6753.636363636364
>>> intake.std(ddof=1)
1142.1232221373727
Calculate the t statistic, setting the ddof parameter to the unbiased
value so the divisor in the standard deviation will be degrees of
freedom, N-1.
>>> t = (np.mean(intake)-7725)/(intake.std(ddof=1)/np.sqrt(len(intake)))
>>> import matplotlib.pyplot as plt
>>> h = plt.hist(s, bins=100, normed=True)
For a one-sided t-test, how far out in the distribution does the t
statistic appear?
>>> >>> np.sum(s<t) / float(len(s))
0.0090699999999999999 #random
So the p-value is about 0.009, which says the null hypothesis has a
probability of about 99% of being true.
triangular(...) method of mtrand.RandomState instance
triangular(left, mode, right, size=None)
Draw samples from the triangular distribution.
The triangular distribution is a continuous probability distribution with
lower limit left, peak at mode, and upper limit right. Unlike the other
distributions, these parameters directly define the shape of the pdf.
Parameters
----------
left : scalar
Lower limit.
mode : scalar
The value where the peak of the distribution occurs.
The value should fulfill the condition ``left <= mode <= right``.
right : scalar
Upper limit, should be larger than `left`.
size : int or tuple of ints, optional
Output shape. Default is None, in which case a single value is
returned.
Returns
-------
samples : ndarray or scalar
The returned samples all lie in the interval [left, right].
Notes
-----
The probability density function for the Triangular distribution is
.. math:: P(x;l, m, r) = \begin{cases}
\frac{2(x-l)}{(r-l)(m-l)}& \text{for $l \leq x \leq m$},\\
\frac{2(m-x)}{(r-l)(r-m)}& \text{for $m \leq x \leq r$},\\
0& \text{otherwise}.
\end{cases}
The triangular distribution is often used in ill-defined problems where the
underlying distribution is not known, but some knowledge of the limits and
mode exists. Often it is used in simulations.
References
----------
.. [1] Wikipedia, "Triangular distribution"
http://en.wikipedia.org/wiki/Triangular_distribution
Examples
--------
Draw values from the distribution and plot the histogram:
>>> import matplotlib.pyplot as plt
>>> h = plt.hist(np.random.triangular(-3, 0, 8, 100000), bins=200,
... normed=True)
>>> plt.show()
uniform(...) method of mtrand.RandomState instance
uniform(low=0.0, high=1.0, size=1)
Draw samples from a uniform distribution.
Samples are uniformly distributed over the half-open interval
``[low, high)`` (includes low, but excludes high). In other words,
any value within the given interval is equally likely to be drawn
by `uniform`.
Parameters
----------
low : float, optional
Lower boundary of the output interval. All values generated will be
greater than or equal to low. The default value is 0.
high : float
Upper boundary of the output interval. All values generated will be
less than high. The default value is 1.0.
size : int or tuple of ints, optional
Shape of output. If the given size is, for example, (m,n,k),
m*n*k samples are generated. If no shape is specified, a single sample
is returned.
Returns
-------
out : ndarray
Drawn samples, with shape `size`.
See Also
--------
randint : Discrete uniform distribution, yielding integers.
random_integers : Discrete uniform distribution over the closed
interval ``[low, high]``.
random_sample : Floats uniformly distributed over ``[0, 1)``.
random : Alias for `random_sample`.
rand : Convenience function that accepts dimensions as input, e.g.,
``rand(2,2)`` would generate a 2-by-2 array of floats,
uniformly distributed over ``[0, 1)``.
Notes
-----
The probability density function of the uniform distribution is
.. math:: p(x) = \frac{1}{b - a}
anywhere within the interval ``[a, b)``, and zero elsewhere.
Examples
--------
Draw samples from the distribution:
>>> s = np.random.uniform(-1,0,1000)
All values are within the given interval:
>>> np.all(s >= -1)
True
>>> np.all(s < 0)
True
Display the histogram of the samples, along with the
probability density function:
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 15, normed=True)
>>> plt.plot(bins, np.ones_like(bins), linewidth=2, color='r')
>>> plt.show()
vonmises(...) method of mtrand.RandomState instance
vonmises(mu, kappa, size=None)
Draw samples from a von Mises distribution.
Samples are drawn from a von Mises distribution with specified mode
(mu) and dispersion (kappa), on the interval [-pi, pi].
The von Mises distribution (also known as the circular normal
distribution) is a continuous probability distribution on the unit
circle. It may be thought of as the circular analogue of the normal
distribution.
Parameters
----------
mu : float
Mode ("center") of the distribution.
kappa : float
Dispersion of the distribution, has to be >=0.
size : int or tuple of int
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
Returns
-------
samples : scalar or ndarray
The returned samples, which are in the interval [-pi, pi].
See Also
--------
scipy.stats.distributions.vonmises : probability density function,
distribution, or cumulative density function, etc.
Notes
-----
The probability density for the von Mises distribution is
.. math:: p(x) = \frac{e^{\kappa cos(x-\mu)}}{2\pi I_0(\kappa)},
where :math:`\mu` is the mode and :math:`\kappa` the dispersion,
and :math:`I_0(\kappa)` is the modified Bessel function of order 0.
The von Mises is named for Richard Edler von Mises, who was born in
Austria-Hungary, in what is now the Ukraine. He fled to the United
States in 1939 and became a professor at Harvard. He worked in
probability theory, aerodynamics, fluid mechanics, and philosophy of
science.
References
----------
Abramowitz, M. and Stegun, I. A. (ed.), *Handbook of Mathematical
Functions*, New York: Dover, 1965.
von Mises, R., *Mathematical Theory of Probability and Statistics*,
New York: Academic Press, 1964.
Examples
--------
Draw samples from the distribution:
>>> mu, kappa = 0.0, 4.0 # mean and dispersion
>>> s = np.random.vonmises(mu, kappa, 1000)
Display the histogram of the samples, along with
the probability density function:
>>> import matplotlib.pyplot as plt
>>> import scipy.special as sps
>>> count, bins, ignored = plt.hist(s, 50, normed=True)
>>> x = np.arange(-np.pi, np.pi, 2*np.pi/50.)
>>> y = -np.exp(kappa*np.cos(x-mu))/(2*np.pi*sps.jn(0,kappa))
>>> plt.plot(x, y/max(y), linewidth=2, color='r')
>>> plt.show()
wald(...) method of mtrand.RandomState instance
wald(mean, scale, size=None)
Draw samples from a Wald, or Inverse Gaussian, distribution.
As the scale approaches infinity, the distribution becomes more like a
Gaussian.
Some references claim that the Wald is an Inverse Gaussian with mean=1, but
this is by no means universal.
The Inverse Gaussian distribution was first studied in relationship to
Brownian motion. In 1956 M.C.K. Tweedie used the name Inverse Gaussian
because there is an inverse relationship between the time to cover a unit
distance and distance covered in unit time.
Parameters
----------
mean : scalar
Distribution mean, should be > 0.
scale : scalar
Scale parameter, should be >= 0.
size : int or tuple of ints, optional
Output shape. Default is None, in which case a single value is
returned.
Returns
-------
samples : ndarray or scalar
Drawn sample, all greater than zero.
Notes
-----
The probability density function for the Wald distribution is
.. math:: P(x;mean,scale) = \sqrt{\frac{scale}{2\pi x^3}}e^
\frac{-scale(x-mean)^2}{2\cdotp mean^2x}
As noted above the Inverse Gaussian distribution first arise from attempts
to model Brownian Motion. It is also a competitor to the Weibull for use in
reliability modeling and modeling stock returns and interest rate
processes.
References
----------
.. [1] Brighton Webs Ltd., Wald Distribution,
http://www.brighton-webs.co.uk/distributions/wald.asp
.. [2] Chhikara, Raj S., and Folks, J. Leroy, "The Inverse Gaussian
Distribution: Theory : Methodology, and Applications", CRC Press,
1988.
.. [3] Wikipedia, "Wald distribution"
http://en.wikipedia.org/wiki/Wald_distribution
Examples
--------
Draw values from the distribution and plot the histogram:
>>> import matplotlib.pyplot as plt
>>> h = plt.hist(np.random.wald(3, 2, 100000), bins=200, normed=True)
>>> plt.show()
weibull(...) method of mtrand.RandomState instance
weibull(a, size=None)
Weibull distribution.
Draw samples from a 1-parameter Weibull distribution with the given
shape parameter `a`.
.. math:: X = (-ln(U))^{1/a}
Here, U is drawn from the uniform distribution over (0,1].
The more common 2-parameter Weibull, including a scale parameter
:math:`\lambda` is just :math:`X = \lambda(-ln(U))^{1/a}`.
Parameters
----------
a : float
Shape of the distribution.
size : tuple of ints
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
See Also
--------
scipy.stats.distributions.weibull_max
scipy.stats.distributions.weibull_min
scipy.stats.distributions.genextreme
gumbel
Notes
-----
The Weibull (or Type III asymptotic extreme value distribution for smallest
values, SEV Type III, or Rosin-Rammler distribution) is one of a class of
Generalized Extreme Value (GEV) distributions used in modeling extreme
value problems. This class includes the Gumbel and Frechet distributions.
The probability density for the Weibull distribution is
.. math:: p(x) = \frac{a}
{\lambda}(\frac{x}{\lambda})^{a-1}e^{-(x/\lambda)^a},
where :math:`a` is the shape and :math:`\lambda` the scale.
The function has its peak (the mode) at
:math:`\lambda(\frac{a-1}{a})^{1/a}`.
When ``a = 1``, the Weibull distribution reduces to the exponential
distribution.
References
----------
.. [1] Waloddi Weibull, Professor, Royal Technical University, Stockholm,
1939 "A Statistical Theory Of The Strength Of Materials",
Ingeniorsvetenskapsakademiens Handlingar Nr 151, 1939,
Generalstabens Litografiska Anstalts Forlag, Stockholm.
.. [2] Waloddi Weibull, 1951 "A Statistical Distribution Function of Wide
Applicability", Journal Of Applied Mechanics ASME Paper.
.. [3] Wikipedia, "Weibull distribution",
http://en.wikipedia.org/wiki/Weibull_distribution
Examples
--------
Draw samples from the distribution:
>>> a = 5. # shape
>>> s = np.random.weibull(a, 1000)
Display the histogram of the samples, along with
the probability density function:
>>> import matplotlib.pyplot as plt
>>> x = np.arange(1,100.)/50.
>>> def weib(x,n,a):
... return (a / n) * (x / n)**(a - 1) * np.exp(-(x / n)**a)
>>> count, bins, ignored = plt.hist(np.random.weibull(5.,1000))
>>> x = np.arange(1,100.)/50.
>>> scale = count.max()/weib(x, 1., 5.).max()
>>> plt.plot(x, weib(x, 1., 5.)*scale)
>>> plt.show()
zipf(...) method of mtrand.RandomState instance
zipf(a, size=None)
Draw samples from a Zipf distribution.
Samples are drawn from a Zipf distribution with specified parameter
`a` > 1.
The Zipf distribution (also known as the zeta distribution) is a
continuous probability distribution that satisfies Zipf's law: the
frequency of an item is inversely proportional to its rank in a
frequency table.
Parameters
----------
a : float > 1
Distribution parameter.
size : int or tuple of int, optional
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn; a single integer is equivalent in
its result to providing a mono-tuple, i.e., a 1-D array of length
*size* is returned. The default is None, in which case a single
scalar is returned.
Returns
-------
samples : scalar or ndarray
The returned samples are greater than or equal to one.
See Also
--------
scipy.stats.distributions.zipf : probability density function,
distribution, or cumulative density function, etc.
Notes
-----
The probability density for the Zipf distribution is
.. math:: p(x) = \frac{x^{-a}}{\zeta(a)},
where :math:`\zeta` is the Riemann Zeta function.
It is named for the American linguist George Kingsley Zipf, who noted
that the frequency of any word in a sample of a language is inversely
proportional to its rank in the frequency table.
References
----------
Zipf, G. K., *Selected Studies of the Principle of Relative Frequency
in Language*, Cambridge, MA: Harvard Univ. Press, 1932.
Examples
--------
Draw samples from the distribution:
>>> a = 2. # parameter
>>> s = np.random.zipf(a, 1000)
Display the histogram of the samples, along with
the probability density function:
>>> import matplotlib.pyplot as plt
>>> import scipy.special as sps
Truncate s values at 50 so plot is interesting
>>> count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
>>> x = np.arange(1., 50.)
>>> y = x**(-a)/sps.zetac(a)
>>> plt.plot(x, y/max(y), linewidth=2, color='r')
>>> plt.show()
DATA
__all__ = ['beta', 'binomial', 'bytes', 'chisquare', 'exponential', 'f...
FILE
/usr/local/lib/python3.4/dist-packages/numpy/random/__init__.py
In [18]:
help(scipy.stats)
Help on package scipy.stats in scipy:
NAME
scipy.stats
DESCRIPTION
==========================================
Statistical functions (:mod:`scipy.stats`)
==========================================
.. module:: scipy.stats
This module contains a large number of probability distributions as
well as a growing library of statistical functions.
Each included distribution is an instance of the class rv_continous:
For each given name the following methods are available:
.. autosummary::
:toctree: generated/
rv_continuous
rv_continuous.pdf
rv_continuous.logpdf
rv_continuous.cdf
rv_continuous.logcdf
rv_continuous.sf
rv_continuous.logsf
rv_continuous.ppf
rv_continuous.isf
rv_continuous.moment
rv_continuous.stats
rv_continuous.entropy
rv_continuous.fit
rv_continuous.expect
Calling the instance as a function returns a frozen pdf whose shape,
location, and scale parameters are fixed.
Similarly, each discrete distribution is an instance of the class
rv_discrete:
.. autosummary::
:toctree: generated/
rv_discrete
rv_discrete.rvs
rv_discrete.pmf
rv_discrete.logpmf
rv_discrete.cdf
rv_discrete.logcdf
rv_discrete.sf
rv_discrete.logsf
rv_discrete.ppf
rv_discrete.isf
rv_discrete.stats
rv_discrete.moment
rv_discrete.entropy
rv_discrete.expect
Continuous distributions
========================
.. autosummary::
:toctree: generated/
alpha -- Alpha
anglit -- Anglit
arcsine -- Arcsine
beta -- Beta
betaprime -- Beta Prime
bradford -- Bradford
burr -- Burr
cauchy -- Cauchy
chi -- Chi
chi2 -- Chi-squared
cosine -- Cosine
dgamma -- Double Gamma
dweibull -- Double Weibull
erlang -- Erlang
expon -- Exponential
exponweib -- Exponentiated Weibull
exponpow -- Exponential Power
f -- F (Snecdor F)
fatiguelife -- Fatigue Life (Birnbaum-Sanders)
fisk -- Fisk
foldcauchy -- Folded Cauchy
foldnorm -- Folded Normal
frechet_r -- Frechet Right Sided, Extreme Value Type II (Extreme LB) or weibull_min
frechet_l -- Frechet Left Sided, Weibull_max
genlogistic -- Generalized Logistic
genpareto -- Generalized Pareto
genexpon -- Generalized Exponential
genextreme -- Generalized Extreme Value
gausshyper -- Gauss Hypergeometric
gamma -- Gamma
gengamma -- Generalized gamma
genhalflogistic -- Generalized Half Logistic
gilbrat -- Gilbrat
gompertz -- Gompertz (Truncated Gumbel)
gumbel_r -- Right Sided Gumbel, Log-Weibull, Fisher-Tippett, Extreme Value Type I
gumbel_l -- Left Sided Gumbel, etc.
halfcauchy -- Half Cauchy
halflogistic -- Half Logistic
halfnorm -- Half Normal
hypsecant -- Hyperbolic Secant
invgamma -- Inverse Gamma
invgauss -- Inverse Gaussian
invweibull -- Inverse Weibull
johnsonsb -- Johnson SB
johnsonsu -- Johnson SU
ksone -- Kolmogorov-Smirnov one-sided (no stats)
kstwobign -- Kolmogorov-Smirnov two-sided test for Large N (no stats)
laplace -- Laplace
logistic -- Logistic
loggamma -- Log-Gamma
loglaplace -- Log-Laplace (Log Double Exponential)
lognorm -- Log-Normal
lomax -- Lomax (Pareto of the second kind)
maxwell -- Maxwell
mielke -- Mielke's Beta-Kappa
nakagami -- Nakagami
ncx2 -- Non-central chi-squared
ncf -- Non-central F
nct -- Non-central Student's T
norm -- Normal (Gaussian)
pareto -- Pareto
pearson3 -- Pearson type III
powerlaw -- Power-function
powerlognorm -- Power log normal
powernorm -- Power normal
rdist -- R-distribution
reciprocal -- Reciprocal
rayleigh -- Rayleigh
rice -- Rice
recipinvgauss -- Reciprocal Inverse Gaussian
semicircular -- Semicircular
t -- Student's T
triang -- Triangular
truncexpon -- Truncated Exponential
truncnorm -- Truncated Normal
tukeylambda -- Tukey-Lambda
uniform -- Uniform
vonmises -- Von-Mises (Circular)
wald -- Wald
weibull_min -- Minimum Weibull (see Frechet)
weibull_max -- Maximum Weibull (see Frechet)
wrapcauchy -- Wrapped Cauchy
Multivariate distributions
==========================
.. autosummary::
:toctree: generated/
multivariate_normal -- Multivariate normal distribution
Discrete distributions
======================
.. autosummary::
:toctree: generated/
bernoulli -- Bernoulli
binom -- Binomial
boltzmann -- Boltzmann (Truncated Discrete Exponential)
dlaplace -- Discrete Laplacian
geom -- Geometric
hypergeom -- Hypergeometric
logser -- Logarithmic (Log-Series, Series)
nbinom -- Negative Binomial
planck -- Planck (Discrete Exponential)
poisson -- Poisson
randint -- Discrete Uniform
skellam -- Skellam
zipf -- Zipf
Statistical functions
=====================
Several of these functions have a similar version in scipy.stats.mstats
which work for masked arrays.
.. autosummary::
:toctree: generated/
describe -- Descriptive statistics
gmean -- Geometric mean
hmean -- Harmonic mean
kurtosis -- Fisher or Pearson kurtosis
kurtosistest --
mode -- Modal value
moment -- Central moment
normaltest --
skew -- Skewness
skewtest --
tmean -- Truncated arithmetic mean
tvar -- Truncated variance
tmin --
tmax --
tstd --
tsem --
nanmean -- Mean, ignoring NaN values
nanstd -- Standard deviation, ignoring NaN values
nanmedian -- Median, ignoring NaN values
variation -- Coefficient of variation
.. autosummary::
:toctree: generated/
cumfreq _
histogram2 _
histogram _
itemfreq _
percentileofscore _
scoreatpercentile _
relfreq _
.. autosummary::
:toctree: generated/
binned_statistic -- Compute a binned statistic for a set of data.
binned_statistic_2d -- Compute a 2-D binned statistic for a set of data.
binned_statistic_dd -- Compute a d-D binned statistic for a set of data.
.. autosummary::
:toctree: generated/
obrientransform
signaltonoise
bayes_mvs
sem
zmap
zscore
.. autosummary::
:toctree: generated/
sigmaclip
threshold
trimboth
trim1
.. autosummary::
:toctree: generated/
f_oneway
pearsonr
spearmanr
pointbiserialr
kendalltau
linregress
.. autosummary::
:toctree: generated/
ttest_1samp
ttest_ind
ttest_rel
kstest
chisquare
power_divergence
ks_2samp
mannwhitneyu
tiecorrect
rankdata
ranksums
wilcoxon
kruskal
friedmanchisquare
.. autosummary::
:toctree: generated/
ansari
bartlett
levene
shapiro
anderson
anderson_ksamp
binom_test
fligner
mood
.. autosummary::
:toctree: generated/
boxcox
boxcox_normmax
boxcox_llf
entropy
Contingency table functions
===========================
.. autosummary::
:toctree: generated/
chi2_contingency
contingency.expected_freq
contingency.margins
fisher_exact
Plot-tests
==========
.. autosummary::
:toctree: generated/
ppcc_max
ppcc_plot
probplot
boxcox_normplot
Masked statistics functions
===========================
.. toctree::
stats.mstats
Univariate and multivariate kernel density estimation (:mod:`scipy.stats.kde`)
==============================================================================
.. autosummary::
:toctree: generated/
gaussian_kde
For many more stat related functions install the software R and the
interface package rpy.
PACKAGE CONTENTS
_binned_statistic
_constants
_continuous_distns
_discrete_distns
_distn_infrastructure
_distr_params
_multivariate
_rank
_tukeylambda_stats
contingency
distributions
futil
kde
morestats
mstats
mstats_basic
mstats_extras
mvn
rv
setup
statlib
stats
vonmises
vonmises_cython
CLASSES
builtins.object
scipy.stats.kde.gaussian_kde
scipy.stats._distn_infrastructure.rv_generic(builtins.object)
scipy.stats._distn_infrastructure.rv_continuous
scipy.stats._distn_infrastructure.rv_discrete
class gaussian_kde(builtins.object)
| Representation of a kernel-density estimate using Gaussian kernels.
|
| Kernel density estimation is a way to estimate the probability density
| function (PDF) of a random variable in a non-parametric way.
| `gaussian_kde` works for both uni-variate and multi-variate data. It
| includes automatic bandwidth determination. The estimation works best for
| a unimodal distribution; bimodal or multi-modal distributions tend to be
| oversmoothed.
|
| Parameters
| ----------
| dataset : array_like
| Datapoints to estimate from. In case of univariate data this is a 1-D
| array, otherwise a 2-D array with shape (# of dims, # of data).
| bw_method : str, scalar or callable, optional
| The method used to calculate the estimator bandwidth. This can be
| 'scott', 'silverman', a scalar constant or a callable. If a scalar,
| this will be used directly as `kde.factor`. If a callable, it should
| take a `gaussian_kde` instance as only parameter and return a scalar.
| If None (default), 'scott' is used. See Notes for more details.
|
| Attributes
| ----------
| dataset : ndarray
| The dataset with which `gaussian_kde` was initialized.
| d : int
| Number of dimensions.
| n : int
| Number of datapoints.
| factor : float
| The bandwidth factor, obtained from `kde.covariance_factor`, with which
| the covariance matrix is multiplied.
| covariance : ndarray
| The covariance matrix of `dataset`, scaled by the calculated bandwidth
| (`kde.factor`).
| inv_cov : ndarray
| The inverse of `covariance`.
|
| Methods
| -------
| kde.evaluate(points) : ndarray
| Evaluate the estimated pdf on a provided set of points.
| kde(points) : ndarray
| Same as kde.evaluate(points)
| kde.integrate_gaussian(mean, cov) : float
| Multiply pdf with a specified Gaussian and integrate over the whole
| domain.
| kde.integrate_box_1d(low, high) : float
| Integrate pdf (1D only) between two bounds.
| kde.integrate_box(low_bounds, high_bounds) : float
| Integrate pdf over a rectangular space between low_bounds and
| high_bounds.
| kde.integrate_kde(other_kde) : float
| Integrate two kernel density estimates multiplied together.
| kde.resample(size=None) : ndarray
| Randomly sample a dataset from the estimated pdf.
| kde.set_bandwidth(bw_method='scott') : None
| Computes the bandwidth, i.e. the coefficient that multiplies the data
| covariance matrix to obtain the kernel covariance matrix.
| .. versionadded:: 0.11.0
| kde.covariance_factor : float
| Computes the coefficient (`kde.factor`) that multiplies the data
| covariance matrix to obtain the kernel covariance matrix.
| The default is `scotts_factor`. A subclass can overwrite this method
| to provide a different method, or set it through a call to
| `kde.set_bandwidth`.
|
|
| Notes
| -----
| Bandwidth selection strongly influences the estimate obtained from the KDE
| (much more so than the actual shape of the kernel). Bandwidth selection
| can be done by a "rule of thumb", by cross-validation, by "plug-in
| methods" or by other means; see [3]_, [4]_ for reviews. `gaussian_kde`
| uses a rule of thumb, the default is Scott's Rule.
|
| Scott's Rule [1]_, implemented as `scotts_factor`, is::
|
| n**(-1./(d+4)),
|
| with ``n`` the number of data points and ``d`` the number of dimensions.
| Silverman's Rule [2]_, implemented as `silverman_factor`, is::
|
| n * (d + 2) / 4.)**(-1. / (d + 4)).
|
| Good general descriptions of kernel density estimation can be found in [1]_
| and [2]_, the mathematics for this multi-dimensional implementation can be
| found in [1]_.
|
| References
| ----------
| .. [1] D.W. Scott, "Multivariate Density Estimation: Theory, Practice, and
| Visualization", John Wiley & Sons, New York, Chicester, 1992.
| .. [2] B.W. Silverman, "Density Estimation for Statistics and Data
| Analysis", Vol. 26, Monographs on Statistics and Applied Probability,
| Chapman and Hall, London, 1986.
| .. [3] B.A. Turlach, "Bandwidth Selection in Kernel Density Estimation: A
| Review", CORE and Institut de Statistique, Vol. 19, pp. 1-33, 1993.
| .. [4] D.M. Bashtannyk and R.J. Hyndman, "Bandwidth selection for kernel
| conditional density estimation", Computational Statistics & Data
| Analysis, Vol. 36, pp. 279-298, 2001.
|
| Examples
| --------
| Generate some random two-dimensional data:
|
| >>> from scipy import stats
| >>> def measure(n):
| >>> "Measurement model, return two coupled measurements."
| >>> m1 = np.random.normal(size=n)
| >>> m2 = np.random.normal(scale=0.5, size=n)
| >>> return m1+m2, m1-m2
|
| >>> m1, m2 = measure(2000)
| >>> xmin = m1.min()
| >>> xmax = m1.max()
| >>> ymin = m2.min()
| >>> ymax = m2.max()
|
| Perform a kernel density estimate on the data:
|
| >>> X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
| >>> positions = np.vstack([X.ravel(), Y.ravel()])
| >>> values = np.vstack([m1, m2])
| >>> kernel = stats.gaussian_kde(values)
| >>> Z = np.reshape(kernel(positions).T, X.shape)
|
| Plot the results:
|
| >>> import matplotlib.pyplot as plt
| >>> fig = plt.figure()
| >>> ax = fig.add_subplot(111)
| >>> ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
| ... extent=[xmin, xmax, ymin, ymax])
| >>> ax.plot(m1, m2, 'k.', markersize=2)
| >>> ax.set_xlim([xmin, xmax])
| >>> ax.set_ylim([ymin, ymax])
| >>> plt.show()
|
| Methods defined here:
|
| __call__ = evaluate(self, points)
|
| __init__(self, dataset, bw_method=None)
|
| covariance_factor = scotts_factor(self)
|
| evaluate(self, points)
| Evaluate the estimated pdf on a set of points.
|
| Parameters
| ----------
| points : (# of dimensions, # of points)-array
| Alternatively, a (# of dimensions,) vector can be passed in and
| treated as a single point.
|
| Returns
| -------
| values : (# of points,)-array
| The values at each point.
|
| Raises
| ------
| ValueError : if the dimensionality of the input points is different than
| the dimensionality of the KDE.
|
| integrate_box(self, low_bounds, high_bounds, maxpts=None)
| Computes the integral of a pdf over a rectangular interval.
|
| Parameters
| ----------
| low_bounds : array_like
| A 1-D array containing the lower bounds of integration.
| high_bounds : array_like
| A 1-D array containing the upper bounds of integration.
| maxpts : int, optional
| The maximum number of points to use for integration.
|
| Returns
| -------
| value : scalar
| The result of the integral.
|
| integrate_box_1d(self, low, high)
| Computes the integral of a 1D pdf between two bounds.
|
| Parameters
| ----------
| low : scalar
| Lower bound of integration.
| high : scalar
| Upper bound of integration.
|
| Returns
| -------
| value : scalar
| The result of the integral.
|
| Raises
| ------
| ValueError
| If the KDE is over more than one dimension.
|
| integrate_gaussian(self, mean, cov)
| Multiply estimated density by a multivariate Gaussian and integrate
| over the whole space.
|
| Parameters
| ----------
| mean : aray_like
| A 1-D array, specifying the mean of the Gaussian.
| cov : array_like
| A 2-D array, specifying the covariance matrix of the Gaussian.
|
| Returns
| -------
| result : scalar
| The value of the integral.
|
| Raises
| ------
| ValueError :
| If the mean or covariance of the input Gaussian differs from
| the KDE's dimensionality.
|
| integrate_kde(self, other)
| Computes the integral of the product of this kernel density estimate
| with another.
|
| Parameters
| ----------
| other : gaussian_kde instance
| The other kde.
|
| Returns
| -------
| value : scalar
| The result of the integral.
|
| Raises
| ------
| ValueError
| If the KDEs have different dimensionality.
|
| resample(self, size=None)
| Randomly sample a dataset from the estimated pdf.
|
| Parameters
| ----------
| size : int, optional
| The number of samples to draw. If not provided, then the size is
| the same as the underlying dataset.
|
| Returns
| -------
| resample : (self.d, `size`) ndarray
| The sampled dataset.
|
| scotts_factor(self)
|
| set_bandwidth(self, bw_method=None)
| Compute the estimator bandwidth with given method.
|
| The new bandwidth calculated after a call to `set_bandwidth` is used
| for subsequent evaluations of the estimated density.
|
| Parameters
| ----------
| bw_method : str, scalar or callable, optional
| The method used to calculate the estimator bandwidth. This can be
| 'scott', 'silverman', a scalar constant or a callable. If a
| scalar, this will be used directly as `kde.factor`. If a callable,
| it should take a `gaussian_kde` instance as only parameter and
| return a scalar. If None (default), nothing happens; the current
| `kde.covariance_factor` method is kept.
|
| Notes
| -----
| .. versionadded:: 0.11
|
| Examples
| --------
| >>> x1 = np.array([-7, -5, 1, 4, 5.])
| >>> kde = stats.gaussian_kde(x1)
| >>> xs = np.linspace(-10, 10, num=50)
| >>> y1 = kde(xs)
| >>> kde.set_bandwidth(bw_method='silverman')
| >>> y2 = kde(xs)
| >>> kde.set_bandwidth(bw_method=kde.factor / 3.)
| >>> y3 = kde(xs)
|
| >>> fig = plt.figure()
| >>> ax = fig.add_subplot(111)
| >>> ax.plot(x1, np.ones(x1.shape) / (4. * x1.size), 'bo',
| ... label='Data points (rescaled)')
| >>> ax.plot(xs, y1, label='Scott (default)')
| >>> ax.plot(xs, y2, label='Silverman')
| >>> ax.plot(xs, y3, label='Const (1/3 * Silverman)')
| >>> ax.legend()
| >>> plt.show()
|
| silverman_factor(self)
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
class rv_continuous(rv_generic)
| A generic continuous random variable class meant for subclassing.
|
| `rv_continuous` is a base class to construct specific distribution classes
| and instances from for continuous random variables. It cannot be used
| directly as a distribution.
|
| Parameters
| ----------
| momtype : int, optional
| The type of generic moment calculation to use: 0 for pdf, 1 (default)
| for ppf.
| a : float, optional
| Lower bound of the support of the distribution, default is minus
| infinity.
| b : float, optional
| Upper bound of the support of the distribution, default is plus
| infinity.
| xtol : float, optional
| The tolerance for fixed point calculation for generic ppf.
| badvalue : object, optional
| The value in a result arrays that indicates a value that for which
| some argument restriction is violated, default is np.nan.
| name : str, optional
| The name of the instance. This string is used to construct the default
| example for distributions.
| longname : str, optional
| This string is used as part of the first line of the docstring returned
| when a subclass has no docstring of its own. Note: `longname` exists
| for backwards compatibility, do not use for new subclasses.
| shapes : str, optional
| The shape of the distribution. For example ``"m, n"`` for a
| distribution that takes two integers as the two shape arguments for all
| its methods.
| extradoc : str, optional, deprecated
| This string is used as the last part of the docstring returned when a
| subclass has no docstring of its own. Note: `extradoc` exists for
| backwards compatibility, do not use for new subclasses.
|
| Methods
| -------
| rvs(<shape(s)>, loc=0, scale=1, size=1)
| random variates
|
| pdf(x, <shape(s)>, loc=0, scale=1)
| probability density function
|
| logpdf(x, <shape(s)>, loc=0, scale=1)
| log of the probability density function
|
| cdf(x, <shape(s)>, loc=0, scale=1)
| cumulative density function
|
| logcdf(x, <shape(s)>, loc=0, scale=1)
| log of the cumulative density function
|
| sf(x, <shape(s)>, loc=0, scale=1)
| survival function (1-cdf --- sometimes more accurate)
|
| logsf(x, <shape(s)>, loc=0, scale=1)
| log of the survival function
|
| ppf(q, <shape(s)>, loc=0, scale=1)
| percent point function (inverse of cdf --- quantiles)
|
| isf(q, <shape(s)>, loc=0, scale=1)
| inverse survival function (inverse of sf)
|
| moment(n, <shape(s)>, loc=0, scale=1)
| non-central n-th moment of the distribution. May not work for array
| arguments.
|
| stats(<shape(s)>, loc=0, scale=1, moments='mv')
| mean('m'), variance('v'), skew('s'), and/or kurtosis('k')
|
| entropy(<shape(s)>, loc=0, scale=1)
| (differential) entropy of the RV.
|
| fit(data, <shape(s)>, loc=0, scale=1)
| Parameter estimates for generic data
|
| expect(func=None, args=(), loc=0, scale=1, lb=None, ub=None,
| conditional=False, **kwds)
| Expected value of a function with respect to the distribution.
| Additional kwd arguments passed to integrate.quad
|
| median(<shape(s)>, loc=0, scale=1)
| Median of the distribution.
|
| mean(<shape(s)>, loc=0, scale=1)
| Mean of the distribution.
|
| std(<shape(s)>, loc=0, scale=1)
| Standard deviation of the distribution.
|
| var(<shape(s)>, loc=0, scale=1)
| Variance of the distribution.
|
| interval(alpha, <shape(s)>, loc=0, scale=1)
| Interval that with `alpha` percent probability contains a random
| realization of this distribution.
|
| __call__(<shape(s)>, loc=0, scale=1)
| Calling a distribution instance creates a frozen RV object with the
| same methods but holding the given shape, location, and scale fixed.
| See Notes section.
|
| **Parameters for Methods**
|
| x : array_like
| quantiles
| q : array_like
| lower or upper tail probability
| <shape(s)> : array_like
| shape parameters
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
| size : int or tuple of ints, optional
| shape of random variates (default computed from input arguments )
| moments : string, optional
| composed of letters ['mvsk'] specifying which moments to compute where
| 'm' = mean, 'v' = variance, 's' = (Fisher's) skew and
| 'k' = (Fisher's) kurtosis. (default='mv')
| n : int
| order of moment to calculate in method moments
|
| Notes
| -----
|
| **Methods that can be overwritten by subclasses**
| ::
|
| _rvs
| _pdf
| _cdf
| _sf
| _ppf
| _isf
| _stats
| _munp
| _entropy
| _argcheck
|
| There are additional (internal and private) generic methods that can
| be useful for cross-checking and for debugging, but might work in all
| cases when directly called.
|
| **Frozen Distribution**
|
| Alternatively, the object may be called (as a function) to fix the shape,
| location, and scale parameters returning a "frozen" continuous RV object:
|
| rv = generic(<shape(s)>, loc=0, scale=1)
| frozen RV object with the same methods but holding the given shape,
| location, and scale fixed
|
| **Subclassing**
|
| New random variables can be defined by subclassing rv_continuous class
| and re-defining at least the ``_pdf`` or the ``_cdf`` method (normalized
| to location 0 and scale 1) which will be given clean arguments (in between
| a and b) and passing the argument check method.
|
| If positive argument checking is not correct for your RV
| then you will also need to re-define the ``_argcheck`` method.
|
| Correct, but potentially slow defaults exist for the remaining
| methods but for speed and/or accuracy you can over-ride::
|
| _logpdf, _cdf, _logcdf, _ppf, _rvs, _isf, _sf, _logsf
|
| Rarely would you override ``_isf``, ``_sf`` or ``_logsf``, but you could.
|
| Statistics are computed using numerical integration by default.
| For speed you can redefine this using ``_stats``:
|
| - take shape parameters and return mu, mu2, g1, g2
| - If you can't compute one of these, return it as None
| - Can also be defined with a keyword argument ``moments=<str>``,
| where <str> is a string composed of 'm', 'v', 's',
| and/or 'k'. Only the components appearing in string
| should be computed and returned in the order 'm', 'v',
| 's', or 'k' with missing values returned as None.
|
| Alternatively, you can override ``_munp``, which takes n and shape
| parameters and returns the nth non-central moment of the distribution.
|
| A note on ``shapes``: subclasses need not specify them explicitly. In this
| case, the `shapes` will be automatically deduced from the signatures of the
| overridden methods.
| If, for some reason, you prefer to avoid relying on introspection, you can
| specify ``shapes`` explicitly as an argument to the instance constructor.
|
| Examples
| --------
| To create a new Gaussian distribution, we would do the following::
|
| class gaussian_gen(rv_continuous):
| "Gaussian distribution"
| def _pdf(self, x):
| ...
| ...
|
| Method resolution order:
| rv_continuous
| rv_generic
| builtins.object
|
| Methods defined here:
|
| __init__(self, momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None)
|
| cdf(self, x, *args, **kwds)
| Cumulative distribution function of the given RV.
|
| Parameters
| ----------
| x : array_like
| quantiles
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| cdf : ndarray
| Cumulative distribution function evaluated at `x`
|
| entropy(self, *args, **kwds)
| Differential entropy of the RV.
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| Location parameter (default=0).
| scale : array_like, optional
| Scale parameter (default=1).
|
| est_loc_scale(*args, **kwds)
| `est_loc_scale` is deprecated!
|
| This function is deprecated, use self.fit_loc_scale(data) instead.
|
| expect(self, func=None, args=(), loc=0, scale=1, lb=None, ub=None, conditional=False, **kwds)
| Calculate expected value of a function with respect to the
| distribution.
|
| The expected value of a function ``f(x)`` with respect to a
| distribution ``dist`` is defined as::
|
| ubound
| E[x] = Integral(f(x) * dist.pdf(x))
| lbound
|
| Parameters
| ----------
| func : callable, optional
| Function for which integral is calculated. Takes only one argument.
| The default is the identity mapping f(x) = x.
| args : tuple, optional
| Argument (parameters) of the distribution.
| lb, ub : scalar, optional
| Lower and upper bound for integration. default is set to the
| support of the distribution.
| conditional : bool, optional
| If True, the integral is corrected by the conditional probability
| of the integration interval. The return value is the expectation
| of the function, conditional on being in the given interval.
| Default is False.
|
| Additional keyword arguments are passed to the integration routine.
|
| Returns
| -------
| expect : float
| The calculated expected value.
|
| Notes
| -----
| The integration behavior of this function is inherited from
| `integrate.quad`.
|
| fit(self, data, *args, **kwds)
| Return MLEs for shape, location, and scale parameters from data.
|
| MLE stands for Maximum Likelihood Estimate. Starting estimates for
| the fit are given by input arguments; for any arguments not provided
| with starting estimates, ``self._fitstart(data)`` is called to generate
| such.
|
| One can hold some parameters fixed to specific values by passing in
| keyword arguments ``f0``, ``f1``, ..., ``fn`` (for shape parameters)
| and ``floc`` and ``fscale`` (for location and scale parameters,
| respectively).
|
| Parameters
| ----------
| data : array_like
| Data to use in calculating the MLEs.
| args : floats, optional
| Starting value(s) for any shape-characterizing arguments (those not
| provided will be determined by a call to ``_fitstart(data)``).
| No default value.
| kwds : floats, optional
| Starting values for the location and scale parameters; no default.
| Special keyword arguments are recognized as holding certain
| parameters fixed:
|
| f0...fn : hold respective shape parameters fixed.
|
| floc : hold location parameter fixed to specified value.
|
| fscale : hold scale parameter fixed to specified value.
|
| optimizer : The optimizer to use. The optimizer must take func,
| and starting position as the first two arguments,
| plus args (for extra arguments to pass to the
| function to be optimized) and disp=0 to suppress
| output as keyword arguments.
|
| Returns
| -------
| shape, loc, scale : tuple of floats
| MLEs for any shape statistics, followed by those for location and
| scale.
|
| Notes
| -----
| This fit is computed by maximizing a log-likelihood function, with
| penalty applied for samples outside of range of the distribution. The
| returned answer is not guaranteed to be the globally optimal MLE, it
| may only be locally optimal, or the optimization may fail altogether.
|
| fit_loc_scale(self, data, *args)
| Estimate loc and scale parameters from data using 1st and 2nd moments.
|
| Parameters
| ----------
| data : array_like
| Data to fit.
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
|
| Returns
| -------
| Lhat : float
| Estimated location parameter for the data.
| Shat : float
| Estimated scale parameter for the data.
|
| isf(self, q, *args, **kwds)
| Inverse survival function at q of the given RV.
|
| Parameters
| ----------
| q : array_like
| upper tail probability
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| x : ndarray or scalar
| Quantile corresponding to the upper tail probability q.
|
| logcdf(self, x, *args, **kwds)
| Log of the cumulative distribution function at x of the given RV.
|
| Parameters
| ----------
| x : array_like
| quantiles
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| logcdf : array_like
| Log of the cumulative distribution function evaluated at x
|
| logpdf(self, x, *args, **kwds)
| Log of the probability density function at x of the given RV.
|
| This uses a more numerically accurate calculation if available.
|
| Parameters
| ----------
| x : array_like
| quantiles
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| logpdf : array_like
| Log of the probability density function evaluated at x
|
| logsf(self, x, *args, **kwds)
| Log of the survival function of the given RV.
|
| Returns the log of the "survival function," defined as (1 - `cdf`),
| evaluated at `x`.
|
| Parameters
| ----------
| x : array_like
| quantiles
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| logsf : ndarray
| Log of the survival function evaluated at `x`.
|
| nnlf(self, theta, x)
| Return negative loglikelihood function
|
| Notes
| -----
| This is ``-sum(log pdf(x, theta), axis=0)`` where theta are the
| parameters (including loc and scale).
|
| pdf(self, x, *args, **kwds)
| Probability density function at x of the given RV.
|
| Parameters
| ----------
| x : array_like
| quantiles
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| pdf : ndarray
| Probability density function evaluated at x
|
| ppf(self, q, *args, **kwds)
| Percent point function (inverse of cdf) at q of the given RV.
|
| Parameters
| ----------
| q : array_like
| lower tail probability
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| x : array_like
| quantile corresponding to the lower tail probability q.
|
| sf(self, x, *args, **kwds)
| Survival function (1-cdf) at x of the given RV.
|
| Parameters
| ----------
| x : array_like
| quantiles
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| sf : array_like
| Survival function evaluated at x
|
| ----------------------------------------------------------------------
| Methods inherited from rv_generic:
|
| __call__(self, *args, **kwds)
|
| freeze(self, *args, **kwds)
| Freeze the distribution for the given arguments.
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution. Should include all
| the non-optional arguments, may include ``loc`` and ``scale``.
|
| Returns
| -------
| rv_frozen : rv_frozen instance
| The frozen distribution.
|
| interval(self, alpha, *args, **kwds)
| Confidence interval with equal areas around the median.
|
| Parameters
| ----------
| alpha : array_like of float
| Probability that an rv will be drawn from the returned range.
| Each value should be in the range [0, 1].
| arg1, arg2, ... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| location parameter, Default is 0.
| scale : array_like, optional
| scale parameter, Default is 1.
|
| Returns
| -------
| a, b : ndarray of float
| end-points of range that contain ``100 * alpha %`` of the rv's
| possible values.
|
| mean(self, *args, **kwds)
| Mean of the distribution
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| mean : float
| the mean of the distribution
|
| median(self, *args, **kwds)
| Median of the distribution.
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| Location parameter, Default is 0.
| scale : array_like, optional
| Scale parameter, Default is 1.
|
| Returns
| -------
| median : float
| The median of the distribution.
|
| See Also
| --------
| stats.distributions.rv_discrete.ppf
| Inverse of the CDF
|
| moment(self, n, *args, **kwds)
| n'th order non-central moment of distribution.
|
| Parameters
| ----------
| n : int, n>=1
| Order of moment.
| arg1, arg2, arg3,... : float
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| kwds : keyword arguments, optional
| These can include "loc" and "scale", as well as other keyword
| arguments relevant for a given distribution.
|
| rvs(self, *args, **kwds)
| Random variates of given type.
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| Location parameter (default=0).
| scale : array_like, optional
| Scale parameter (default=1).
| size : int or tuple of ints, optional
| Defining number of random variates (default=1).
|
| Returns
| -------
| rvs : ndarray or scalar
| Random variates of given `size`.
|
| stats(self, *args, **kwds)
| Some statistics of the given RV
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional (discrete RVs only)
| scale parameter (default=1)
| moments : str, optional
| composed of letters ['mvsk'] defining which moments to compute:
| 'm' = mean,
| 'v' = variance,
| 's' = (Fisher's) skew,
| 'k' = (Fisher's) kurtosis.
| (default='mv')
|
| Returns
| -------
| stats : sequence
| of requested moments.
|
| std(self, *args, **kwds)
| Standard deviation of the distribution.
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| std : float
| standard deviation of the distribution
|
| var(self, *args, **kwds)
| Variance of the distribution
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| var : float
| the variance of the distribution
|
| ----------------------------------------------------------------------
| Data descriptors inherited from rv_generic:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
class rv_discrete(rv_generic)
| A generic discrete random variable class meant for subclassing.
|
| `rv_discrete` is a base class to construct specific distribution classes
| and instances from for discrete random variables. rv_discrete can be used
| to construct an arbitrary distribution with defined by a list of support
| points and the corresponding probabilities.
|
| Parameters
| ----------
| a : float, optional
| Lower bound of the support of the distribution, default: 0
| b : float, optional
| Upper bound of the support of the distribution, default: plus infinity
| moment_tol : float, optional
| The tolerance for the generic calculation of moments
| values : tuple of two array_like
| (xk, pk) where xk are points (integers) with positive probability pk
| with sum(pk) = 1
| inc : integer
| increment for the support of the distribution, default: 1
| other values have not been tested
| badvalue : object, optional
| The value in (masked) arrays that indicates a value that should be
| ignored.
| name : str, optional
| The name of the instance. This string is used to construct the default
| example for distributions.
| longname : str, optional
| This string is used as part of the first line of the docstring returned
| when a subclass has no docstring of its own. Note: `longname` exists
| for backwards compatibility, do not use for new subclasses.
| shapes : str, optional
| The shape of the distribution. For example ``"m, n"`` for a
| distribution that takes two integers as the first two arguments for all
| its methods.
| extradoc : str, optional
| This string is used as the last part of the docstring returned when a
| subclass has no docstring of its own. Note: `extradoc` exists for
| backwards compatibility, do not use for new subclasses.
|
| Methods
| -------
| generic.rvs(<shape(s)>, loc=0, size=1)
| random variates
|
| generic.pmf(x, <shape(s)>, loc=0)
| probability mass function
|
| logpmf(x, <shape(s)>, loc=0)
| log of the probability density function
|
| generic.cdf(x, <shape(s)>, loc=0)
| cumulative density function
|
| generic.logcdf(x, <shape(s)>, loc=0)
| log of the cumulative density function
|
| generic.sf(x, <shape(s)>, loc=0)
| survival function (1-cdf --- sometimes more accurate)
|
| generic.logsf(x, <shape(s)>, loc=0, scale=1)
| log of the survival function
|
| generic.ppf(q, <shape(s)>, loc=0)
| percent point function (inverse of cdf --- percentiles)
|
| generic.isf(q, <shape(s)>, loc=0)
| inverse survival function (inverse of sf)
|
| generic.moment(n, <shape(s)>, loc=0)
| non-central n-th moment of the distribution. May not work for array
| arguments.
|
| generic.stats(<shape(s)>, loc=0, moments='mv')
| mean('m', axis=0), variance('v'), skew('s'), and/or kurtosis('k')
|
| generic.entropy(<shape(s)>, loc=0)
| entropy of the RV
|
| generic.expect(func=None, args=(), loc=0, lb=None, ub=None,
| conditional=False)
| Expected value of a function with respect to the distribution.
| Additional kwd arguments passed to integrate.quad
|
| generic.median(<shape(s)>, loc=0)
| Median of the distribution.
|
| generic.mean(<shape(s)>, loc=0)
| Mean of the distribution.
|
| generic.std(<shape(s)>, loc=0)
| Standard deviation of the distribution.
|
| generic.var(<shape(s)>, loc=0)
| Variance of the distribution.
|
| generic.interval(alpha, <shape(s)>, loc=0)
| Interval that with `alpha` percent probability contains a random
| realization of this distribution.
|
| generic(<shape(s)>, loc=0)
| calling a distribution instance returns a frozen distribution
|
| Notes
| -----
|
| You can construct an arbitrary discrete rv where ``P{X=xk} = pk``
| by passing to the rv_discrete initialization method (through the
| values=keyword) a tuple of sequences (xk, pk) which describes only those
| values of X (xk) that occur with nonzero probability (pk).
|
| To create a new discrete distribution, we would do the following::
|
| class poisson_gen(rv_discrete):
| #"Poisson distribution"
| def _pmf(self, k, mu):
| ...
|
| and create an instance::
|
| poisson = poisson_gen(name="poisson",
| longname='A Poisson')
|
| The docstring can be created from a template.
|
| Alternatively, the object may be called (as a function) to fix the shape
| and location parameters returning a "frozen" discrete RV object::
|
| myrv = generic(<shape(s)>, loc=0)
| - frozen RV object with the same methods but holding the given
| shape and location fixed.
|
| A note on ``shapes``: subclasses need not specify them explicitly. In this
| case, the `shapes` will be automatically deduced from the signatures of the
| overridden methods.
| If, for some reason, you prefer to avoid relying on introspection, you can
| specify ``shapes`` explicitly as an argument to the instance constructor.
|
|
| Examples
| --------
|
| Custom made discrete distribution:
|
| >>> import matplotlib.pyplot as plt
| >>> from scipy import stats
| >>> xk = np.arange(7)
| >>> pk = (0.1, 0.2, 0.3, 0.1, 0.1, 0.1, 0.1)
| >>> custm = stats.rv_discrete(name='custm', values=(xk, pk))
| >>> h = plt.plot(xk, custm.pmf(xk))
|
| Random number generation:
|
| >>> R = custm.rvs(size=100)
|
| Display frozen pmf:
|
| >>> numargs = generic.numargs
| >>> [ <shape(s)> ] = ['Replace with resonable value', ]*numargs
| >>> rv = generic(<shape(s)>)
| >>> x = np.arange(0, np.min(rv.dist.b, 3)+1)
| >>> h = plt.plot(x, rv.pmf(x))
|
| Here, ``rv.dist.b`` is the right endpoint of the support of ``rv.dist``.
|
| Check accuracy of cdf and ppf:
|
| >>> prb = generic.cdf(x, <shape(s)>)
| >>> h = plt.semilogy(np.abs(x-generic.ppf(prb, <shape(s)>))+1e-20)
|
| Method resolution order:
| rv_discrete
| rv_generic
| builtins.object
|
| Methods defined here:
|
| __init__(self, a=0, b=inf, name=None, badvalue=None, moment_tol=1e-08, values=None, inc=1, longname=None, shapes=None, extradoc=None)
|
| cdf(self, k, *args, **kwds)
| Cumulative distribution function of the given RV.
|
| Parameters
| ----------
| k : array_like, int
| Quantiles.
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| Location parameter (default=0).
|
| Returns
| -------
| cdf : ndarray
| Cumulative distribution function evaluated at `k`.
|
| expect(self, func=None, args=(), loc=0, lb=None, ub=None, conditional=False)
| Calculate expected value of a function with respect to the distribution
| for discrete distribution
|
| Parameters
| ----------
| fn : function (default: identity mapping)
| Function for which sum is calculated. Takes only one argument.
| args : tuple
| argument (parameters) of the distribution
| lb, ub : numbers, optional
| lower and upper bound for integration, default is set to the
| support of the distribution, lb and ub are inclusive (ul<=k<=ub)
| conditional : bool, optional
| Default is False.
| If true then the expectation is corrected by the conditional
| probability of the integration interval. The return value is the
| expectation of the function, conditional on being in the given
| interval (k such that ul<=k<=ub).
|
| Returns
| -------
| expect : float
| Expected value.
|
| Notes
| -----
| * function is not vectorized
| * accuracy: uses self.moment_tol as stopping criterium
| for heavy tailed distribution e.g. zipf(4), accuracy for
| mean, variance in example is only 1e-5,
| increasing precision (moment_tol) makes zipf very slow
| * suppnmin=100 internal parameter for minimum number of points to
| evaluate could be added as keyword parameter, to evaluate functions
| with non-monotonic shapes, points include integers in (-suppnmin,
| suppnmin)
| * uses maxcount=1000 limits the number of points that are evaluated
| to break loop for infinite sums
| (a maximum of suppnmin+1000 positive plus suppnmin+1000 negative
| integers are evaluated)
|
| isf(self, q, *args, **kwds)
| Inverse survival function (1-sf) at q of the given RV.
|
| Parameters
| ----------
| q : array_like
| Upper tail probability.
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| Location parameter (default=0).
|
| Returns
| -------
| k : ndarray or scalar
| Quantile corresponding to the upper tail probability, q.
|
| logcdf(self, k, *args, **kwds)
| Log of the cumulative distribution function at k of the given RV
|
| Parameters
| ----------
| k : array_like, int
| Quantiles.
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| Location parameter (default=0).
|
| Returns
| -------
| logcdf : array_like
| Log of the cumulative distribution function evaluated at k.
|
| logpmf(self, k, *args, **kwds)
| Log of the probability mass function at k of the given RV.
|
| Parameters
| ----------
| k : array_like
| Quantiles.
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| Location parameter. Default is 0.
|
| Returns
| -------
| logpmf : array_like
| Log of the probability mass function evaluated at k.
|
| logsf(self, k, *args, **kwds)
| Log of the survival function of the given RV.
|
| Returns the log of the "survival function," defined as ``1 - cdf``,
| evaluated at `k`.
|
| Parameters
| ----------
| k : array_like
| Quantiles.
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| Location parameter (default=0).
|
| Returns
| -------
| logsf : ndarray
| Log of the survival function evaluated at `k`.
|
| pmf(self, k, *args, **kwds)
| Probability mass function at k of the given RV.
|
| Parameters
| ----------
| k : array_like
| quantiles
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| Location parameter (default=0).
|
| Returns
| -------
| pmf : array_like
| Probability mass function evaluated at k
|
| ppf(self, q, *args, **kwds)
| Percent point function (inverse of cdf) at q of the given RV
|
| Parameters
| ----------
| q : array_like
| Lower tail probability.
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| Location parameter (default=0).
| scale : array_like, optional
| Scale parameter (default=1).
|
| Returns
| -------
| k : array_like
| Quantile corresponding to the lower tail probability, q.
|
| rvs(self, *args, **kwargs)
| Random variates of given type.
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| Location parameter (default=0).
| size : int or tuple of ints, optional
| Defining number of random variates (default=1). Note that `size`
| has to be given as keyword, not as positional argument.
|
| Returns
| -------
| rvs : ndarray or scalar
| Random variates of given `size`.
|
| sf(self, k, *args, **kwds)
| Survival function (1-cdf) at k of the given RV.
|
| Parameters
| ----------
| k : array_like
| Quantiles.
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| Location parameter (default=0).
|
| Returns
| -------
| sf : array_like
| Survival function evaluated at k.
|
| ----------------------------------------------------------------------
| Methods inherited from rv_generic:
|
| __call__(self, *args, **kwds)
|
| entropy(self, *args, **kwds)
| Differential entropy of the RV.
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| Location parameter (default=0).
| scale : array_like, optional (continuous distributions only).
| Scale parameter (default=1).
|
| Notes
| -----
| Entropy is defined base `e`:
|
| >>> drv = rv_discrete(values=((0, 1), (0.5, 0.5)))
| >>> np.allclose(drv.entropy(), np.log(2.0))
| True
|
| freeze(self, *args, **kwds)
| Freeze the distribution for the given arguments.
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution. Should include all
| the non-optional arguments, may include ``loc`` and ``scale``.
|
| Returns
| -------
| rv_frozen : rv_frozen instance
| The frozen distribution.
|
| interval(self, alpha, *args, **kwds)
| Confidence interval with equal areas around the median.
|
| Parameters
| ----------
| alpha : array_like of float
| Probability that an rv will be drawn from the returned range.
| Each value should be in the range [0, 1].
| arg1, arg2, ... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| loc : array_like, optional
| location parameter, Default is 0.
| scale : array_like, optional
| scale parameter, Default is 1.
|
| Returns
| -------
| a, b : ndarray of float
| end-points of range that contain ``100 * alpha %`` of the rv's
| possible values.
|
| mean(self, *args, **kwds)
| Mean of the distribution
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| mean : float
| the mean of the distribution
|
| median(self, *args, **kwds)
| Median of the distribution.
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| Location parameter, Default is 0.
| scale : array_like, optional
| Scale parameter, Default is 1.
|
| Returns
| -------
| median : float
| The median of the distribution.
|
| See Also
| --------
| stats.distributions.rv_discrete.ppf
| Inverse of the CDF
|
| moment(self, n, *args, **kwds)
| n'th order non-central moment of distribution.
|
| Parameters
| ----------
| n : int, n>=1
| Order of moment.
| arg1, arg2, arg3,... : float
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information).
| kwds : keyword arguments, optional
| These can include "loc" and "scale", as well as other keyword
| arguments relevant for a given distribution.
|
| stats(self, *args, **kwds)
| Some statistics of the given RV
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional (discrete RVs only)
| scale parameter (default=1)
| moments : str, optional
| composed of letters ['mvsk'] defining which moments to compute:
| 'm' = mean,
| 'v' = variance,
| 's' = (Fisher's) skew,
| 'k' = (Fisher's) kurtosis.
| (default='mv')
|
| Returns
| -------
| stats : sequence
| of requested moments.
|
| std(self, *args, **kwds)
| Standard deviation of the distribution.
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| std : float
| standard deviation of the distribution
|
| var(self, *args, **kwds)
| Variance of the distribution
|
| Parameters
| ----------
| arg1, arg2, arg3,... : array_like
| The shape parameter(s) for the distribution (see docstring of the
| instance object for more information)
| loc : array_like, optional
| location parameter (default=0)
| scale : array_like, optional
| scale parameter (default=1)
|
| Returns
| -------
| var : float
| the variance of the distribution
|
| ----------------------------------------------------------------------
| Data descriptors inherited from rv_generic:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
FUNCTIONS
anderson(x, dist='norm')
Anderson-Darling test for data coming from a particular distribution
The Anderson-Darling test is a modification of the Kolmogorov-
Smirnov test `kstest` for the null hypothesis that a sample is
drawn from a population that follows a particular distribution.
For the Anderson-Darling test, the critical values depend on
which distribution is being tested against. This function works
for normal, exponential, logistic, or Gumbel (Extreme Value
Type I) distributions.
Parameters
----------
x : array_like
array of sample data
dist : {'norm','expon','logistic','gumbel','extreme1'}, optional
the type of distribution to test against. The default is 'norm'
and 'extreme1' is a synonym for 'gumbel'
Returns
-------
A2 : float
The Anderson-Darling test statistic
critical : list
The critical values for this distribution
sig : list
The significance levels for the corresponding critical values
in percents. The function returns critical values for a
differing set of significance levels depending on the
distribution that is being tested against.
Notes
-----
Critical values provided are for the following significance levels:
normal/exponenential
15%, 10%, 5%, 2.5%, 1%
logistic
25%, 10%, 5%, 2.5%, 1%, 0.5%
Gumbel
25%, 10%, 5%, 2.5%, 1%
If A2 is larger than these critical values then for the corresponding
significance level, the null hypothesis that the data come from the
chosen distribution can be rejected.
References
----------
.. [1] http://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm
.. [2] Stephens, M. A. (1974). EDF Statistics for Goodness of Fit and
Some Comparisons, Journal of the American Statistical Association,
Vol. 69, pp. 730-737.
.. [3] Stephens, M. A. (1976). Asymptotic Results for Goodness-of-Fit
Statistics with Unknown Parameters, Annals of Statistics, Vol. 4,
pp. 357-369.
.. [4] Stephens, M. A. (1977). Goodness of Fit for the Extreme Value
Distribution, Biometrika, Vol. 64, pp. 583-588.
.. [5] Stephens, M. A. (1977). Goodness of Fit with Special Reference
to Tests for Exponentiality , Technical Report No. 262,
Department of Statistics, Stanford University, Stanford, CA.
.. [6] Stephens, M. A. (1979). Tests of Fit for the Logistic Distribution
Based on the Empirical Distribution Function, Biometrika, Vol. 66,
pp. 591-595.
anderson_ksamp(samples, midrank=True)
The Anderson-Darling test for k-samples.
The k-sample Anderson-Darling test is a modification of the
one-sample Anderson-Darling test. It tests the null hypothesis
that k-samples are drawn from the same population without having
to specify the distribution function of that population. The
critical values depend on the number of samples.
Parameters
----------
samples : sequence of 1-D array_like
Array of sample data in arrays.
midrank : bool, optional
Type of Anderson-Darling test which is computed. Default
(True) is the midrank test applicable to continuous and
discrete populations. If False, the right side empirical
distribution is used.
Returns
-------
A2 : float
Normalized k-sample Anderson-Darling test statistic.
critical : array
The critical values for significance levels 25%, 10%, 5%, 2.5%, 1%.
p : float
An approximate significance level at which the null hypothesis for the
provided samples can be rejected.
Raises
------
ValueError
If less than 2 samples are provided, a sample is empty, or no
distinct observations are in the samples.
See Also
--------
ks_2samp : 2 sample Kolmogorov-Smirnov test
anderson : 1 sample Anderson-Darling test
Notes
-----
[1]_ Defines three versions of the k-sample Anderson-Darling test:
one for continuous distributions and two for discrete
distributions, in which ties between samples may occur. The
default of this routine is to compute the version based on the
midrank empirical distribution function. This test is applicable
to continuous and discrete data. If midrank is set to False, the
right side empirical distribution is used for a test for discrete
data. According to [1]_, the two discrete test statistics differ
only slightly if a few collisions due to round-off errors occur in
the test not adjusted for ties between samples.
.. versionadded:: 0.14.0
References
----------
.. [1] Scholz, F. W and Stephens, M. A. (1987), K-Sample
Anderson-Darling Tests, Journal of the American Statistical
Association, Vol. 82, pp. 918-924.
Examples:
---------
>>> from scipy import stats
>>> np.random.seed(314159)
The null hypothesis that the two random samples come from the same
distribution can be rejected at the 5% level because the returned
test value is greater than the critical value for 5% (1.961) but
not at the 2.5% level. The interpolation gives an approximate
significance level of 3.1%:
>>> stats.anderson_ksamp([np.random.normal(size=50),
... np.random.normal(loc=0.5, size=30)])
(2.4615796189876105,
array([ 0.325, 1.226, 1.961, 2.718, 3.752]),
0.03134990135800783)
The null hypothesis cannot be rejected for three samples from an
identical distribution. The approximate p-value (87%) has to be
computed by extrapolation and may not be very accurate:
>>> stats.anderson_ksamp([np.random.normal(size=50),
... np.random.normal(size=30), np.random.normal(size=20)])
(-0.73091722665244196,
array([ 0.44925884, 1.3052767 , 1.9434184 , 2.57696569, 3.41634856]),
0.8789283903979661)
ansari(x, y)
Perform the Ansari-Bradley test for equal scale parameters
The Ansari-Bradley test is a non-parametric test for the equality
of the scale parameter of the distributions from which two
samples were drawn.
Parameters
----------
x, y : array_like
arrays of sample data
Returns
-------
AB : float
The Ansari-Bradley test statistic
p-value : float
The p-value of the hypothesis test
See Also
--------
fligner : A non-parametric test for the equality of k variances
mood : A non-parametric test for the equality of two scale parameters
Notes
-----
The p-value given is exact when the sample sizes are both less than
55 and there are no ties, otherwise a normal approximation for the
p-value is used.
References
----------
.. [1] Sprent, Peter and N.C. Smeeton. Applied nonparametric statistical
methods. 3rd ed. Chapman and Hall/CRC. 2001. Section 5.8.2.
bartlett(*args)
Perform Bartlett's test for equal variances
Bartlett's test tests the null hypothesis that all input samples
are from populations with equal variances. For samples
from significantly non-normal populations, Levene's test
`levene`_ is more robust.
Parameters
----------
sample1, sample2,... : array_like
arrays of sample data. May be different lengths.
Returns
-------
T : float
The test statistic.
p-value : float
The p-value of the test.
References
----------
.. [1] http://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm
.. [2] Snedecor, George W. and Cochran, William G. (1989), Statistical
Methods, Eighth Edition, Iowa State University Press.
bayes_mvs(data, alpha=0.9)
Bayesian confidence intervals for the mean, var, and std.
Parameters
----------
data : array_like
Input data, if multi-dimensional it is flattened to 1-D by `bayes_mvs`.
Requires 2 or more data points.
alpha : float, optional
Probability that the returned confidence interval contains
the true parameter.
Returns
-------
mean_cntr, var_cntr, std_cntr : tuple
The three results are for the mean, variance and standard deviation,
respectively. Each result is a tuple of the form::
(center, (lower, upper))
with `center` the mean of the conditional pdf of the value given the
data, and `(lower, upper)` a confidence interval, centered on the
median, containing the estimate to a probability `alpha`.
Notes
-----
Each tuple of mean, variance, and standard deviation estimates represent
the (center, (lower, upper)) with center the mean of the conditional pdf
of the value given the data and (lower, upper) is a confidence interval
centered on the median, containing the estimate to a probability
`alpha`.
Converts data to 1-D and assumes all data has the same mean and variance.
Uses Jeffrey's prior for variance and std.
Equivalent to tuple((x.mean(), x.interval(alpha)) for x in mvsdist(dat))
References
----------
T.E. Oliphant, "A Bayesian perspective on estimating mean, variance, and
standard-deviation from data", http://hdl.handle.net/1877/438, 2006.
betai(a, b, x)
Returns the incomplete beta function.
I_x(a,b) = 1/B(a,b)*(Integral(0,x) of t^(a-1)(1-t)^(b-1) dt)
where a,b>0 and B(a,b) = G(a)*G(b)/(G(a+b)) where G(a) is the gamma
function of a.
The standard broadcasting rules apply to a, b, and x.
Parameters
----------
a : array_like or float > 0
b : array_like or float > 0
x : array_like or float
x will be clipped to be no greater than 1.0 .
Returns
-------
betai : ndarray
Incomplete beta function.
binned_statistic(x, values, statistic='mean', bins=10, range=None)
Compute a binned statistic for a set of data.
This is a generalization of a histogram function. A histogram divides
the space into bins, and returns the count of the number of points in
each bin. This function allows the computation of the sum, mean, median,
or other statistic of the values within each bin.
.. versionadded:: 0.11.0
Parameters
----------
x : array_like
A sequence of values to be binned.
values : array_like
The values on which the statistic will be computed. This must be
the same shape as `x`.
statistic : string or callable, optional
The statistic to compute (default is 'mean').
The following statistics are available:
* 'mean' : compute the mean of values for points within each bin.
Empty bins will be represented by NaN.
* 'median' : compute the median of values for points within each
bin. Empty bins will be represented by NaN.
* 'count' : compute the count of points within each bin. This is
identical to an unweighted histogram. `values` array is not
referenced.
* 'sum' : compute the sum of values for points within each bin.
This is identical to a weighted histogram.
* function : a user-defined function which takes a 1D array of
values, and outputs a single numerical statistic. This function
will be called on the values in each bin. Empty bins will be
represented by function([]), or NaN if this returns an error.
bins : int or sequence of scalars, optional
If `bins` is an int, it defines the number of equal-width
bins in the given range (10, by default). If `bins` is a sequence,
it defines the bin edges, including the rightmost edge, allowing
for non-uniform bin widths.
range : (float, float) or [(float, float)], optional
The lower and upper range of the bins. If not provided, range
is simply ``(x.min(), x.max())``. Values outside the range are
ignored.
Returns
-------
statistic : array
The values of the selected statistic in each bin.
bin_edges : array of dtype float
Return the bin edges ``(length(statistic)+1)``.
binnumber : 1-D ndarray of ints
This assigns to each observation an integer that represents the bin
in which this observation falls. Array has the same length as values.
See Also
--------
numpy.histogram, binned_statistic_2d, binned_statistic_dd
Notes
-----
All but the last (righthand-most) bin is half-open. In other words, if
`bins` is::
[1, 2, 3, 4]
then the first bin is ``[1, 2)`` (including 1, but excluding 2) and the
second ``[2, 3)``. The last bin, however, is ``[3, 4]``, which *includes*
4.
Examples
--------
>>> stats.binned_statistic([1, 2, 1, 2, 4], np.arange(5), statistic='mean',
... bins=3)
(array([ 1., 2., 4.]), array([ 1., 2., 3., 4.]), array([1, 2, 1, 2, 3]))
>>> stats.binned_statistic([1, 2, 1, 2, 4], np.arange(5), statistic='mean', bins=3)
(array([ 1., 2., 4.]), array([ 1., 2., 3., 4.]), array([1, 2, 1, 2, 3]))
binned_statistic_2d(x, y, values, statistic='mean', bins=10, range=None)
Compute a bidimensional binned statistic for a set of data.
This is a generalization of a histogram2d function. A histogram divides
the space into bins, and returns the count of the number of points in
each bin. This function allows the computation of the sum, mean, median,
or other statistic of the values within each bin.
.. versionadded:: 0.11.0
Parameters
----------
x : (N,) array_like
A sequence of values to be binned along the first dimension.
y : (M,) array_like
A sequence of values to be binned along the second dimension.
values : (N,) array_like
The values on which the statistic will be computed. This must be
the same shape as `x`.
statistic : string or callable, optional
The statistic to compute (default is 'mean').
The following statistics are available:
* 'mean' : compute the mean of values for points within each bin.
Empty bins will be represented by NaN.
* 'median' : compute the median of values for points within each
bin. Empty bins will be represented by NaN.
* 'count' : compute the count of points within each bin. This is
identical to an unweighted histogram. `values` array is not
referenced.
* 'sum' : compute the sum of values for points within each bin.
This is identical to a weighted histogram.
* function : a user-defined function which takes a 1D array of
values, and outputs a single numerical statistic. This function
will be called on the values in each bin. Empty bins will be
represented by function([]), or NaN if this returns an error.
bins : int or [int, int] or array-like or [array, array], optional
The bin specification:
* the number of bins for the two dimensions (nx=ny=bins),
* the number of bins in each dimension (nx, ny = bins),
* the bin edges for the two dimensions (x_edges = y_edges = bins),
* the bin edges in each dimension (x_edges, y_edges = bins).
range : (2,2) array_like, optional
The leftmost and rightmost edges of the bins along each dimension
(if not specified explicitly in the `bins` parameters):
[[xmin, xmax], [ymin, ymax]]. All values outside of this range will be
considered outliers and not tallied in the histogram.
Returns
-------
statistic : (nx, ny) ndarray
The values of the selected statistic in each two-dimensional bin
xedges : (nx + 1) ndarray
The bin edges along the first dimension.
yedges : (ny + 1) ndarray
The bin edges along the second dimension.
binnumber : 1-D ndarray of ints
This assigns to each observation an integer that represents the bin
in which this observation falls. Array has the same length as `values`.
See Also
--------
numpy.histogram2d, binned_statistic, binned_statistic_dd
binned_statistic_dd(sample, values, statistic='mean', bins=10, range=None)
Compute a multidimensional binned statistic for a set of data.
This is a generalization of a histogramdd function. A histogram divides
the space into bins, and returns the count of the number of points in
each bin. This function allows the computation of the sum, mean, median,
or other statistic of the values within each bin.
.. versionadded:: 0.11.0
Parameters
----------
sample : array_like
Data to histogram passed as a sequence of D arrays of length N, or
as an (N,D) array.
values : array_like
The values on which the statistic will be computed. This must be
the same shape as x.
statistic : string or callable, optional
The statistic to compute (default is 'mean').
The following statistics are available:
* 'mean' : compute the mean of values for points within each bin.
Empty bins will be represented by NaN.
* 'median' : compute the median of values for points within each
bin. Empty bins will be represented by NaN.
* 'count' : compute the count of points within each bin. This is
identical to an unweighted histogram. `values` array is not
referenced.
* 'sum' : compute the sum of values for points within each bin.
This is identical to a weighted histogram.
* function : a user-defined function which takes a 1D array of
values, and outputs a single numerical statistic. This function
will be called on the values in each bin. Empty bins will be
represented by function([]), or NaN if this returns an error.
bins : sequence or int, optional
The bin specification:
* A sequence of arrays describing the bin edges along each dimension.
* The number of bins for each dimension (nx, ny, ... =bins)
* The number of bins for all dimensions (nx=ny=...=bins).
range : sequence, optional
A sequence of lower and upper bin edges to be used if the edges are
not given explicitely in `bins`. Defaults to the minimum and maximum
values along each dimension.
Returns
-------
statistic : ndarray, shape(nx1, nx2, nx3,...)
The values of the selected statistic in each two-dimensional bin
edges : list of ndarrays
A list of D arrays describing the (nxi + 1) bin edges for each
dimension
binnumber : 1-D ndarray of ints
This assigns to each observation an integer that represents the bin
in which this observation falls. Array has the same length as values.
See Also
--------
np.histogramdd, binned_statistic, binned_statistic_2d
binom_test(x, n=None, p=0.5)
Perform a test that the probability of success is p.
This is an exact, two-sided test of the null hypothesis
that the probability of success in a Bernoulli experiment
is `p`.
Parameters
----------
x : integer or array_like
the number of successes, or if x has length 2, it is the
number of successes and the number of failures.
n : integer
the number of trials. This is ignored if x gives both the
number of successes and failures
p : float, optional
The hypothesized probability of success. 0 <= p <= 1. The
default value is p = 0.5
Returns
-------
p-value : float
The p-value of the hypothesis test
References
----------
.. [1] http://en.wikipedia.org/wiki/Binomial_test
boxcox(x, lmbda=None, alpha=None)
Return a positive dataset transformed by a Box-Cox power transformation.
Parameters
----------
x : ndarray
Input array. Should be 1-dimensional.
lmbda : {None, scalar}, optional
If `lmbda` is not None, do the transformation for that value.
If `lmbda` is None, find the lambda that maximizes the log-likelihood
function and return it as the second output argument.
alpha : {None, float}, optional
If `alpha` is not None, return the ``100 * (1-alpha)%`` confidence
interval for `lmbda` as the third output argument.
Must be between 0.0 and 1.0.
Returns
-------
boxcox : ndarray
Box-Cox power transformed array.
maxlog : float, optional
If the `lmbda` parameter is None, the second returned argument is
the lambda that maximizes the log-likelihood function.
(min_ci, max_ci) : tuple of float, optional
If `lmbda` parameter is None and `alpha` is not None, this returned
tuple of floats represents the minimum and maximum confidence limits
given `alpha`.
See Also
--------
probplot, boxcox_normplot, boxcox_normmax, boxcox_llf
Notes
-----
The Box-Cox transform is given by::
y = (x**lmbda - 1) / lmbda, for lmbda > 0
log(x), for lmbda = 0
`boxcox` requires the input data to be positive. Sometimes a Box-Cox
transformation provides a shift parameter to achieve this; `boxcox` does
not. Such a shift parameter is equivalent to adding a positive constant to
`x` before calling `boxcox`.
The confidence limits returned when `alpha` is provided give the interval
where:
.. math::
llf(\hat{\lambda}) - llf(\lambda) < \frac{1}{2}\chi^2(1 - \alpha, 1),
with ``llf`` the log-likelihood function and :math:`\chi^2` the chi-squared
function.
References
----------
G.E.P. Box and D.R. Cox, "An Analysis of Transformations", Journal of the
Royal Statistical Society B, 26, 211-252 (1964).
Examples
--------
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
We generate some random variates from a non-normal distribution and make a
probability plot for it, to show it is non-normal in the tails:
>>> fig = plt.figure()
>>> ax1 = fig.add_subplot(211)
>>> x = stats.loggamma.rvs(5, size=500) + 5
>>> stats.probplot(x, dist=stats.norm, plot=ax1)
>>> ax1.set_xlabel('')
>>> ax1.set_title('Probplot against normal distribution')
We now use `boxcox` to transform the data so it's closest to normal:
>>> ax2 = fig.add_subplot(212)
>>> xt, _ = stats.boxcox(x)
>>> stats.probplot(xt, dist=stats.norm, plot=ax2)
>>> ax2.set_title('Probplot after Box-Cox transformation')
>>> plt.show()
boxcox_llf(lmb, data)
The boxcox log-likelihood function.
Parameters
----------
lmb : scalar
Parameter for Box-Cox transformation. See `boxcox` for details.
data : array_like
Data to calculate Box-Cox log-likelihood for. If `data` is
multi-dimensional, the log-likelihood is calculated along the first
axis.
Returns
-------
llf : float or ndarray
Box-Cox log-likelihood of `data` given `lmb`. A float for 1-D `data`,
an array otherwise.
See Also
--------
boxcox, probplot, boxcox_normplot, boxcox_normmax
Notes
-----
The Box-Cox log-likelihood function is defined here as
.. math::
llf = (\lambda - 1) \sum_i(\log(x_i)) -
N/2 \log(\sum_i (y_i - \bar{y})^2 / N),
where ``y`` is the Box-Cox transformed input data ``x``.
Examples
--------
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
>>> from mpl_toolkits.axes_grid1.inset_locator import inset_axes
>>> np.random.seed(1245)
Generate some random variates and calculate Box-Cox log-likelihood values
for them for a range of ``lmbda`` values:
>>> x = stats.loggamma.rvs(5, loc=10, size=1000)
>>> lmbdas = np.linspace(-2, 10)
>>> llf = np.zeros(lmbdas.shape, dtype=np.float)
>>> for ii, lmbda in enumerate(lmbdas):
... llf[ii] = stats.boxcox_llf(lmbda, x)
Also find the optimal lmbda value with `boxcox`:
>>> x_most_normal, lmbda_optimal = stats.boxcox(x)
Plot the log-likelihood as function of lmbda. Add the optimal lmbda as a
horizontal line to check that that's really the optimum:
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ax.plot(lmbdas, llf, 'b.-')
>>> ax.axhline(stats.boxcox_llf(lmbda_optimal, x), color='r')
>>> ax.set_xlabel('lmbda parameter')
>>> ax.set_ylabel('Box-Cox log-likelihood')
Now add some probability plots to show that where the log-likelihood is
maximized the data transformed with `boxcox` looks closest to normal:
>>> locs = [3, 10, 4] # 'lower left', 'center', 'lower right'
>>> for lmbda, loc in zip([-1, lmbda_optimal, 9], locs):
... xt = stats.boxcox(x, lmbda=lmbda)
... (osm, osr), (slope, intercept, r_sq) = stats.probplot(xt)
... ax_inset = inset_axes(ax, width="20%", height="20%", loc=loc)
... ax_inset.plot(osm, osr, 'c.', osm, slope*osm + intercept, 'k-')
... ax_inset.set_xticklabels([])
... ax_inset.set_yticklabels([])
... ax_inset.set_title('$\lambda=%1.2f$' % lmbda)
>>> plt.show()
boxcox_normmax(x, brack=(-2.0, 2.0), method='pearsonr')
Compute optimal Box-Cox transform parameter for input data.
Parameters
----------
x : array_like
Input array.
brack : 2-tuple, optional
The starting interval for a downhill bracket search with
`optimize.brent`. Note that this is in most cases not critical; the
final result is allowed to be outside this bracket.
method : str, optional
The method to determine the optimal transform parameter (`boxcox`
``lmbda`` parameter). Options are:
'pearsonr' (default)
Maximizes the Pearson correlation coefficient between
``y = boxcox(x)`` and the expected values for ``y`` if `x` would be
normally-distributed.
'mle'
Minimizes the log-likelihood `boxcox_llf`. This is the method used
in `boxcox`.
'all'
Use all optimization methods available, and return all results.
Useful to compare different methods.
Returns
-------
maxlog : float or ndarray
The optimal transform parameter found. An array instead of a scalar
for ``method='all'``.
See Also
--------
boxcox, boxcox_llf, boxcox_normplot
Examples
--------
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
>>> np.random.seed(1234) # make this example reproducible
Generate some data and determine optimal ``lmbda`` in various ways:
>>> x = stats.loggamma.rvs(5, size=30) + 5
>>> y, lmax_mle = stats.boxcox(x)
>>> lmax_pearsonr = stats.boxcox_normmax(x)
>>> lmax_mle
7.177...
>>> lmax_pearsonr
7.916...
>>> stats.boxcox_normmax(x, method='all')
array([ 7.91667384, 7.17718692])
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> stats.boxcox_normplot(x, -10, 10, plot=ax)
>>> ax.axvline(lmax_mle, color='r')
>>> ax.axvline(lmax_pearsonr, color='g', ls='--')
>>> plt.show()
boxcox_normplot(x, la, lb, plot=None, N=80)
Compute parameters for a Box-Cox normality plot, optionally show it.
A Box-Cox normality plot shows graphically what the best transformation
parameter is to use in `boxcox` to obtain a distribution that is close
to normal.
Parameters
----------
x : array_like
Input array.
la, lb : scalar
The lower and upper bounds for the ``lmbda`` values to pass to `boxcox`
for Box-Cox transformations. These are also the limits of the
horizontal axis of the plot if that is generated.
plot : object, optional
If given, plots the quantiles and least squares fit.
`plot` is an object that has to have methods "plot" and "text".
The `matplotlib.pyplot` module or a Matplotlib Axes object can be used,
or a custom object with the same methods.
Default is None, which means that no plot is created.
N : int, optional
Number of points on the horizontal axis (equally distributed from
`la` to `lb`).
Returns
-------
lmbdas : ndarray
The ``lmbda`` values for which a Box-Cox transform was done.
ppcc : ndarray
Probability Plot Correlelation Coefficient, as obtained from `probplot`
when fitting the Box-Cox transformed input `x` against a normal
distribution.
See Also
--------
probplot, boxcox, boxcox_normmax, boxcox_llf, ppcc_max
Notes
-----
Even if `plot` is given, the figure is not shown or saved by
`boxcox_normplot`; ``plt.show()`` or ``plt.savefig('figname.png')``
should be used after calling `probplot`.
Examples
--------
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
Generate some non-normally distributed data, and create a Box-Cox plot:
>>> x = stats.loggamma.rvs(5, size=500) + 5
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> stats.boxcox_normplot(x, -20, 20, plot=ax)
Determine and plot the optimal ``lmbda`` to transform ``x`` and plot it in
the same plot:
>>> _, maxlog = stats.boxcox(x)
>>> ax.axvline(maxlog, color='r')
>>> plt.show()
callable(obj)
chi2_contingency(observed, correction=True, lambda_=None)
Chi-square test of independence of variables in a contingency table.
This function computes the chi-square statistic and p-value for the
hypothesis test of independence of the observed frequencies in the
contingency table [1]_ `observed`. The expected frequencies are computed
based on the marginal sums under the assumption of independence; see
`scipy.stats.contingency.expected_freq`. The number of degrees of
freedom is (expressed using numpy functions and attributes)::
dof = observed.size - sum(observed.shape) + observed.ndim - 1
Parameters
----------
observed : array_like
The contingency table. The table contains the observed frequencies
(i.e. number of occurrences) in each category. In the two-dimensional
case, the table is often described as an "R x C table".
correction : bool, optional
If True, *and* the degrees of freedom is 1, apply Yates' correction
for continuity. The effect of the correction is to adjust each
observed value by 0.5 towards the corresponding expected value.
lambda_ : float or str, optional.
By default, the statistic computed in this test is Pearson's
chi-squared statistic [2]_. `lambda_` allows a statistic from the
Cressie-Read power divergence family [3]_ to be used instead. See
`power_divergence` for details.
Returns
-------
chi2 : float
The test statistic.
p : float
The p-value of the test
dof : int
Degrees of freedom
expected : ndarray, same shape as `observed`
The expected frequencies, based on the marginal sums of the table.
See Also
--------
contingency.expected_freq
fisher_exact
chisquare
power_divergence
Notes
-----
An often quoted guideline for the validity of this calculation is that
the test should be used only if the observed and expected frequency in
each cell is at least 5.
This is a test for the independence of different categories of a
population. The test is only meaningful when the dimension of
`observed` is two or more. Applying the test to a one-dimensional
table will always result in `expected` equal to `observed` and a
chi-square statistic equal to 0.
This function does not handle masked arrays, because the calculation
does not make sense with missing values.
Like stats.chisquare, this function computes a chi-square statistic;
the convenience this function provides is to figure out the expected
frequencies and degrees of freedom from the given contingency table.
If these were already known, and if the Yates' correction was not
required, one could use stats.chisquare. That is, if one calls::
chi2, p, dof, ex = chi2_contingency(obs, correction=False)
then the following is true::
(chi2, p) == stats.chisquare(obs.ravel(), f_exp=ex.ravel(),
ddof=obs.size - 1 - dof)
The `lambda_` argument was added in version 0.13.0 of scipy.
References
----------
.. [1] "Contingency table", http://en.wikipedia.org/wiki/Contingency_table
.. [2] "Pearson's chi-squared test",
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
.. [3] Cressie, N. and Read, T. R. C., "Multinomial Goodness-of-Fit
Tests", J. Royal Stat. Soc. Series B, Vol. 46, No. 3 (1984),
pp. 440-464.
Examples
--------
A two-way example (2 x 3):
>>> obs = np.array([[10, 10, 20], [20, 20, 20]])
>>> chi2_contingency(obs)
(2.7777777777777777,
0.24935220877729619,
2,
array([[ 12., 12., 16.],
[ 18., 18., 24.]]))
Perform the test using the log-likelihood ratio (i.e. the "G-test")
instead of Pearson's chi-squared statistic.
>>> g, p, dof, expctd = chi2_contingency(obs, lambda_="log-likelihood")
>>> g, p
(2.7688587616781319, 0.25046668010954165)
A four-way example (2 x 2 x 2 x 2):
>>> obs = np.array(
... [[[[12, 17],
... [11, 16]],
... [[11, 12],
... [15, 16]]],
... [[[23, 15],
... [30, 22]],
... [[14, 17],
... [15, 16]]]])
>>> chi2_contingency(obs)
(8.7584514426741897,
0.64417725029295503,
11,
array([[[[ 14.15462386, 14.15462386],
[ 16.49423111, 16.49423111]],
[[ 11.2461395 , 11.2461395 ],
[ 13.10500554, 13.10500554]]],
[[[ 19.5591166 , 19.5591166 ],
[ 22.79202844, 22.79202844]],
[[ 15.54012004, 15.54012004],
[ 18.10873492, 18.10873492]]]]))
chisqprob(chisq, df)
Probability value (1-tail) for the Chi^2 probability distribution.
Broadcasting rules apply.
Parameters
----------
chisq : array_like or float > 0
df : array_like or float, probably int >= 1
Returns
-------
chisqprob : ndarray
The area from `chisq` to infinity under the Chi^2 probability
distribution with degrees of freedom `df`.
chisquare(f_obs, f_exp=None, ddof=0, axis=0)
Calculates a one-way chi square test.
The chi square test tests the null hypothesis that the categorical data
has the given frequencies.
Parameters
----------
f_obs : array_like
Observed frequencies in each category.
f_exp : array_like, optional
Expected frequencies in each category. By default the categories are
assumed to be equally likely.
ddof : int, optional
"Delta degrees of freedom": adjustment to the degrees of freedom
for the p-value. The p-value is computed using a chi-squared
distribution with ``k - 1 - ddof`` degrees of freedom, where `k`
is the number of observed frequencies. The default value of `ddof`
is 0.
axis : int or None, optional
The axis of the broadcast result of `f_obs` and `f_exp` along which to
apply the test. If axis is None, all values in `f_obs` are treated
as a single data set. Default is 0.
Returns
-------
chisq : float or ndarray
The chi-squared test statistic. The value is a float if `axis` is
None or `f_obs` and `f_exp` are 1-D.
p : float or ndarray
The p-value of the test. The value is a float if `ddof` and the
return value `chisq` are scalars.
See Also
--------
power_divergence
mstats.chisquare
Notes
-----
This test is invalid when the observed or expected frequencies in each
category are too small. A typical rule is that all of the observed
and expected frequencies should be at least 5.
The default degrees of freedom, k-1, are for the case when no parameters
of the distribution are estimated. If p parameters are estimated by
efficient maximum likelihood then the correct degrees of freedom are
k-1-p. If the parameters are estimated in a different way, then the
dof can be between k-1-p and k-1. However, it is also possible that
the asymptotic distribution is not a chisquare, in which case this
test is not appropriate.
References
----------
.. [1] Lowry, Richard. "Concepts and Applications of Inferential
Statistics". Chapter 8. http://faculty.vassar.edu/lowry/ch8pt1.html
.. [2] "Chi-squared test", http://en.wikipedia.org/wiki/Chi-squared_test
Examples
--------
When just `f_obs` is given, it is assumed that the expected frequencies
are uniform and given by the mean of the observed frequencies.
>>> chisquare([16, 18, 16, 14, 12, 12])
(2.0, 0.84914503608460956)
With `f_exp` the expected frequencies can be given.
>>> chisquare([16, 18, 16, 14, 12, 12], f_exp=[16, 16, 16, 16, 16, 8])
(3.5, 0.62338762774958223)
When `f_obs` is 2-D, by default the test is applied to each column.
>>> obs = np.array([[16, 18, 16, 14, 12, 12], [32, 24, 16, 28, 20, 24]]).T
>>> obs.shape
(6, 2)
>>> chisquare(obs)
(array([ 2. , 6.66666667]), array([ 0.84914504, 0.24663415]))
By setting ``axis=None``, the test is applied to all data in the array,
which is equivalent to applying the test to the flattened array.
>>> chisquare(obs, axis=None)
(23.31034482758621, 0.015975692534127565)
>>> chisquare(obs.ravel())
(23.31034482758621, 0.015975692534127565)
`ddof` is the change to make to the default degrees of freedom.
>>> chisquare([16, 18, 16, 14, 12, 12], ddof=1)
(2.0, 0.73575888234288467)
The calculation of the p-values is done by broadcasting the
chi-squared statistic with `ddof`.
>>> chisquare([16, 18, 16, 14, 12, 12], ddof=[0,1,2])
(2.0, array([ 0.84914504, 0.73575888, 0.5724067 ]))
`f_obs` and `f_exp` are also broadcast. In the following, `f_obs` has
shape (6,) and `f_exp` has shape (2, 6), so the result of broadcasting
`f_obs` and `f_exp` has shape (2, 6). To compute the desired chi-squared
statistics, we use ``axis=1``:
>>> chisquare([16, 18, 16, 14, 12, 12],
... f_exp=[[16, 16, 16, 16, 16, 8], [8, 20, 20, 16, 12, 12]],
... axis=1)
(array([ 3.5 , 9.25]), array([ 0.62338763, 0.09949846]))
circmean(samples, high=6.283185307179586, low=0, axis=None)
Compute the circular mean for samples in a range.
Parameters
----------
samples : array_like
Input array.
high : float or int, optional
High boundary for circular mean range. Default is ``2*pi``.
low : float or int, optional
Low boundary for circular mean range. Default is 0.
axis : int, optional
Axis along which means are computed. The default is to compute
the mean of the flattened array.
Returns
-------
circmean : float
Circular mean.
circstd(samples, high=6.283185307179586, low=0, axis=None)
Compute the circular standard deviation for samples assumed to be in the
range [low to high].
Parameters
----------
samples : array_like
Input array.
low : float or int, optional
Low boundary for circular standard deviation range. Default is 0.
high : float or int, optional
High boundary for circular standard deviation range.
Default is ``2*pi``.
axis : int, optional
Axis along which standard deviations are computed. The default is
to compute the standard deviation of the flattened array.
Returns
-------
circstd : float
Circular standard deviation.
Notes
-----
This uses a definition of circular standard deviation that in the limit of
small angles returns a number close to the 'linear' standard deviation.
circvar(samples, high=6.283185307179586, low=0, axis=None)
Compute the circular variance for samples assumed to be in a range
Parameters
----------
samples : array_like
Input array.
low : float or int, optional
Low boundary for circular variance range. Default is 0.
high : float or int, optional
High boundary for circular variance range. Default is ``2*pi``.
axis : int, optional
Axis along which variances are computed. The default is to compute
the variance of the flattened array.
Returns
-------
circvar : float
Circular variance.
Notes
-----
This uses a definition of circular variance that in the limit of small
angles returns a number close to the 'linear' variance.
cumfreq(a, numbins=10, defaultreallimits=None, weights=None)
Returns a cumulative frequency histogram, using the histogram function.
Parameters
----------
a : array_like
Input array.
numbins : int, optional
The number of bins to use for the histogram. Default is 10.
defaultlimits : tuple (lower, upper), optional
The lower and upper values for the range of the histogram.
If no value is given, a range slightly larger than the range of the
values in `a` is used. Specifically ``(a.min() - s, a.max() + s)``,
where ``s = (1/2)(a.max() - a.min()) / (numbins - 1)``.
weights : array_like, optional
The weights for each value in `a`. Default is None, which gives each
value a weight of 1.0
Returns
-------
cumfreq : ndarray
Binned values of cumulative frequency.
lowerreallimit : float
Lower real limit
binsize : float
Width of each bin.
extrapoints : int
Extra points.
Examples
--------
>>> x = [1, 4, 2, 1, 3, 1]
>>> cumfreqs, lowlim, binsize, extrapoints = sp.stats.cumfreq(x, numbins=4)
>>> cumfreqs
array([ 3., 4., 5., 6.])
>>> cumfreqs, lowlim, binsize, extrapoints = ... sp.stats.cumfreq(x, numbins=4, defaultreallimits=(1.5, 5))
>>> cumfreqs
array([ 1., 2., 3., 3.])
>>> extrapoints
3
describe(a, axis=0)
Computes several descriptive statistics of the passed array.
Parameters
----------
a : array_like
data
axis : int or None
axis along which statistics are calculated. If axis is None, then data
array is raveled. The default axis is zero.
Returns
-------
size of the data : int
length of data along axis
(min, max): tuple of ndarrays or floats
minimum and maximum value of data array
arithmetic mean : ndarray or float
mean of data along axis
unbiased variance : ndarray or float
variance of the data along axis, denominator is number of observations
minus one.
biased skewness : ndarray or float
skewness, based on moment calculations with denominator equal to the
number of observations, i.e. no degrees of freedom correction
biased kurtosis : ndarray or float
kurtosis (Fisher), the kurtosis is normalized so that it is zero for the
normal distribution. No degrees of freedom or bias correction is used.
See Also
--------
skew
kurtosis
entropy(pk, qk=None, base=None)
Calculate the entropy of a distribution for given probability values.
If only probabilities `pk` are given, the entropy is calculated as
``S = -sum(pk * log(pk), axis=0)``.
If `qk` is not None, then compute a relative entropy (also known as
Kullback-Leibler divergence or Kullback-Leibler distance)
``S = sum(pk * log(pk / qk), axis=0)``.
This routine will normalize `pk` and `qk` if they don't sum to 1.
Parameters
----------
pk : sequence
Defines the (discrete) distribution. ``pk[i]`` is the (possibly
unnormalized) probability of event ``i``.
qk : sequence, optional
Sequence against which the relative entropy is computed. Should be in
the same format as `pk`.
base : float, optional
The logarithmic base to use, defaults to ``e`` (natural logarithm).
Returns
-------
S : float
The calculated entropy.
f_oneway(*args)
Performs a 1-way ANOVA.
The one-way ANOVA tests the null hypothesis that two or more groups have
the same population mean. The test is applied to samples from two or
more groups, possibly with differing sizes.
Parameters
----------
sample1, sample2, ... : array_like
The sample measurements for each group.
Returns
-------
F-value : float
The computed F-value of the test.
p-value : float
The associated p-value from the F-distribution.
Notes
-----
The ANOVA test has important assumptions that must be satisfied in order
for the associated p-value to be valid.
1. The samples are independent.
2. Each sample is from a normally distributed population.
3. The population standard deviations of the groups are all equal. This
property is known as homoscedasticity.
If these assumptions are not true for a given set of data, it may still be
possible to use the Kruskal-Wallis H-test (`scipy.stats.kruskal`) although
with some loss of power.
The algorithm is from Heiman[2], pp.394-7.
References
----------
.. [1] Lowry, Richard. "Concepts and Applications of Inferential
Statistics". Chapter 14.
http://faculty.vassar.edu/lowry/ch14pt1.html
.. [2] Heiman, G.W. Research Methods in Statistics. 2002.
f_value(ER, EF, dfR, dfF)
Returns an F-statistic for a restricted vs. unrestricted model.
Parameters
----------
ER : float
`ER` is the sum of squared residuals for the restricted model
or null hypothesis
EF : float
`EF` is the sum of squared residuals for the unrestricted model
or alternate hypothesis
dfR : int
`dfR` is the degrees of freedom in the restricted model
dfF : int
`dfF` is the degrees of freedom in the unrestricted model
Returns
-------
F-statistic : float
f_value_multivariate(ER, EF, dfnum, dfden)
Returns a multivariate F-statistic.
Parameters
----------
ER : ndarray
Error associated with the null hypothesis (the Restricted model).
From a multivariate F calculation.
EF : ndarray
Error associated with the alternate hypothesis (the Full model)
From a multivariate F calculation.
dfnum : int
Degrees of freedom the Restricted model.
dfden : int
Degrees of freedom associated with the Restricted model.
Returns
-------
fstat : float
The computed F-statistic.
f_value_wilks_lambda(ER, EF, dfnum, dfden, a, b)
Calculation of Wilks lambda F-statistic for multivarite data, per
Maxwell & Delaney p.657.
fastsort(a)
Sort an array and provide the argsort.
Parameters
----------
a : array_like
Input array.
Returns
-------
fastsort : ndarray of type int
sorted indices into the original array
find_repeats(arr)
Find repeats and repeat counts.
Parameters
----------
arr : array_like
Input array
Returns
-------
find_repeats : tuple
Returns a tuple of two 1-D ndarrays. The first ndarray are the repeats
as sorted, unique values that are repeated in `arr`. The second
ndarray are the counts mapped one-to-one of the repeated values
in the first ndarray.
Examples
--------
>>> sp.stats.find_repeats([2, 1, 2, 3, 2, 2, 5])
(array([ 2. ]), array([ 4 ], dtype=int32)
>>> sp.stats.find_repeats([[10, 20, 1, 2], [5, 5, 4, 4]])
(array([ 4., 5.]), array([2, 2], dtype=int32))
fisher_exact(table, alternative='two-sided')
Performs a Fisher exact test on a 2x2 contingency table.
Parameters
----------
table : array_like of ints
A 2x2 contingency table. Elements should be non-negative integers.
alternative : {'two-sided', 'less', 'greater'}, optional
Which alternative hypothesis to the null hypothesis the test uses.
Default is 'two-sided'.
Returns
-------
oddsratio : float
This is prior odds ratio and not a posterior estimate.
p_value : float
P-value, the probability of obtaining a distribution at least as
extreme as the one that was actually observed, assuming that the
null hypothesis is true.
See Also
--------
chi2_contingency : Chi-square test of independence of variables in a
contingency table.
Notes
-----
The calculated odds ratio is different from the one R uses. In R language,
this implementation returns the (more common) "unconditional Maximum
Likelihood Estimate", while R uses the "conditional Maximum Likelihood
Estimate".
For tables with large numbers the (inexact) chi-square test implemented
in the function `chi2_contingency` can also be used.
Examples
--------
Say we spend a few days counting whales and sharks in the Atlantic and
Indian oceans. In the Atlantic ocean we find 8 whales and 1 shark, in the
Indian ocean 2 whales and 5 sharks. Then our contingency table is::
Atlantic Indian
whales 8 2
sharks 1 5
We use this table to find the p-value:
>>> oddsratio, pvalue = stats.fisher_exact([[8, 2], [1, 5]])
>>> pvalue
0.0349...
The probability that we would observe this or an even more imbalanced ratio
by chance is about 3.5%. A commonly used significance level is 5%, if we
adopt that we can therefore conclude that our observed imbalance is
statistically significant; whales prefer the Atlantic while sharks prefer
the Indian ocean.
fligner(*args, **kwds)
Perform Fligner's test for equal variances.
Fligner's test tests the null hypothesis that all input samples
are from populations with equal variances. Fligner's test is
non-parametric in contrast to Bartlett's test `bartlett` and
Levene's test `levene`.
Parameters
----------
sample1, sample2, ... : array_like
arrays of sample data. Need not be the same length
center : {'mean', 'median', 'trimmed'}, optional
keyword argument controlling which function of the data
is used in computing the test statistic. The default
is 'median'.
proportiontocut : float, optional
When `center` is 'trimmed', this gives the proportion of data points
to cut from each end. (See `scipy.stats.trim_mean`.)
Default is 0.05.
Returns
-------
Xsq : float
the test statistic
p-value : float
the p-value for the hypothesis test
Notes
-----
As with Levene's test there are three variants
of Fligner's test that differ by the measure of central
tendency used in the test. See `levene` for more information.
References
----------
.. [1] http://www.stat.psu.edu/~bgl/center/tr/TR993.ps
.. [2] Fligner, M.A. and Killeen, T.J. (1976). Distribution-free two-sample
tests for scale. 'Journal of the American Statistical Association.'
71(353), 210-213.
fprob(*args, **kwds)
`fprob` is deprecated!
fprob is deprecated in scipy 0.14, use stats.f.sf or special.fdtrc instead
fdtrc(x1, x2, x3[, out])
fdtrc(dfn, dfd, x)
F survival function
Returns the complemented F distribution function.
friedmanchisquare(*args)
Computes the Friedman test for repeated measurements
The Friedman test tests the null hypothesis that repeated measurements of
the same individuals have the same distribution. It is often used
to test for consistency among measurements obtained in different ways.
For example, if two measurement techniques are used on the same set of
individuals, the Friedman test can be used to determine if the two
measurement techniques are consistent.
Parameters
----------
measurements1, measurements2, measurements3... : array_like
Arrays of measurements. All of the arrays must have the same number
of elements. At least 3 sets of measurements must be given.
Returns
-------
friedman chi-square statistic : float
the test statistic, correcting for ties
p-value : float
the associated p-value assuming that the test statistic has a chi
squared distribution
Notes
-----
Due to the assumption that the test statistic has a chi squared
distribution, the p-value is only reliable for n > 10 and more than
6 repeated measurements.
References
----------
.. [1] http://en.wikipedia.org/wiki/Friedman_test
gmean(a, axis=0, dtype=None)
Compute the geometric mean along the specified axis.
Returns the geometric average of the array elements.
That is: n-th root of (x1 * x2 * ... * xn)
Parameters
----------
a : array_like
Input array or object that can be converted to an array.
axis : int, optional, default axis=0
Axis along which the geometric mean is computed.
dtype : dtype, optional
Type of the returned array and of the accumulator in which the
elements are summed. If dtype is not specified, it defaults to the
dtype of a, unless a has an integer dtype with a precision less than
that of the default platform integer. In that case, the default
platform integer is used.
Returns
-------
gmean : ndarray
see dtype parameter above
See Also
--------
numpy.mean : Arithmetic average
numpy.average : Weighted average
hmean : Harmonic mean
Notes
-----
The geometric average is computed over a single dimension of the input
array, axis=0 by default, or all values in the array if axis=None.
float64 intermediate and return values are used for integer inputs.
Use masked arrays to ignore any non-finite values in the input or that
arise in the calculations such as Not a Number and infinity because masked
arrays automatically mask any non-finite values.
histogram(a, numbins=10, defaultlimits=None, weights=None, printextras=False)
Separates the range into several bins and returns the number of instances
in each bin.
Parameters
----------
a : array_like
Array of scores which will be put into bins.
numbins : int, optional
The number of bins to use for the histogram. Default is 10.
defaultlimits : tuple (lower, upper), optional
The lower and upper values for the range of the histogram.
If no value is given, a range slightly larger then the range of the
values in a is used. Specifically ``(a.min() - s, a.max() + s)``,
where ``s = (1/2)(a.max() - a.min()) / (numbins - 1)``.
weights : array_like, optional
The weights for each value in `a`. Default is None, which gives each
value a weight of 1.0
printextras : bool, optional
If True, if there are extra points (i.e. the points that fall outside
the bin limits) a warning is raised saying how many of those points
there are. Default is False.
Returns
-------
histogram : ndarray
Number of points (or sum of weights) in each bin.
low_range : float
Lowest value of histogram, the lower limit of the first bin.
binsize : float
The size of the bins (all bins have the same size).
extrapoints : int
The number of points outside the range of the histogram.
See Also
--------
numpy.histogram
Notes
-----
This histogram is based on numpy's histogram but has a larger range by
default if default limits is not set.
histogram2(a, bins)
Compute histogram using divisions in bins.
Count the number of times values from array `a` fall into
numerical ranges defined by `bins`. Range x is given by
bins[x] <= range_x < bins[x+1] where x =0,N and N is the
length of the `bins` array. The last range is given by
bins[N] <= range_N < infinity. Values less than bins[0] are
not included in the histogram.
Parameters
----------
a : array_like of rank 1
The array of values to be assigned into bins
bins : array_like of rank 1
Defines the ranges of values to use during histogramming.
Returns
-------
histogram2 : ndarray of rank 1
Each value represents the occurrences for a given bin (range) of
values.
hmean(a, axis=0, dtype=None)
Calculates the harmonic mean along the specified axis.
That is: n / (1/x1 + 1/x2 + ... + 1/xn)
Parameters
----------
a : array_like
Input array, masked array or object that can be converted to an array.
axis : int, optional, default axis=0
Axis along which the harmonic mean is computed.
dtype : dtype, optional
Type of the returned array and of the accumulator in which the
elements are summed. If `dtype` is not specified, it defaults to the
dtype of `a`, unless `a` has an integer `dtype` with a precision less
than that of the default platform integer. In that case, the default
platform integer is used.
Returns
-------
hmean : ndarray
see `dtype` parameter above
See Also
--------
numpy.mean : Arithmetic average
numpy.average : Weighted average
gmean : Geometric mean
Notes
-----
The harmonic mean is computed over a single dimension of the input
array, axis=0 by default, or all values in the array if axis=None.
float64 intermediate and return values are used for integer inputs.
Use masked arrays to ignore any non-finite values in the input or that
arise in the calculations such as Not a Number and infinity.
itemfreq(a)
Returns a 2-D array of item frequencies.
Parameters
----------
a : (N,) array_like
Input array.
Returns
-------
itemfreq : (K, 2) ndarray
A 2-D frequency table. Column 1 contains sorted, unique values from
`a`, column 2 contains their respective counts.
Examples
--------
>>> a = np.array([1, 1, 5, 0, 1, 2, 2, 0, 1, 4])
>>> stats.itemfreq(a)
array([[ 0., 2.],
[ 1., 4.],
[ 2., 2.],
[ 4., 1.],
[ 5., 1.]])
>>> np.bincount(a)
array([2, 4, 2, 0, 1, 1])
>>> stats.itemfreq(a/10.)
array([[ 0. , 2. ],
[ 0.1, 4. ],
[ 0.2, 2. ],
[ 0.4, 1. ],
[ 0.5, 1. ]])
jarque_bera(x)
Perform the Jarque-Bera goodness of fit test on sample data.
The Jarque-Bera test tests whether the sample data has the skewness and
kurtosis matching a normal distribution.
Note that this test only works for a large enough number of data samples
(>2000) as the test statistic asymptotically has a Chi-squared distribution
with 2 degrees of freedom.
Parameters
----------
x : array_like
Observations of a random variable.
Returns
-------
jb_value : float
The test statistic.
p : float
The p-value for the hypothesis test.
References
----------
.. [1] Jarque, C. and Bera, A. (1980) "Efficient tests for normality,
homoscedasticity and serial independence of regression residuals",
6 Econometric Letters 255-259.
Examples
--------
>>> from scipy import stats
>>> np.random.seed(987654321)
>>> x = np.random.normal(0, 1, 100000)
>>> y = np.random.rayleigh(1, 100000)
>>> stats.jarque_bera(x)
(4.7165707989581342, 0.09458225503041906)
>>> stats.jarque_bera(y)
(6713.7098548143422, 0.0)
kendalltau(x, y, initial_lexsort=True)
Calculates Kendall's tau, a correlation measure for ordinal data.
Kendall's tau is a measure of the correspondence between two rankings.
Values close to 1 indicate strong agreement, values close to -1 indicate
strong disagreement. This is the tau-b version of Kendall's tau which
accounts for ties.
Parameters
----------
x, y : array_like
Arrays of rankings, of the same shape. If arrays are not 1-D, they will
be flattened to 1-D.
initial_lexsort : bool, optional
Whether to use lexsort or quicksort as the sorting method for the
initial sort of the inputs. Default is lexsort (True), for which
`kendalltau` is of complexity O(n log(n)). If False, the complexity is
O(n^2), but with a smaller pre-factor (so quicksort may be faster for
small arrays).
Returns
-------
Kendall's tau : float
The tau statistic.
p-value : float
The two-sided p-value for a hypothesis test whose null hypothesis is
an absence of association, tau = 0.
Notes
-----
The definition of Kendall's tau that is used is::
tau = (P - Q) / sqrt((P + Q + T) * (P + Q + U))
where P is the number of concordant pairs, Q the number of discordant
pairs, T the number of ties only in `x`, and U the number of ties only in
`y`. If a tie occurs for the same pair in both `x` and `y`, it is not
added to either T or U.
References
----------
W.R. Knight, "A Computer Method for Calculating Kendall's Tau with
Ungrouped Data", Journal of the American Statistical Association, Vol. 61,
No. 314, Part 1, pp. 436-439, 1966.
Examples
--------
>>> x1 = [12, 2, 1, 12, 2]
>>> x2 = [1, 4, 7, 1, 0]
>>> tau, p_value = sp.stats.kendalltau(x1, x2)
>>> tau
-0.47140452079103173
>>> p_value
0.24821309157521476
kruskal(*args)
Compute the Kruskal-Wallis H-test for independent samples
The Kruskal-Wallis H-test tests the null hypothesis that the population
median of all of the groups are equal. It is a non-parametric version of
ANOVA. The test works on 2 or more independent samples, which may have
different sizes. Note that rejecting the null hypothesis does not
indicate which of the groups differs. Post-hoc comparisons between
groups are required to determine which groups are different.
Parameters
----------
sample1, sample2, ... : array_like
Two or more arrays with the sample measurements can be given as
arguments.
Returns
-------
H-statistic : float
The Kruskal-Wallis H statistic, corrected for ties
p-value : float
The p-value for the test using the assumption that H has a chi
square distribution
Notes
-----
Due to the assumption that H has a chi square distribution, the number
of samples in each group must not be too small. A typical rule is
that each sample must have at least 5 measurements.
References
----------
.. [1] http://en.wikipedia.org/wiki/Kruskal-Wallis_one-way_analysis_of_variance
ks_2samp(data1, data2)
Computes the Kolmogorov-Smirnov statistic on 2 samples.
This is a two-sided test for the null hypothesis that 2 independent samples
are drawn from the same continuous distribution.
Parameters
----------
a, b : sequence of 1-D ndarrays
two arrays of sample observations assumed to be drawn from a continuous
distribution, sample sizes can be different
Returns
-------
D : float
KS statistic
p-value : float
two-tailed p-value
Notes
-----
This tests whether 2 samples are drawn from the same distribution. Note
that, like in the case of the one-sample K-S test, the distribution is
assumed to be continuous.
This is the two-sided test, one-sided tests are not implemented.
The test uses the two-sided asymptotic Kolmogorov-Smirnov distribution.
If the K-S statistic is small or the p-value is high, then we cannot
reject the hypothesis that the distributions of the two samples
are the same.
Examples
--------
>>> from scipy import stats
>>> np.random.seed(12345678) #fix random seed to get the same result
>>> n1 = 200 # size of first sample
>>> n2 = 300 # size of second sample
For a different distribution, we can reject the null hypothesis since the
pvalue is below 1%:
>>> rvs1 = stats.norm.rvs(size=n1, loc=0., scale=1)
>>> rvs2 = stats.norm.rvs(size=n2, loc=0.5, scale=1.5)
>>> stats.ks_2samp(rvs1, rvs2)
(0.20833333333333337, 4.6674975515806989e-005)
For a slightly different distribution, we cannot reject the null hypothesis
at a 10% or lower alpha since the p-value at 0.144 is higher than 10%
>>> rvs3 = stats.norm.rvs(size=n2, loc=0.01, scale=1.0)
>>> stats.ks_2samp(rvs1, rvs3)
(0.10333333333333333, 0.14498781825751686)
For an identical distribution, we cannot reject the null hypothesis since
the p-value is high, 41%:
>>> rvs4 = stats.norm.rvs(size=n2, loc=0.0, scale=1.0)
>>> stats.ks_2samp(rvs1, rvs4)
(0.07999999999999996, 0.41126949729859719)
ksprob(*args, **kwds)
`ksprob` is deprecated!
ksprob is deprecated in scipy 0.14, use stats.kstwobign.sf or special.kolmogorov instead
kolmogorov(x[, out])
kolmogorov(y)
Complementary cumulative distribution function of Kolmogorov distribution
Returns the complementary cumulative distribution function of
Kolmogorov's limiting distribution (Kn* for large n) of a
two-sided test for equality between an empirical and a theoretical
distribution. It is equal to the (limit as n->infinity of the)
probability that sqrt(n) * max absolute deviation > y.
kstat(data, n=2)
Return the nth k-statistic (1<=n<=4 so far).
The nth k-statistic is the unique symmetric unbiased estimator of the nth
cumulant kappa_n.
Parameters
----------
data : array_like
Input array.
n : int, {1, 2, 3, 4}, optional
Default is equal to 2.
Returns
-------
kstat : float
The nth k-statistic.
See Also
--------
kstatvar: Returns an unbiased estimator of the variance of the k-statistic.
Notes
-----
The cumulants are related to central moments but are specifically defined
using a power series expansion of the logarithm of the characteristic
function (which is the Fourier transform of the PDF).
In particular let phi(t) be the characteristic function, then::
ln phi(t) = > kappa_n (it)^n / n! (sum from n=0 to inf)
The first few cumulants (kappa_n) in terms of central moments (mu_n) are::
kappa_1 = mu_1
kappa_2 = mu_2
kappa_3 = mu_3
kappa_4 = mu_4 - 3*mu_2**2
kappa_5 = mu_5 - 10*mu_2 * mu_3
References
----------
http://mathworld.wolfram.com/k-Statistic.html
http://mathworld.wolfram.com/Cumulant.html
kstatvar(data, n=2)
Returns an unbiased estimator of the variance of the k-statistic.
See `kstat` for more details of the k-statistic.
Parameters
----------
data : array_like
Input array.
n : int, {1, 2}, optional
Default is equal to 2.
Returns
-------
kstatvar : float
The nth k-statistic variance.
See Also
--------
kstat
kstest(rvs, cdf, args=(), N=20, alternative='two-sided', mode='approx')
Perform the Kolmogorov-Smirnov test for goodness of fit.
This performs a test of the distribution G(x) of an observed
random variable against a given distribution F(x). Under the null
hypothesis the two distributions are identical, G(x)=F(x). The
alternative hypothesis can be either 'two-sided' (default), 'less'
or 'greater'. The KS test is only valid for continuous distributions.
Parameters
----------
rvs : str, array or callable
If a string, it should be the name of a distribution in `scipy.stats`.
If an array, it should be a 1-D array of observations of random
variables.
If a callable, it should be a function to generate random variables;
it is required to have a keyword argument `size`.
cdf : str or callable
If a string, it should be the name of a distribution in `scipy.stats`.
If `rvs` is a string then `cdf` can be False or the same as `rvs`.
If a callable, that callable is used to calculate the cdf.
args : tuple, sequence, optional
Distribution parameters, used if `rvs` or `cdf` are strings.
N : int, optional
Sample size if `rvs` is string or callable. Default is 20.
alternative : {'two-sided', 'less','greater'}, optional
Defines the alternative hypothesis (see explanation above).
Default is 'two-sided'.
mode : 'approx' (default) or 'asymp', optional
Defines the distribution used for calculating the p-value.
- 'approx' : use approximation to exact distribution of test statistic
- 'asymp' : use asymptotic distribution of test statistic
Returns
-------
D : float
KS test statistic, either D, D+ or D-.
p-value : float
One-tailed or two-tailed p-value.
Notes
-----
In the one-sided test, the alternative is that the empirical
cumulative distribution function of the random variable is "less"
or "greater" than the cumulative distribution function F(x) of the
hypothesis, ``G(x)<=F(x)``, resp. ``G(x)>=F(x)``.
Examples
--------
>>> from scipy import stats
>>> x = np.linspace(-15, 15, 9)
>>> stats.kstest(x, 'norm')
(0.44435602715924361, 0.038850142705171065)
>>> np.random.seed(987654321) # set random seed to get the same result
>>> stats.kstest('norm', False, N=100)
(0.058352892479417884, 0.88531190944151261)
The above lines are equivalent to:
>>> np.random.seed(987654321)
>>> stats.kstest(stats.norm.rvs(size=100), 'norm')
(0.058352892479417884, 0.88531190944151261)
*Test against one-sided alternative hypothesis*
Shift distribution to larger values, so that ``cdf_dgp(x) < norm.cdf(x)``:
>>> np.random.seed(987654321)
>>> x = stats.norm.rvs(loc=0.2, size=100)
>>> stats.kstest(x,'norm', alternative = 'less')
(0.12464329735846891, 0.040989164077641749)
Reject equal distribution against alternative hypothesis: less
>>> stats.kstest(x,'norm', alternative = 'greater')
(0.0072115233216311081, 0.98531158590396395)
Don't reject equal distribution against alternative hypothesis: greater
>>> stats.kstest(x,'norm', mode='asymp')
(0.12464329735846891, 0.08944488871182088)
*Testing t distributed random variables against normal distribution*
With 100 degrees of freedom the t distribution looks close to the normal
distribution, and the K-S test does not reject the hypothesis that the
sample came from the normal distribution:
>>> np.random.seed(987654321)
>>> stats.kstest(stats.t.rvs(100,size=100),'norm')
(0.072018929165471257, 0.67630062862479168)
With 3 degrees of freedom the t distribution looks sufficiently different
from the normal distribution, that we can reject the hypothesis that the
sample came from the normal distribution at the 10% level:
>>> np.random.seed(987654321)
>>> stats.kstest(stats.t.rvs(3,size=100),'norm')
(0.131016895759829, 0.058826222555312224)
kurtosis(a, axis=0, fisher=True, bias=True)
Computes the kurtosis (Fisher or Pearson) of a dataset.
Kurtosis is the fourth central moment divided by the square of the
variance. If Fisher's definition is used, then 3.0 is subtracted from
the result to give 0.0 for a normal distribution.
If bias is False then the kurtosis is calculated using k statistics to
eliminate bias coming from biased moment estimators
Use `kurtosistest` to see if result is close enough to normal.
Parameters
----------
a : array
data for which the kurtosis is calculated
axis : int or None
Axis along which the kurtosis is calculated
fisher : bool
If True, Fisher's definition is used (normal ==> 0.0). If False,
Pearson's definition is used (normal ==> 3.0).
bias : bool
If False, then the calculations are corrected for statistical bias.
Returns
-------
kurtosis : array
The kurtosis of values along an axis. If all values are equal,
return -3 for Fisher's definition and 0 for Pearson's definition.
References
----------
.. [1] Zwillinger, D. and Kokoska, S. (2000). CRC Standard
Probability and Statistics Tables and Formulae. Chapman & Hall: New
York. 2000.
kurtosistest(a, axis=0)
Tests whether a dataset has normal kurtosis
This function tests the null hypothesis that the kurtosis
of the population from which the sample was drawn is that
of the normal distribution: ``kurtosis = 3(n-1)/(n+1)``.
Parameters
----------
a : array
array of the sample data
axis : int or None
the axis to operate along, or None to work on the whole array.
The default is the first axis.
Returns
-------
z-score : float
The computed z-score for this test.
p-value : float
The 2-sided p-value for the hypothesis test
Notes
-----
Valid only for n>20. The Z-score is set to 0 for bad entries.
levene(*args, **kwds)
Perform Levene test for equal variances.
The Levene test tests the null hypothesis that all input samples
are from populations with equal variances. Levene's test is an
alternative to Bartlett's test `bartlett` in the case where
there are significant deviations from normality.
Parameters
----------
sample1, sample2, ... : array_like
The sample data, possibly with different lengths
center : {'mean', 'median', 'trimmed'}, optional
Which function of the data to use in the test. The default
is 'median'.
proportiontocut : float, optional
When `center` is 'trimmed', this gives the proportion of data points
to cut from each end. (See `scipy.stats.trim_mean`.)
Default is 0.05.
Returns
-------
W : float
The test statistic.
p-value : float
The p-value for the test.
Notes
-----
Three variations of Levene's test are possible. The possibilities
and their recommended usages are:
* 'median' : Recommended for skewed (non-normal) distributions>
* 'mean' : Recommended for symmetric, moderate-tailed distributions.
* 'trimmed' : Recommended for heavy-tailed distributions.
References
----------
.. [1] http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
.. [2] Levene, H. (1960). In Contributions to Probability and Statistics:
Essays in Honor of Harold Hotelling, I. Olkin et al. eds.,
Stanford University Press, pp. 278-292.
.. [3] Brown, M. B. and Forsythe, A. B. (1974), Journal of the American
Statistical Association, 69, 364-367
linregress(x, y=None)
Calculate a regression line
This computes a least-squares regression for two sets of measurements.
Parameters
----------
x, y : array_like
two sets of measurements. Both arrays should have the same length.
If only x is given (and y=None), then it must be a two-dimensional
array where one dimension has length 2. The two sets of measurements
are then found by splitting the array along the length-2 dimension.
Returns
-------
slope : float
slope of the regression line
intercept : float
intercept of the regression line
r-value : float
correlation coefficient
p-value : float
two-sided p-value for a hypothesis test whose null hypothesis is
that the slope is zero.
stderr : float
Standard error of the estimate
Examples
--------
>>> from scipy import stats
>>> import numpy as np
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
# To get coefficient of determination (r_squared)
>>> print "r-squared:", r_value**2
r-squared: 0.15286643777
mannwhitneyu(x, y, use_continuity=True)
Computes the Mann-Whitney rank test on samples x and y.
Parameters
----------
x, y : array_like
Array of samples, should be one-dimensional.
use_continuity : bool, optional
Whether a continuity correction (1/2.) should be taken into
account. Default is True.
Returns
-------
u : float
The Mann-Whitney statistics.
prob : float
One-sided p-value assuming a asymptotic normal distribution.
Notes
-----
Use only when the number of observation in each sample is > 20 and
you have 2 independent samples of ranks. Mann-Whitney U is
significant if the u-obtained is LESS THAN or equal to the critical
value of U.
This test corrects for ties and by default uses a continuity correction.
The reported p-value is for a one-sided hypothesis, to get the two-sided
p-value multiply the returned p-value by 2.
mode(a, axis=0)
Returns an array of the modal (most common) value in the passed array.
If there is more than one such value, only the first is returned.
The bin-count for the modal bins is also returned.
Parameters
----------
a : array_like
n-dimensional array of which to find mode(s).
axis : int, optional
Axis along which to operate. Default is 0, i.e. the first axis.
Returns
-------
vals : ndarray
Array of modal values.
counts : ndarray
Array of counts for each mode.
Examples
--------
>>> a = np.array([[6, 8, 3, 0],
[3, 2, 1, 7],
[8, 1, 8, 4],
[5, 3, 0, 5],
[4, 7, 5, 9]])
>>> from scipy import stats
>>> stats.mode(a)
(array([[ 3., 1., 0., 0.]]), array([[ 1., 1., 1., 1.]]))
To get mode of whole array, specify axis=None:
>>> stats.mode(a, axis=None)
(array([ 3.]), array([ 3.]))
moment(a, moment=1, axis=0)
Calculates the nth moment about the mean for a sample.
Generally used to calculate coefficients of skewness and
kurtosis.
Parameters
----------
a : array_like
data
moment : int
order of central moment that is returned
axis : int or None
Axis along which the central moment is computed. If None, then the data
array is raveled. The default axis is zero.
Returns
-------
n-th central moment : ndarray or float
The appropriate moment along the given axis or over all values if axis
is None. The denominator for the moment calculation is the number of
observations, no degrees of freedom correction is done.
mood(x, y, axis=0)
Perform Mood's test for equal scale parameters.
Mood's two-sample test for scale parameters is a non-parametric
test for the null hypothesis that two samples are drawn from the
same distribution with the same scale parameter.
Parameters
----------
x, y : array_like
Arrays of sample data.
axis: int, optional
The axis along which the samples are tested. `x` and `y` can be of
different length along `axis`.
If `axis` is None, `x` and `y` are flattened and the test is done on
all values in the flattened arrays.
Returns
-------
z : scalar or ndarray
The z-score for the hypothesis test. For 1-D inputs a scalar is
returned;
p-value : scalar ndarray
The p-value for the hypothesis test.
See Also
--------
fligner : A non-parametric test for the equality of k variances
ansari : A non-parametric test for the equality of 2 variances
bartlett : A parametric test for equality of k variances in normal samples
levene : A parametric test for equality of k variances
Notes
-----
The data are assumed to be drawn from probability distributions ``f(x)``
and ``f(x/s) / s`` respectively, for some probability density function f.
The null hypothesis is that ``s == 1``.
For multi-dimensional arrays, if the inputs are of shapes
``(n0, n1, n2, n3)`` and ``(n0, m1, n2, n3)``, then if ``axis=1``, the
resulting z and p values will have shape ``(n0, n2, n3)``. Note that
``n1`` and ``m1`` don't have to be equal, but the other dimensions do.
Examples
--------
>>> from scipy import stats
>>> x2 = np.random.randn(2, 45, 6, 7)
>>> x1 = np.random.randn(2, 30, 6, 7)
>>> z, p = stats.mood(x1, x2, axis=1)
>>> p.shape
(2, 6, 7)
Find the number of points where the difference in scale is not significant:
>>> (p > 0.1).sum()
74
Perform the test with different scales:
>>> x1 = np.random.randn(2, 30)
>>> x2 = np.random.randn(2, 35) * 10.0
>>> stats.mood(x1, x2, axis=1)
(array([-5.84332354, -5.6840814 ]), array([5.11694980e-09, 1.31517628e-08]))
mvsdist(data)
'Frozen' distributions for mean, variance, and standard deviation of data.
Parameters
----------
data : array_like
Input array. Converted to 1-D using ravel.
Requires 2 or more data-points.
Returns
-------
mdist : "frozen" distribution object
Distribution object representing the mean of the data
vdist : "frozen" distribution object
Distribution object representing the variance of the data
sdist : "frozen" distribution object
Distribution object representing the standard deviation of the data
Notes
-----
The return values from bayes_mvs(data) is equivalent to
``tuple((x.mean(), x.interval(0.90)) for x in mvsdist(data))``.
In other words, calling ``<dist>.mean()`` and ``<dist>.interval(0.90)``
on the three distribution objects returned from this function will give
the same results that are returned from `bayes_mvs`.
Examples
--------
>>> from scipy.stats import mvsdist
>>> data = [6, 9, 12, 7, 8, 8, 13]
>>> mean, var, std = mvsdist(data)
We now have frozen distribution objects "mean", "var" and "std" that we can
examine:
>>> mean.mean()
9.0
>>> mean.interval(0.95)
(6.6120585482655692, 11.387941451734431)
>>> mean.std()
1.1952286093343936
nanmean(x, axis=0)
Compute the mean over the given axis ignoring nans.
Parameters
----------
x : ndarray
Input array.
axis : int, optional
Axis along which the mean is computed. Default is 0, i.e. the
first axis.
Returns
-------
m : float
The mean of `x`, ignoring nans.
See Also
--------
nanstd, nanmedian
Examples
--------
>>> from scipy import stats
>>> a = np.linspace(0, 4, 3)
>>> a
array([ 0., 2., 4.])
>>> a[-1] = np.nan
>>> stats.nanmean(a)
1.0
nanmedian(x, axis=0)
Compute the median along the given axis ignoring nan values.
Parameters
----------
x : array_like
Input array.
axis : int, optional
Axis along which the median is computed. Default is 0, i.e. the
first axis.
Returns
-------
m : float
The median of `x` along `axis`.
See Also
--------
nanstd, nanmean
Examples
--------
>>> from scipy import stats
>>> a = np.array([0, 3, 1, 5, 5, np.nan])
>>> stats.nanmedian(a)
array(3.0)
>>> b = np.array([0, 3, 1, 5, 5, np.nan, 5])
>>> stats.nanmedian(b)
array(4.0)
Example with axis:
>>> c = np.arange(30.).reshape(5,6)
>>> idx = np.array([False, False, False, True, False] * 6).reshape(5,6)
>>> c[idx] = np.nan
>>> c
array([[ 0., 1., 2., nan, 4., 5.],
[ 6., 7., nan, 9., 10., 11.],
[ 12., nan, 14., 15., 16., 17.],
[ nan, 19., 20., 21., 22., nan],
[ 24., 25., 26., 27., nan, 29.]])
>>> stats.nanmedian(c, axis=1)
array([ 2. , 9. , 15. , 20.5, 26. ])
nanstd(x, axis=0, bias=False)
Compute the standard deviation over the given axis, ignoring nans.
Parameters
----------
x : array_like
Input array.
axis : int or None, optional
Axis along which the standard deviation is computed. Default is 0.
If None, compute over the whole array `x`.
bias : bool, optional
If True, the biased (normalized by N) definition is used. If False
(default), the unbiased definition is used.
Returns
-------
s : float
The standard deviation.
See Also
--------
nanmean, nanmedian
Examples
--------
>>> from scipy import stats
>>> a = np.arange(10, dtype=float)
>>> a[1:3] = np.nan
>>> np.std(a)
nan
>>> stats.nanstd(a)
2.9154759474226504
>>> stats.nanstd(a.reshape(2, 5), axis=1)
array([ 2.0817, 1.5811])
>>> stats.nanstd(a.reshape(2, 5), axis=None)
2.9154759474226504
normaltest(a, axis=0)
Tests whether a sample differs from a normal distribution.
This function tests the null hypothesis that a sample comes
from a normal distribution. It is based on D'Agostino and
Pearson's [1]_, [2]_ test that combines skew and kurtosis to
produce an omnibus test of normality.
Parameters
----------
a : array_like
The array containing the data to be tested.
axis : int or None
If None, the array is treated as a single data set, regardless of
its shape. Otherwise, each 1-d array along axis `axis` is tested.
Returns
-------
k2 : float or array
`s^2 + k^2`, where `s` is the z-score returned by `skewtest` and
`k` is the z-score returned by `kurtosistest`.
p-value : float or array
A 2-sided chi squared probability for the hypothesis test.
References
----------
.. [1] D'Agostino, R. B. (1971), "An omnibus test of normality for
moderate and large sample size," Biometrika, 58, 341-348
.. [2] D'Agostino, R. and Pearson, E. S. (1973), "Testing for
departures from normality," Biometrika, 60, 613-622
obrientransform(*args)
Computes the O'Brien transform on input data (any number of arrays).
Used to test for homogeneity of variance prior to running one-way stats.
Each array in ``*args`` is one level of a factor.
If `f_oneway` is run on the transformed data and found significant,
the variances are unequal. From Maxwell and Delaney [1]_, p.112.
Parameters
----------
args : tuple of array_like
Any number of arrays.
Returns
-------
obrientransform : ndarray
Transformed data for use in an ANOVA. The first dimension
of the result corresponds to the sequence of transformed
arrays. If the arrays given are all 1-D of the same length,
the return value is a 2-D array; otherwise it is a 1-D array
of type object, with each element being an ndarray.
References
----------
.. [1] S. E. Maxwell and H. D. Delaney, "Designing Experiments and
Analyzing Data: A Model Comparison Perspective", Wadsworth, 1990.
Examples
--------
We'll test the following data sets for differences in their variance.
>>> x = [10, 11, 13, 9, 7, 12, 12, 9, 10]
>>> y = [13, 21, 5, 10, 8, 14, 10, 12, 7, 15]
Apply the O'Brien transform to the data.
>>> tx, ty = obrientransform(x, y)
Use `scipy.stats.f_oneway` to apply a one-way ANOVA test to the
transformed data.
>>> from scipy.stats import f_oneway
>>> F, p = f_oneway(tx, ty)
>>> p
0.1314139477040335
If we require that ``p < 0.05`` for significance, we cannot conclude
that the variances are different.
pdf_fromgamma(g1, g2, g3=0.0, g4=None)
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : (N,) array_like
Input
y : (N,) array_like
Input
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
percentileofscore(a, score, kind='rank')
The percentile rank of a score relative to a list of scores.
A `percentileofscore` of, for example, 80% means that 80% of the
scores in `a` are below the given score. In the case of gaps or
ties, the exact definition depends on the optional keyword, `kind`.
Parameters
----------
a : array_like
Array of scores to which `score` is compared.
score : int or float
Score that is compared to the elements in `a`.
kind : {'rank', 'weak', 'strict', 'mean'}, optional
This optional parameter specifies the interpretation of the
resulting score:
- "rank": Average percentage ranking of score. In case of
multiple matches, average the percentage rankings of
all matching scores.
- "weak": This kind corresponds to the definition of a cumulative
distribution function. A percentileofscore of 80%
means that 80% of values are less than or equal
to the provided score.
- "strict": Similar to "weak", except that only values that are
strictly less than the given score are counted.
- "mean": The average of the "weak" and "strict" scores, often used in
testing. See
http://en.wikipedia.org/wiki/Percentile_rank
Returns
-------
pcos : float
Percentile-position of score (0-100) relative to `a`.
Examples
--------
Three-quarters of the given values lie below a given score:
>>> percentileofscore([1, 2, 3, 4], 3)
75.0
With multiple matches, note how the scores of the two matches, 0.6
and 0.8 respectively, are averaged:
>>> percentileofscore([1, 2, 3, 3, 4], 3)
70.0
Only 2/5 values are strictly less than 3:
>>> percentileofscore([1, 2, 3, 3, 4], 3, kind='strict')
40.0
But 4/5 values are less than or equal to 3:
>>> percentileofscore([1, 2, 3, 3, 4], 3, kind='weak')
80.0
The average between the weak and the strict scores is
>>> percentileofscore([1, 2, 3, 3, 4], 3, kind='mean')
60.0
pointbiserialr(x, y)
Calculates a point biserial correlation coefficient and the associated
p-value.
The point biserial correlation is used to measure the relationship
between a binary variable, x, and a continuous variable, y. Like other
correlation coefficients, this one varies between -1 and +1 with 0
implying no correlation. Correlations of -1 or +1 imply a determinative
relationship.
This function uses a shortcut formula but produces the same result as
`pearsonr`.
Parameters
----------
x : array_like of bools
Input array.
y : array_like
Input array.
Returns
-------
r : float
R value
p-value : float
2-tailed p-value
References
----------
http://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient
Examples
--------
>>> from scipy import stats
>>> a = np.array([0, 0, 0, 1, 1, 1, 1])
>>> b = np.arange(7)
>>> stats.pointbiserialr(a, b)
(0.8660254037844386, 0.011724811003954652)
>>> stats.pearsonr(a, b)
(0.86602540378443871, 0.011724811003954626)
>>> np.corrcoef(a, b)
array([[ 1. , 0.8660254],
[ 0.8660254, 1. ]])
power_divergence(f_obs, f_exp=None, ddof=0, axis=0, lambda_=None)
Cressie-Read power divergence statistic and goodness of fit test.
This function tests the null hypothesis that the categorical data
has the given frequencies, using the Cressie-Read power divergence
statistic.
Parameters
----------
f_obs : array_like
Observed frequencies in each category.
f_exp : array_like, optional
Expected frequencies in each category. By default the categories are
assumed to be equally likely.
ddof : int, optional
"Delta degrees of freedom": adjustment to the degrees of freedom
for the p-value. The p-value is computed using a chi-squared
distribution with ``k - 1 - ddof`` degrees of freedom, where `k`
is the number of observed frequencies. The default value of `ddof`
is 0.
axis : int or None, optional
The axis of the broadcast result of `f_obs` and `f_exp` along which to
apply the test. If axis is None, all values in `f_obs` are treated
as a single data set. Default is 0.
lambda_ : float or str, optional
`lambda_` gives the power in the Cressie-Read power divergence
statistic. The default is 1. For convenience, `lambda_` may be
assigned one of the following strings, in which case the
corresponding numerical value is used::
String Value Description
"pearson" 1 Pearson's chi-squared statistic.
In this case, the function is
equivalent to `stats.chisquare`.
"log-likelihood" 0 Log-likelihood ratio. Also known as
the G-test [3]_.
"freeman-tukey" -1/2 Freeman-Tukey statistic.
"mod-log-likelihood" -1 Modified log-likelihood ratio.
"neyman" -2 Neyman's statistic.
"cressie-read" 2/3 The power recommended in [5]_.
Returns
-------
stat : float or ndarray
The Cressie-Read power divergence test statistic. The value is
a float if `axis` is None or if` `f_obs` and `f_exp` are 1-D.
p : float or ndarray
The p-value of the test. The value is a float if `ddof` and the
return value `stat` are scalars.
See Also
--------
chisquare
Notes
-----
This test is invalid when the observed or expected frequencies in each
category are too small. A typical rule is that all of the observed
and expected frequencies should be at least 5.
When `lambda_` is less than zero, the formula for the statistic involves
dividing by `f_obs`, so a warning or error may be generated if any value
in `f_obs` is 0.
Similarly, a warning or error may be generated if any value in `f_exp` is
zero when `lambda_` >= 0.
The default degrees of freedom, k-1, are for the case when no parameters
of the distribution are estimated. If p parameters are estimated by
efficient maximum likelihood then the correct degrees of freedom are
k-1-p. If the parameters are estimated in a different way, then the
dof can be between k-1-p and k-1. However, it is also possible that
the asymptotic distribution is not a chisquare, in which case this
test is not appropriate.
This function handles masked arrays. If an element of `f_obs` or `f_exp`
is masked, then data at that position is ignored, and does not count
towards the size of the data set.
.. versionadded:: 0.13.0
References
----------
.. [1] Lowry, Richard. "Concepts and Applications of Inferential
Statistics". Chapter 8. http://faculty.vassar.edu/lowry/ch8pt1.html
.. [2] "Chi-squared test", http://en.wikipedia.org/wiki/Chi-squared_test
.. [3] "G-test", http://en.wikipedia.org/wiki/G-test
.. [4] Sokal, R. R. and Rohlf, F. J. "Biometry: the principles and
practice of statistics in biological research", New York: Freeman
(1981)
.. [5] Cressie, N. and Read, T. R. C., "Multinomial Goodness-of-Fit
Tests", J. Royal Stat. Soc. Series B, Vol. 46, No. 3 (1984),
pp. 440-464.
Examples
--------
(See `chisquare` for more examples.)
When just `f_obs` is given, it is assumed that the expected frequencies
are uniform and given by the mean of the observed frequencies. Here we
perform a G-test (i.e. use the log-likelihood ratio statistic):
>>> power_divergence([16, 18, 16, 14, 12, 12], method='log-likelihood')
(2.006573162632538, 0.84823476779463769)
The expected frequencies can be given with the `f_exp` argument:
>>> power_divergence([16, 18, 16, 14, 12, 12],
... f_exp=[16, 16, 16, 16, 16, 8],
... lambda_='log-likelihood')
(3.5, 0.62338762774958223)
When `f_obs` is 2-D, by default the test is applied to each column.
>>> obs = np.array([[16, 18, 16, 14, 12, 12], [32, 24, 16, 28, 20, 24]]).T
>>> obs.shape
(6, 2)
>>> power_divergence(obs, lambda_="log-likelihood")
(array([ 2.00657316, 6.77634498]), array([ 0.84823477, 0.23781225]))
By setting ``axis=None``, the test is applied to all data in the array,
which is equivalent to applying the test to the flattened array.
>>> power_divergence(obs, axis=None)
(23.31034482758621, 0.015975692534127565)
>>> power_divergence(obs.ravel())
(23.31034482758621, 0.015975692534127565)
`ddof` is the change to make to the default degrees of freedom.
>>> power_divergence([16, 18, 16, 14, 12, 12], ddof=1)
(2.0, 0.73575888234288467)
The calculation of the p-values is done by broadcasting the
test statistic with `ddof`.
>>> power_divergence([16, 18, 16, 14, 12, 12], ddof=[0,1,2])
(2.0, array([ 0.84914504, 0.73575888, 0.5724067 ]))
`f_obs` and `f_exp` are also broadcast. In the following, `f_obs` has
shape (6,) and `f_exp` has shape (2, 6), so the result of broadcasting
`f_obs` and `f_exp` has shape (2, 6). To compute the desired chi-squared
statistics, we must use ``axis=1``:
>>> power_divergence([16, 18, 16, 14, 12, 12],
... f_exp=[[16, 16, 16, 16, 16, 8],
... [8, 20, 20, 16, 12, 12]],
... axis=1)
(array([ 3.5 , 9.25]), array([ 0.62338763, 0.09949846]))
ppcc_max(x, brack=(0.0, 1.0), dist='tukeylambda')
Returns the shape parameter that maximizes the probability plot
correlation coefficient for the given data to a one-parameter
family of distributions.
See also ppcc_plot
ppcc_plot(x, a, b, dist='tukeylambda', plot=None, N=80)
Returns (shape, ppcc), and optionally plots shape vs. ppcc
(probability plot correlation coefficient) as a function of shape
parameter for a one-parameter family of distributions from shape
value a to b.
See also ppcc_max
probplot(x, sparams=(), dist='norm', fit=True, plot=None)
Calculate quantiles for a probability plot, and optionally show the plot.
Generates a probability plot of sample data against the quantiles of a
specified theoretical distribution (the normal distribution by default).
`probplot` optionally calculates a best-fit line for the data and plots the
results using Matplotlib or a given plot function.
Parameters
----------
x : array_like
Sample/response data from which `probplot` creates the plot.
sparams : tuple, optional
Distribution-specific shape parameters (shape parameters plus location
and scale).
dist : str or stats.distributions instance, optional
Distribution or distribution function name. The default is 'norm' for a
normal probability plot. Objects that look enough like a
stats.distributions instance (i.e. they have a ``ppf`` method) are also
accepted.
fit : bool, optional
Fit a least-squares regression (best-fit) line to the sample data if
True (default).
plot : object, optional
If given, plots the quantiles and least squares fit.
`plot` is an object that has to have methods "plot" and "text".
The `matplotlib.pyplot` module or a Matplotlib Axes object can be used,
or a custom object with the same methods.
Default is None, which means that no plot is created.
Returns
-------
(osm, osr) : tuple of ndarrays
Tuple of theoretical quantiles (osm, or order statistic medians) and
ordered responses (osr). `osr` is simply sorted input `x`.
For details on how `osm` is calculated see the Notes section.
(slope, intercept, r) : tuple of floats, optional
Tuple containing the result of the least-squares fit, if that is
performed by `probplot`. `r` is the square root of the coefficient of
determination. If ``fit=False`` and ``plot=None``, this tuple is not
returned.
Notes
-----
Even if `plot` is given, the figure is not shown or saved by `probplot`;
``plt.show()`` or ``plt.savefig('figname.png')`` should be used after
calling `probplot`.
`probplot` generates a probability plot, which should not be confused with
a Q-Q or a P-P plot. Statsmodels has more extensive functionality of this
type, see ``statsmodels.api.ProbPlot``.
The formula used for the theoretical quantiles (horizontal axis of the
probability plot) is Filliben's estimate::
quantiles = dist.ppf(val), for
0.5**(1/n), for i = n
val = (i - 0.3175) / (n + 0.365), for i = 2, ..., n-1
1 - 0.5**(1/n), for i = 1
where ``i`` indicates the i-th ordered value and ``n`` is the total number
of values.
Examples
--------
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
>>> nsample = 100
>>> np.random.seed(7654321)
A t distribution with small degrees of freedom:
>>> ax1 = plt.subplot(221)
>>> x = stats.t.rvs(3, size=nsample)
>>> res = stats.probplot(x, plot=plt)
A t distribution with larger degrees of freedom:
>>> ax2 = plt.subplot(222)
>>> x = stats.t.rvs(25, size=nsample)
>>> res = stats.probplot(x, plot=plt)
A mixture of two normal distributions with broadcasting:
>>> ax3 = plt.subplot(223)
>>> x = stats.norm.rvs(loc=[0,5], scale=[1,1.5],
... size=(nsample/2.,2)).ravel()
>>> res = stats.probplot(x, plot=plt)
A standard normal distribution:
>>> ax4 = plt.subplot(224)
>>> x = stats.norm.rvs(loc=0, scale=1, size=nsample)
>>> res = stats.probplot(x, plot=plt)
Produce a new figure with a loggamma distribution, using the ``dist`` and
``sparams`` keywords:
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> x = stats.loggamma.rvs(c=2.5, size=500)
>>> stats.probplot(x, dist=stats.loggamma, sparams=(2.5,), plot=ax)
>>> ax.set_title("Probplot for loggamma dist with shape parameter 2.5")
Show the results with Matplotlib:
>>> plt.show()
randwcdf(*args, **kwds)
`randwcdf` is deprecated!
Deprecated in scipy 0.14.0, use distribution-specific rvs() method instead
Returns an array of randomly distributed integers given a CDF.
Given a cumulative distribution function (CDF) returns an array of
randomly distributed integers that would satisfy the CDF.
Parameters
----------
cdf : function
CDF function that accepts a single value and `args`, and returns
an single value.
mean : float, optional
The mean of the distribution which helps the solver. Defaults
to 1.0.
args : tuple, optional
Extra arguments to the cdf function (i.e. shape, location, scale)
size : {int, None}, optional
Is the size of the output. If None, only 1 value will be returned.
Returns
-------
randwcdf : ndarray
Array of random numbers.
Notes
-----
Can use the ``scipy.stats.distributions.*.cdf`` functions for the
`cdf` parameter.
randwppf(*args, **kwds)
`randwppf` is deprecated!
Deprecated in scipy 0.14.0, use distribution-specific rvs() method instead
returns an array of randomly distributed integers of a distribution
whose percent point function (inverse of the CDF or quantile function)
is given.
args is a tuple of extra arguments to the ppf function (i.e. shape,
location, scale), and size is the size of the output. Note the ppf
function must accept an array of q values to compute over.
rankdata(...)
rankdata(a, method='average')
Assign ranks to data, dealing with ties appropriately.
Ranks begin at 1. The `method` argument controls how ranks are assigned
to equal values. See [1]_ for further discussion of ranking methods.
Parameters
----------
a : array_like
The array of values to be ranked. The array is first flattened.
method : str, optional
The method used to assign ranks to tied elements.
The options are 'average', 'min', 'max', 'dense' and 'ordinal'.
'average':
The average of the ranks that would have been assigned to
all the tied values is assigned to each value.
'min':
The minimum of the ranks that would have been assigned to all
the tied values is assigned to each value. (This is also
referred to as "competition" ranking.)
'max':
The maximum of the ranks that would have been assigned to all
the tied values is assigned to each value.
'dense':
Like 'min', but the rank of the next highest element is assigned
the rank immediately after those assigned to the tied elements.
'ordinal':
All values are given a distinct rank, corresponding to the order
that the values occur in `a`.
The default is 'average'.
Returns
-------
ranks : ndarray
An array of length equal to the size of `a`, containing rank
scores.
Notes
-----
All floating point types are converted to numpy.float64 before ranking.
This may result in spurious ties if an input array of floats has a wider
data type than numpy.float64 (e.g. numpy.float128).
References
----------
.. [1] "Ranking", http://en.wikipedia.org/wiki/Ranking
Examples
--------
>>> rankdata([0, 2, 3, 2])
array([ 1. , 2.5, 4. , 2.5])
>>> rankdata([0, 2, 3, 2], method='min')
array([ 1., 2., 4., 2.])
>>> rankdata([0, 2, 3, 2], method='max')
array([ 1., 3., 4., 3.])
>>> rankdata([0, 2, 3, 2], method='dense')
array([ 1., 2., 3., 2.])
>>> rankdata([0, 2, 3, 2], method='ordinal')
array([ 1., 2., 4., 3.])
ranksums(x, y)
Compute the Wilcoxon rank-sum statistic for two samples.
The Wilcoxon rank-sum test tests the null hypothesis that two sets
of measurements are drawn from the same distribution. The alternative
hypothesis is that values in one sample are more likely to be
larger than the values in the other sample.
This test should be used to compare two samples from continuous
distributions. It does not handle ties between measurements
in x and y. For tie-handling and an optional continuity correction
see `scipy.stats.mannwhitneyu`.
Parameters
----------
x,y : array_like
The data from the two samples
Returns
-------
z-statistic : float
The test statistic under the large-sample approximation that the
rank sum statistic is normally distributed
p-value : float
The two-sided p-value of the test
References
----------
.. [1] http://en.wikipedia.org/wiki/Wilcoxon_rank-sum_test
relfreq(a, numbins=10, defaultreallimits=None, weights=None)
Returns a relative frequency histogram, using the histogram function.
Parameters
----------
a : array_like
Input array.
numbins : int, optional
The number of bins to use for the histogram. Default is 10.
defaultreallimits : tuple (lower, upper), optional
The lower and upper values for the range of the histogram.
If no value is given, a range slightly larger then the range of the
values in a is used. Specifically ``(a.min() - s, a.max() + s)``,
where ``s = (1/2)(a.max() - a.min()) / (numbins - 1)``.
weights : array_like, optional
The weights for each value in `a`. Default is None, which gives each
value a weight of 1.0
Returns
-------
relfreq : ndarray
Binned values of relative frequency.
lowerreallimit : float
Lower real limit
binsize : float
Width of each bin.
extrapoints : int
Extra points.
Examples
--------
>>> a = np.array([1, 4, 2, 1, 3, 1])
>>> relfreqs, lowlim, binsize, extrapoints = sp.stats.relfreq(a, numbins=4)
>>> relfreqs
array([ 0.5 , 0.16666667, 0.16666667, 0.16666667])
>>> np.sum(relfreqs) # relative frequencies should add up to 1
0.99999999999999989
scoreatpercentile(a, per, limit=(), interpolation_method='fraction', axis=None)
Calculate the score at a given percentile of the input sequence.
For example, the score at `per=50` is the median. If the desired quantile
lies between two data points, we interpolate between them, according to
the value of `interpolation`. If the parameter `limit` is provided, it
should be a tuple (lower, upper) of two values.
Parameters
----------
a : array_like
A 1-D array of values from which to extract score.
per : array_like
Percentile(s) at which to extract score. Values should be in range
[0,100].
limit : tuple, optional
Tuple of two scalars, the lower and upper limits within which to
compute the percentile. Values of `a` outside
this (closed) interval will be ignored.
interpolation : {'fraction', 'lower', 'higher'}, optional
This optional parameter specifies the interpolation method to use,
when the desired quantile lies between two data points `i` and `j`
- fraction: ``i + (j - i) * fraction`` where ``fraction`` is the
fractional part of the index surrounded by ``i`` and ``j``.
- lower: ``i``.
- higher: ``j``.
axis : int, optional
Axis along which the percentiles are computed. The default (None)
is to compute the median along a flattened version of the array.
Returns
-------
score : float or ndarray
Score at percentile(s).
See Also
--------
percentileofscore, numpy.percentile
Notes
-----
This function will become obsolete in the future.
For Numpy 1.9 and higher, `numpy.percentile` provides all the functionality
that `scoreatpercentile` provides. And it's significantly faster.
Therefore it's recommended to use `numpy.percentile` for users that have
numpy >= 1.9.
Examples
--------
>>> from scipy import stats
>>> a = np.arange(100)
>>> stats.scoreatpercentile(a, 50)
49.5
sem(a, axis=0, ddof=1)
Calculates the standard error of the mean (or standard error of
measurement) of the values in the input array.
Parameters
----------
a : array_like
An array containing the values for which the standard error is
returned.
axis : int or None, optional.
If axis is None, ravel `a` first. If axis is an integer, this will be
the axis over which to operate. Defaults to 0.
ddof : int, optional
Delta degrees-of-freedom. How many degrees of freedom to adjust
for bias in limited samples relative to the population estimate
of variance. Defaults to 1.
Returns
-------
s : ndarray or float
The standard error of the mean in the sample(s), along the input axis.
Notes
-----
The default value for `ddof` is different to the default (0) used by other
ddof containing routines, such as np.std nd stats.nanstd.
Examples
--------
Find standard error along the first axis:
>>> from scipy import stats
>>> a = np.arange(20).reshape(5,4)
>>> stats.sem(a)
array([ 2.8284, 2.8284, 2.8284, 2.8284])
Find standard error across the whole array, using n degrees of freedom:
>>> stats.sem(a, axis=None, ddof=0)
1.2893796958227628
shapiro(x, a=None, reta=False)
Perform the Shapiro-Wilk test for normality.
The Shapiro-Wilk test tests the null hypothesis that the
data was drawn from a normal distribution.
Parameters
----------
x : array_like
Array of sample data.
a : array_like, optional
Array of internal parameters used in the calculation. If these
are not given, they will be computed internally. If x has length
n, then a must have length n/2.
reta : bool, optional
Whether or not to return the internally computed a values. The
default is False.
Returns
-------
W : float
The test statistic.
p-value : float
The p-value for the hypothesis test.
a : array_like, optional
If `reta` is True, then these are the internally computed "a"
values that may be passed into this function on future calls.
See Also
--------
anderson : The Anderson-Darling test for normality
References
----------
.. [1] http://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm
sigmaclip(a, low=4.0, high=4.0)
Iterative sigma-clipping of array elements.
The output array contains only those elements of the input array `c`
that satisfy the conditions ::
mean(c) - std(c)*low < c < mean(c) + std(c)*high
Starting from the full sample, all elements outside the critical range are
removed. The iteration continues with a new critical range until no
elements are outside the range.
Parameters
----------
a : array_like
Data array, will be raveled if not 1-D.
low : float, optional
Lower bound factor of sigma clipping. Default is 4.
high : float, optional
Upper bound factor of sigma clipping. Default is 4.
Returns
-------
c : ndarray
Input array with clipped elements removed.
critlower : float
Lower threshold value use for clipping.
critlupper : float
Upper threshold value use for clipping.
Examples
--------
>>> a = np.concatenate((np.linspace(9.5,10.5,31), np.linspace(0,20,5)))
>>> fact = 1.5
>>> c, low, upp = sigmaclip(a, fact, fact)
>>> c
array([ 9.96666667, 10. , 10.03333333, 10. ])
>>> c.var(), c.std()
(0.00055555555555555165, 0.023570226039551501)
>>> low, c.mean() - fact*c.std(), c.min()
(9.9646446609406727, 9.9646446609406727, 9.9666666666666668)
>>> upp, c.mean() + fact*c.std(), c.max()
(10.035355339059327, 10.035355339059327, 10.033333333333333)
>>> a = np.concatenate((np.linspace(9.5,10.5,11),
np.linspace(-100,-50,3)))
>>> c, low, upp = sigmaclip(a, 1.8, 1.8)
>>> (c == np.linspace(9.5,10.5,11)).all()
True
signaltonoise(a, axis=0, ddof=0)
The signal-to-noise ratio of the input data.
Returns the signal-to-noise ratio of `a`, here defined as the mean
divided by the standard deviation.
Parameters
----------
a : array_like
An array_like object containing the sample data.
axis : int or None, optional
If axis is equal to None, the array is first ravel'd. If axis is an
integer, this is the axis over which to operate. Default is 0.
ddof : int, optional
Degrees of freedom correction for standard deviation. Default is 0.
Returns
-------
s2n : ndarray
The mean to standard deviation ratio(s) along `axis`, or 0 where the
standard deviation is 0.
skew(a, axis=0, bias=True)
Computes the skewness of a data set.
For normally distributed data, the skewness should be about 0. A skewness
value > 0 means that there is more weight in the left tail of the
distribution. The function `skewtest` can be used to determine if the
skewness value is close enough to 0, statistically speaking.
Parameters
----------
a : ndarray
data
axis : int or None
axis along which skewness is calculated
bias : bool
If False, then the calculations are corrected for statistical bias.
Returns
-------
skewness : ndarray
The skewness of values along an axis, returning 0 where all values are
equal.
References
----------
[CRCProbStat2000]_ Section 2.2.24.1
.. [CRCProbStat2000] Zwillinger, D. and Kokoska, S. (2000). CRC Standard
Probability and Statistics Tables and Formulae. Chapman & Hall: New
York. 2000.
skewtest(a, axis=0)
Tests whether the skew is different from the normal distribution.
This function tests the null hypothesis that the skewness of
the population that the sample was drawn from is the same
as that of a corresponding normal distribution.
Parameters
----------
a : array
axis : int or None
Returns
-------
z-score : float
The computed z-score for this test.
p-value : float
a 2-sided p-value for the hypothesis test
Notes
-----
The sample size must be at least 8.
spearmanr(a, b=None, axis=0)
Calculates a Spearman rank-order correlation coefficient and the p-value
to test for non-correlation.
The Spearman correlation is a nonparametric measure of the monotonicity
of the relationship between two datasets. Unlike the Pearson correlation,
the Spearman correlation does not assume that both datasets are normally
distributed. Like other correlation coefficients, this one varies
between -1 and +1 with 0 implying no correlation. Correlations of -1 or
+1 imply an exact monotonic relationship. Positive correlations imply that
as x increases, so does y. Negative correlations imply that as x
increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Spearman correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
a, b : 1D or 2D array_like, b is optional
One or two 1-D or 2-D arrays containing multiple variables and
observations. Each column of `a` and `b` represents a variable, and
each row entry a single observation of those variables. See also
`axis`. Both arrays need to have the same length in the `axis`
dimension.
axis : int or None, optional
If axis=0 (default), then each column represents a variable, with
observations in the rows. If axis=0, the relationship is transposed:
each row represents a variable, while the columns contain observations.
If axis=None, then both arrays will be raveled.
Returns
-------
rho : float or ndarray (2-D square)
Spearman correlation matrix or correlation coefficient (if only 2
variables are given as parameters. Correlation matrix is square with
length equal to total number of variables (columns or rows) in a and b
combined.
p-value : float
The two-sided p-value for a hypothesis test whose null hypothesis is
that two sets of data are uncorrelated, has same dimension as rho.
Notes
-----
Changes in scipy 0.8.0: rewrite to add tie-handling, and axis.
References
----------
[CRCProbStat2000]_ Section 14.7
.. [CRCProbStat2000] Zwillinger, D. and Kokoska, S. (2000). CRC Standard
Probability and Statistics Tables and Formulae. Chapman & Hall: New
York. 2000.
Examples
--------
>>> spearmanr([1,2,3,4,5],[5,6,7,8,7])
(0.82078268166812329, 0.088587005313543798)
>>> np.random.seed(1234321)
>>> x2n=np.random.randn(100,2)
>>> y2n=np.random.randn(100,2)
>>> spearmanr(x2n)
(0.059969996999699973, 0.55338590803773591)
>>> spearmanr(x2n[:,0], x2n[:,1])
(0.059969996999699973, 0.55338590803773591)
>>> rho, pval = spearmanr(x2n,y2n)
>>> rho
array([[ 1. , 0.05997 , 0.18569457, 0.06258626],
[ 0.05997 , 1. , 0.110003 , 0.02534653],
[ 0.18569457, 0.110003 , 1. , 0.03488749],
[ 0.06258626, 0.02534653, 0.03488749, 1. ]])
>>> pval
array([[ 0. , 0.55338591, 0.06435364, 0.53617935],
[ 0.55338591, 0. , 0.27592895, 0.80234077],
[ 0.06435364, 0.27592895, 0. , 0.73039992],
[ 0.53617935, 0.80234077, 0.73039992, 0. ]])
>>> rho, pval = spearmanr(x2n.T, y2n.T, axis=1)
>>> rho
array([[ 1. , 0.05997 , 0.18569457, 0.06258626],
[ 0.05997 , 1. , 0.110003 , 0.02534653],
[ 0.18569457, 0.110003 , 1. , 0.03488749],
[ 0.06258626, 0.02534653, 0.03488749, 1. ]])
>>> spearmanr(x2n, y2n, axis=None)
(0.10816770419260482, 0.1273562188027364)
>>> spearmanr(x2n.ravel(), y2n.ravel())
(0.10816770419260482, 0.1273562188027364)
>>> xint = np.random.randint(10,size=(100,2))
>>> spearmanr(xint)
(0.052760927029710199, 0.60213045837062351)
square_of_sums(a, axis=0)
Sums elements of the input array, and returns the square(s) of that sum.
Parameters
----------
a : array_like
Input array.
axis : int or None, optional
If axis is None, ravel `a` first. If `axis` is an integer, this will
be the axis over which to operate. Defaults to 0.
Returns
-------
square_of_sums : float or ndarray
The square of the sum over `axis`.
See also
--------
ss : The sum of squares (the opposite of `square_of_sums`).
Examples
--------
>>> from scipy import stats
>>> a = np.arange(20).reshape(5,4)
>>> stats.square_of_sums(a)
array([ 1600., 2025., 2500., 3025.])
>>> stats.square_of_sums(a, axis=None)
36100.0
ss(a, axis=0)
Squares each element of the input array, and returns the sum(s) of that.
Parameters
----------
a : array_like
Input array.
axis : int or None, optional
The axis along which to calculate. If None, use whole array.
Default is 0, i.e. along the first axis.
Returns
-------
ss : ndarray
The sum along the given axis for (a**2).
See also
--------
square_of_sums : The square(s) of the sum(s) (the opposite of `ss`).
Examples
--------
>>> from scipy import stats
>>> a = np.array([1., 2., 5.])
>>> stats.ss(a)
30.0
And calculating along an axis:
>>> b = np.array([[1., 2., 5.], [2., 5., 6.]])
>>> stats.ss(b, axis=1)
array([ 30., 65.])
threshold(a, threshmin=None, threshmax=None, newval=0)
Clip array to a given value.
Similar to numpy.clip(), except that values less than `threshmin` or
greater than `threshmax` are replaced by `newval`, instead of by
`threshmin` and `threshmax` respectively.
Parameters
----------
a : array_like
Data to threshold.
threshmin : float, int or None, optional
Minimum threshold, defaults to None.
threshmax : float, int or None, optional
Maximum threshold, defaults to None.
newval : float or int, optional
Value to put in place of values in `a` outside of bounds.
Defaults to 0.
Returns
-------
out : ndarray
The clipped input array, with values less than `threshmin` or
greater than `threshmax` replaced with `newval`.
Examples
--------
>>> a = np.array([9, 9, 6, 3, 1, 6, 1, 0, 0, 8])
>>> from scipy import stats
>>> stats.threshold(a, threshmin=2, threshmax=8, newval=-1)
array([-1, -1, 6, 3, -1, 6, -1, -1, -1, 8])
tiecorrect(...)
tiecorrect(rankvals)
Tie correction factor for ties in the Mann-Whitney U and
Kruskal-Wallis H tests.
Parameters
----------
rankvals : array_like
A 1-D sequence of ranks. Typically this will be the array
returned by `stats.rankdata`.
Returns
-------
factor : float
Correction factor for U or H.
See Also
--------
rankdata : Assign ranks to the data
mannwhitneyu : Mann-Whitney rank test
kruskal : Kruskal-Wallis H test
References
----------
.. [1] Siegel, S. (1956) Nonparametric Statistics for the Behavioral
Sciences. New York: McGraw-Hill.
Examples
--------
>>> tiecorrect([1, 2.5, 2.5, 4])
0.9
>>> ranks = rankdata([1, 3, 2, 4, 5, 7, 2, 8, 4])
>>> ranks
array([ 1. , 4. , 2.5, 5.5, 7. , 8. , 2.5, 9. , 5.5])
>>> tiecorrect(ranks)
0.9833333333333333
tmax(a, upperlimit=None, axis=0, inclusive=True)
Compute the trimmed maximum
This function computes the maximum value of an array along a given axis,
while ignoring values larger than a specified upper limit.
Parameters
----------
a : array_like
array of values
upperlimit : None or float, optional
Values in the input array greater than the given limit will be ignored.
When upperlimit is None, then all values are used. The default value
is None.
axis : None or int, optional
Operate along this axis. None means to use the flattened array and
the default is zero.
inclusive : {True, False}, optional
This flag determines whether values exactly equal to the upper limit
are included. The default value is True.
Returns
-------
tmax : float
tmean(a, limits=None, inclusive=(True, True))
Compute the trimmed mean.
This function finds the arithmetic mean of given values, ignoring values
outside the given `limits`.
Parameters
----------
a : array_like
Array of values.
limits : None or (lower limit, upper limit), optional
Values in the input array less than the lower limit or greater than the
upper limit will be ignored. When limits is None (default), then all
values are used. Either of the limit values in the tuple can also be
None representing a half-open interval.
inclusive : (bool, bool), optional
A tuple consisting of the (lower flag, upper flag). These flags
determine whether values exactly equal to the lower or upper limits
are included. The default value is (True, True).
Returns
-------
tmean : float
tmin(a, lowerlimit=None, axis=0, inclusive=True)
Compute the trimmed minimum
This function finds the miminum value of an array `a` along the
specified axis, but only considering values greater than a specified
lower limit.
Parameters
----------
a : array_like
array of values
lowerlimit : None or float, optional
Values in the input array less than the given limit will be ignored.
When lowerlimit is None, then all values are used. The default value
is None.
axis : None or int, optional
Operate along this axis. None means to use the flattened array and
the default is zero
inclusive : {True, False}, optional
This flag determines whether values exactly equal to the lower limit
are included. The default value is True.
Returns
-------
tmin : float
trim1(a, proportiontocut, tail='right')
Slices off a proportion of items from ONE end of the passed array
distribution.
If `proportiontocut` = 0.1, slices off 'leftmost' or 'rightmost'
10% of scores. Slices off LESS if proportion results in a non-integer
slice index (i.e., conservatively slices off `proportiontocut` ).
Parameters
----------
a : array_like
Input array
proportiontocut : float
Fraction to cut off of 'left' or 'right' of distribution
tail : {'left', 'right'}, optional
Defaults to 'right'.
Returns
-------
trim1 : ndarray
Trimmed version of array `a`
trim_mean(a, proportiontocut, axis=0)
Return mean of array after trimming distribution from both lower and upper
tails.
If `proportiontocut` = 0.1, slices off 'leftmost' and 'rightmost' 10% of
scores. Slices off LESS if proportion results in a non-integer slice
index (i.e., conservatively slices off `proportiontocut` ).
Parameters
----------
a : array_like
Input array
proportiontocut : float
Fraction to cut off of both tails of the distribution
axis : int or None, optional
Axis along which the trimmed means are computed. The default is axis=0.
If axis is None then the trimmed mean will be computed for the
flattened array.
Returns
-------
trim_mean : ndarray
Mean of trimmed array.
See Also
--------
trimboth
Examples
--------
>>> from scipy import stats
>>> x = np.arange(20)
>>> stats.trim_mean(x, 0.1)
9.5
>>> x2 = x.reshape(5, 4)
>>> x2
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])
>>> stats.trim_mean(x2, 0.25)
array([ 8., 9., 10., 11.])
>>> stats.trim_mean(x2, 0.25, axis=1)
array([ 1.5, 5.5, 9.5, 13.5, 17.5])
trimboth(a, proportiontocut, axis=0)
Slices off a proportion of items from both ends of an array.
Slices off the passed proportion of items from both ends of the passed
array (i.e., with `proportiontocut` = 0.1, slices leftmost 10% **and**
rightmost 10% of scores). You must pre-sort the array if you want
'proper' trimming. Slices off less if proportion results in a
non-integer slice index (i.e., conservatively slices off
`proportiontocut`).
Parameters
----------
a : array_like
Data to trim.
proportiontocut : float
Proportion (in range 0-1) of total data set to trim of each end.
axis : int or None, optional
Axis along which the observations are trimmed. The default is to trim
along axis=0. If axis is None then the array will be flattened before
trimming.
Returns
-------
out : ndarray
Trimmed version of array `a`.
See Also
--------
trim_mean
Examples
--------
>>> from scipy import stats
>>> a = np.arange(20)
>>> b = stats.trimboth(a, 0.1)
>>> b.shape
(16,)
tsem(a, limits=None, inclusive=(True, True))
Compute the trimmed standard error of the mean.
This function finds the standard error of the mean for given
values, ignoring values outside the given `limits`.
Parameters
----------
a : array_like
array of values
limits : None or (lower limit, upper limit), optional
Values in the input array less than the lower limit or greater than the
upper limit will be ignored. When limits is None, then all values are
used. Either of the limit values in the tuple can also be None
representing a half-open interval. The default value is None.
inclusive : (bool, bool), optional
A tuple consisting of the (lower flag, upper flag). These flags
determine whether values exactly equal to the lower or upper limits
are included. The default value is (True, True).
Returns
-------
tsem : float
Notes
-----
`tsem` uses unbiased sample standard deviation, i.e. it uses a
correction factor ``n / (n - 1)``.
tstd(a, limits=None, inclusive=(True, True))
Compute the trimmed sample standard deviation
This function finds the sample standard deviation of given values,
ignoring values outside the given `limits`.
Parameters
----------
a : array_like
array of values
limits : None or (lower limit, upper limit), optional
Values in the input array less than the lower limit or greater than the
upper limit will be ignored. When limits is None, then all values are
used. Either of the limit values in the tuple can also be None
representing a half-open interval. The default value is None.
inclusive : (bool, bool), optional
A tuple consisting of the (lower flag, upper flag). These flags
determine whether values exactly equal to the lower or upper limits
are included. The default value is (True, True).
Returns
-------
tstd : float
Notes
-----
`tstd` computes the unbiased sample standard deviation, i.e. it uses a
correction factor ``n / (n - 1)``.
ttest_1samp(a, popmean, axis=0)
Calculates the T-test for the mean of ONE group of scores.
This is a two-sided test for the null hypothesis that the expected value
(mean) of a sample of independent observations `a` is equal to the given
population mean, `popmean`.
Parameters
----------
a : array_like
sample observation
popmean : float or array_like
expected value in null hypothesis, if array_like than it must have the
same shape as `a` excluding the axis dimension
axis : int, optional, (default axis=0)
Axis can equal None (ravel array first), or an integer (the axis
over which to operate on a).
Returns
-------
t : float or array
t-statistic
prob : float or array
two-tailed p-value
Examples
--------
>>> from scipy import stats
>>> np.random.seed(7654567) # fix seed to get the same result
>>> rvs = stats.norm.rvs(loc=5, scale=10, size=(50,2))
Test if mean of random sample is equal to true mean, and different mean.
We reject the null hypothesis in the second case and don't reject it in
the first case.
>>> stats.ttest_1samp(rvs,5.0)
(array([-0.68014479, -0.04323899]), array([ 0.49961383, 0.96568674]))
>>> stats.ttest_1samp(rvs,0.0)
(array([ 2.77025808, 4.11038784]), array([ 0.00789095, 0.00014999]))
Examples using axis and non-scalar dimension for population mean.
>>> stats.ttest_1samp(rvs,[5.0,0.0])
(array([-0.68014479, 4.11038784]), array([ 4.99613833e-01, 1.49986458e-04]))
>>> stats.ttest_1samp(rvs.T,[5.0,0.0],axis=1)
(array([-0.68014479, 4.11038784]), array([ 4.99613833e-01, 1.49986458e-04]))
>>> stats.ttest_1samp(rvs,[[5.0],[0.0]])
(array([[-0.68014479, -0.04323899],
[ 2.77025808, 4.11038784]]), array([[ 4.99613833e-01, 9.65686743e-01],
[ 7.89094663e-03, 1.49986458e-04]]))
ttest_ind(a, b, axis=0, equal_var=True)
Calculates the T-test for the means of TWO INDEPENDENT samples of scores.
This is a two-sided test for the null hypothesis that 2 independent samples
have identical average (expected) values. This test assumes that the
populations have identical variances.
Parameters
----------
a, b : array_like
The arrays must have the same shape, except in the dimension
corresponding to `axis` (the first, by default).
axis : int, optional
Axis can equal None (ravel array first), or an integer (the axis
over which to operate on a and b).
equal_var : bool, optional
If True (default), perform a standard independent 2 sample test
that assumes equal population variances [1]_.
If False, perform Welch's t-test, which does not assume equal
population variance [2]_.
.. versionadded:: 0.11.0
Returns
-------
t : float or array
The calculated t-statistic.
prob : float or array
The two-tailed p-value.
Notes
-----
We can use this test, if we observe two independent samples from
the same or different population, e.g. exam scores of boys and
girls or of two ethnic groups. The test measures whether the
average (expected) value differs significantly across samples. If
we observe a large p-value, for example larger than 0.05 or 0.1,
then we cannot reject the null hypothesis of identical average scores.
If the p-value is smaller than the threshold, e.g. 1%, 5% or 10%,
then we reject the null hypothesis of equal averages.
References
----------
.. [1] http://en.wikipedia.org/wiki/T-test#Independent_two-sample_t-test
.. [2] http://en.wikipedia.org/wiki/Welch%27s_t_test
Examples
--------
>>> from scipy import stats
>>> np.random.seed(12345678)
Test with sample with identical means:
>>> rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
>>> rvs2 = stats.norm.rvs(loc=5,scale=10,size=500)
>>> stats.ttest_ind(rvs1,rvs2)
(0.26833823296239279, 0.78849443369564776)
>>> stats.ttest_ind(rvs1,rvs2, equal_var = False)
(0.26833823296239279, 0.78849452749500748)
`ttest_ind` underestimates p for unequal variances:
>>> rvs3 = stats.norm.rvs(loc=5, scale=20, size=500)
>>> stats.ttest_ind(rvs1, rvs3)
(-0.46580283298287162, 0.64145827413436174)
>>> stats.ttest_ind(rvs1, rvs3, equal_var = False)
(-0.46580283298287162, 0.64149646246569292)
When n1 != n2, the equal variance t-statistic is no longer equal to the
unequal variance t-statistic:
>>> rvs4 = stats.norm.rvs(loc=5, scale=20, size=100)
>>> stats.ttest_ind(rvs1, rvs4)
(-0.99882539442782481, 0.3182832709103896)
>>> stats.ttest_ind(rvs1, rvs4, equal_var = False)
(-0.69712570584654099, 0.48716927725402048)
T-test with different means, variance, and n:
>>> rvs5 = stats.norm.rvs(loc=8, scale=20, size=100)
>>> stats.ttest_ind(rvs1, rvs5)
(-1.4679669854490653, 0.14263895620529152)
>>> stats.ttest_ind(rvs1, rvs5, equal_var = False)
(-0.94365973617132992, 0.34744170334794122)
ttest_rel(a, b, axis=0)
Calculates the T-test on TWO RELATED samples of scores, a and b.
This is a two-sided test for the null hypothesis that 2 related or
repeated samples have identical average (expected) values.
Parameters
----------
a, b : array_like
The arrays must have the same shape.
axis : int, optional, (default axis=0)
Axis can equal None (ravel array first), or an integer (the axis
over which to operate on a and b).
Returns
-------
t : float or array
t-statistic
prob : float or array
two-tailed p-value
Notes
-----
Examples for the use are scores of the same set of student in
different exams, or repeated sampling from the same units. The
test measures whether the average score differs significantly
across samples (e.g. exams). If we observe a large p-value, for
example greater than 0.05 or 0.1 then we cannot reject the null
hypothesis of identical average scores. If the p-value is smaller
than the threshold, e.g. 1%, 5% or 10%, then we reject the null
hypothesis of equal averages. Small p-values are associated with
large t-statistics.
References
----------
http://en.wikipedia.org/wiki/T-test#Dependent_t-test
Examples
--------
>>> from scipy import stats
>>> np.random.seed(12345678) # fix random seed to get same numbers
>>> rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
>>> rvs2 = (stats.norm.rvs(loc=5,scale=10,size=500) +
... stats.norm.rvs(scale=0.2,size=500))
>>> stats.ttest_rel(rvs1,rvs2)
(0.24101764965300962, 0.80964043445811562)
>>> rvs3 = (stats.norm.rvs(loc=8,scale=10,size=500) +
... stats.norm.rvs(scale=0.2,size=500))
>>> stats.ttest_rel(rvs1,rvs3)
(-3.9995108708727933, 7.3082402191726459e-005)
tvar(a, limits=None, inclusive=(True, True))
Compute the trimmed variance
This function computes the sample variance of an array of values,
while ignoring values which are outside of given `limits`.
Parameters
----------
a : array_like
Array of values.
limits : None or (lower limit, upper limit), optional
Values in the input array less than the lower limit or greater than the
upper limit will be ignored. When limits is None, then all values are
used. Either of the limit values in the tuple can also be None
representing a half-open interval. The default value is None.
inclusive : (bool, bool), optional
A tuple consisting of the (lower flag, upper flag). These flags
determine whether values exactly equal to the lower or upper limits
are included. The default value is (True, True).
Returns
-------
tvar : float
Trimmed variance.
Notes
-----
`tvar` computes the unbiased sample variance, i.e. it uses a correction
factor ``n / (n - 1)``.
variation(a, axis=0)
Computes the coefficient of variation, the ratio of the biased standard
deviation to the mean.
Parameters
----------
a : array_like
Input array.
axis : int or None
Axis along which to calculate the coefficient of variation.
References
----------
.. [1] Zwillinger, D. and Kokoska, S. (2000). CRC Standard
Probability and Statistics Tables and Formulae. Chapman & Hall: New
York. 2000.
wilcoxon(x, y=None, zero_method='wilcox', correction=False)
Calculate the Wilcoxon signed-rank test.
The Wilcoxon signed-rank test tests the null hypothesis that two
related paired samples come from the same distribution. In particular,
it tests whether the distribution of the differences x - y is symmetric
about zero. It is a non-parametric version of the paired T-test.
Parameters
----------
x : array_like
The first set of measurements.
y : array_like, optional
The second set of measurements. If `y` is not given, then the `x`
array is considered to be the differences between the two sets of
measurements.
zero_method : string, {"pratt", "wilcox", "zsplit"}, optional
"pratt":
Pratt treatment: includes zero-differences in the ranking process
(more conservative)
"wilcox":
Wilcox treatment: discards all zero-differences
"zsplit":
Zero rank split: just like Pratt, but spliting the zero rank
between positive and negative ones
correction : bool, optional
If True, apply continuity correction by adjusting the Wilcoxon rank
statistic by 0.5 towards the mean value when computing the
z-statistic. Default is False.
Returns
-------
T : float
The sum of the ranks of the differences above or below zero, whichever
is smaller.
p-value : float
The two-sided p-value for the test.
Notes
-----
Because the normal approximation is used for the calculations, the
samples used should be large. A typical rule is to require that
n > 20.
References
----------
.. [1] http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
zmap(scores, compare, axis=0, ddof=0)
Calculates the relative z-scores.
Returns an array of z-scores, i.e., scores that are standardized to zero
mean and unit variance, where mean and variance are calculated from the
comparison array.
Parameters
----------
scores : array_like
The input for which z-scores are calculated.
compare : array_like
The input from which the mean and standard deviation of the
normalization are taken; assumed to have the same dimension as
`scores`.
axis : int or None, optional
Axis over which mean and variance of `compare` are calculated.
Default is 0.
ddof : int, optional
Degrees of freedom correction in the calculation of the
standard deviation. Default is 0.
Returns
-------
zscore : array_like
Z-scores, in the same shape as `scores`.
Notes
-----
This function preserves ndarray subclasses, and works also with
matrices and masked arrays (it uses `asanyarray` instead of `asarray`
for parameters).
Examples
--------
>>> a = [0.5, 2.0, 2.5, 3]
>>> b = [0, 1, 2, 3, 4]
>>> zmap(a, b)
array([-1.06066017, 0. , 0.35355339, 0.70710678])
zprob(*args, **kwds)
`zprob` is deprecated!
zprob is deprecated in scipy 0.14, use norm.cdf or special.ndtr instead
ndtr(x[, out])
ndtr(x)
Gaussian cumulative distribution function
Returns the area under the standard Gaussian probability
density function, integrated from minus infinity to x::
1/sqrt(2*pi) * integral(exp(-t**2 / 2),t=-inf..x)
zscore(a, axis=0, ddof=0)
Calculates the z score of each value in the sample, relative to the sample
mean and standard deviation.
Parameters
----------
a : array_like
An array like object containing the sample data.
axis : int or None, optional
If `axis` is equal to None, the array is first raveled. If `axis` is
an integer, this is the axis over which to operate. Default is 0.
ddof : int, optional
Degrees of freedom correction in the calculation of the
standard deviation. Default is 0.
Returns
-------
zscore : array_like
The z-scores, standardized by mean and standard deviation of input
array `a`.
Notes
-----
This function preserves ndarray subclasses, and works also with
matrices and masked arrays (it uses `asanyarray` instead of `asarray`
for parameters).
Examples
--------
>>> a = np.array([ 0.7972, 0.0767, 0.4383, 0.7866, 0.8091, 0.1954,
0.6307, 0.6599, 0.1065, 0.0508])
>>> from scipy import stats
>>> stats.zscore(a)
array([ 1.1273, -1.247 , -0.0552, 1.0923, 1.1664, -0.8559, 0.5786,
0.6748, -1.1488, -1.3324])
Computing along a specified axis, using n-1 degrees of freedom (``ddof=1``)
to calculate the standard deviation:
>>> b = np.array([[ 0.3148, 0.0478, 0.6243, 0.4608],
[ 0.7149, 0.0775, 0.6072, 0.9656],
[ 0.6341, 0.1403, 0.9759, 0.4064],
[ 0.5918, 0.6948, 0.904 , 0.3721],
[ 0.0921, 0.2481, 0.1188, 0.1366]])
>>> stats.zscore(b, axis=1, ddof=1)
array([[-0.19264823, -1.28415119, 1.07259584, 0.40420358],
[ 0.33048416, -1.37380874, 0.04251374, 1.00081084],
[ 0.26796377, -1.12598418, 1.23283094, -0.37481053],
[-0.22095197, 0.24468594, 1.19042819, -1.21416216],
[-0.82780366, 1.4457416 , -0.43867764, -0.1792603 ]])
DATA
__all__ = ['absolute_import', 'alpha', 'anderson', 'anderson_ksamp', '...
absolute_import = _Feature((2, 5, 0, 'alpha', 1), (3, 0, 0, 'alpha', 0...
alpha = <scipy.stats._continuous_distns.alpha_gen object>
anglit = <scipy.stats._continuous_distns.anglit_gen object>
arcsine = <scipy.stats._continuous_distns.arcsine_gen object>
bernoulli = <scipy.stats._discrete_distns.bernoulli_gen object>
beta = <scipy.stats._continuous_distns.beta_gen object>
betaprime = <scipy.stats._continuous_distns.betaprime_gen object>
binom = <scipy.stats._discrete_distns.binom_gen object>
boltzmann = <scipy.stats._discrete_distns.boltzmann_gen object>
bradford = <scipy.stats._continuous_distns.bradford_gen object>
burr = <scipy.stats._continuous_distns.burr_gen object>
cauchy = <scipy.stats._continuous_distns.cauchy_gen object>
chi = <scipy.stats._continuous_distns.chi_gen object>
chi2 = <scipy.stats._continuous_distns.chi2_gen object>
cosine = <scipy.stats._continuous_distns.cosine_gen object>
dgamma = <scipy.stats._continuous_distns.dgamma_gen object>
division = _Feature((2, 2, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 8192...
dlaplace = <scipy.stats._discrete_distns.dlaplace_gen object>
dweibull = <scipy.stats._continuous_distns.dweibull_gen object>
erlang = <scipy.stats._continuous_distns.erlang_gen object>
expon = <scipy.stats._continuous_distns.expon_gen object>
exponpow = <scipy.stats._continuous_distns.exponpow_gen object>
exponweib = <scipy.stats._continuous_distns.exponweib_gen object>
f = <scipy.stats._continuous_distns.f_gen object>
fatiguelife = <scipy.stats._continuous_distns.fatiguelife_gen object>
fisk = <scipy.stats._continuous_distns.fisk_gen object>
foldcauchy = <scipy.stats._continuous_distns.foldcauchy_gen object>
foldnorm = <scipy.stats._continuous_distns.foldnorm_gen object>
frechet_l = <scipy.stats._continuous_distns.frechet_l_gen object>
frechet_r = <scipy.stats._continuous_distns.frechet_r_gen object>
gamma = <scipy.stats._continuous_distns.gamma_gen object>
gausshyper = <scipy.stats._continuous_distns.gausshyper_gen object>
genexpon = <scipy.stats._continuous_distns.genexpon_gen object>
genextreme = <scipy.stats._continuous_distns.genextreme_gen object>
gengamma = <scipy.stats._continuous_distns.gengamma_gen object>
genhalflogistic = <scipy.stats._continuous_distns.genhalflogistic_gen ...
genlogistic = <scipy.stats._continuous_distns.genlogistic_gen object>
genpareto = <scipy.stats._continuous_distns.genpareto_gen object>
geom = <scipy.stats._discrete_distns.geom_gen object>
gilbrat = <scipy.stats._continuous_distns.gilbrat_gen object>
gompertz = <scipy.stats._continuous_distns.gompertz_gen object>
gumbel_l = <scipy.stats._continuous_distns.gumbel_l_gen object>
gumbel_r = <scipy.stats._continuous_distns.gumbel_r_gen object>
halfcauchy = <scipy.stats._continuous_distns.halfcauchy_gen object>
halflogistic = <scipy.stats._continuous_distns.halflogistic_gen object...
halfnorm = <scipy.stats._continuous_distns.halfnorm_gen object>
hypergeom = <scipy.stats._discrete_distns.hypergeom_gen object>
hypsecant = <scipy.stats._continuous_distns.hypsecant_gen object>
invgamma = <scipy.stats._continuous_distns.invgamma_gen object>
invgauss = <scipy.stats._continuous_distns.invgauss_gen object>
invweibull = <scipy.stats._continuous_distns.invweibull_gen object>
johnsonsb = <scipy.stats._continuous_distns.johnsonsb_gen object>
johnsonsu = <scipy.stats._continuous_distns.johnsonsu_gen object>
ksone = <scipy.stats._continuous_distns.ksone_gen object>
kstwobign = <scipy.stats._continuous_distns.kstwobign_gen object>
laplace = <scipy.stats._continuous_distns.laplace_gen object>
levy = <scipy.stats._continuous_distns.levy_gen object>
levy_l = <scipy.stats._continuous_distns.levy_l_gen object>
levy_stable = <scipy.stats._continuous_distns.levy_stable_gen object>
loggamma = <scipy.stats._continuous_distns.loggamma_gen object>
logistic = <scipy.stats._continuous_distns.logistic_gen object>
loglaplace = <scipy.stats._continuous_distns.loglaplace_gen object>
lognorm = <scipy.stats._continuous_distns.lognorm_gen object>
logser = <scipy.stats._discrete_distns.logser_gen object>
lomax = <scipy.stats._continuous_distns.lomax_gen object>
maxwell = <scipy.stats._continuous_distns.maxwell_gen object>
mielke = <scipy.stats._continuous_distns.mielke_gen object>
multivariate_normal = <scipy.stats._multivariate.multivariate_normal_g...
nakagami = <scipy.stats._continuous_distns.nakagami_gen object>
nbinom = <scipy.stats._discrete_distns.nbinom_gen object>
ncf = <scipy.stats._continuous_distns.ncf_gen object>
nct = <scipy.stats._continuous_distns.nct_gen object>
ncx2 = <scipy.stats._continuous_distns.ncx2_gen object>
norm = <scipy.stats._continuous_distns.norm_gen object>
pareto = <scipy.stats._continuous_distns.pareto_gen object>
pearson3 = <scipy.stats._continuous_distns.pearson3_gen object>
planck = <scipy.stats._discrete_distns.planck_gen object>
poisson = <scipy.stats._discrete_distns.poisson_gen object>
powerlaw = <scipy.stats._continuous_distns.powerlaw_gen object>
powerlognorm = <scipy.stats._continuous_distns.powerlognorm_gen object...
powernorm = <scipy.stats._continuous_distns.powernorm_gen object>
print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...
randint = <scipy.stats._discrete_distns.randint_gen object>
rayleigh = <scipy.stats._continuous_distns.rayleigh_gen object>
rdist = <scipy.stats._continuous_distns.rdist_gen object>
recipinvgauss = <scipy.stats._continuous_distns.recipinvgauss_gen obje...
reciprocal = <scipy.stats._continuous_distns.reciprocal_gen object>
rice = <scipy.stats._continuous_distns.rice_gen object>
semicircular = <scipy.stats._continuous_distns.semicircular_gen object...
skellam = <scipy.stats._discrete_distns.skellam_gen object>
t = <scipy.stats._continuous_distns.t_gen object>
triang = <scipy.stats._continuous_distns.triang_gen object>
truncexpon = <scipy.stats._continuous_distns.truncexpon_gen object>
truncnorm = <scipy.stats._continuous_distns.truncnorm_gen object>
tukeylambda = <scipy.stats._continuous_distns.tukeylambda_gen object>
uniform = <scipy.stats._continuous_distns.uniform_gen object>
vonmises = <scipy.stats._continuous_distns.vonmises_gen object>
vonmises_line = <scipy.stats._continuous_distns.vonmises_gen object>
wald = <scipy.stats._continuous_distns.wald_gen object>
weibull_max = <scipy.stats._continuous_distns.frechet_l_gen object>
weibull_min = <scipy.stats._continuous_distns.frechet_r_gen object>
wrapcauchy = <scipy.stats._continuous_distns.wrapcauchy_gen object>
zipf = <scipy.stats._discrete_distns.zipf_gen object>
FILE
/usr/local/lib/python3.4/dist-packages/scipy/stats/__init__.py
In [19]:
help(scipy.stats.kurtosis)
Help on function kurtosis in module scipy.stats.stats:
kurtosis(a, axis=0, fisher=True, bias=True)
Computes the kurtosis (Fisher or Pearson) of a dataset.
Kurtosis is the fourth central moment divided by the square of the
variance. If Fisher's definition is used, then 3.0 is subtracted from
the result to give 0.0 for a normal distribution.
If bias is False then the kurtosis is calculated using k statistics to
eliminate bias coming from biased moment estimators
Use `kurtosistest` to see if result is close enough to normal.
Parameters
----------
a : array
data for which the kurtosis is calculated
axis : int or None
Axis along which the kurtosis is calculated
fisher : bool
If True, Fisher's definition is used (normal ==> 0.0). If False,
Pearson's definition is used (normal ==> 3.0).
bias : bool
If False, then the calculations are corrected for statistical bias.
Returns
-------
kurtosis : array
The kurtosis of values along an axis. If all values are equal,
return -3 for Fisher's definition and 0 for Pearson's definition.
References
----------
.. [1] Zwillinger, D. and Kokoska, S. (2000). CRC Standard
Probability and Statistics Tables and Formulae. Chapman & Hall: New
York. 2000.
In [20]:
%matplotlib inline
import numpy
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 8.0)
x=numpy.linspace(0,2*numpy.pi,32)
fig = plt.figure()
plt.plot(x, numpy.sin(x))
fig.savefig('sine.png')
plt.show()
Content under Creative Commons Attribution license CC-BY 4.0, code under MIT license (c)2014 Sergio Rojas (srojas@usb.ve) and Erik A Christensen (erikcny@aol.com).
Content source: moonbury/notebooks
Similar notebooks: