Statistics Corner
Overview
The statistics corner highlights various libraries, components and techniques for creating statistical analysis.
The following sections provide resources for the various types of analysis.
Foundations
Probability and Statistics is a branch of
measure theory
that is used to model uncertainty, in particular, it provides a framework for understanding
choice
under uncertainty. It is also used in
machine learning
- Probability :
describes the basic framework of probability, including basic assumptions.
- Random Variable :
any real valued function with domain the set of outcomes.
- Distributions refers to statistical distributions,
that is, functions that are used to model the probability
of the occurence of certain random variables. The distribution corner contains information on commonly
used distributions including tools for calculations related to the common distributions, such as
computing the cumulative distribution value or the inverse distribution.
- Expectation :
integrals of random variables
- Inequalities
- lists a number of useful statistical inequalities
Moments
Exploratory Data Analysis
Exploratory data analysis :
is a process of examining data prior to any real formal analytics in order to understand
its basic characteristics and any trends observable from the naked eye. Typically, this means charting the data
along different sets of dimensions. The exploring data corner collects resources for doing
exploratory analysis of data.
Outliers : one of the easiest ways to detect outliers (see below) is to
simply chart the data and look for the stray points. Of course, this is generally only possible for datasets of limited size
and dimensions.
Data Preparation
Outliers and Data Cleansing:
some analysis can be improved by processing the outliers and faulty records prior to the analysis.
Dimensionality Reduction:
some datatsets have a large number of properties that can lead to overfitting or other anomalies.
Dependence Modeling
Statistical Dependence -
is the statistical relationships between variables in a dataset. In simplest terms, this may show up as a correlation
between the variables, however, dependence may be more causal in nature.
Statistical Inference
Statistical Inference :
refers to the process of inferring then properties of a data generating process based on a set of
samples of that process. The properties are typically estimates of the
moments of the distribution,
but can be other statistics as well.
Statistical inference is used in hypothesis testing, which gives confidence intervals for
variables of interest, often to determine if it is likely that a variable is non zero.
Random Sampling, Polling, Studies and Experiments - the typical application of statistical
inerence is to infer properties of a given population from a sample.
Regression
Regression
: is a general term used to describe a range of techniques
for estimating a value drawn from a continuous range of values.
Regularization:
is a technique that penalizes a regression for higher coefficients, thereby pulling the coeeficients back closer
to zero than what the standard regression would produce.
Time Series
Time Series are datasets where each point represents a particular point in time. The nature of time series creates
new issues not present in other statisitical datasets, in particular, the question of independence between
points of the series causes new difficulties for modeling.
Simulations
-
Simulations are a common method for calculating various statistics of a distribution when an analytical
answer is not readily at hand. The simiulations corner contains resources related to solving problems
using simulation methods.
-
Resampling is a method for sampling
from a given dataset in order to estimate statistics about the
distribution that generated the dataset
in the first place and includes methods such as jackknife and bootstrap.
Models
Robust Statistics
Robust Statistics
Information Theory
Information Theory:
is the theory developed by Claude Shannon describing the amount of information in a message. Originally designed to be used
in message compression and reliability, but has been extended to be used in various machine learning algorithms.
Philosophy
Philosophy of Probability: