statistics

Statistics Corner

Overview

The statistics corner highlights various libraries, components and techniques for creating statistical analysis. The following sections provide resources for the various types of analysis.

Foundations

Probability and Statistics is a branch of measure theory that is used to model uncertainty, in particular, it provides a framework for understanding choice under uncertainty. It is also used in machine learning

Probability : describes the basic framework of probability, including basic assumptions.
Random Variable : any real valued function with domain the set of outcomes.
Distributions refers to statistical distributions, that is, functions that are used to model the probability of the occurence of certain random variables. The distribution corner contains information on commonly used distributions including tools for calculations related to the common distributions, such as computing the cumulative distribution value or the inverse distribution.
Expectation : integrals of random variables
Inequalities - lists a number of useful statistical inequalities

Moments

Moments (expected value, standard deviation, ...) refer to the expectation value of a random variable, as well as the expectation of that variable raised to various powers. The moments corner contains resources for doing moment calculations, including tools for related calculations such as the variance and covariance (centered moments)
Conditional Probability, Expectation and Bayes : describes how information changes probability.
Method of Moments describes a technique to fit a distribution to a set of observations.

Exploratory Data Analysis

Exploratory data analysis : is a process of examining data prior to any real formal analytics in order to understand its basic characteristics and any trends observable from the naked eye. Typically, this means charting the data along different sets of dimensions. The exploring data corner collects resources for doing exploratory analysis of data.

Outliers : one of the easiest ways to detect outliers (see below) is to simply chart the data and look for the stray points. Of course, this is generally only possible for datasets of limited size and dimensions.

Data Preparation

Outliers and Data Cleansing: some analysis can be improved by processing the outliers and faulty records prior to the analysis.

Dimensionality Reduction: some datatsets have a large number of properties that can lead to overfitting or other anomalies.

Dependence Modeling

Statistical Dependence - is the statistical relationships between variables in a dataset. In simplest terms, this may show up as a correlation between the variables, however, dependence may be more causal in nature.

Statistical Inference

Statistical Inference : refers to the process of inferring then properties of a data generating process based on a set of samples of that process. The properties are typically estimates of the moments of the distribution, but can be other statistics as well.

Statistical inference is used in hypothesis testing, which gives confidence intervals for variables of interest, often to determine if it is likely that a variable is non zero.

Random Sampling, Polling, Studies and Experiments - the typical application of statistical inerence is to infer properties of a given population from a sample.

Regression

Regression : is a general term used to describe a range of techniques for estimating a value drawn from a continuous range of values.

Ordinary Least Squares Regression:
Logistic Regression: Logistic Regression is a regression technique which fits a logistic function to a set of data. It is used primarily to model dichotomous data, that is, data where the response variable can take one of only two values.
Poisson Regression:
Generalized Linear Models:

Regularization: is a technique that penalizes a regression for higher coefficients, thereby pulling the coeeficients back closer to zero than what the standard regression would produce.

Time Series

Time Series are datasets where each point represents a particular point in time. The nature of time series creates new issues not present in other statisitical datasets, in particular, the question of independence between points of the series causes new difficulties for modeling.

Simulations

Simulations are a common method for calculating various statistics of a distribution when an analytical answer is not readily at hand. The simiulations corner contains resources related to solving problems using simulation methods.
Resampling is a method for sampling from a given dataset in order to estimate statistics about the distribution that generated the dataset in the first place and includes methods such as jackknife and bootstrap.

Models

Analysis of Variance (ANOVA)
Count Processes
Survival and Event Analysis
Queueing
Mixture Models
Principal Components
Conditional Probability Models
- Markov Models
- Probabilistic Graphical Models are used to represent a multi-variable probability distribution as a graph. The graph is a compact representation that can lead to improved computability and visualization.
Gaussian Models are models where the distributions underlying the model are all Gaussian, or normal. This is usually a simplification, but is often good enough and makes the models tractable and computationally feasible. Gaussian models are often a good starting point.
Random Matrices

Robust Statistics

Information Theory

Information Theory: is the theory developed by Claude Shannon describing the amount of information in a message. Originally designed to be used in message compression and reliability, but has been extended to be used in various machine learning algorithms.

Philosophy

Philosophy of Probability: