Statistics

Overview


The statistics corner provides various resources to the theory of probability and statistics, including various libraries, components and techniques for creating statistical analysis.

Foundations


Probability and Statistics is a branch of measure theory that is used to model uncertainty, in particular, it provides a framework for understanding choice under uncertainty. It is also used in machine learning

  • Probability : describes the basic framework of probability, including basic assumptions.
  • Random Variable : any real valued function with domain the set of outcomes.
  • Distributions functions that are used to model the probability of the occurrence of random variables. The distribution corner contains information on commonly used distributions including tools for calculations related to the common distributions, such as computing the cumulative distribution value or the inverse distribution.
  • Expectation : integrals of random variables
  • Inequalities - lists a number of useful statistical inequalities
  • Combinatorics

Moments


  • Moments (expected value, standard deviation, ...) refer to the expectation value of a random variable, as well as the expectation of that variable raised to various powers. The moments corner contains resources for doing moment calculations, including tools for related calculations such as the variance and covariance (centered moments)
  • Conditional Probability, Expectation and Bayes : describes how information changes probability.
  • Method of Moments describes a technique to fit a distribution to a set of observations.

Exploratory Data Analysis


  • Exploratory data analysis : is a process of examining data prior to any real formal analytics in order to understand its basic characteristics and any trends observable from the naked eye. Typically, this means charting the data along different sets of dimensions. The exploring data corner collects resources for doing exploratory analysis of data.
  • Outliers : one of the easiest ways to detect outliers (see below) is to simply chart the data and look for the stray points. Of course, this is generally only possible for datasets of limited size and dimensions.

Data Preparation


  • Outliers and Data Cleansing: some analysis can be improved by processing the outliers and faulty records prior to the analysis.
  • Dimensionality Reduction: some datasets have a large number of properties that can lead to overfitting or other anomalies.

Dependence Modeling


  • Statistical Dependence - reflects the statistical relationships between variables in a dataset. In simplest terms, this may show up as a correlation between the variables, however, dependence may be more causal in nature.

Statistical Inference


  • Statistical Inference : refers to the process of inferring then properties of a data generating process based on a set of samples of that process. The properties are typically estimates of the moments of the distribution, but can be other statistics as well.

    Statistical inference is used in hypothesis testing, which gives confidence intervals for variables of interest, often to determine if it is likely that a variable is non zero.
  • Random Sampling, Polling, Studies and Experiments - the typical application of statistical inference is to infer properties of a given population from a sample.

Regression


Regression : is a general term used to describe a range of techniques for estimating a value drawn from a continuous range of values.
  • Ordinary Least Squares Regression:
  • Logistic Regression: Logistic Regression is a regression technique which fits a logistic function to a set of data. It is used primarily to model dichotomous data, that is, data where the response variable can take one of only two values.
  • Multinomial (Softmax) Regression:
  • Poisson Regression:
  • Generalized Linear Models:


Regularization: is a technique that penalizes a regression for higher coefficients, thereby pulling the coeeficients back closer to zero than what the standard regression would produce.

Time Series


  • Time Series are datasets where each point represents a particular point in time. The nature of time series creates new issues not present in other statistical datasets, in particular, the question of independence between points of the series causes new difficulties for modeling.

Simulations


  • Simulations are a common method for calculating various statistics of a distribution when an analytical answer is not readily at hand. The simulations corner contains resources related to solving problems using simulation methods.
  • Resampling is a method for sampling from a given dataset in order to estimate statistics about the distribution that generated the dataset in the first place and includes methods such as jackknife and bootstrap.

Models


  • Count Processes
  • Survival and Event Analysis
  • Queueing
  • Mixture Models
  • Principal Components
  • Conditional Probability Models
    • Markov Models
    • Probabilistic Graphical Models are used to represent a multi-variable probability distribution as a graph. The graph is a compact representation that can lead to improved computability and visualization.
    • Latent Variable and Factor Models
  • Gaussian Models are models where the distributions underlying the model are all Gaussian, or normal. This is usually a simplification, but is often good enough and makes the models tractable and computationally feasible. Gaussian models are often a good starting point.
  • Random Matrices
  • Sparse Models
  • Robust Statistics

Information Theory


  • Information Theory: is the theory developed by Claude Shannon describing the amount of information in a message. Originally designed to be used in message compression and reliability, but has been extended to be used in various machine learning algorithms.

Philosophy


  • Philosophy of Probability

Community