Statistics

Overview

The statistics corner provides various resources to the theory of probability and statistics, including various libraries, components and techniques for creating statistical analysis.

Foundations

Probability and Statistics is a branch of measure theory that is used to model uncertainty, in particular, it provides a framework for understanding choice under uncertainty. It is also used in machine learning

  • Probability : describes the basic framework of probability, including basic assumptions.
  • Random Variable : any real valued function with domain the set of outcomes.
  • Distributions functions that are used to model the probability of the occurrence of random variables. The distribution corner contains information on commonly used distributions including tools for calculations related to the common distributions, such as computing the cumulative distribution value or the inverse distribution.
  • Expectation : integrals of random variables
  • Inequalities - lists a number of useful statistical inequalities
  • Combinatorics

Moments

Exploratory Data Analysis

  • Exploratory data analysis : is a process of examining data prior to any real formal analytics in order to understand its basic characteristics and any trends observable from the naked eye. Typically, this means charting the data along different sets of dimensions. The exploring data corner collects resources for doing exploratory analysis of data.
  • Outliers : one of the easiest ways to detect outliers (see below) is to simply chart the data and look for the stray points. Of course, this is generally only possible for datasets of limited size and dimensions.

Data Preparation

Dependence Modeling

  • Statistical Dependence - reflects the statistical relationships between variables in a dataset. In simplest terms, this may show up as a correlation between the variables, however, dependence may be more causal in nature.

Statistical Inference

  • Statistical Inference : refers to the process of inferring then properties of a data generating process based on a set of samples of that process. The properties are typically estimates of the moments of the distribution, but can be other statistics as well.

    Statistical inference is used in hypothesis testing, which gives confidence intervals for variables of interest, often to determine if it is likely that a variable is non zero.
  • Random Sampling, Polling, Studies and Experiments - the typical application of statistical inference is to infer properties of a given population from a sample.

Regression

Regression : is a general term used to describe a range of techniques for estimating a value drawn from a continuous range of values.


Regularization: is a technique that penalizes a regression for higher coefficients, thereby pulling the coefficients back closer to zero than what the standard regression would produce.

Time Series

  • Time Series are datasets where each point represents a particular point in time. The nature of time series creates new issues not present in other statistical datasets, in particular, the question of independence between points of the series causes new difficulties for modeling.

Simulations

  • Simulations are a common method for calculating various statistics of a distribution when an analytical answer is not readily at hand. The simulations corner contains resources related to solving problems using simulation methods.
  • Resampling is a method for sampling from a given dataset in order to estimate statistics about the distribution that generated the dataset in the first place and includes methods such as jackknife and bootstrap.

Models

Information Theory

  • Information Theory: is the theory developed by Claude Shannon describing the amount of information in a message. Originally designed to be used in message compression and reliability, but has been extended to be used in various machine learning algorithms.

Community