Statistics Corner

Overview


The statistics corner highlights various libraries, components and techniques for creating statistical analysis. The following sections provide resources for the various types of analysis.

Foundations


Probability and Statistics is a branch of measure theory that is used to model uncertainty, in particular, it provides a framework for understanding choice under uncertainty. It is also used in machine learning

  • Probability : describes the basic framework of probability, including basic assumptions.
  • Random Variable : any real valued function with domain the set of outcomes.
  • Distributions refers to statistical distributions, that is, functions that are used to model the probability of the occurence of certain random variables. The distribution corner contains information on commonly used distributions including tools for calculations related to the common distributions, such as computing the cumulative distribution value or the inverse distribution.
  • Expectation : integrals of random variables
  • Inequalities - lists a number of useful statistical inequalities

Moments


Exploratory Data Analysis


Exploratory data analysis : is a process of examining data prior to any real formal analytics in order to understand its basic characteristics and any trends observable from the naked eye. Typically, this means charting the data along different sets of dimensions. The exploring data corner collects resources for doing exploratory analysis of data.
Outliers : one of the easiest ways to detect outliers (see below) is to simply chart the data and look for the stray points. Of course, this is generally only possible for datasets of limited size and dimensions.

Data Preparation


Outliers and Data Cleansing: some analysis can be improved by processing the outliers and faulty records prior to the analysis.
Dimensionality Reduction: some datatsets have a large number of properties that can lead to overfitting or other anomalies.

Dependence Modeling


Statistical Dependence - is the statistical relationships between variables in a dataset. In simplest terms, this may show up as a correlation between the variables, however, dependence may be more causal in nature.

Statistical Inference


Statistical Inference : refers to the process of inferring then properties of a data generating process based on a set of samples of that process. The properties are typically estimates of the moments of the distribution, but can be other statistics as well.

Statistical inference is used in hypothesis testing, which gives confidence intervals for variables of interest, often to determine if it is likely that a variable is non zero.

Random Sampling, Polling, Studies and Experiments - the typical application of statistical inerence is to infer properties of a given population from a sample.

Regression


Regression : is a general term used to describe a range of techniques for estimating a value drawn from a continuous range of values.


Regularization: is a technique that penalizes a regression for higher coefficients, thereby pulling the coeeficients back closer to zero than what the standard regression would produce.

Time Series


Time Series are datasets where each point represents a particular point in time. The nature of time series creates new issues not present in other statisitical datasets, in particular, the question of independence between points of the series causes new difficulties for modeling.

Simulations


  • Simulations are a common method for calculating various statistics of a distribution when an analytical answer is not readily at hand. The simiulations corner contains resources related to solving problems using simulation methods.
  • Resampling is a method for sampling from a given dataset in order to estimate statistics about the distribution that generated the dataset in the first place and includes methods such as jackknife and bootstrap.

Models


Robust Statistics


Robust Statistics

Information Theory


Information Theory: is the theory developed by Claude Shannon describing the amount of information in a message. Originally designed to be used in message compression and reliability, but has been extended to be used in various machine learning algorithms.

Philosophy


Philosophy of Probability:

Contents