bank deposits - fitting a hazard function to age

Fitting a Hazard Function to Age

Overview

Often, it is believed that the hazard function should

Fitting to Age

When assuming a constant hazard function, fitting the function is mathematically fairly simple. The process involves splitting your data into time intervals, for instance monthly. Then calculating the number of active accounts every month, labeled {% N_i %}, and the number of account closings in that month, labeled {% C_i %}

Then the measured probabily of closure (the hazard rate) in any period is

{% p_i = C_i / N_i %}

Because we are hypothesizing that the rate is constant and unaffected by any factors, you can view each account in each period as a separate independent observation. In that case, the total number of observations is

{% N = \sum N_i %}

and the number of closures is

{% C = \sum C_i %}

and the calculated probability is

{% \sum C_i / \sum N_i %}

The above calculated probability is for the time interval that the data set is segmented into. You could set your hazard functon equation to the calculated value as such:

{% h(t) = \sum C_i / \sum N_i %}

This assumes that time is measured in units given by the segmentation of your dataset. If the segmented interval represents 1 month, then {% \Delta t=5 %} represents a change of 5 months. If you wish to use a different interval as your unit of time, you will need to appropriately scale your function value.

Data Work

The challenge for implementation is getting the counts needed for the above calculation. Typically, organizations will capture monthly data, and assign a status to account for that month. The following represents a common structure for account records.


{
  account_id:'1'
  date:'2010-03-31',
  status:'OPEN'
},
{
  account_id:'1',
  date:'2010-04-30',
  status:'CLOSED'
}

These two records refer to the same account at different dates. Data analysts must understand how to measure an account closure. In this case, given that the accounts are sampled at the end of each month, it is fair to assume that the account closed in April, but you would not know this without being able to look at the record in March.

Grouping the Data

Assuming that the data is loaded into a variable, the first step is to group the data by date and account number. This can easily be done using the group api.


let gp = await import('/lib/utilities/v1.0.0/group.js');

let groups = gp.group(data, p=>[p.date, p.account_id]).toObj();
let dates = Object.keys(groups);
dates.sort();
let total = 0;
let closures = 0;
let lastDate;
for(let date of dates){
  if(lastDate != undefined){
    for(let id in groups[date]){
      let acct = groups[date][id][0];
      if(acct.status == 'OPEN') total += 1;
      if(acct.status == 'CLOSED' && id in groups[lastDate] && groups[lastDate][id][0].status == 'OPEN'){
        closures+=1;
        total += 1;
      }
    }
  }
  lastDate = date;
}

let prob = closures / total;

Try it!

In this code, we groupd by date and then account number. Then we iterate through the dates, skipping the first date. We do this because we need a prior date in order to determine if an account with status CLOSED was open in the prior month. (This isnt necessary if close accounts drop out of your dataset after the month they are closed, however, most companies keep a record of account.)

For each account that was open for the given date, or if its closed but open in the prior period, we update the total accounts for that period by 1. Then for any account that is closed in the given period but open in the prior period, we update the number of closures by one.

Overview

Fitting to Age

Data Work

Contents