Fitting a Constant Hazard Function

Overview


A common simple assumption for modeling account closures is that the probability of an account closing in any time period is a constant.

Fitting a Constant


When assuming a constant hazard function, fitting the function is mathematically fairly simple. The process involves splitting your data into time intervals, for instance monthly. Then calculating the number of active observations every month, labeled {% N_i %}, and the number of events in that month, labeled {% C_i %}.

For instance, if you were building a survival function to predict when accounts at a bank closes, {% N_i %} would be the number of open accounts in the period, and {% C_i %} would be the number of account closures that period.

Then the measured probabily of closure (the hazard rate) in any period is
{% p_i = C_i / N_i %}
Because we are hypothesizing that the rate is constant and unaffected by any factors, you can view each account in each period as a separate independent observation. In that case, the total number of observations is
{% N = \sum N_i %}
and the number of closures is
{% C = \sum C_i %}
and the calculated probability is
{% \sum C_i / \sum N_i %}


The above calculated probability is for the time interval that the data set is segmented into. You could set your hazard functon equation to the calculated value as such:
{% h(t) = \sum C_i / \sum N_i %}
This assumes that time is measured in units given by the segmentation of your dataset. If the segmented interval represents 1 month, then {% \Delta t=5 %} represents a change of 5 months. If you wish to use a different interval as your unit of time, you will need to appropriately scale your function value.

Data Work


The challenge for implementation is getting the counts needed for the above calculation. Typically, organizations will capture monthly data, and assign a status to account for that month. The following represents a common structure for account records.


{
  account_id:'1'
  date:'2010-03-31',
  status:'OPEN'
},
{
  account_id:'1',
  date:'2010-04-30',
  status:'CLOSED'
}
					


These two records refer to the same account at different dates. Data analysts must understand how to measure an account closure. In this case, given that the accounts are sampled at the end of each month, it is fair to assume that the account closed in April, but you would not know this without being able to look at the record in March.

Grouping the Data

Assuming that the data is loaded into a variable, the first step is to group the data by date and account number. This can easily be done using the group api.

let gp = await import('/lib/utilities/v1.0.0/group.js');

let groups = gp.group(data, p=>[p.date, p.account_id]).toObj();
let dates = Object.keys(groups);
dates.sort();
let total = 0;
let closures = 0;
let lastDate;
for(let date of dates){
  if(lastDate != undefined){
    for(let id in groups[date]){
      let acct = groups[date][id][0];
      if(acct.status == 'OPEN') total += 1;
      if(acct.status == 'CLOSED' && id in groups[lastDate] && groups[lastDate][id][0].status == 'OPEN'){
        closures+=1;
        total += 1;
      }
    }
  }
  lastDate = date;
}

let prob = closures / total;
					
Try it!


In this code, we groupd by date and then account number. Then we iterate through the dates, skipping the first date. We do this because we need a prior date in order to determine if an account with status CLOSED was open in the prior month. (This isnt necessary if close accounts drop out of your dataset after the month they are closed, however, most companies keep a record of account.)

For each account that was open for the given date, or if its closed but open in the prior period, we update the total accounts for that period by 1. Then for any account that is closed in the given period but open in the prior period, we update the number of closures by one.

Contents