Fitting a Hazard Function to Age
Overview
Often, it is believed that the hazard function should
Fitting to Age
When assuming a constant hazard function, fitting the function is
mathematically fairly simple. The process involves splitting
your data into time intervals, for instance monthly. Then
calculating the number of active accounts every month, labeled
{% N_i %}, and the number
of account closings in that month, labeled {% C_i %}
Then the measured probabily of closure (the hazard rate)
in any period is
{% p_i = C_i / N_i %}
Because we are hypothesizing that the rate is constant and
unaffected by any factors, you can view each account in each period
as a separate independent observation. In that case, the total
number of observations is
{% N = \sum N_i %}
and the number of closures is
{% C = \sum C_i %}
and the calculated probability is
{% \sum C_i / \sum N_i %}
The above calculated probability is for the time interval that
the data set is segmented into. You could set your hazard functon
equation to the calculated value as such:
{% h(t) = \sum C_i / \sum N_i %}
This assumes that time is measured in units given by the segmentation of
your dataset. If the segmented interval represents 1 month, then
{% \Delta t=5 %} represents a change of 5 months. If you wish
to use a different interval as your unit of time, you will need
to appropriately scale your function value.
Data Work
The challenge for implementation is getting the counts needed
for the above calculation. Typically, organizations will capture
monthly data, and assign a status to account for that month. The
following represents a common structure for account records.
{
account_id:'1'
date:'2010-03-31',
status:'OPEN'
},
{
account_id:'1',
date:'2010-04-30',
status:'CLOSED'
}
These two records refer to the same account at different dates.
Data analysts must understand how to measure an account closure.
In this case, given that the accounts are sampled at the end of
each month, it is fair to assume that the account closed in April,
but you would not know this without being able to look at the record
in March.
Grouping the Data
Assuming that the data is loaded into a variable, the first step
is to group the data by date and account number. This
can easily be done using the
group api.
let gp = await import('/lib/utilities/v1.0.0/group.js');
let groups = gp.group(data, p=>[p.date, p.account_id]).toObj();
let dates = Object.keys(groups);
dates.sort();
let total = 0;
let closures = 0;
let lastDate;
for(let date of dates){
if(lastDate != undefined){
for(let id in groups[date]){
let acct = groups[date][id][0];
if(acct.status == 'OPEN') total += 1;
if(acct.status == 'CLOSED' && id in groups[lastDate] && groups[lastDate][id][0].status == 'OPEN'){
closures+=1;
total += 1;
}
}
}
lastDate = date;
}
let prob = closures / total;
Try it!
In this code, we groupd by date and then account number. Then we
iterate through the dates, skipping the first date. We do this because
we need a prior date in order to determine if an account with status
CLOSED was open in the prior month. (This isnt necessary if close
accounts drop out of your dataset after the month they are closed, however,
most companies keep a record of account.)
For each account that was open for the given date, or if its closed but
open in the prior period, we update the total accounts for that period
by 1. Then for any account that is closed in the given period but
open in the prior period, we update the number of closures by one.