etl and batch processing

ETL and Batch processing

Overview

data mesh

Extracting File Data

One of the simplest and most common sources of data come from reading data from a file. If the file is of limited size (typically less than 500MB), you can use the file.mjs module to just read the data.


const fl = require('./file.mjs');

let data = await fl.load('/directory/file.csv');

The 500MB size limit comes from the fact that the file contents are cast as a Javascript string when it is loaded. Most browsers have a limit on string size, although it is quite large. This means that if you data cannot fit within a single file, you have two options. You can split your file into multiple files, or you can use the utilities for reading files line by line. Reading a file line by line splits the file into lines as it is being read, with each line being a separate string and thereby bypassing the string lize limit.

Splitting your data into separate files is a very useful thing to do, especially if you can group the data by column values by file. That is, separate your data into separate files by date, for instance. This makes it rather easy to filter your data by date without having to load all the data files into memory first.

If your files are still too large, you can read the file line by line. The file.mjs contains a method for reading files line by line. If your data is structured (that is, its a json file, or a csv file), you can use the file-data.mjs module to read the file, record by record.

Extracting Database Data


const fd = require('./file-data.js');

Transforming Data

You can apply transformations to your data after it has loaded. In fact you can apply those transformations in the same line of code that loads the file.


const fl = require('./file.mjs');

let data = (await fl.load('/directory/file.csv'))
               .map(p=>{
                 return {
                   //add props here
                 };
               });

Dimensional Data

Often a dataset is structured in a dimensional fashion. That is, the core set of data is a set of (usually) timestamped facts. For example, a set of bank transactions (deposits and withdrawals), each with a time stamp. The data is usually stored with a set of properties (or columns) that provide information about the fact. (in this case, a transcation equals a fact) A bank transaction would likely have a bank account number attached to it.

These additional properties, such as the bank account number, are called the dimensions. Each property represents a separate dimension. When processing the transaction data, you may need the dimension data in order to appropriately filter and transform the data. For example, you may to count the number of transactions for accounts with a certain average balance.

When saving the fact data, it is typical to only save the account number with the fact, not the other information about the account (such as account status or average balance) This is because there may be several fact records tied to the same account. If the data is structured like a table, and the account information is included, this would mean that the account status is duplicated across all transaction records that are tied to the same account.

The solution to this problem is to save the account information in a separate file, and then to join the data.

A simple way to join the data is the use the grouping feature of the $list api.


const fl = require('./file.mjs');

let data = await fl.load('transactions.csv');
let accounts = await fl.load('accounts.csv');

let groups = $list(accounts).group(p=>p['account-number']).object;
data.forEach(p=>{
  p.account = groups[p['account-number']];
})

Sometimes the fact dimensions are in a more complex data structure. For example, the chart of accounts for most accounting systems is structured like a tree.

Saving Data

Once you have loaded a dataset, then transformed it according to your needs, you will likely want to persist the resulting data to a file or database. If the data is small (again, less than 500MB) you can just write it to a file.

Automated Documents

pdflatex is part of MikTex distribution.