ETL and Batch processing
Overview
data mesh
Extracting File Data
One of the simplest and most common sources of data come from reading data from a file. If the file is of limited
size (typically less than 500MB), you can use the
file.mjs module to just read the data.
const fl = require('./file.mjs');
let data = await fl.load('/directory/file.csv');
The 500MB size limit comes from the fact that the file contents are cast as a Javascript string when it is loaded.
Most browsers have a limit on string size, although it is quite large. This means that if you data cannot fit
within a single file, you have two options. You can split your file into multiple files, or you can use the
utilities for reading files line by line. Reading a file line by line splits the file into lines as it is being
read, with each line being a separate string and thereby bypassing the string lize limit.
Splitting your data into separate files is a very useful thing to do, especially if you can group the data by
column values by file. That is, separate your data into separate files by date, for instance. This
makes it rather easy to filter your data by date without having to load all the data files into memory first.
If your files are still too large, you can read the file line by line. The
file.mjs contains a method for reading files line by line.
If your data is structured (that is, its a json file, or a csv file), you can use the
file-data.mjs module to read the file, record by record.
Extracting Database Data
const fd = require('./file-data.js');
Transforming Data
You can apply
transformations to your data after it has loaded.
In fact you can apply those transformations in the same line of code that loads the file.
const fl = require('./file.mjs');
let data = (await fl.load('/directory/file.csv'))
.map(p=>{
return {
//add props here
};
});
Dimensional Data
Often a dataset is structured in a
dimensional fashion. That is, the core set of data is a set of
(usually) timestamped facts. For example, a set of bank transactions (deposits and withdrawals), each with a time stamp.
The data is usually stored with a set of properties (or columns) that provide information about the fact. (in this case,
a transcation equals a fact) A bank transaction would likely have a bank account number attached to it.
These additional properties, such as the bank account number, are called the dimensions. Each property represents a
separate dimension. When processing the transaction data, you may need the dimension data in order to appropriately
filter and transform the data. For example, you may to count the number of transactions for accounts with a certain
average balance.
When saving the fact data, it is typical to only save the account number with the fact, not the other information about
the account (such as account status or average balance) This is because there may be several fact records tied to the same
account. If the data is structured like a table, and the account information is included, this would mean that the account
status is duplicated across all transaction records that are tied to the same account.
The solution to this problem is to save the account information in a separate file, and then
to
join the data.
A simple way to join the data is the use
the
grouping feature of the
$list api.
const fl = require('./file.mjs');
let data = await fl.load('transactions.csv');
let accounts = await fl.load('accounts.csv');
let groups = $list(accounts).group(p=>p['account-number']).object;
data.forEach(p=>{
p.account = groups[p['account-number']];
})
Sometimes the fact dimensions are in a more complex data structure. For example, the chart of accounts
for most accounting systems is structured like a
tree.
Saving Data
Once you have loaded a dataset, then transformed it according to your needs, you will likely want to persist the resulting data
to a file or database.
If the data is small (again, less than 500MB) you can just write it to a file.
Automated Documents
pdflatex is part of MikTex distribution.