Large Data Set Chunks

Overview


Large datasets can pose problems for running analytics that are not present when analyzing a smaller dataset. Typically these problems come about because of the resource limitations of the machine on which the analytics are being run.

Benefits of Chunked Data


There are benefits to chunking the data.

  • Makes development easier. For analysts building an analysis of the dataset, it is often easier to build the analysis on a subset of the data, before running it on the entire dataset. If the data is chunked, it is easier to pull a single chunk when developing, and then running the production code on the entire dataset.
  • Can utilize the browsers cachec. Because the data is downloaded in pieces, each piece can utilize the browsers cache mechanism, which means that accessing the data after the first access may be highly optimized.

Merging Data


Merging data refers to the situation where you have two array and you wish to create an array that contains the items from each array concatenated together. This is simple using the Javascript spread operator. The spread operator is expressed as three dots, i.e. "...". When three dots precedes an array, it represents the items of the array, so that the following sample code represents an array with the elements of two arrays.


let data = [...data1, ...data2]
						 


When dealing with a large number of datasets, it is faster to just add each record into the final array. The following code demonstrates iterating over an array of arrays and concatting to a single array.


let data = [];
for(let set of datasets){
  set.forEach(p=>{
    data.push(p);
  });					
}
					

Using the Server API


The server api encapuslates the functionality of querying a server for the data files in a particular directory and then retrieving the data. It assumes that the data in the files are structured as arrays.


let sv = await import('/lib/server/v1.0.0/server.mjs');

let server = sv.server($url);
let data = await server('https://server.com/data.json')
					

The options parameter lets you process the data that is returned before appending it to the result array. You can also specify how many files to retrieve concurrently.


let data = await server('https://server.com/data.json', {
    process:function(data){
        if(typeof data === 'string') return JSON.parse(data);
        return data;
    },
    concurrent:3
});
					

Using the File Server


Many browsers have an effective file size limit of 500MB. This means that dataset larger than that need to be split into multiple files. The simple way to query this type of data is to place it in a directory, query the directory for the files that exist there, and then retrieve each file and concatenate as above.

A standard way to do this is to encode the directory structure in the URL that is being queried. That is, suppose you have a dataset named "big-data", and your company domain name is "acme.com". The URL "https://acme.com/big-data/" should return a list of files, representing the pieces of the data set.


let files = await $url("https://acme.com/big-data/" );
					

Then, each file can be retrieved from the web server and added to the dataset.


let data = [];
for(let file of files){
    let data2 = await $url("https://acme.com/big-data/" + file);
    data2.forEach(p=>data.push(p));
}
					


The file server is a server that hosts files from the local harddrive as a webserver. It follows the above protocol for querying files from a directory and retrieving each file.