Ramin Sadre – Analyzing Large Data Sets From the Command Line
When doing research on traffic modeling, you might come into a situation where you are standing in front of a large data set with hundreds of millions of data entries, wondering how you could perform some ad-hoc data processing on it – nothing sophisticated, just enough to filter out the uninteresting data.
The problem: although there are some really nice tools for statistical data processing, they are usually not made for data sets with more than, say, hundred thousand entries. Even the good, old SQL database (you can really do some interesting things with SQL) would probably need longer to setup and build the tables than to do the data analysis.
In the following, I want to give some examples how you can process large data sets with ordinary command line tools (cygwin if you are using Windows). The basic idea is to store your data in text files. Of course, there are more sophisticated approaches, for example map/reduce, but for ad-hoc data processing the command line can often be the fastest and easiest way.
I will try to regularly extend the list of examples.
Store compressed data
When doing simple data manipulations (such as filtering numerical entries outside a given range), the disk generally is the bottle neck. The CPU is usually fast enough to compress/decompress your data on the fly, especially if it is a multi-core model. You can expect a compression rate of at least 50% with numerical data stored in text files.
To compress data:
...other commands... | gzip - > file.gz
To decompress data:
zcat file.gz | ...other commands...
An interesting, although rather obvious observation is that sorted data often can be better compressed. One of my data sets contained around 200 million timestamps with microsecond resolution. The unsorted, compressed file had a size of around 900 MiBytes. The sorted data was compressed to only 700 MiBytes.
Select specific columns
The command cut can be used to filter one or more columns from a table. The following example selects the first and third column from a compressed text file with space-seperated columns:
zcat file.gz | cut -d' ' -f1,3 | ...other commands...
Check its man page for more options (such as byte-based column selection with the -b option). For more complex formattings, cut may not be powerful enough. Use perl or awk instead.
Simple random sampling
In order to reduce the size of the data set, sampling may be appropriate. The follow example samples a data file. Each line in the data file is selected with probability 0.01:
... | awk 'begin{ srand() } { if(rand()<=.01) print $0 }' | ...
Depending on your usage scenario, the quality of the RNG used by awk may not be high enough. Check the manual of your awk implementation.
Quick and dirty histogram creation of discrete data
The following example only works well for small data sets. It creates a histogram of the (numerical) discrete data in column 3:
... | sort -n -k3 | uniq -c | ...'
Check the manual for other useful options (for example, the -t option to specify the field seperator).
To be extended...