Nicholas M Hancock

Hacking with Hancock



Wednesday, 25 May 2016 -- A short note on Column vs Row storage

I’m writing my own small storage engine for a financial analysis application. I want to run quantitative finance-style analysis on the market in Eve-Online, so as to hone my skills at such analysis cheaply. Eve-Central has a historic dataset of Eve-Online market data which, while not comprehensive, still measure in at 2.7 Terabytes in decompressed, CSV form. The data are pretty typical star-schema fact-table style data. The tuples consist of a primary key, 4 ‘foreign keys’, 2 time stamps, and 8 miscellaneous fact fields for a total of 15 fields. One day’s worth of data is 1.5 GB in the aforementioned decompressed, CSV format. I’m comparing row and column storage formats for two criteria: space consumed and time-to-write. Note that all output formats are binary, and all times are determined by the minimum of four consecutive runs.

Format Size (MB) Time to Write (s)
CSV (input) 1.5G 0
Row 600M 8.6
Row (lz4 -9) 209M 8.6 + 20.59 = 29.19
Row (gz – 9) 230M 8.6 + 196.81 = 205.41
Column 600M 13.94
Column (lz4 -9) 174M 13.94 + 33.23 = 47.17
Column (gz – 9) 146M 13.94 + 171 = 184.94

Note that the times for Row versus Column compression aren’t totally accurate, as I had all the data written to the row or column storage and then compressed in one fell swoop, whereas in a real environment the compression would be piece-wise.

For lz4 breaking these data into columns decreases space utilization by 16.75% while increasing the bulk loading time by 61.60%. For gzip the numbers are 36.52% and -9.97% (the bulk loading time went down) respectively. It remains to be seen why the two algorithms perform so differently with respect to loading time. On a cursory glance, the difference in CPU utilization doesn’t seem to explain the discrepancy.

Newport
Beach

Photo credit: Austin Neill