-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
While profiling a recent release build of ERIN, it looks like time being spent is roughly:
- 1/3 in reading (mostly CSV)
- 1/3 in writing (mostly string->double conversion)
- 1/3 in the simulation itself
Although these ratios may change on larger problems, the time spent in reading / writing is still rather large. The objective of this task would be to:
- investigate if there are more performant CSV reading techniques
- investigate if there are more performant
string->double
conversion - investigate how much performance gain could be had by adding a flag to remove rounding from double outputs
- investigate appending all csvs into one file (i.e., more columns) and check time of reading that vs individually reading each
- investigate what it would take to ADD another input format using Apache Parquet
- investigate what it would take to ADD another output format for events file and stats file using Apache Parquet
A study of the speed in reading .csv files performed using the hyperfine
benchmark utility indicates marked speed improvements in reading a given amount of data packed into a single file, as compared to distributing the data across multiple files. Here we define a single "entry" as two columns. For comparison, two files were used, both with 8760 rows. In "repeat" mode, a file with one entry was opened, read, and then closed 1024 times. In "mixed" mode, a file with 1024 entries was open, read, and then closed only once. For comparison, a third mode, "multi", which reads from a list of 128 files, each with only one entry, gave comparable results to "repeat" mode, as expected. (The same single entry file was used for "repeat" and "multi" modes.)
p: # of files to read
q: # of entries (8760 rows each)
r: # of trials
# of entries to read = p x q x r (= 1024)
The hyperfine results are below:
-
"repeat": p =1, q = 1., r = 1024
Benchmark 1: ../../build/bin/erin read test_files.toml repeat -v
Time (mean ± σ): 4.412 s ± 0.238 s [User: 4.331 s, System: 0.067 s]
Range (min … max): 4.040 s … 4.887 s 10 runs -
"mixed": p = 1, q = 1024, r = 1
Benchmark 1: ../../build/bin/erin read test_files.toml mixed -v
Time (mean ± σ): 1.993 s ± 0.078 s [User: 1.925 s, System: 0.061 s]
Range (min … max): 1.833 s … 2.101 s 10 runs -
"multi": p = 128, q = 1, r = 8
Benchmark 1: ../../build/bin/erin read test_files.toml multi -v
Time (mean ± σ): 4.397 s ± 0.201 s [User: 4.303 s, System: 0.070 s]
Range (min … max): 4.049 s … 4.734 s 10 runs
These tests indicate that reductions in read times are possible using packed .csv data formats.