Investigate ways to speed up CSV reading/writing AND/OR alternate file formats

# Problem

While profiling a recent release build of ERIN, it looks like time being spent is roughly:

- 1/3 in reading (mostly CSV)
- 1/3 in writing (mostly string->double conversion)
- 1/3 in the simulation itself

Although these ratios may change on larger problems, the time spent in reading / writing is still rather large. The objective of this task would be to:

- [x] investigate if there are more performant CSV reading techniques
- [ ] investigate if there are more performant `string->double` conversion
- [ ] investigate how much performance gain could be had by adding a flag to remove rounding from double outputs
- [ ] investigate appending all csvs into one file (i.e., more columns) and check time of reading that vs individually reading each
- [ ] investigate what it would take to ADD another input format using Apache Parquet
- [ ] investigate what it would take to ADD another output format for events file and stats file using Apache Parquet

Reference: #40 #41 

A study of the speed in reading .csv files performed using the `hyperfine` benchmark utility indicates marked speed improvements in reading a given amount of data packed into a single file, as compared to distributing the data across multiple files. Here we define a single "entry" as two columns. For comparison, two files were used, both with 8760 rows.  In "repeat" mode, a file with one entry was opened, read, and then closed 1024 times. In "mixed" mode, a file with 1024 entries was open, read, and then closed only once.  For comparison, a third mode, "multi", which reads from a list of 128 files, each with only one entry, gave comparable results to "repeat" mode, as expected. (The same single entry file was used for "repeat" and "multi" modes.)

p: # of files to read
q: # of entries (8760 rows each)
r: # of trials
\# of entries to read = p x q x r (= 1024)

The hyperfine results are below:
- "repeat": p =1, q = 1., r = 1024
Benchmark 1: ../../build/bin/erin read test_files.toml repeat -v
  Time (mean ± σ):      4.412 s ±  0.238 s    [User: 4.331 s, System: 0.067 s]
  Range (min … max):    4.040 s …  4.887 s    10 runs

- "mixed": p = 1, q = 1024, r = 1
Benchmark 1: ../../build/bin/erin read test_files.toml mixed -v
  Time (mean ± σ):      1.993 s ±  0.078 s    [User: 1.925 s, System: 0.061 s]
  Range (min … max):    1.833 s …  2.101 s    10 runs

- "multi": p = 128, q = 1, r = 8
Benchmark 1: ../../build/bin/erin read test_files.toml multi -v
  Time (mean ± σ):      4.397 s ±  0.201 s    [User: 4.303 s, System: 0.070 s]
  Range (min … max):    4.049 s …  4.734 s    10 runs

These tests indicate that reductions in read times are possible using packed .csv data formats.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate ways to speed up CSV reading/writing AND/OR alternate file formats #34

Problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate ways to speed up CSV reading/writing AND/OR alternate file formats #34

Description

Problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions