Description
I would like to select/read only a set of data from file based on criteria. Is there an optimal approach for doing it?
attempt 1
model
public struct CandleInDbNew
{
public uint OpenTs;
public decimal OpenPrice;
public decimal HighPrice;
public decimal LowPrice;
public decimal ClosePrice;
public uint TradeCount;
public decimal Volume;
public decimal QuoteAssetVolume;
public decimal TakerBuyBaseAssetVolume;
public decimal TakerBuyQuoteAssetVolume;
}
method
public List<CandleInDbNew> GetCandlesInRange(
string fileFullPath,
uint from)
{
var result = new List<CandleInDbNew>();
if (!File.Exists(fileFullPath))
{
return result;
}
using (var tf = TeaFile<CandleInDbNew>.OpenRead(fileFullPath,
ItemDescriptionElements.FieldNames |
ItemDescriptionElements.FieldTypes |
ItemDescriptionElements.FieldOffsets |
ItemDescriptionElements.ItemSize))
{
foreach (var item in tf.Items)
{
if (item.OpenTs >= from)
result.Add(item);
}
}
return result;
}
Given that my data in file is sorted by OpenTs
I would like to filter out the values that are not within a specific range as in example above.
issue
This approach is really inefficient, because the whole Item
is being read and mapped right away. It's slow. Not solving the problem.
attempt 2
I have also tried using the unmapped approach. But exception is thrown upon read
System.IO.IOException: 'Decimal constructor requires an array or span of four valid decimal bytes.'
I have managed to extract part of the data that causes the issue. https://github.com/pavlexander/testfile/blob/main/ETHBTC_big.7z
There were no issues with 10k, 50k, 100k of records. But at 1 mil of records I started getting the error.. Please download, unpack the file, then use following code to repro:
var result = new List<CandleInDbNew>();
using (var tf = TeaFile.OpenRead("ETHBTC_big.tea")) // exception here
{
var openTsColumn = tf.Description.ItemDescription.GetFieldByName("OpenTs");
foreach (Item item in tf.Items)
{
var openTs = (uint)openTsColumn.GetValue(item);
if (openTs >= 1692190740)
result.Add(default); // temporary
}
}
issue
even if this solution worked there is no guarantee that it would work faster than approach 1. In fact, on a smaller dataset where no exceptions are thrown - on my machine approach 1 performs many times faster than approach 2.. If we put the error aside - I also want to know how to map an item
to struct
..
conclusion
the original question still stands - how to filter out the data based on criteria and avoid reading all file..