Skip to content

extracting only a part of the tea file #28

Open
@pavlexander

Description

@pavlexander

I would like to select/read only a set of data from file based on criteria. Is there an optimal approach for doing it?

attempt 1

model

    public struct CandleInDbNew
    {
        public uint OpenTs;

        public decimal OpenPrice;
        public decimal HighPrice;
        public decimal LowPrice;
        public decimal ClosePrice;

        public uint TradeCount;

        public decimal Volume;
        public decimal QuoteAssetVolume;
        public decimal TakerBuyBaseAssetVolume;
        public decimal TakerBuyQuoteAssetVolume;
    }

method

        public List<CandleInDbNew> GetCandlesInRange(
            string fileFullPath,
            uint from)
        {
            var result = new List<CandleInDbNew>();

            if (!File.Exists(fileFullPath))
            {
                return result;
            }

            using (var tf = TeaFile<CandleInDbNew>.OpenRead(fileFullPath,
                    ItemDescriptionElements.FieldNames |
                    ItemDescriptionElements.FieldTypes |
                    ItemDescriptionElements.FieldOffsets |
                    ItemDescriptionElements.ItemSize))
            {
                foreach (var item in tf.Items)
                {
                    if (item.OpenTs >= from)
                        result.Add(item);
                }
            }

            return result;
        }

Given that my data in file is sorted by OpenTs I would like to filter out the values that are not within a specific range as in example above.

issue

This approach is really inefficient, because the whole Item is being read and mapped right away. It's slow. Not solving the problem.

attempt 2

I have also tried using the unmapped approach. But exception is thrown upon read

System.IO.IOException: 'Decimal constructor requires an array or span of four valid decimal bytes.'

image

I have managed to extract part of the data that causes the issue. https://github.com/pavlexander/testfile/blob/main/ETHBTC_big.7z

There were no issues with 10k, 50k, 100k of records. But at 1 mil of records I started getting the error.. Please download, unpack the file, then use following code to repro:

            var result = new List<CandleInDbNew>();

            using (var tf = TeaFile.OpenRead("ETHBTC_big.tea")) // exception here
            {
                var openTsColumn = tf.Description.ItemDescription.GetFieldByName("OpenTs");

                foreach (Item item in tf.Items)
                {
                    var openTs = (uint)openTsColumn.GetValue(item);

                    if (openTs >= 1692190740)
                        result.Add(default); // temporary
                }
            }

issue

even if this solution worked there is no guarantee that it would work faster than approach 1. In fact, on a smaller dataset where no exceptions are thrown - on my machine approach 1 performs many times faster than approach 2.. If we put the error aside - I also want to know how to map an item to struct..

conclusion

the original question still stands - how to filter out the data based on criteria and avoid reading all file..

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions