Skip to content

Improved delimiter sniffing for .csv files #210

Open
@JensWendt

Description

@JensWendt

Hello,

Adding to a previous discussion we had #195, I would like to propose an improvement to the delimiter sniffing technique we use.
Why do I come up with this now? Because @SchmChris found a .csv which could not be properly resolved with the current code.
the .csv in question: test.csv

the current code reads:

with open(temp_name, 'rt', encoding='utf-8-sig') as file_handle:
            try:
                delimiter = csv.Sniffer().sniff(
                    file_handle.read(500), ",;\t").delimiter
                print("Using delimiter: ", delimiter,
                      " after reading 500 characters")
            except Exception:
                file_handle.seek(0)
                try:
                    delimiter = csv.Sniffer().sniff(
                        file_handle.read(1000), ",;\t").delimiter
                    print("Using delimiter: ", delimiter,
                          " after reading 1000 characters")
                except Exception:
                    file_handle.seek(0)
                    try:
                        delimiter = csv.Sniffer().sniff(
                            file_handle.read(2000), ";,\t").delimiter
                        print("Using delimiter: ", delimiter,
                              " after reading 2000 characters")
                    except Exception:
                        print("Failed to sniff delimiter, using ','")
                        delimiter = ","
            # reset to start and read whole file...
            file_handle.seek(0)
            data = list(csv.reader(file_handle, delimiter=delimiter))

I propose ditching the fixed values of characters we feed into the sniffer and go for a dynamic approach where we read the first quarter, half , three quarters or the whole file.

from math import floor
with open("temp_name", 'rt', encoding='utf-8-sig') as file_handle:
    file_length = len(file_handle.read(-1))
    try:
        delimiter = csv.Sniffer().sniff(
            file_handle.read(floor(file_length/4)), ",;\t").delimiter
        print("Using delimiter: ", delimiter,
              f" after reading {floor(file_length/4)} characters")
    except Exception:
        file_handle.seek(0)
        try:
            delimiter = csv.Sniffer().sniff(
                file_handle.read(floor(file_length/2)), ",;\t").delimiter
            print("Using delimiter: ", delimiter,
                  f" after reading {floor(file_length/2)} characters")
        except Exception:
            file_handle.seek(0)
            try:
                delimiter = csv.Sniffer().sniff(
                    file_handle.read(floor(file_length*0.75)), ",;\t").delimiter
                print("Using delimiter: ", delimiter,
                      f" after reading {floor(file_length*0.75)} characters")
            except Exception:
                file_handle.seek(0)
                try:
                    delimiter = csv.Sniffer().sniff(
                        file_handle.read(file_length), ",;\t").delimiter
                    print("Using delimiter: ", delimiter,
                          " after reading all characters")
                except Exception:
                    print("Failed to sniff delimiter, using ','")
                    delimiter = ","
# reset to start and read whole file...
	file_handle.seek(0)
	data = list(csv.reader(file_handle, delimiter=delimiter))

This will cost maybe a bit more time, but will more reliably lead to a successful "sniffing" of the delimiter.

Does this sound like a good idea, am I missing some issue or is there maybe a more elegant approach to implement this?

When this more robust method is implemented (in whatever fashion) I would also like to adapt the code for populate_metadata.py and OMERO.parade to include this, as the delimiter-issue will come up for any german localized Excel version and possibly prevents new OMERO users from utilizing this part of the OMERO functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions