Description
Hello,
Adding to a previous discussion we had #195, I would like to propose an improvement to the delimiter sniffing technique we use.
Why do I come up with this now? Because @SchmChris found a .csv which could not be properly resolved with the current code.
the .csv in question: test.csv
the current code reads:
with open(temp_name, 'rt', encoding='utf-8-sig') as file_handle:
try:
delimiter = csv.Sniffer().sniff(
file_handle.read(500), ",;\t").delimiter
print("Using delimiter: ", delimiter,
" after reading 500 characters")
except Exception:
file_handle.seek(0)
try:
delimiter = csv.Sniffer().sniff(
file_handle.read(1000), ",;\t").delimiter
print("Using delimiter: ", delimiter,
" after reading 1000 characters")
except Exception:
file_handle.seek(0)
try:
delimiter = csv.Sniffer().sniff(
file_handle.read(2000), ";,\t").delimiter
print("Using delimiter: ", delimiter,
" after reading 2000 characters")
except Exception:
print("Failed to sniff delimiter, using ','")
delimiter = ","
# reset to start and read whole file...
file_handle.seek(0)
data = list(csv.reader(file_handle, delimiter=delimiter))
I propose ditching the fixed values of characters we feed into the sniffer and go for a dynamic approach where we read the first quarter, half , three quarters or the whole file.
from math import floor
with open("temp_name", 'rt', encoding='utf-8-sig') as file_handle:
file_length = len(file_handle.read(-1))
try:
delimiter = csv.Sniffer().sniff(
file_handle.read(floor(file_length/4)), ",;\t").delimiter
print("Using delimiter: ", delimiter,
f" after reading {floor(file_length/4)} characters")
except Exception:
file_handle.seek(0)
try:
delimiter = csv.Sniffer().sniff(
file_handle.read(floor(file_length/2)), ",;\t").delimiter
print("Using delimiter: ", delimiter,
f" after reading {floor(file_length/2)} characters")
except Exception:
file_handle.seek(0)
try:
delimiter = csv.Sniffer().sniff(
file_handle.read(floor(file_length*0.75)), ",;\t").delimiter
print("Using delimiter: ", delimiter,
f" after reading {floor(file_length*0.75)} characters")
except Exception:
file_handle.seek(0)
try:
delimiter = csv.Sniffer().sniff(
file_handle.read(file_length), ",;\t").delimiter
print("Using delimiter: ", delimiter,
" after reading all characters")
except Exception:
print("Failed to sniff delimiter, using ','")
delimiter = ","
# reset to start and read whole file...
file_handle.seek(0)
data = list(csv.reader(file_handle, delimiter=delimiter))
This will cost maybe a bit more time, but will more reliably lead to a successful "sniffing" of the delimiter.
Does this sound like a good idea, am I missing some issue or is there maybe a more elegant approach to implement this?
When this more robust method is implemented (in whatever fashion) I would also like to adapt the code for populate_metadata.py
and OMERO.parade to include this, as the delimiter-issue will come up for any german localized Excel version and possibly prevents new OMERO users from utilizing this part of the OMERO functionality.