-
Notifications
You must be signed in to change notification settings - Fork 416
Support to ignore specific storage options when tokenizing #1933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support to ignore specific storage options when tokenizing #1933
Conversation
|
fails for "zip://afile::simplecache::ftp://user:pass@localhost:2121/archive.zip", need to revisit implementation... |
|
Looks like this is cross polution via the instance cache between tests using simplecache: Running And passes in the opposite order: |
|
I don't know why that's happening :| For the sake of tests, it would make sense to both clear the cache between tests, and also to have some tests that specifically examine the state of the cache after particular operations. |
|
I worked out that it's a stale socket that survives due to the instance cache not being cleared. Below is a reproduction: def test_url_to_fs_chained_url():
# Reproduce the test cross-contamination issue:
# When test_url_to_fs is executed before test_chained_url the latter fails.
# Switching the test execution order makes tests succeed
# Cleaning all instance caches between tests makes tests succeed
import subprocess
import sys
import tempfile
import time
from fsspec.implementations.ftp import FTPFileSystem
from fsspec.implementations.cached import CachingFileSystem
from fsspec.implementations.cached import SimpleCacheFileSystem
# clear instance caches of classes and subclasses
classes = [FTPFileSystem, CachingFileSystem]
while classes:
cls = classes.pop()
classes.extend(cls.__subclasses__())
cls.clear_instance_cache()
# start the ftpserver process
host, port, username, password = "localhost", 2121, "user", "pass"
P = subprocess.Popen(
[
sys.executable,
"-m", "pyftpdlib",
"-d", str(tempfile.mkdtemp()),
"-u", username,
"-P", password,
"-w"
]
)
time.sleep(1)
# The following has to happen in test_url_to_fs
try:
fs, url = fsspec.core.url_to_fs(
f"simplecache::ftp://{username}:{password}@{host}:{port}/afile"
)
# The bug will only trigger if we use a method that
# internally calls a method on FTPFileSystem().ftp.<some-method>
fs.exists(url)
finally:
# We now kill the ftp server (the ftp_writable fixture does this)
P.terminate()
P.wait()
# if we now run any method on this filesystem, we reproduce the error
# fs.open("???") <-- would raise the EOFError
# (because ftplib.FTP keeps a reference to a socket)
# In the test cross contamination case, the filesystem
# survives via the instance cache and gets reused in the following test
assert len(SimpleCacheFileSystem._cache) == 1
cached_fs = next(iter(SimpleCacheFileSystem._cache.values()))
assert cached_fs.storage_options == {
"target_options": {
"host": host,
"password": password,
"port": port,
"username": username,
},
"target_protocol": "ftp",
}
# when test_chained_url
fs, _ = fsspec.core.url_to_fs(f"simplecache::ftp://{username}:{password}@{host}:{port}/xxx")
assert fs is cached_fs # we retrieved the broken fs from the previous section
fs.open("???") # raises EOFError |
I'll add more tests for this. |
454bf92 to
48d5fa8
Compare
|
Note to self: https://github.com/fsspec/filesystem_spec/actions/runs/18803845822/job/53654986219#step:4:2311 The reference filesystem key errors must have been masked somehow by the instance caches not being cleared between tests... The errors occur because the It's also strange because I can't reproduce this locally on my mac... |
… actual protocol" This reverts commit 753cf5c.
|
This is ready for review.
|
|
This looks perfect, thank you - I have no changes at all. |
This PR closes #1930.
It provides a solution to remove the
"fo"keyword argument from being considered for the filesystem instance cache. Instead of specializing just on"fo"it adds support for configuring which storage options to skip via a class attribute, similar to how_extra_tokenize_attributeslet's users add class attributes to the tokenization.Note: implementing this revealed some cross-contamination of the tests in test_core.py via the simplecache instance cache. I fixed this by adding a fixture that clears the caches for every test.