Skip to content

Conversation

DoctorPresto
Copy link
Contributor

dataKrash is a convenience function using os.scandir produces the same results as glob.glob, but 25x faster and 3x more memory efficient. Rather than recursively searching the entire directory multiple times, it indexes once and builds a dict which contains lists of the filetypes requested.

results from benchmarking on a small folder of sectors vs current implentation with glob.glob:

Benchmarking: dataKrash(x5)
  Run 1: 0.2427s, Peak Memory: 0.06 MB
  Run 2: 0.2404s, Peak Memory: 0.06 MB
  Run 3: 0.2372s, Peak Memory: 0.06 MB
  Run 4: 0.2423s, Peak Memory: 0.06 MB
  Run 5: 0.2374s, Peak Memory: 0.06 MB

📊 Average Time: 0.2400s
📦 Average Peak Memory: 0.06 MB
  .glb: 177 files
  .mesh.json: 0 files
  .app.json: 27 files
  .anims.json: 0 files
  .ent.json: 89 files
  .anims.glb: 0 files
  .streamingsector.json: 8 files
  .rig.json: 0 files
  .phys.json: 0 files

Benchmarking: glob.glob (x5)
  Run 1: 6.0052s, Peak Memory: 0.17 MB
  Run 2: 5.9445s, Peak Memory: 0.16 MB
  Run 3: 5.8777s, Peak Memory: 0.16 MB
  Run 4: 5.8439s, Peak Memory: 0.16 MB
  Run 5: 5.9775s, Peak Memory: 0.16 MB

📊 Average Time: 5.9298s
📦 Average Peak Memory: 0.16 MB
  .glb: 177 files
  .mesh.json: 0 files
  .app.json: 27 files
  .anims.json: 0 files
  .ent.json: 89 files
  .anims.glb: 0 files
  .streamingsector.json: 8 files
  .rig.json: 0 files
  .phys.json: 0 files

datakrash produces the same results as glob.glob, but 25x faster and 3x more memory efficient.
rather than recursively searching the entire directory multiple times, it indexes once and builds a dict which contains lists of the filetypes requested.
@DoctorPresto DoctorPresto added the enhancement New feature or request label Aug 4, 2025
@Simarilius-uk
Copy link
Contributor

I dont understand how glob sucks so bad on your machine, I'm testing it on a project here and its parsing >4x as many files in 1.1sec. Your codes still faster, but not by the margin your seeing.
Is this safe against file stuff that I was doing escaped path for?

@DoctorPresto
Copy link
Contributor Author

I dont understand how glob sucks so bad on your machine, I'm testing it on a project here and its parsing >4x as many files in 1.1sec. Your codes still faster, but not by the margin your seeing.

Yeah, I cherry picked where I'd benchmark a bit here but glob’s performance hit comes from re-scanning the whole tree for every pattern - this version only walks the directory once and filters per file, so it scales way better in larger structures where glob falls down.

Is this safe against file stuff that I was doing escaped path for?

I tried to make sure it covers everything we were relying on: proper normalization for cross-platform + unicode safety (unicodedata.normalize), support for non-ASCII, emojis in paths for some reason, filenames with spaces etc... The only edge case that might still need caution is escaped paths — but as long as we’re passing them in as raw strings (r""), we should be good.

@Simarilius-uk
Copy link
Contributor

can you check it agains the issue that #147 showed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants