Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix line ending handling in reverse_readfile/readline across OS, and not skipping empty lines #712

Open
wants to merge 100 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
e2952c4
add l_end arg to reverse_readfile
DanielYang59 Sep 4, 2024
619a38d
rstrip remove hard coded trailing white space
DanielYang59 Sep 4, 2024
6cbea60
add default value for l_end
DanielYang59 Sep 4, 2024
0de9696
add l_end to reverse_readline
DanielYang59 Sep 4, 2024
f114afe
continue CI test jobs upon failure
DanielYang59 Sep 4, 2024
deb1ad7
bump codecov to v4
DanielYang59 Sep 4, 2024
c4a845c
tweak docstring
DanielYang59 Sep 4, 2024
afbe573
var name and docstring tweak
DanielYang59 Sep 4, 2024
a4c4fe3
add helper function to get line ending and unit test
DanielYang59 Sep 5, 2024
86ea01b
add bzip2 and gzip file support and test
DanielYang59 Sep 5, 2024
c7ec2de
sort import and tweak docstring
DanielYang59 Sep 5, 2024
ae125c3
use unix line ending \n as default if empty
DanielYang59 Sep 5, 2024
343e0db
fix docstring
DanielYang59 Sep 5, 2024
d288e0d
update unit test
DanielYang59 Sep 5, 2024
064c064
remove accidental comment sign
DanielYang59 Sep 5, 2024
30624aa
use if after return
DanielYang59 Sep 5, 2024
3084123
tweak docstring
DanielYang59 Sep 5, 2024
1639725
reset pointer to fix test in windows
DanielYang59 Sep 6, 2024
dfe553d
encode in linux
DanielYang59 Sep 6, 2024
3b7a811
reset pointer in win only
DanielYang59 Sep 6, 2024
b291707
yield str in windows
DanielYang59 Sep 6, 2024
aeba5a7
use Iterator as return type
DanielYang59 Sep 6, 2024
1d7adda
assert iterator return str
DanielYang59 Sep 6, 2024
e6312a6
strict usage of rstrp(line_ending)
DanielYang59 Sep 6, 2024
7911e0a
better assert no error
DanielYang59 Sep 6, 2024
6cb5349
decode before strip, but tests are failing
DanielYang59 Sep 6, 2024
79f07f1
add test_file_with_empty_lines for readfile, Win still not working
DanielYang59 Sep 6, 2024
d2ea989
allow zipped file directly into get_line_end, and add test
DanielYang59 Sep 9, 2024
6803c7b
remove seemingly unused file
DanielYang59 Sep 9, 2024
351452a
update test file to include line end for the last line
DanielYang59 Sep 9, 2024
6f8baf9
skip the first match of line ending char
DanielYang59 Sep 9, 2024
74a451f
fix test for zipped format
DanielYang59 Sep 9, 2024
ef454b3
drop test for legacy MacOS \r
DanielYang59 Sep 9, 2024
3516c27
make capture of mmap valueerror more narrow
DanielYang59 Sep 9, 2024
5e1344c
use platform.system for windows check
DanielYang59 Sep 9, 2024
283a383
WIP: test for \r\n is failing for some reason
DanielYang59 Sep 9, 2024
ace9264
fix \r\n location, but it looks silly, need opt
DanielYang59 Sep 9, 2024
9e92bb9
add TODO tag
DanielYang59 Sep 9, 2024
6a8f8b5
finally fixed, avoid newline otherwise \n get overwrite by \r\n causi…
DanielYang59 Sep 9, 2024
f69c205
fix line ending for text mode
DanielYang59 Sep 9, 2024
28e6cc4
drop test of \r altogether
DanielYang59 Sep 9, 2024
bf3b90a
remove manual counter
DanielYang59 Sep 9, 2024
a03b301
copy unit test, to be fixed
DanielYang59 Sep 9, 2024
138b756
update warn msg upon empty file
DanielYang59 Sep 10, 2024
29dc50d
tweak docstring
DanielYang59 Sep 10, 2024
8f94964
add comments
DanielYang59 Sep 10, 2024
3804727
Fix line ending handling in reverse readline
DanielYang59 Sep 13, 2024
76c0243
significantly increase test
DanielYang59 Sep 13, 2024
3b68d14
add warning for overly small mem size
DanielYang59 Sep 13, 2024
cd3bbd1
update unit test
DanielYang59 Sep 13, 2024
36a02f4
add test for illegal usage
DanielYang59 Sep 13, 2024
6457922
use tuple for instance check for now
DanielYang59 Sep 13, 2024
14dad79
avoid repeated len(l_end) call
DanielYang59 Sep 13, 2024
306a8f1
tweak unit test names
DanielYang59 Sep 13, 2024
168bc42
supress newline convert
DanielYang59 Sep 13, 2024
dcdfde5
pre-commit auto-fixes
pre-commit-ci[bot] Sep 13, 2024
f255121
support bufferedreader and clarify m_file type
DanielYang59 Sep 14, 2024
57157ef
clarify and test missing last l_end char
DanielYang59 Sep 14, 2024
4276032
clean up docstring
DanielYang59 Sep 14, 2024
33257c5
clarify myth around blk size and max mem
DanielYang59 Sep 14, 2024
f75abd3
fix skip last l_end (temporarily)
DanielYang59 Sep 14, 2024
709ed17
fix skip first l_end match
DanielYang59 Sep 14, 2024
4799c24
update test big file, but it's failing and need to fix
DanielYang59 Sep 14, 2024
702e6ca
tweak comment
DanielYang59 Sep 14, 2024
d0a0fda
update TODO list
DanielYang59 Sep 15, 2024
5af7145
fix incorrect buffer refill position
DanielYang59 Sep 15, 2024
384c231
suppress OS l_end translate
DanielYang59 Sep 15, 2024
6102263
suppress unrelated warnings
DanielYang59 Sep 15, 2024
e9c22ed
check file pointer reset
DanielYang59 Sep 15, 2024
0be5a85
track benchmark script in case we need it someday
DanielYang59 Sep 15, 2024
edfa0cb
track benchmark script in case we need it someday
DanielYang59 Sep 15, 2024
bdda9f8
Merge branch 'readline-line-ending' of github.com:DanielYang59/monty …
DanielYang59 Sep 15, 2024
8714cb4
simplify test script, make env install manual
DanielYang59 Sep 15, 2024
c236434
Merge branch 'readline-line-ending' of github.com:DanielYang59/monty …
DanielYang59 Sep 15, 2024
174d940
save test log on wsl2
DanielYang59 Sep 15, 2024
1f7476e
pre-commit auto-fixes
pre-commit-ci[bot] Sep 15, 2024
0d2af60
also track test file create time
DanielYang59 Sep 16, 2024
7cddd29
get obj size as getsize is slow in win
DanielYang59 Sep 16, 2024
0b478dd
pre-commit auto-fixes
pre-commit-ci[bot] Sep 16, 2024
3053e87
update builtin readline test not to read entire file
DanielYang59 Sep 16, 2024
097ae65
remove outdated test log
DanielYang59 Sep 16, 2024
cf542a8
update test log on windows
DanielYang59 Sep 16, 2024
8dc894e
pre-commit auto-fixes
pre-commit-ci[bot] Sep 16, 2024
e90da64
test on Ubuntu 22.04 WSL2
DanielYang59 Sep 18, 2024
df0288b
pre-commit auto-fixes
pre-commit-ci[bot] Sep 18, 2024
0ea6828
remove dup test script
DanielYang59 Sep 18, 2024
8db6bb4
Merge branch 'readline-line-ending' of github.com:DanielYang59/monty …
DanielYang59 Sep 18, 2024
f354756
clear finished TODO tag
DanielYang59 Sep 19, 2024
96c9108
tweak var name
DanielYang59 Sep 19, 2024
e1786ef
fix missing newline char in comment
DanielYang59 Sep 19, 2024
f90074c
guard warning filter with context manager
DanielYang59 Sep 19, 2024
27b28a6
add type annotation
DanielYang59 Sep 19, 2024
1cdc75b
put tag into condition branch
DanielYang59 Sep 21, 2024
e4940e0
untrack benchmark script and results
DanielYang59 Sep 22, 2024
279cd4c
Merge branch 'master' into readline-line-ending
shyuep Oct 21, 2024
6613d33
fix typo in test var name
DanielYang59 Oct 22, 2024
3baea3a
remove merge issue
DanielYang59 Oct 22, 2024
61f4aff
Merge branch 'master' into readline-line-ending
DanielYang59 Oct 22, 2024
f1fd669
revise comment
DanielYang59 Oct 22, 2024
d70e80e
suppress mypy errors
DanielYang59 Oct 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ jobs:
build:
strategy:
max-parallel: 20
fail-fast: false
matrix:
os: [ubuntu-latest, macos-14, windows-latest]
python-version: ["3.9", "3.x"]
Expand All @@ -29,6 +30,6 @@ jobs:
run: pytest --cov=monty --cov-report html:coverage_reports tests

- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v3
uses: codecov/codecov-action@v4
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
132 changes: 96 additions & 36 deletions src/monty/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,20 +9,21 @@
import errno
import gzip
import io

try:
import lzma
except ImportError:
lzma = None # type: ignore[assignment]
import mmap
import os
import subprocess
import time
import warnings
from pathlib import Path
from typing import TYPE_CHECKING

try:
import lzma
except ImportError:
lzma = None # type: ignore[assignment]

if TYPE_CHECKING:
from typing import IO, Generator, Union
from typing import IO, Iterator, Literal, Union


def zopen(filename: Union[str, Path], *args, **kwargs) -> IO:
Expand Down Expand Up @@ -54,7 +55,53 @@ def zopen(filename: Union[str, Path], *args, **kwargs) -> IO:
return open(filename, *args, **kwargs)


def reverse_readfile(filename: Union[str, Path]) -> Generator[str, str, None]:
def _get_line_ending(
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
file: str | Path | io.TextIOWrapper,
) -> Literal["\r\n", "\n", "\r"]:
"""Helper function to get line ending of a file.

This function assumes the file has a single consistent line ending.

Returns:
"\n": Unix line ending.
"\r\n": Windows line ending.
"\r": Classic MacOS line ending.

Raises:
ValueError: If line ending is unknown.

Warnings:
If file is empty, "\n" would be used as default.
"""
if isinstance(file, (str, Path)):
with open(file, "rb") as f:
first_line = f.readline()
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
elif isinstance(file, io.TextIOWrapper):
first_line = file.buffer.readline()
elif isinstance(file, (gzip.GzipFile, bz2.BZ2File)):
first_line = file.readline()
file.seek(0) # reset pointer
else:
raise TypeError(f"Unknown file type {type(file).__name__}")

# Return Unix "\n" line ending as default if file is empty
if not first_line:
warnings.warn("File empty, use default line ending \n.", stacklevel=2)
return "\n"

if first_line.endswith(b"\r\n"):
return "\r\n"
if first_line.endswith(b"\n"):
return "\n"
if first_line.endswith(b"\r"):
return "\r"

raise ValueError(f"Unknown line ending in line {repr(first_line)}.")


def reverse_readfile(
filename: Union[str, Path],
) -> Iterator[str]:
"""
A much faster reverse read of file by using Python's mmap to generate a
memory-mapped file. It is slower for very small files than
Expand All @@ -67,26 +114,32 @@ def reverse_readfile(filename: Union[str, Path]) -> Generator[str, str, None]:
Yields:
Lines from the file in reverse order.
"""
# Generate line ending
l_end = _get_line_ending(filename)

try:
with zopen(filename, "rb") as file:
if isinstance(file, (gzip.GzipFile, bz2.BZ2File)):
for line in reversed(file.readlines()):
yield line.decode("utf-8").rstrip(os.linesep)
yield line.decode("utf-8").rstrip(l_end)

else:
filemap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
n = len(filemap)
while n > 0:
i = filemap.rfind(os.linesep.encode(), 0, n)
yield filemap[i + 1 : n].decode("utf-8").rstrip(os.linesep)
i = filemap.rfind(l_end.encode(), 0, n)
yield filemap[i + 1 : n].decode("utf-8")
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
n = i

except ValueError:
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
return


def reverse_readline(
m_file, blk_size: int = 4096, max_mem: int = 4000000
) -> Generator[str, str, None]:
m_file,
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
blk_size: int = 4096,
max_mem: int = 4000000,
) -> Iterator[str]:
"""
Generator function to read a file line-by-line, but backwards.
This allows one to efficiently get data at the end of a file.
Expand All @@ -104,68 +157,75 @@ def reverse_readline(

Args:
m_file (File): File stream to read (backwards)
blk_size (int): The buffer size. Defaults to 4096.
blk_size (int): The buffer size in bytes. Defaults to 4096.
max_mem (int): The maximum amount of memory to involve in this
operation. This is used to determine when to reverse a file
in-memory versus seeking portions of a file. For bz2 files,
this sets the maximum block size.

Returns:
Generator that yields lines from the file. Behave similarly to the
file.readline() function, except the lines are returned from the back
of the file.
Yields:
Lines from the file. Behave similarly to the file.readline function,
except the lines are returned from the back of the file.
"""
# Check if the file stream is a bit stream or not
# Generate line ending
l_end = _get_line_ending(m_file)

# Check if the file stream is a buffered text stream
is_text = isinstance(m_file, io.TextIOWrapper)

try:
file_size = os.path.getsize(m_file.name)
except AttributeError:
# Bz2 files do not have name attribute. Just set file_size to above
# max_mem for now.
# Bz2 files do not have "name" attribute.
# Just set file_size to max_mem for now.
file_size = max_mem + 1

# If the file size is within our desired RAM use, just reverse it in memory
# GZip files must use this method because there is no way to negative seek
# If the file size is within desired RAM limit, just reverse it in memory.
# GZip files must use this method because there is no way to negative seek.
# For windows, we also read the whole file.
if file_size < max_mem or isinstance(m_file, gzip.GzipFile) or os.name == "nt":
if os.name == "nt" or file_size < max_mem or isinstance(m_file, gzip.GzipFile):
for line in reversed(m_file.readlines()):
yield line.rstrip()
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
yield (
line.rstrip(l_end)
if isinstance(line, str)
else line.decode().rstrip(l_end)
)

else:
if isinstance(m_file, bz2.BZ2File):
# for bz2 files, seeks are expensive. It is therefore in our best
# For bz2 files, seeks are expensive. It is therefore in our best
# interest to maximize the blk_size within limits of desired RAM
# use.
blk_size = min(max_mem, file_size)

buf = ""
m_file.seek(0, 2)
lastchar = m_file.read(1) if is_text else m_file.read(1).decode("utf-8")
last_char = m_file.read(1) if is_text else m_file.read(1).decode("utf-8")

trailing_newline = lastchar == os.linesep
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
trailing_newline = last_char == l_end

while True:
newline_pos = buf.rfind(os.linesep)
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
newline_pos = buf.rfind(l_end)
pos = m_file.tell()
if newline_pos != -1:
# Found a newline
line = buf[newline_pos + 1 :]
buf = buf[:newline_pos]
if pos or newline_pos or trailing_newline:
line += os.linesep
line += l_end
yield line

elif pos:
# Need to fill buffer
toread = min(blk_size, pos)
m_file.seek(pos - toread, 0)
to_read = min(blk_size, pos)
m_file.seek(pos - to_read, 0)
if is_text:
buf = m_file.read(toread) + buf
buf = m_file.read(to_read) + buf
else:
buf = m_file.read(toread).decode("utf-8") + buf
m_file.seek(pos - toread, 0)
if pos == toread:
buf = os.linesep + buf
buf = m_file.read(to_read).decode("utf-8") + buf
m_file.seek(pos - to_read, 0)
if pos == to_read:
buf = l_end + buf

else:
# Start-of-file
Expand Down
Loading
Loading