Replace Antiword with a Python alternative #468

SMillerDev · 2023-06-12T08:50:57Z

Is your feature request related to a problem? Please describe.
Antiword hasn't been updated for a while and now the source has completely disappeared. It would be good to use an alternative way to parse word files.

Which filetype should textract support?
docx

Which external software (python or command line tool), can parse the requested file type
https://pypi.org/project/docx-parser/

Describe alternatives you've considered
Nothing is done and package managers drop antiword and all it's dependencies inclusing textract

Additional context
Relates to Homebrew/homebrew-core#131387

michelemaroni · 2023-06-14T09:03:10Z

According to the documentation antiword is used for parsing old MS Word binary doc files (Word 97-2003), while newer MS Word docx files are parsed with python-docx2txt. It is not clear how docx-parser would help with former Word 97-2003 files.

One issue to consider is that doc extension can be either a Word 97-2003 or a newer Word file.
Maybe abiword could be a better alternative in this regard.

SMillerDev · 2023-06-14T10:49:51Z

Thanks for pointing that out, I must have misread what antiword was actually used for. I don't actually use textract so unfortunately I can't help much with the consideration for Abiword, I just wanted to make sure that the team here was aware of the disappearance of Antiword.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Antiword with a Python alternative #468

Replace Antiword with a Python alternative #468

SMillerDev commented Jun 12, 2023

michelemaroni commented Jun 14, 2023

SMillerDev commented Jun 14, 2023

Replace Antiword with a Python alternative #468

Replace Antiword with a Python alternative #468

Comments

SMillerDev commented Jun 12, 2023

michelemaroni commented Jun 14, 2023

SMillerDev commented Jun 14, 2023