Skip to content

Commit 4cc6b9b

Browse files
committed
new tika flask process
1 parent bf06602 commit 4cc6b9b

File tree

14 files changed

+1171
-418
lines changed

14 files changed

+1171
-418
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,5 +107,7 @@ venv.bak/
107107
*.zip
108108
*.exe
109109
*.txt
110+
*.jar
110111
tmp/
111112
exe-win7-tmp/
113+
README1.md

README.md

Lines changed: 132 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,74 +1,141 @@
1-
# file-batch-renamer Python 批量重命名文件脚本
1+
## Python 批量重命名文件
22

3-
> a file batch renamer based on python (include Chinese)
3+
* 一个基于Python的终极重命名机
4+
* a file batch renamer based on python (include Chinese)
5+
* 用于自动对文件夹里大部分类型的文件进行分析,并批量重命名
6+
* 重命名文件自古就是繁琐事情,谁用谁指导
7+
* 方便处理IT办公文件和下载文件夹的杂乱文件
8+
* 简单练手,练手第三方包,编写环节综合到各方面,python初学者必备
9+
* 基于云端和本地,也可以本地
10+
* 对小白提供(exe),云端提供临时服务器
411

5-
- Updated 2019.1.2:
12+
[![](https://img.shields.io/badge/github-source-orange.svg?style=popout&logo=github)](https://github.com/autolordz/file-batch-renamer)
13+
[![](https://img.shields.io/github/license/autolordz/file-batch-renamer.svg?style=popout&logo=github)](https://github.com/autolordz/file-batch-renamer/blob/master/LICENSE)
14+
15+
## Tika版架构
16+
17+
![](img_tmp/flow.jpg)
18+
(假如条件不允许可以全部本地化)
19+
20+
## Updated
21+
22+
- Updated 2019.8.10:
23+
- **Apache Tika** 版改进,基于云端和本地,终极自动重命名机
24+
25+
- Updated 2019.1.2:
626
- 新版 **Apache Tika** 解析全文件版本
727
- 旧版 **Python 3rd party** 解析文件版本
828

29+
<!--more-->
930
----------------
1031

11-
## Tutorial
12-
13-
### 1. Tika | Tesseract OCR
14-
15-
- Files
16-
- batch-renamer-tika.py
17-
18-
- Requirements
19-
- [zhon](https://pypi.org/project/zhon/) zhon to deal with Chinese
20-
- [tika](https://pypi.org/project/tika/) tika for python
21-
- [Java Jre jre-8u91-windows-x64](https://www.oracle.com/technetwork/java/javase/downloads/java-archive-javase8-2177648.html) Jre8 is at least and fitting package
22-
- [Tesseract v4.0.0.20181030](https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v4.0.0.20181030.exe) Tesseract for Image OCR
23-
24-
- Supported Platform:
25-
- [x] win7 32bit,win10 64bit,其他没测试过
26-
27-
- Supported Files:
28-
- [x] docx,pptx,xlsx
29-
- [x] doc,ppt,xls
30-
- [x] epub,rar,zip,tar,html,pdf
31-
- [x] png,jpg,jpeg,bmp,tif
32-
- [x] others(follows [tika](http://tika.apache.org/1.20/formats.html))
33-
34-
- Usage:
35-
- 安装必须
36-
- installplug.bat
37-
- setenv.bat
38-
- 要重命名的文件放在当前目录
39-
- 执行batch-renamer-tika.(py|exe)
40-
41-
#### 2. Python 3rd party | Tesseract OCR
42-
43-
- Files
44-
- batch-renamer.py
45-
- extrectImage.py (Author: BJ Jang (jangbi882 at gmail.com))
46-
47-
- Requirements
48-
- [python-pptx](https://pypi.org/project/python-pptx/) ppt格式
49-
- [python-docx](https://pypi.org/project/python-docx/) word格式
50-
- [xlrd](https://pypi.org/project/xlrd/) excel格式
51-
- [zhon](https://pypi.org/project/zhon/) 提取中文
52-
- [PyPDF2](https://github.com/mstamy2/PyPDF2) 提取PDF
53-
- [PDFMiner](https://github.com/euske/pdfminer/) 提取PDF
54-
- [pytesseract](https://pypi.org/project/pytesseract/) 识别图像
55-
56-
- Supported Platform:
57-
- [x] win7 32bit,win10 64bit,其他没测试过
58-
59-
- Supported Files:
60-
- [x] docx,pptx,xlsx
61-
- [x] doc,ppt,xls
62-
- [x] pdf
63-
- [x] png,jpg,jpeg,bmp,tif
64-
65-
- Usage:
66-
- 安装必须或手动安装包
67-
- installplug.bat
68-
- setenv.bat
69-
- 要重命名的文件放在当前目录
70-
- 执行batch-renamer.(py|exe)
71-
72-
[![ForTheBadge built-with-science](http://ForTheBadge.com/images/badges/built-with-science.svg)](https://github.com/autolordz/docx-content-modify/blob/master/LICENSE)
32+
## 环境
33+
34+
* conda : 4.6.14
35+
* python : 3.7.3
36+
* Win10 + Spyder3.3.4 (打开脚本自上而下运行,或者自己添加main来py运行)
37+
38+
* 组件: tika版
39+
- [zhon](https://pypi.org/project/zhon/) 提供中文字符
40+
- [opencv](https://pypi.org/project/opencv-python/) 处理图片,阈值滤镜等
41+
- [PIL](https://pypi.org/project/Pillow/) 处理图片
42+
- [fitz](https://pypi.org/project/PyMuPDF/) 提取PDF图片
43+
- [jieba](https://github.com/fxsjy/jieba) 分词词干识别
44+
- [numpy,requests,string,json,glob,time,os,re,string,subprocess,configparser,BeautifulSoup4]
45+
- [Java jre-8u91-windows-x64](https://www.oracle.com/technetwork/java/javase/downloads/java-archive-javase8-2177648.html) Jre8 is at least and fitting package
46+
- [tika server](https://www.apache.org/dyn/closer.cgi/tika/tika-server-1.22.jar) 工程没附带,一定要下载
47+
- **Tesseract 云端** 参考云端[Tesseract]安装
48+
49+
* 组件: 普通版
50+
- [Tesseract v4.0](https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v4.0.0.20181030.exe) Tesseract for Image OCR
51+
- [PyPDF2,pdfminer,pytesseract,docx,pptx,xlrd,PIL,extrectImage]
52+
53+
* 打包程序: pyinstaller
54+
55+
- **以下重点更新和维护Tika版,普通版代码保留**
56+
57+
## 内容
58+
59+
- [x] 按以下格式重命名
60+
- [x] ['.txt','.html','.epub','.chm','.wps','.md',
61+
'.doc','.odt','.docx','.xlsx','.csv','.xls','.rtf',
62+
'.rar','.zip','.tar','.tgz','.7z',
63+
'.mp4','.gif','.flv','.mkv','.swf','.psd',
64+
'.mp3','.m4a','.flac',
65+
'.pdf',]
66+
- [x] ['.ppt','.pptx','.pptm']
67+
- [x] ['.png','.jpg','.jpeg','.bmp','.tif']
68+
- [x] others (rules follow [tika](http://tika.apache.org/1.20/formats.html))
69+
70+
- [x] 过滤下格式非重命名
71+
- [x] ['.bat','.jar','.exe','.py','.ini']
72+
73+
- [x] 支持平台
74+
- [x] win7 32bit,win10 64bit,其他平台请按错误修改代码
75+
76+
## 使用
77+
78+
相关文件在flask_app目录
79+
80+
- 云端[tika]部署
81+
82+
```shell
83+
#Centos启动 tika
84+
nohup java -Djava.awt.headless=true -jar tika-server.jar --host=yourhost --port=3232 >/dev/null &
85+
86+
#Centos终止
87+
ps -ef | grep tika-server | grep -v grep | awk '{print $2}' | xargs kill -9
88+
```
89+
90+
- 本地[tika]部署
91+
92+
```shell
93+
94+
#win启动 tika
95+
96+
start /b java -Djava.awt.headless=true -jar tika-server.jar --config=tika-config.xml --host=127.0.0.1 --port=3232
97+
98+
#[tika-config.xml 用于跳过本地Tesseract,加速非图片文件读取速度]
99+
100+
#Win终止
101+
102+
taskkill /F /FI "IMAGENAME eq java.exe"
103+
```
104+
- 云端[flask]部署
105+
106+
```shell
107+
#启动
108+
nohup python3 /pyweb/app.py >/dev/null &
109+
110+
#终止
111+
ps -ef | grep pyweb | grep -v grep | awk '{print $2}' | xargs kill -9
112+
```
113+
114+
- 云端[Tesseract]安装
115+
116+
- Centos 6.5 安装 Tesseract 4+
117+
- 参考 https://www.jianshu.com/p/bf8521703143 差异如下:
118+
- autoconf-2.63-5.1.el6.noarch 不用 2.69 也行,保留
119+
- 实际安装了 autoconf-archive-2015.02.24-1.sdl6.noarch.rpm
120+
121+
- 客户端安装
122+
- installplug.bat -> 安装 java 环境
123+
- 需要处理文件放在target目录
124+
- 点击 -> batch-renamer-tika.exe -> 处理target目录
125+
- cmd -> batch-renamer-tika.py 'yourfile' -> 处理yourfile(文件|目录)
126+
127+
## 未来
128+
129+
- [x] 以文件开始内容命名
130+
- [x] 识别图像内容命名
131+
- [ ] 提取文章(jieba)关键词命名
132+
- [ ] 提取文章摘要(NLP)命名
133+
134+
## Licence
135+
136+
[See Licence](#file-batch-renamer)
73137

74138
That's it,enjoy.
139+
140+
141+

batch-renamer.py renamed to batch-renamer-old/batch-renamer.py

Lines changed: 16 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -26,21 +26,16 @@
2626
"""
2727

2828
#%%
29-
3029
import zhon.hanzi,zhon.cedict
3130
import os,re,io,glob,shutil,string,platform
3231
import itertools as it
33-
34-
import extrectImage
3532
from pdfminer.high_level import extract_text_to_fp
36-
3733
import pytesseract
38-
from PIL import Image
39-
34+
from PIL import Image
4035
from docx import Document
4136
from pptx import Presentation
4237
from xlrd import open_workbook
43-
from win32com.client import Dispatch # for office 97-2003
38+
from win32com.client import Dispatch # for office 97-2003
4439

4540
#%%
4641
def parse_subpath(path,file):
@@ -83,12 +78,12 @@ def clean_txt_func(x,**kwargs):
8378
return xx
8479

8580

86-
#%% rename office,officex
87-
81+
#%% rename office,officex
82+
8883
def rename_officex(file,**kwargs):
8984
'''rename only judgment doc files'''
9085
suffix = os.path.splitext(file)[1]
91-
86+
9287
if suffix == '.docx':
9388
try:
9489
doc = Document(file)
@@ -99,7 +94,7 @@ def rename_officex(file,**kwargs):
9994
return x
10095
except Exception as e:
10196
print('>>> 读取 %s 失败,可能格式不正确 => %s'%(file,e))
102-
97+
10398
if suffix == '.pptx':
10499
try:
105100
prs = Presentation(file)
@@ -117,7 +112,7 @@ def rename_officex(file,**kwargs):
117112
return x
118113
except Exception as e:
119114
print('>>> 读取 %s 失败,可能格式不正确 => %s'%(file,e))
120-
115+
121116
if suffix in ['.xlsx','.xls']:
122117
try:
123118
exl = open_workbook(file)
@@ -149,12 +144,12 @@ def get_txt_text(file,**kwargs):
149144
def rename_office(file,**kwargs):
150145
name = os.path.splitext(file)[0]
151146
suffix = os.path.splitext(file)[1]
152-
147+
153148
if suffix == '.txt':
154149
x = get_txt_text(file,**kwargs)
155150
print('>>> 找到 %s 内容: %s'%(file,x))
156151
os_rename(file,x)
157-
152+
158153
if suffix == '.doc':
159154
file_txt = name + '_doc.txt'
160155
word = Dispatch("Word.Application")
@@ -165,7 +160,7 @@ def rename_office(file,**kwargs):
165160
print('>>> 找到 %s 内容: %s'%(file,x))
166161
os_rename(file,x)
167162
os.remove(file_txt)
168-
163+
169164
if suffix == '.ppt':
170165
txt = []
171166
try:
@@ -186,7 +181,7 @@ def rename_office(file,**kwargs):
186181
x = clean_txt_func(','.join(txt),**kwargs)
187182
print('>>> 找到 %s 内容: %s'%(file,x))
188183
os_rename(file,x)
189-
184+
190185
if suffix == '.xls':
191186
try:
192187
app = Dispatch("Excel.Application")
@@ -206,7 +201,7 @@ def rename_office(file,**kwargs):
206201
x = clean_txt_func(','.join(txt),**kwargs)
207202
print('>>> 找到 %s 内容: %s'%(file,x))
208203
os_rename(file,x)
209-
204+
210205
return True
211206

212207
#%% rename image
@@ -220,10 +215,10 @@ def get_image_txt(file,**kwargs):
220215
print('image size :',img.size)
221216
img = img.crop((0,0,img.width,img.height/img_h))
222217
print('image size 2:',img.size)
223-
218+
224219
pytesseract.pytesseract.tesseract_cmd = 'c:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe' \
225-
if '64bit' in platform.architecture() else 'c:\\Program Files\\Tesseract-OCR\\tesseract.exe'
226-
220+
if '64bit' in platform.architecture() else 'c:\\Program Files\\Tesseract-OCR\\tesseract.exe'
221+
227222
x = pytesseract.image_to_string(img,lang='chi_sim') # eng
228223
x = re.sub(r'\s+',',',x)
229224
print('>>> 解析 %s \n 内容: %s'%(file,x))
@@ -256,7 +251,7 @@ def get_pdf_txt(ifile,**kwargs):
256251
if len(txt) < 10:
257252
print('====decode images===')
258253
extrectImage.main(sourceName=ifile,outputFolder=odir,**kwargs)
259-
subext = [parse_subpath(odir,x) for x in
254+
subext = [parse_subpath(odir,x) for x in
260255
['*.png','*.jpg','*.jpeg','*.bmp','*.tif']]
261256
images = list(it.chain(*(glob.iglob(e) for e in subext)))
262257
print(images)

extrectImage.py renamed to batch-renamer-old/extractImage.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,6 @@ def get_pdfObj_contents(pdfObj,**kwargs):
171171
img = Image.open(jpgData)
172172
if mode == "CMYK":
173173
# case of CMYK invert all channel
174-
175174
# imgData = list(img.tobytes())
176175
# invData = [(255 - val) & 0xff for val in imgData]
177176
# data = struct.pack("{}B".format(len(invData)), *invData)
@@ -190,7 +189,7 @@ def get_pdfObj_contents(pdfObj,**kwargs):
190189
img.write(data)
191190
img.close()
192191
print('save to:',outFileName + ".jp2")
193-
192+
194193
# case of JBIG2
195194
elif len(leftFilters) == 1 and leftFilters[0] == '/JBIG2Decode':
196195
img = open(outFileName + ".jbig2", "wb")
@@ -222,11 +221,11 @@ def main(sourceName,**kwargs):
222221
outputFolder = kwargs.get('outputFolder',None)
223222
os.makedirs(outputFolder,exist_ok=True)
224223
fileBase = os.path.splitext(os.path.basename(sourceName))[0]
225-
224+
226225
with open(sourceName, "rb") as fp:
227226
pdfObj = PyPDF2.PdfFileReader(fp,strict=False)
228227
get_pdfObj_contents(pdfObj,fileBase=fileBase,**kwargs)
229-
228+
230229
print("Completed.")
231230

232231
# main(sourceName = 'aa.pdf', outputFolder = ".\\Temp",num_pages = 1,targetPage = None)

0 commit comments

Comments
 (0)