|
1 | | -# file-batch-renamer Python 批量重命名文件脚本 |
| 1 | +## Python 批量重命名文件 |
2 | 2 |
|
3 | | -> a file batch renamer based on python (include Chinese) |
| 3 | +* 一个基于Python的终极重命名机 |
| 4 | +* a file batch renamer based on python (include Chinese) |
| 5 | +* 用于自动对文件夹里大部分类型的文件进行分析,并批量重命名 |
| 6 | +* 重命名文件自古就是繁琐事情,谁用谁指导 |
| 7 | +* 方便处理IT办公文件和下载文件夹的杂乱文件 |
| 8 | +* 简单练手,练手第三方包,编写环节综合到各方面,python初学者必备 |
| 9 | +* 基于云端和本地,也可以本地 |
| 10 | +* 对小白提供(exe),云端提供临时服务器 |
4 | 11 |
|
5 | | -- Updated 2019.1.2: |
| 12 | +[](https://github.com/autolordz/file-batch-renamer) |
| 13 | +[](https://github.com/autolordz/file-batch-renamer/blob/master/LICENSE) |
| 14 | + |
| 15 | +## Tika版架构 |
| 16 | + |
| 17 | + |
| 18 | +(假如条件不允许可以全部本地化) |
| 19 | + |
| 20 | +## Updated |
| 21 | + |
| 22 | +- Updated 2019.8.10: |
| 23 | + - **Apache Tika** 版改进,基于云端和本地,终极自动重命名机 |
| 24 | + |
| 25 | +- Updated 2019.1.2: |
6 | 26 | - 新版 **Apache Tika** 解析全文件版本 |
7 | 27 | - 旧版 **Python 3rd party** 解析文件版本 |
8 | 28 |
|
| 29 | +<!--more--> |
9 | 30 | ---------------- |
10 | 31 |
|
11 | | -## Tutorial |
12 | | - |
13 | | -### 1. Tika | Tesseract OCR |
14 | | - |
15 | | -- Files |
16 | | - - batch-renamer-tika.py |
17 | | - |
18 | | -- Requirements |
19 | | - - [zhon](https://pypi.org/project/zhon/) zhon to deal with Chinese |
20 | | - - [tika](https://pypi.org/project/tika/) tika for python |
21 | | - - [Java Jre jre-8u91-windows-x64](https://www.oracle.com/technetwork/java/javase/downloads/java-archive-javase8-2177648.html) Jre8 is at least and fitting package |
22 | | - - [Tesseract v4.0.0.20181030](https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v4.0.0.20181030.exe) Tesseract for Image OCR |
23 | | - |
24 | | -- Supported Platform: |
25 | | - - [x] win7 32bit,win10 64bit,其他没测试过 |
26 | | - |
27 | | -- Supported Files: |
28 | | - - [x] docx,pptx,xlsx |
29 | | - - [x] doc,ppt,xls |
30 | | - - [x] epub,rar,zip,tar,html,pdf |
31 | | - - [x] png,jpg,jpeg,bmp,tif |
32 | | - - [x] others(follows [tika](http://tika.apache.org/1.20/formats.html)) |
33 | | - |
34 | | -- Usage: |
35 | | - - 安装必须 |
36 | | - - installplug.bat |
37 | | - - setenv.bat |
38 | | - - 要重命名的文件放在当前目录 |
39 | | - - 执行batch-renamer-tika.(py|exe) |
40 | | - |
41 | | -#### 2. Python 3rd party | Tesseract OCR |
42 | | - |
43 | | -- Files |
44 | | - - batch-renamer.py |
45 | | - - extrectImage.py (Author: BJ Jang (jangbi882 at gmail.com)) |
46 | | - |
47 | | -- Requirements |
48 | | - - [python-pptx](https://pypi.org/project/python-pptx/) ppt格式 |
49 | | - - [python-docx](https://pypi.org/project/python-docx/) word格式 |
50 | | - - [xlrd](https://pypi.org/project/xlrd/) excel格式 |
51 | | - - [zhon](https://pypi.org/project/zhon/) 提取中文 |
52 | | - - [PyPDF2](https://github.com/mstamy2/PyPDF2) 提取PDF |
53 | | - - [PDFMiner](https://github.com/euske/pdfminer/) 提取PDF |
54 | | - - [pytesseract](https://pypi.org/project/pytesseract/) 识别图像 |
55 | | - |
56 | | -- Supported Platform: |
57 | | - - [x] win7 32bit,win10 64bit,其他没测试过 |
58 | | - |
59 | | -- Supported Files: |
60 | | - - [x] docx,pptx,xlsx |
61 | | - - [x] doc,ppt,xls |
62 | | - - [x] pdf |
63 | | - - [x] png,jpg,jpeg,bmp,tif |
64 | | - |
65 | | -- Usage: |
66 | | - - 安装必须或手动安装包 |
67 | | - - installplug.bat |
68 | | - - setenv.bat |
69 | | - - 要重命名的文件放在当前目录 |
70 | | - - 执行batch-renamer.(py|exe) |
71 | | - |
72 | | -[](https://github.com/autolordz/docx-content-modify/blob/master/LICENSE) |
| 32 | +## 环境 |
| 33 | + |
| 34 | +* conda : 4.6.14 |
| 35 | +* python : 3.7.3 |
| 36 | +* Win10 + Spyder3.3.4 (打开脚本自上而下运行,或者自己添加main来py运行) |
| 37 | + |
| 38 | +* 组件: tika版 |
| 39 | + - [zhon](https://pypi.org/project/zhon/) 提供中文字符 |
| 40 | + - [opencv](https://pypi.org/project/opencv-python/) 处理图片,阈值滤镜等 |
| 41 | + - [PIL](https://pypi.org/project/Pillow/) 处理图片 |
| 42 | + - [fitz](https://pypi.org/project/PyMuPDF/) 提取PDF图片 |
| 43 | + - [jieba](https://github.com/fxsjy/jieba) 分词词干识别 |
| 44 | + - [numpy,requests,string,json,glob,time,os,re,string,subprocess,configparser,BeautifulSoup4] |
| 45 | + - [Java jre-8u91-windows-x64](https://www.oracle.com/technetwork/java/javase/downloads/java-archive-javase8-2177648.html) Jre8 is at least and fitting package |
| 46 | + - [tika server](https://www.apache.org/dyn/closer.cgi/tika/tika-server-1.22.jar) 工程没附带,一定要下载 |
| 47 | + - **Tesseract 云端** 参考云端[Tesseract]安装 |
| 48 | + |
| 49 | +* 组件: 普通版 |
| 50 | + - [Tesseract v4.0](https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v4.0.0.20181030.exe) Tesseract for Image OCR |
| 51 | + - [PyPDF2,pdfminer,pytesseract,docx,pptx,xlrd,PIL,extrectImage] |
| 52 | + |
| 53 | +* 打包程序: pyinstaller |
| 54 | + |
| 55 | +- **以下重点更新和维护Tika版,普通版代码保留** |
| 56 | + |
| 57 | +## 内容 |
| 58 | + |
| 59 | +- [x] 按以下格式重命名 |
| 60 | + - [x] ['.txt','.html','.epub','.chm','.wps','.md', |
| 61 | + '.doc','.odt','.docx','.xlsx','.csv','.xls','.rtf', |
| 62 | + '.rar','.zip','.tar','.tgz','.7z', |
| 63 | + '.mp4','.gif','.flv','.mkv','.swf','.psd', |
| 64 | + '.mp3','.m4a','.flac', |
| 65 | + '.pdf',] |
| 66 | + - [x] ['.ppt','.pptx','.pptm'] |
| 67 | + - [x] ['.png','.jpg','.jpeg','.bmp','.tif'] |
| 68 | + - [x] others (rules follow [tika](http://tika.apache.org/1.20/formats.html)) |
| 69 | + |
| 70 | +- [x] 过滤下格式非重命名 |
| 71 | + - [x] ['.bat','.jar','.exe','.py','.ini'] |
| 72 | + |
| 73 | +- [x] 支持平台 |
| 74 | + - [x] win7 32bit,win10 64bit,其他平台请按错误修改代码 |
| 75 | + |
| 76 | +## 使用 |
| 77 | + |
| 78 | +相关文件在flask_app目录 |
| 79 | + |
| 80 | +- 云端[tika]部署 |
| 81 | + |
| 82 | +```shell |
| 83 | +#Centos启动 tika |
| 84 | +nohup java -Djava.awt.headless=true -jar tika-server.jar --host=yourhost --port=3232 >/dev/null & |
| 85 | + |
| 86 | +#Centos终止 |
| 87 | +ps -ef | grep tika-server | grep -v grep | awk '{print $2}' | xargs kill -9 |
| 88 | +``` |
| 89 | + |
| 90 | +- 本地[tika]部署 |
| 91 | + |
| 92 | +```shell |
| 93 | + |
| 94 | +#win启动 tika |
| 95 | + |
| 96 | +start /b java -Djava.awt.headless=true -jar tika-server.jar --config=tika-config.xml --host=127.0.0.1 --port=3232 |
| 97 | + |
| 98 | +#[tika-config.xml 用于跳过本地Tesseract,加速非图片文件读取速度] |
| 99 | + |
| 100 | +#Win终止 |
| 101 | + |
| 102 | +taskkill /F /FI "IMAGENAME eq java.exe" |
| 103 | +``` |
| 104 | +- 云端[flask]部署 |
| 105 | + |
| 106 | +```shell |
| 107 | +#启动 |
| 108 | +nohup python3 /pyweb/app.py >/dev/null & |
| 109 | + |
| 110 | +#终止 |
| 111 | +ps -ef | grep pyweb | grep -v grep | awk '{print $2}' | xargs kill -9 |
| 112 | +``` |
| 113 | + |
| 114 | +- 云端[Tesseract]安装 |
| 115 | + |
| 116 | + - Centos 6.5 安装 Tesseract 4+ |
| 117 | + - 参考 https://www.jianshu.com/p/bf8521703143 差异如下: |
| 118 | + - autoconf-2.63-5.1.el6.noarch 不用 2.69 也行,保留 |
| 119 | + - 实际安装了 autoconf-archive-2015.02.24-1.sdl6.noarch.rpm |
| 120 | + |
| 121 | +- 客户端安装 |
| 122 | + - installplug.bat -> 安装 java 环境 |
| 123 | + - 需要处理文件放在target目录 |
| 124 | + - 点击 -> batch-renamer-tika.exe -> 处理target目录 |
| 125 | + - cmd -> batch-renamer-tika.py 'yourfile' -> 处理yourfile(文件|目录) |
| 126 | + |
| 127 | +## 未来 |
| 128 | + |
| 129 | +- [x] 以文件开始内容命名 |
| 130 | +- [x] 识别图像内容命名 |
| 131 | +- [ ] 提取文章(jieba)关键词命名 |
| 132 | +- [ ] 提取文章摘要(NLP)命名 |
| 133 | + |
| 134 | +## Licence |
| 135 | + |
| 136 | +[See Licence](#file-batch-renamer) |
73 | 137 |
|
74 | 138 | That's it,enjoy. |
| 139 | + |
| 140 | + |
| 141 | + |
0 commit comments