Skip to content

Commit 987f26b

Browse files
committed
Merge remote-tracking branch 'remotes/origin/master' into dev
# Conflicts: # requirements.txt
2 parents cd54eeb + d967cdc commit 987f26b

File tree

8 files changed

+124
-21
lines changed

8 files changed

+124
-21
lines changed

.github/workflows/python-ci.yml

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ name: CI
55

66
on:
77
push:
8-
branches: [ dev ]
8+
branches: [ master, dev ]
99
pull_request:
1010
branches: [ master ]
1111

@@ -15,7 +15,7 @@ jobs:
1515
strategy:
1616
matrix:
1717
os: [ubuntu-latest, macos-latest, windows-latest]
18-
python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
18+
python-version: ["3.7", "3.8", "3.9", "3.10"]
1919
steps:
2020
- uses: actions/checkout@v2
2121
- name: Set up Python ${{ matrix.python-version }}
@@ -31,9 +31,3 @@ jobs:
3131
run: |
3232
python setup.py install sdist bdist_wheel
3333
pytest --disable-warnings
34-
- name: Publish
35-
uses: pypa/gh-action-pypi-publish@release/v1
36-
if: github.event_name == 'release' && github.event.action == 'created'
37-
with:
38-
user: __token__
39-
password: ${{ secrets.PYPI_API_TOKEN }}

.github/workflows/python-publish.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: CI
1+
name: Publish
22

33
on:
44
release:
@@ -15,7 +15,7 @@ jobs:
1515
strategy:
1616
matrix:
1717
os: [ubuntu-latest]
18-
python-version: ["3.6"]
18+
python-version: ["3.7"]
1919
steps:
2020
- uses: actions/checkout@v2
2121
- name: Set up Python ${{ matrix.python-version }}
@@ -37,10 +37,12 @@ jobs:
3737
with:
3838
user: __token__
3939
password: ${{ secrets.PYPI_API_TOKEN }}
40-
repository_url: https://test.pypi.org/legacy/
40+
repository-url: https://test.pypi.org/legacy/
41+
skip-existing: true
4142
- name: Publish
4243
uses: pypa/gh-action-pypi-publish@release/v1
4344
if: github.event_name == 'create' && github.event.ref_type == 'tag'
4445
with:
4546
user: __token__
4647
password: ${{ secrets.PYPI_API_TOKEN }}
48+
skip-existing: true

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,4 @@ env/
1313
htmlcov/
1414
.mypy_cache/
1515
*.pypirc
16+
.eggs/

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
# article-parser
22

33
![GitHub Repo stars](https://img.shields.io/github/stars/myifeng/article-parser)
4-
![GitHub Workflow Status](https://img.shields.io/github/workflow/status/myifeng/article-parser/CI)
54
[![python](https://img.shields.io/pypi/pyversions/article-parser)](https://pypi.org/project/article-parser/)
65
[![pypi](https://img.shields.io/pypi/v/article-parser)](https://pypi.org/project/article-parser/)
76
[![wheel](https://img.shields.io/pypi/wheel/article-parser)](https://pypi.org/project/article-parser/)
87
[![license](https://img.shields.io/github/license/myifeng/article-parser)](https://pypi.org/project/article-parser/)
98
![PyPI - Downloads](https://img.shields.io/pypi/dd/article-parser)
109

1110

12-
Extract article or news by url or html, parse the title and content, output in markdown format.
11+
**Extract article or news by url or html, parse the title and content.**
1312

13+
*[English](https://github.com/myifeng/article-parser/blob/master/README.md)[简体中文](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md)*
1414

1515
## How to install
1616

17-
`article-parser` is available on pypi
17+
[`article-parser`](https://pypi.org/project/article-parser/) is available on pypi
1818
https://pypi.org/project/article-parser/
1919

2020
```
@@ -105,4 +105,4 @@ Djokovic wins record 36th Masters title in Rome
105105
```
106106
## Contributors
107107

108-
[![All contributions](https://contrib.rocks/image?repo=myifeng/article-parser)](https://github.com/myifeng/mybatis-operation-log/graphs/contributors)
108+
[![All contributions](https://contrib.rocks/image?repo=myifeng/article-parser)](https://github.com/myifeng/article-parser/graphs/contributors)

README.zh-CN.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# article-parser
2+
3+
![GitHub Repo stars](https://img.shields.io/github/stars/myifeng/article-parser)
4+
[![python](https://img.shields.io/pypi/pyversions/article-parser)](https://pypi.org/project/article-parser/)
5+
[![pypi](https://img.shields.io/pypi/v/article-parser)](https://pypi.org/project/article-parser/)
6+
[![wheel](https://img.shields.io/pypi/wheel/article-parser)](https://pypi.org/project/article-parser/)
7+
[![license](https://img.shields.io/github/license/myifeng/article-parser)](https://pypi.org/project/article-parser/)
8+
![PyPI - Downloads](https://img.shields.io/pypi/dd/article-parser)
9+
10+
11+
**一种通过任意URL或者html文件解析网页的标题和正文的通用库**
12+
13+
*[English](https://github.com/myifeng/article-parser/blob/master/README.md)[简体中文](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md)*
14+
15+
## 安装
16+
17+
[`article-parser`](https://pypi.org/project/article-parser/) 可以在[`Pipy`](https://pypi.org/project/article-parser/)中下载使用
18+
19+
```
20+
$ pip install article-parser
21+
```
22+
23+
## 使用
24+
25+
```python
26+
>>> import article_parser
27+
28+
article_parser.parse(
29+
url='', # 网页的地址.
30+
html='', # Html文件内容
31+
threshold=0.9, # 阈值,默认0.9
32+
output='html', # 输出格式,支持 markdown 和 html, 默认html
33+
**kwargs # 可选参数
34+
),
35+
36+
37+
## 输出markdown格式
38+
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5)
39+
40+
## 输出html格式
41+
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5)
42+
43+
```
44+
45+
## 示例
46+
[Djokovic wins record 36th Masters title in Rome - Chinadaily.com.cn](http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html)
47+
48+
49+
* Markdown
50+
51+
```python
52+
>>> import article_parser
53+
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5)
54+
>>> print(title)
55+
>>> print('----------------')
56+
>>> print(content)
57+
58+
Djokovic wins record 36th Masters title in Rome
59+
----------------
60+
![](http://img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg)
61+
Serbia's Novak Djokovic kisses the trophy after winning the final against
62+
Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept
63+
21, 2020. [Photo/Agencies]
64+
65+
ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego
66+
Schwartzman in the men's final of the ATP Italian Open on Monday.
67+
68+
Djokovic, the world number one and the top seed at the tournament, won 7-5,
69+
6-3 against Argentine Schwartzman to lift his 36th Masters title, one more
70+
than Rafael Nadal.
71+
72+
The Serb said he did not play his best tennis this time in Rome, but could
73+
find it when needed.
74+
75+
Simona Halep, top seed of the women's draw, won her first title in Rome after
76+
defending champion Karolina Pliskova of the Czech Republic retired while
77+
trailing 6-0, 2-1 in the final.
78+
```
79+
80+
81+
* HTML
82+
```python
83+
>>> import article_parser
84+
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5)
85+
>>> print(title)
86+
>>> print('----------------')
87+
>>> print(content)
88+
89+
Djokovic wins record 36th Masters title in Rome
90+
----------------
91+
<div id="Content">
92+
93+
<figure class="image" style="display: table;">
94+
<img data-from="newsroom" id="img-5f6962b2a31024adbd959228" src="//img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg"/>
95+
<figcaption style="font-size: 14px; display: table-caption; caption-side: bottom;">
96+
Serbia's Novak Djokovic kisses the trophy after winning the final against Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept 21, 2020. [Photo/Agencies]
97+
</figcaption>
98+
</figure>
99+
<p dir="ltr">ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego Schwartzman in the men's final of the ATP Italian Open on Monday.</p>
100+
<p dir="ltr">Djokovic, the world number one and the top seed at the tournament, won 7-5, 6-3 against Argentine Schwartzman to lift his 36th Masters title, one more than Rafael Nadal.</p>
101+
<p dir="ltr">The Serb said he did not play his best tennis this time in Rome, but could find it when needed.</p>
102+
<p dir="ltr">Simona Halep, top seed of the women's draw, won her first title in Rome after defending champion Karolina Pliskova of the Czech Republic retired while trailing 6-0, 2-1 in the final.</p>
103+
</div>
104+
```
105+
## Contributors
106+
107+
[![All contributions](https://contrib.rocks/image?repo=myifeng/article-parser)](https://github.com/myifeng/article-parser/graphs/contributors)

article_parser/Extractor.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -97,9 +97,9 @@ def __download(self) -> str:
9797
html = response.text
9898
return html
9999

100+
100101
def parse(self) -> tuple:
101-
soup = BeautifulSoup(self.html, 'lxml')
102-
soup = soup.find('body')
102+
soup = BeautifulSoup(self.html, 'lxml').find('body')
103103
if soup:
104104
for tag in soup.find_all(style=re.compile('display:\s?none')):
105105
tag.extract()

requirements.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
beautifulsoup4==4.11.1
1+
beautifulsoup4==4.12.2
22
html2text==2020.1.16
3-
requests==2.27.1
3+
requests==2.31.0
44
setuptools==49.2.1
55
lxml==4.9.0

setup.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,15 +28,14 @@ def read_file(filename):
2828
"Operating System :: OS Independent",
2929
"Programming Language :: Python",
3030
"Programming Language :: Python :: 3",
31-
"Programming Language :: Python :: 3.6",
3231
"Programming Language :: Python :: 3.7",
3332
"Programming Language :: Python :: 3.8",
3433
"Programming Language :: Python :: 3.9",
3534
"Programming Language :: Python :: 3.10",
3635
"Programming Language :: Python :: 3 :: Only",
3736
"Programming Language :: Python :: Implementation",
3837
],
39-
python_requires='>=3.6',
38+
python_requires='>=3.7',
4039
version_config={
4140
"template": "{tag}",
4241
"dirty_template": "{tag}",

0 commit comments

Comments
 (0)