Skip to content

Commit

Permalink
Merge remote-tracking branch 'remotes/origin/master' into dev
Browse files Browse the repository at this point in the history
# Conflicts:
#	requirements.txt
  • Loading branch information
myifeng committed Feb 20, 2024
2 parents cd54eeb + d967cdc commit 987f26b
Show file tree
Hide file tree
Showing 8 changed files with 124 additions and 21 deletions.
10 changes: 2 additions & 8 deletions .github/workflows/python-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ name: CI

on:
push:
branches: [ dev ]
branches: [ master, dev ]
pull_request:
branches: [ master ]

Expand All @@ -15,7 +15,7 @@ jobs:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
python-version: ["3.7", "3.8", "3.9", "3.10"]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
Expand All @@ -31,9 +31,3 @@ jobs:
run: |
python setup.py install sdist bdist_wheel
pytest --disable-warnings
- name: Publish
uses: pypa/gh-action-pypi-publish@release/v1
if: github.event_name == 'release' && github.event.action == 'created'
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
8 changes: 5 additions & 3 deletions .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: CI
name: Publish

on:
release:
Expand All @@ -15,7 +15,7 @@ jobs:
strategy:
matrix:
os: [ubuntu-latest]
python-version: ["3.6"]
python-version: ["3.7"]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
Expand All @@ -37,10 +37,12 @@ jobs:
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
repository_url: https://test.pypi.org/legacy/
repository-url: https://test.pypi.org/legacy/
skip-existing: true
- name: Publish
uses: pypa/gh-action-pypi-publish@release/v1
if: github.event_name == 'create' && github.event.ref_type == 'tag'
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
skip-existing: true
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ env/
htmlcov/
.mypy_cache/
*.pypirc
.eggs/
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
# article-parser

![GitHub Repo stars](https://img.shields.io/github/stars/myifeng/article-parser)
![GitHub Workflow Status](https://img.shields.io/github/workflow/status/myifeng/article-parser/CI)
[![python](https://img.shields.io/pypi/pyversions/article-parser)](https://pypi.org/project/article-parser/)
[![pypi](https://img.shields.io/pypi/v/article-parser)](https://pypi.org/project/article-parser/)
[![wheel](https://img.shields.io/pypi/wheel/article-parser)](https://pypi.org/project/article-parser/)
[![license](https://img.shields.io/github/license/myifeng/article-parser)](https://pypi.org/project/article-parser/)
![PyPI - Downloads](https://img.shields.io/pypi/dd/article-parser)


Extract article or news by url or html, parse the title and content, output in markdown format.
**Extract article or news by url or html, parse the title and content.**

*[English](https://github.com/myifeng/article-parser/blob/master/README.md)[简体中文](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md)*

## How to install

`article-parser` is available on pypi
[`article-parser`](https://pypi.org/project/article-parser/) is available on pypi
https://pypi.org/project/article-parser/

```
Expand Down Expand Up @@ -105,4 +105,4 @@ Djokovic wins record 36th Masters title in Rome
```
## Contributors

[![All contributions](https://contrib.rocks/image?repo=myifeng/article-parser)](https://github.com/myifeng/mybatis-operation-log/graphs/contributors)
[![All contributions](https://contrib.rocks/image?repo=myifeng/article-parser)](https://github.com/myifeng/article-parser/graphs/contributors)
107 changes: 107 additions & 0 deletions README.zh-CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# article-parser

![GitHub Repo stars](https://img.shields.io/github/stars/myifeng/article-parser)
[![python](https://img.shields.io/pypi/pyversions/article-parser)](https://pypi.org/project/article-parser/)
[![pypi](https://img.shields.io/pypi/v/article-parser)](https://pypi.org/project/article-parser/)
[![wheel](https://img.shields.io/pypi/wheel/article-parser)](https://pypi.org/project/article-parser/)
[![license](https://img.shields.io/github/license/myifeng/article-parser)](https://pypi.org/project/article-parser/)
![PyPI - Downloads](https://img.shields.io/pypi/dd/article-parser)


**一种通过任意URL或者html文件解析网页的标题和正文的通用库**

*[English](https://github.com/myifeng/article-parser/blob/master/README.md)[简体中文](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md)*

## 安装

[`article-parser`](https://pypi.org/project/article-parser/) 可以在[`Pipy`](https://pypi.org/project/article-parser/)中下载使用

```
$ pip install article-parser
```

## 使用

```python
>>> import article_parser

article_parser.parse(
url='', # 网页的地址.
html='', # Html文件内容
threshold=0.9, # 阈值,默认0.9
output='html', # 输出格式,支持 markdown 和 html, 默认html
**kwargs # 可选参数
),


## 输出markdown格式
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5)

## 输出html格式
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5)

```

## 示例
[Djokovic wins record 36th Masters title in Rome - Chinadaily.com.cn](http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html)


* Markdown

```python
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5)
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
![](http://img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg)
Serbia's Novak Djokovic kisses the trophy after winning the final against
Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept
21, 2020. [Photo/Agencies]

ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego
Schwartzman in the men's final of the ATP Italian Open on Monday.

Djokovic, the world number one and the top seed at the tournament, won 7-5,
6-3 against Argentine Schwartzman to lift his 36th Masters title, one more
than Rafael Nadal.

The Serb said he did not play his best tennis this time in Rome, but could
find it when needed.

Simona Halep, top seed of the women's draw, won her first title in Rome after
defending champion Karolina Pliskova of the Czech Republic retired while
trailing 6-0, 2-1 in the final.
```


* HTML
```python
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5)
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
<div id="Content">

<figure class="image" style="display: table;">
<img data-from="newsroom" id="img-5f6962b2a31024adbd959228" src="//img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg"/>
<figcaption style="font-size: 14px; display: table-caption; caption-side: bottom;">
Serbia's Novak Djokovic kisses the trophy after winning the final against Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept 21, 2020. [Photo/Agencies]
</figcaption>
</figure>
<p dir="ltr">ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego Schwartzman in the men's final of the ATP Italian Open on Monday.</p>
<p dir="ltr">Djokovic, the world number one and the top seed at the tournament, won 7-5, 6-3 against Argentine Schwartzman to lift his 36th Masters title, one more than Rafael Nadal.</p>
<p dir="ltr">The Serb said he did not play his best tennis this time in Rome, but could find it when needed.</p>
<p dir="ltr">Simona Halep, top seed of the women's draw, won her first title in Rome after defending champion Karolina Pliskova of the Czech Republic retired while trailing 6-0, 2-1 in the final.</p>
</div>
```
## Contributors

[![All contributions](https://contrib.rocks/image?repo=myifeng/article-parser)](https://github.com/myifeng/article-parser/graphs/contributors)
4 changes: 2 additions & 2 deletions article_parser/Extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,9 +97,9 @@ def __download(self) -> str:
html = response.text
return html


def parse(self) -> tuple:
soup = BeautifulSoup(self.html, 'lxml')
soup = soup.find('body')
soup = BeautifulSoup(self.html, 'lxml').find('body')
if soup:
for tag in soup.find_all(style=re.compile('display:\s?none')):
tag.extract()
Expand Down
4 changes: 2 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
beautifulsoup4==4.11.1
beautifulsoup4==4.12.2
html2text==2020.1.16
requests==2.27.1
requests==2.31.0
setuptools==49.2.1
lxml==4.9.0
3 changes: 1 addition & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,14 @@ def read_file(filename):
"Operating System :: OS Independent",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3 :: Only",
"Programming Language :: Python :: Implementation",
],
python_requires='>=3.6',
python_requires='>=3.7',
version_config={
"template": "{tag}",
"dirty_template": "{tag}",
Expand Down

0 comments on commit 987f26b

Please sign in to comment.