|
| 1 | +# article-parser |
| 2 | + |
| 3 | + |
| 4 | +[](https://pypi.org/project/article-parser/) |
| 5 | +[](https://pypi.org/project/article-parser/) |
| 6 | +[](https://pypi.org/project/article-parser/) |
| 7 | +[](https://pypi.org/project/article-parser/) |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | +**一种通过任意URL或者html文件解析网页的标题和正文的通用库** |
| 12 | + |
| 13 | +*[English](https://github.com/myifeng/article-parser/blob/master/README.md) ∙ [简体中文](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md)* |
| 14 | + |
| 15 | +## 安装 |
| 16 | + |
| 17 | +[`article-parser`](https://pypi.org/project/article-parser/) 可以在[`Pipy`](https://pypi.org/project/article-parser/)中下载使用 |
| 18 | + |
| 19 | +``` |
| 20 | +$ pip install article-parser |
| 21 | +``` |
| 22 | + |
| 23 | +## 使用 |
| 24 | + |
| 25 | +```python |
| 26 | +>>> import article_parser |
| 27 | + |
| 28 | +article_parser.parse( |
| 29 | + url='', # 网页的地址. |
| 30 | + html='', # Html文件内容 |
| 31 | + threshold=0.9, # 阈值,默认0.9 |
| 32 | + output='html', # 输出格式,支持 markdown 和 html, 默认html |
| 33 | + **kwargs # 可选参数 |
| 34 | + ), |
| 35 | + |
| 36 | + |
| 37 | +## 输出markdown格式 |
| 38 | +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5) |
| 39 | + |
| 40 | +## 输出html格式 |
| 41 | +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5) |
| 42 | + |
| 43 | +``` |
| 44 | + |
| 45 | +## 示例 |
| 46 | +[Djokovic wins record 36th Masters title in Rome - Chinadaily.com.cn](http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html) |
| 47 | + |
| 48 | + |
| 49 | +* Markdown |
| 50 | + |
| 51 | +```python |
| 52 | +>>> import article_parser |
| 53 | +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5) |
| 54 | +>>> print(title) |
| 55 | +>>> print('----------------') |
| 56 | +>>> print(content) |
| 57 | + |
| 58 | +Djokovic wins record 36th Masters title in Rome |
| 59 | +---------------- |
| 60 | + |
| 61 | +Serbia's Novak Djokovic kisses the trophy after winning the final against |
| 62 | +Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept |
| 63 | +21, 2020. [Photo/Agencies] |
| 64 | + |
| 65 | +ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego |
| 66 | +Schwartzman in the men's final of the ATP Italian Open on Monday. |
| 67 | + |
| 68 | +Djokovic, the world number one and the top seed at the tournament, won 7-5, |
| 69 | +6-3 against Argentine Schwartzman to lift his 36th Masters title, one more |
| 70 | +than Rafael Nadal. |
| 71 | + |
| 72 | +The Serb said he did not play his best tennis this time in Rome, but could |
| 73 | +find it when needed. |
| 74 | + |
| 75 | +Simona Halep, top seed of the women's draw, won her first title in Rome after |
| 76 | +defending champion Karolina Pliskova of the Czech Republic retired while |
| 77 | +trailing 6-0, 2-1 in the final. |
| 78 | +``` |
| 79 | + |
| 80 | + |
| 81 | +* HTML |
| 82 | +```python |
| 83 | +>>> import article_parser |
| 84 | +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5) |
| 85 | +>>> print(title) |
| 86 | +>>> print('----------------') |
| 87 | +>>> print(content) |
| 88 | + |
| 89 | +Djokovic wins record 36th Masters title in Rome |
| 90 | +---------------- |
| 91 | +<div id="Content"> |
| 92 | + |
| 93 | +<figure class="image" style="display: table;"> |
| 94 | +<img data-from="newsroom" id="img-5f6962b2a31024adbd959228" src="//img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg"/> |
| 95 | +<figcaption style="font-size: 14px; display: table-caption; caption-side: bottom;"> |
| 96 | + Serbia's Novak Djokovic kisses the trophy after winning the final against Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept 21, 2020. [Photo/Agencies] |
| 97 | + </figcaption> |
| 98 | +</figure> |
| 99 | +<p dir="ltr">ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego Schwartzman in the men's final of the ATP Italian Open on Monday.</p> |
| 100 | +<p dir="ltr">Djokovic, the world number one and the top seed at the tournament, won 7-5, 6-3 against Argentine Schwartzman to lift his 36th Masters title, one more than Rafael Nadal.</p> |
| 101 | +<p dir="ltr">The Serb said he did not play his best tennis this time in Rome, but could find it when needed.</p> |
| 102 | +<p dir="ltr">Simona Halep, top seed of the women's draw, won her first title in Rome after defending champion Karolina Pliskova of the Czech Republic retired while trailing 6-0, 2-1 in the final.</p> |
| 103 | +</div> |
| 104 | +``` |
| 105 | +## Contributors |
| 106 | + |
| 107 | +[](https://github.com/myifeng/article-parser/graphs/contributors) |
0 commit comments