From b705d84ee7a4bcfe92e8babfbf3fc17776fc7ff9 Mon Sep 17 00:00:00 2001 From: myifeng Date: Sun, 3 Jul 2022 16:00:57 +0800 Subject: [PATCH 01/17] Modify method parameters (#34) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Update dev (#31) * Dev (#18) * Update action; Update README.md * Dev (#19) * Update python-publish.yml * Add setuptools-git-versioning * move timeout to options * Update action; Update README.md * Update setup.py * Bump requests from 2.26.0 to 2.27.0 (#23) Bumps [requests](https://github.com/psf/requests) from 2.26.0 to 2.27.0. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](https://github.com/psf/requests/compare/v2.26.0...v2.27.0) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump requests from 2.27.0 to 2.27.1 (#24) Bumps [requests](https://github.com/psf/requests) from 2.27.0 to 2.27.1. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](https://github.com/psf/requests/compare/v2.27.0...v2.27.1) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump lxml from 4.7.1 to 4.8.0 (#25) Bumps [lxml](https://github.com/lxml/lxml) from 4.7.1 to 4.8.0. - [Release notes](https://github.com/lxml/lxml/releases) - [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt) - [Commits](https://github.com/lxml/lxml/compare/lxml-4.7.1...lxml-4.8.0) --- updated-dependencies: - dependency-name: lxml dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * proxies integration (#28) * Bump beautifulsoup4 from 4.10.0 to 4.11.1 (#27) Bumps [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/) from 4.10.0 to 4.11.1. --- updated-dependencies: - dependency-name: beautifulsoup4 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Squashed commit of the following: commit daceae0e6cd5ff6ca935f4b46ce0c9694f2942d0 Author: myifeng Date: Wed Apr 27 10:48:14 2022 +0800 Update CI commit 7e778f002b5c751c7fb520c791389e6ed00ebebf Merge: 1eaccce 1945d7f Author: myifeng Date: Wed Apr 27 10:23:08 2022 +0800 Merge branch 'master' into dev # Conflicts: # README.md commit 1eaccce88ef2b4fc46511a0da969dce49a70a426 Author: myifengs Date: Wed Dec 22 21:03:15 2021 +0800 Update CI name commit f8c1dd9433472f749d079bf725d70d36fa52923d Merge: 694dbbd c126e5b Author: myifengs Date: Wed Dec 22 20:58:04 2021 +0800 Merge branch 'dev' of https://github.com/myifeng/article-parser into dev commit 694dbbd1dfb186c4f76f409521b658a92a69cb71 Author: myifengs Date: Wed Dec 22 20:57:56 2021 +0800 Update workflows commit c126e5bc1f3be77f60bcec34fc08396e09109de8 Merge: 03143d8 a66518c Author: liqf7 Date: Wed Dec 22 15:01:00 2021 +0800 Merge branch 'dev' of https://github.com/myifeng/article-parser into dev commit 03143d81212c665a655223318809bd9b8cf6eae0 Author: liqf7 Date: Wed Dec 22 14:59:59 2021 +0800 Update setup.py commit a66518caddb88827d13f4cd55d1f8aad16ba0950 Merge: d6f9708 f11baa6 Author: myifeng Date: Wed Dec 22 14:45:19 2021 +0800 Merge branch 'master' into dev commit d6f970842b1bfd43a391b5e614f1e521bc4aec41 Author: liqf7 Date: Wed Dec 22 14:41:11 2021 +0800 Update action; Update README.md commit 4d31bb570e80c6a0bb210007d7afc1037784985c Merge: 9a9930f c83911e Author: myifeng Date: Tue Dec 21 15:53:25 2021 +0800 Merge branch 'master' into dev commit 9a9930f5551565a1ab657950f8201edcf19d2275 Author: liqf7 Date: Tue Dec 21 15:02:21 2021 +0800 move timeout to options commit 164ec23d34fd68834b422e3422303a74203ccd6e Author: liqf7 Date: Tue Dec 21 14:45:58 2021 +0800 Add setuptools-git-versioning commit 42c499c74454472b249f02ee6e416d460f273d0d Author: myifeng Date: Mon Dec 20 22:25:15 2021 +0800 Update python-publish.yml commit edce23571d36fd06c2b538a054869ee561af8629 Author: myifengs Date: Mon Dec 20 22:11:20 2021 +0800 FIX commit 9c4553047eb41952c2d46f84d8167a3f93f42b16 Author: myifengs Date: Mon Dec 20 21:55:19 2021 +0800 0.1.11 commit c4a2cea1e01a93133630af9cef520d6e708ac60e Author: myifengs Date: Mon Dec 20 21:49:36 2021 +0800 FIX title; Add User-Agent commit 525ac5ebac4fbefad4ef79a3502022be543ad980 Author: myifengs Date: Mon Dec 20 21:27:42 2021 +0800 FIX test commit 3c89180b59793673da258389a0e874ab44a38d8c Author: myifengs Date: Sun Dec 19 20:44:49 2021 +0800 Update python-publish.yml commit 11cb9ebf22ec555cfc9536838f334c60f2b73eef Author: myifengs Date: Sat Dec 18 12:12:32 2021 +0800 version 1.0.0 commit e503e4075c6c410443e5eb7a65f5e2e459ca28da Author: myifengs Date: Sat Dec 18 11:47:46 2021 +0800 FIX commit bef4bf8b124f8ddef0503cd1833d9468ac43e6e1 Author: myifengs Date: Sat Dec 18 11:46:27 2021 +0800 FIX commit 127084b576c25561a92cf83c7c7c2769f51d0eb4 Author: myifengs Date: Sat Dec 18 11:36:22 2021 +0800 FIX commit 5164720983c0fe6ed9acd8aac6a36ddef19f44a9 Merge: 889c6d2 1de2c6f Author: myifeng Date: Sat Dec 18 11:34:41 2021 +0800 Merge branch 'master' into dev commit 889c6d2cdd1321807e00ba60acfc59b707a2fd4a Author: myifengs Date: Sat Dec 18 11:31:42 2021 +0800 Update commit 1d0852f7c97dbd86da989506dcb095f182c6271f Author: myifengs Date: Sat Dec 18 11:08:21 2021 +0800 warring commit ca65ea3d6b4bca184fe14d5407b6a4f439e47aef Author: myifengs Date: Sat Dec 18 10:49:06 2021 +0800 FIX commit abe9366c63a0689fb54c63b90b06279b35446ccb Author: myifengs Date: Sat Dec 18 10:07:22 2021 +0800 Remove version 3.5 commit 2807056e2d1c69679df6227b957ff33c4878ac64 Author: myifengs Date: Sat Dec 18 09:58:29 2021 +0800 FIX commit f70fed629a62071373a1cc9f5927dde0fe609423 Author: myifengs Date: Sat Dec 18 09:57:20 2021 +0800 Update workflows * Bump lxml from 4.8.0 to 4.9.0 (#29) Bumps [lxml](https://github.com/lxml/lxml) from 4.8.0 to 4.9.0. - [Release notes](https://github.com/lxml/lxml/releases) - [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt) - [Commits](https://github.com/lxml/lxml/compare/lxml-4.8.0...lxml-4.9.0) --- updated-dependencies: - dependency-name: lxml dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sohail Ahmed * FIX: README.md * Add request_kwargs optional arguments that `request` takes. * API * FIX: Test Domo timeout Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sohail Ahmed --- README.md | 24 ++++++++++++------------ article_parser/Extractor.py | 24 +++++++++--------------- article_parser/__init__.py | 18 +++++++++++++++--- requirements.txt | 1 - setup.py | 10 ++++++---- test/test_parser.py | 2 +- 6 files changed, 43 insertions(+), 36 deletions(-) diff --git a/README.md b/README.md index 6b858f0..d395c01 100644 --- a/README.md +++ b/README.md @@ -27,20 +27,20 @@ $ pip install article-parser >>> import article_parser article_parser.parse( - url='', # The URL of the article. optional - html='', # The HTML of the article. optional - proxies={}, # The Proxies to bypass anonymity, security and prevent IP blocking. - options={ - 'markdown': True, # Output in markdown format. defult True. optional - 'threshold': 0.9, # Content ratio threshold. defult 0.9. optional - 'timeout': 5 # Request webpage timeout time, in seconds, default 5. optional - }) + url='', # The URL of the article. + html='', # The HTML of the article. + threshold=0.9, # The ratio of text to the entire document, default 0.9. + output='html', # Result output format, support ``markdown`` and ``html``, default ``html``. + **kwargs # Optional arguments that `request` takes. optional + ), + ## ouput markdown ->>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html") +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5) ## output html ->>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", options={'markdown': False}) +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5) + ``` ## Example @@ -51,7 +51,7 @@ article_parser.parse( ```python >>> import article_parser ->>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html") +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5) >>> print(title) >>> print('----------------') >>> print(content) @@ -82,7 +82,7 @@ trailing 6-0, 2-1 in the final. * HTML ```python >>> import article_parser ->>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", options={'markdown': False}) +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5) >>> print(title) >>> print('----------------') >>> print(content) diff --git a/article_parser/Extractor.py b/article_parser/Extractor.py index 812a5c4..c5f8f8c 100644 --- a/article_parser/Extractor.py +++ b/article_parser/Extractor.py @@ -5,22 +5,16 @@ import requests import html2text from bs4 import BeautifulSoup, Comment, NavigableString -from fake_useragent import UserAgent + class Extractor(): - def __init__(self, url='', html='', proxies={}, options={}): - default_options = { - 'markdown': True, - 'threshold': 0.9, - 'timeout': 5 - } + def __init__(self, url, html, threshold, output, **kwargs): self.url = url self.title = '' - self.date = '' self.html = html - self.proxies = proxies - default_options.update(options) - self.options = default_options.copy() + self.output = output + self.threshold = threshold + self.kwargs = kwargs if not self.html: self.html = self.__download() @@ -64,7 +58,7 @@ def __find_article_html(self, soup) -> BeautifulSoup: return soup if tmp_radio >= parent_radio and tmp_tag.name != 'p': # article radio - if soup.find_all(re.compile("h[1-6]")) or tmp_radio < self.options['threshold']: + if soup.find_all(re.compile("h[1-6]")) or tmp_radio < self.threshold: return self.__find_article_html(tmp_tag) return tmp_tag else: @@ -81,13 +75,13 @@ def __get_title(self, soup) -> str: if not title: html = BeautifulSoup(self.html, 'lxml') if html.title: - title = html.title.text.split('_')[0].split('|')[0] + title = html.title.text.split('_')[0].split('|')[0] self.title = re.sub(r'<[\s\S]*?>|[\t\r\f\v]|^\s+|\s+$', "", title) return self.title def __download(self) -> str: - response = requests.get(self.url, timeout=self.options['timeout'], headers={'User-Agent': UserAgent().random}, proxies=self.proxies) + response = requests.get(self.url, **self.kwargs) response.raise_for_status() html = '' if response.encoding != 'ISO-8859-1': @@ -112,7 +106,7 @@ def parse(self) -> tuple: for comment in soup.find_all(text=lambda text: isinstance(text, Comment)): comment.extract() article_html = self.__find_article_html(soup) - if self.options['markdown']: + if self.output == 'markdown': return self.__get_title(article_html), self.__html_to_md(article_html) else: return self.__get_title(article_html), article_html diff --git a/article_parser/__init__.py b/article_parser/__init__.py index b9d127e..fdc003e 100644 --- a/article_parser/__init__.py +++ b/article_parser/__init__.py @@ -1,5 +1,17 @@ from .Extractor import * -def parse(url='', html='', proxies={}, options={}): - ext = Extractor(url=url, html=html, proxies=proxies, options=options) - return ext.parse() \ No newline at end of file + +def parse(url='', html='', threshold=0.9, output='html', **kwargs): + r"""Extract article by url or html. + + :param url: URL for the article. + :param html: Html for the article. + :param threshold: The ratio of text to the entire document, default 0.9. + :param output: Result output format, support ``markdown`` and ``html``, default ``html``. + :param \*\*kwargs: Optional arguments that ``request`` takes. + :return: :class:`tuple` object + """ + + ext = Extractor(url=url, html=html, threshold=threshold, + output=output, **kwargs) + return ext.parse() diff --git a/requirements.txt b/requirements.txt index 34dc40b..63f3414 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,4 +3,3 @@ html2text==2020.1.16 requests==2.27.1 setuptools==49.2.1 lxml==4.9.0 -fake_useragent==0.1.11 \ No newline at end of file diff --git a/setup.py b/setup.py index 155be43..598d8c1 100644 --- a/setup.py +++ b/setup.py @@ -3,18 +3,20 @@ this_directory = os_path.abspath(os_path.dirname(__file__)) + def read_file(filename): with open(os_path.join(this_directory, filename), encoding='utf-8') as f: long_description = f.read() return long_description + setup( name="article-parser", author="myifeng", author_email="myifengs@gmail.com", - maintainer ="myifeng", - maintainer_email ="myifengs@gmail.com", - keywords ="article news html parser extractor", + maintainer="myifeng", + maintainer_email="myifengs@gmail.com", + keywords="article news html parser extractor", description="A parser to parse article from url or html", long_description=read_file('README.md'), long_description_content_type="text/markdown", @@ -41,4 +43,4 @@ def read_file(filename): "dev_template": "{tag}" }, setup_requires=["setuptools-git-versioning"] -) \ No newline at end of file +) diff --git a/test/test_parser.py b/test/test_parser.py index 4304b24..70cb1f6 100644 --- a/test/test_parser.py +++ b/test/test_parser.py @@ -4,7 +4,7 @@ import article_parser def test_markdown(): - title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html") + title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5) assert title == 'Djokovic wins record 36th Masters title in Rome' assert content == '''![](http://img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg) From 5eaba39a3511e09192dbd1580bfedfbc56912436 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun, 10 Jul 2022 18:16:16 +0800 Subject: [PATCH 02/17] Bump lxml from 4.9.0 to 4.9.1 (#35) Bumps [lxml](https://github.com/lxml/lxml) from 4.9.0 to 4.9.1. - [Release notes](https://github.com/lxml/lxml/releases) - [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt) - [Commits](https://github.com/lxml/lxml/compare/lxml-4.9.0...lxml-4.9.1) --- updated-dependencies: - dependency-name: lxml dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 63f3414..b2e8630 100644 --- a/requirements.txt +++ b/requirements.txt @@ -2,4 +2,4 @@ beautifulsoup4==4.11.1 html2text==2020.1.16 requests==2.27.1 setuptools==49.2.1 -lxml==4.9.0 +lxml==4.9.1 From 397fb5f2669ea42988c16822e164fa0acee88417 Mon Sep 17 00:00:00 2001 From: myifeng Date: Wed, 21 Dec 2022 10:52:58 +0800 Subject: [PATCH 03/17] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d395c01..55e46ee 100644 --- a/README.md +++ b/README.md @@ -105,4 +105,4 @@ Djokovic wins record 36th Masters title in Rome ``` ## Contributors -[![All contributions](https://contrib.rocks/image?repo=myifeng/article-parser)](https://github.com/myifeng/mybatis-operation-log/graphs/contributors) \ No newline at end of file +[![All contributions](https://contrib.rocks/image?repo=myifeng/article-parser)](https://github.com/myifeng/article-parser/graphs/contributors) From 72e48aefc47f2d0df4fbac6c0b9022dff2ca8285 Mon Sep 17 00:00:00 2001 From: myifengs Date: Mon, 20 Feb 2023 10:59:42 +0800 Subject: [PATCH 04/17] Upgrade Python Version --- .github/workflows/python-ci.yml | 2 +- .github/workflows/python-publish.yml | 2 +- setup.py | 1 - 3 files changed, 2 insertions(+), 3 deletions(-) diff --git a/.github/workflows/python-ci.yml b/.github/workflows/python-ci.yml index 98fcd5d..063400e 100644 --- a/.github/workflows/python-ci.yml +++ b/.github/workflows/python-ci.yml @@ -15,7 +15,7 @@ jobs: strategy: matrix: os: [ubuntu-latest, macos-latest, windows-latest] - python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"] + python-version: ["3.7", "3.8", "3.9", "3.10"] steps: - uses: actions/checkout@v2 - name: Set up Python ${{ matrix.python-version }} diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml index 844de57..eb62612 100644 --- a/.github/workflows/python-publish.yml +++ b/.github/workflows/python-publish.yml @@ -15,7 +15,7 @@ jobs: strategy: matrix: os: [ubuntu-latest] - python-version: ["3.6"] + python-version: ["3.7"] steps: - uses: actions/checkout@v2 - name: Set up Python ${{ matrix.python-version }} diff --git a/setup.py b/setup.py index 598d8c1..9814f29 100644 --- a/setup.py +++ b/setup.py @@ -28,7 +28,6 @@ def read_file(filename): "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3", - "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3.8", "Programming Language :: Python :: 3.9", From 708cfe1921af2dd93b33443388e9693a46450597 Mon Sep 17 00:00:00 2001 From: myifengs Date: Mon, 20 Feb 2023 11:00:01 +0800 Subject: [PATCH 05/17] add --- setup.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/setup.py b/setup.py index 9814f29..4ae2224 100644 --- a/setup.py +++ b/setup.py @@ -35,7 +35,7 @@ def read_file(filename): "Programming Language :: Python :: 3 :: Only", "Programming Language :: Python :: Implementation", ], - python_requires='>=3.6', + python_requires='>=3.7', version_config={ "template": "{tag}", "dirty_template": "{tag}", From a2c13bde8579a1ebef34277dd74b638004b63c86 Mon Sep 17 00:00:00 2001 From: myifengs Date: Mon, 20 Mar 2023 12:29:56 +0800 Subject: [PATCH 06/17] Optimize code and improve efficiency. --- article_parser/Extractor.py | 41 +++++++++---------------------------- 1 file changed, 10 insertions(+), 31 deletions(-) diff --git a/article_parser/Extractor.py b/article_parser/Extractor.py index c5f8f8c..9528e64 100644 --- a/article_parser/Extractor.py +++ b/article_parser/Extractor.py @@ -18,22 +18,12 @@ def __init__(self, url, html, threshold, output, **kwargs): if not self.html: self.html = self.__download() - def __process_text_ratio(self, soup) -> tuple: - soup = copy.copy(soup) - if soup: - if type(soup) is NavigableString: - return 1 - for t in soup.find_all(['script', 'style', 'noscript', 'a', 'img']): - t.extract() - soup_str = re.sub( - r'\s*[^=\s+]+\s*=\s*([^=>]+)?(?=(\s+|>))', "", str(soup)) - total_len = len(soup_str) - if total_len: - tag_len = 0.0 - for tag in re.compile(r']*>|[\s]', re.S).findall(soup_str): - tag_len += len(tag) - return (total_len-tag_len)/total_len, total_len - return 0, 0 + def __process_text_ratio(self, tag) -> tuple: + text_len = len(tag.text.strip()) + tag_len = len(str(tag)) + if tag_len == 0: + return 0, 0 + return text_len / tag_len, tag_len def __find_article_html(self, soup) -> BeautifulSoup: tmp_len = 0 @@ -72,7 +62,7 @@ def __get_title(self, soup) -> str: title = t.text break - if not title: + if not title and self.html: html = BeautifulSoup(self.html, 'lxml') if html.title: title = html.title.text.split('_')[0].split('|')[0] @@ -83,23 +73,12 @@ def __get_title(self, soup) -> str: def __download(self) -> str: response = requests.get(self.url, **self.kwargs) response.raise_for_status() - html = '' - if response.encoding != 'ISO-8859-1': - # return response as a unicode string - html = response.text - else: - html = response.content - if 'charset' not in response.headers.get('content-type'): - encodings = requests.utils.get_encodings_from_content( - response.text) - if len(encodings) > 0: - response.encoding = encodings[0] - html = response.text + html = response.content.decode(response.encoding) return html + def parse(self) -> tuple: - soup = BeautifulSoup(self.html, 'lxml') - soup = soup.find('body') + soup = BeautifulSoup(self.html, 'lxml').find('body') if soup: for tag in soup.find_all(style=re.compile('display:\s?none')): tag.extract() From a0e9f95b36b68c9d0f55259724af8cb367b411d1 Mon Sep 17 00:00:00 2001 From: myifengs Date: Mon, 20 Mar 2023 13:55:48 +0800 Subject: [PATCH 07/17] FIX encoding --- article_parser/Extractor.py | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/article_parser/Extractor.py b/article_parser/Extractor.py index 9528e64..3582e4a 100644 --- a/article_parser/Extractor.py +++ b/article_parser/Extractor.py @@ -73,7 +73,18 @@ def __get_title(self, soup) -> str: def __download(self) -> str: response = requests.get(self.url, **self.kwargs) response.raise_for_status() - html = response.content.decode(response.encoding) + html = '' + if response.encoding != 'ISO-8859-1': + # return response as a unicode string + html = response.text + else: + html = response.content + if 'charset' not in response.headers.get('content-type'): + encodings = requests.utils.get_encodings_from_content( + response.text) + if len(encodings) > 0: + response.encoding = encodings[0] + html = response.text return html From d91d5dbb68a0c0e65a330d5e6bf842589caf59d6 Mon Sep 17 00:00:00 2001 From: myifengs Date: Mon, 20 Mar 2023 14:03:28 +0800 Subject: [PATCH 08/17] revert --- article_parser/Extractor.py | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/article_parser/Extractor.py b/article_parser/Extractor.py index 3582e4a..5dbe068 100644 --- a/article_parser/Extractor.py +++ b/article_parser/Extractor.py @@ -18,12 +18,22 @@ def __init__(self, url, html, threshold, output, **kwargs): if not self.html: self.html = self.__download() - def __process_text_ratio(self, tag) -> tuple: - text_len = len(tag.text.strip()) - tag_len = len(str(tag)) - if tag_len == 0: - return 0, 0 - return text_len / tag_len, tag_len + def __process_text_ratio(self, soup) -> tuple: + soup = copy.copy(soup) + if soup: + if type(soup) is NavigableString: + return 1 + for t in soup.find_all(['script', 'style', 'noscript', 'a', 'img']): + t.extract() + soup_str = re.sub( + r'\s*[^=\s+]+\s*=\s*([^=>]+)?(?=(\s+|>))', "", str(soup)) + total_len = len(soup_str) + if total_len: + tag_len = 0.0 + for tag in re.compile(r']*>|[\s]', re.S).findall(soup_str): + tag_len += len(tag) + return (total_len-tag_len)/total_len, total_len + return 0, 0 def __find_article_html(self, soup) -> BeautifulSoup: tmp_len = 0 @@ -62,7 +72,7 @@ def __get_title(self, soup) -> str: title = t.text break - if not title and self.html: + if not title: html = BeautifulSoup(self.html, 'lxml') if html.title: title = html.title.text.split('_')[0].split('|')[0] From 9ca2f03fcf0310867e3af9c377e78a44d3c2a95c Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu, 6 Apr 2023 17:10:44 +0800 Subject: [PATCH 09/17] Bump beautifulsoup4 from 4.11.1 to 4.12.1 (#41) Bumps [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/) from 4.11.1 to 4.12.1. --- updated-dependencies: - dependency-name: beautifulsoup4 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index b2e8630..8b59344 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,4 @@ -beautifulsoup4==4.11.1 +beautifulsoup4==4.12.1 html2text==2020.1.16 requests==2.27.1 setuptools==49.2.1 From 8a852e4c748ae708387a1a1b8cc30410aaa9f49b Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 18 Sep 2023 10:14:24 +0800 Subject: [PATCH 10/17] Bump beautifulsoup4 from 4.12.1 to 4.12.2 (#42) Bumps [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/) from 4.12.1 to 4.12.2. --- updated-dependencies: - dependency-name: beautifulsoup4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 8b59344..0d14cd5 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,4 @@ -beautifulsoup4==4.12.1 +beautifulsoup4==4.12.2 html2text==2020.1.16 requests==2.27.1 setuptools==49.2.1 From 7f2fe9cd2cb3a25832f96cdb0a756254054481e9 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 18 Sep 2023 10:15:57 +0800 Subject: [PATCH 11/17] Bump lxml from 4.9.1 to 4.9.3 (#46) Bumps [lxml](https://github.com/lxml/lxml) from 4.9.1 to 4.9.3. - [Release notes](https://github.com/lxml/lxml/releases) - [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt) - [Commits](https://github.com/lxml/lxml/compare/lxml-4.9.1...lxml-4.9.3) --- updated-dependencies: - dependency-name: lxml dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 0d14cd5..c964942 100644 --- a/requirements.txt +++ b/requirements.txt @@ -2,4 +2,4 @@ beautifulsoup4==4.12.2 html2text==2020.1.16 requests==2.27.1 setuptools==49.2.1 -lxml==4.9.1 +lxml==4.9.3 From 43450361568477871437debf1a2eda4852179ca2 Mon Sep 17 00:00:00 2001 From: myifengs Date: Mon, 18 Sep 2023 10:40:05 +0800 Subject: [PATCH 12/17] Update README --- .github/workflows/python-publish.yml | 2 +- README.md | 6 +- README.zh-CN.md | 107 +++++++++++++++++++++++++++ 3 files changed, 111 insertions(+), 4 deletions(-) create mode 100644 README.zh-CN.md diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml index eb62612..7f12330 100644 --- a/.github/workflows/python-publish.yml +++ b/.github/workflows/python-publish.yml @@ -1,4 +1,4 @@ -name: CI +name: Publish on: release: diff --git a/README.md b/README.md index 55e46ee..6e408d2 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,6 @@ # article-parser ![GitHub Repo stars](https://img.shields.io/github/stars/myifeng/article-parser) -![GitHub Workflow Status](https://img.shields.io/github/workflow/status/myifeng/article-parser/CI) [![python](https://img.shields.io/pypi/pyversions/article-parser)](https://pypi.org/project/article-parser/) [![pypi](https://img.shields.io/pypi/v/article-parser)](https://pypi.org/project/article-parser/) [![wheel](https://img.shields.io/pypi/wheel/article-parser)](https://pypi.org/project/article-parser/) @@ -9,12 +8,13 @@ ![PyPI - Downloads](https://img.shields.io/pypi/dd/article-parser) -Extract article or news by url or html, parse the title and content, output in markdown format. +**Extract article or news by url or html, parse the title and content, output in markdown format.** +*[English](README.md) ∙ [简体中文](README.zh-CN.md)* ## How to install -`article-parser` is available on pypi +[`article-parser`](https://pypi.org/project/article-parser/) is available on pypi https://pypi.org/project/article-parser/ ``` diff --git a/README.zh-CN.md b/README.zh-CN.md new file mode 100644 index 0000000..b39a727 --- /dev/null +++ b/README.zh-CN.md @@ -0,0 +1,107 @@ +# article-parser + +![GitHub Repo stars](https://img.shields.io/github/stars/myifeng/article-parser) +[![python](https://img.shields.io/pypi/pyversions/article-parser)](https://pypi.org/project/article-parser/) +[![pypi](https://img.shields.io/pypi/v/article-parser)](https://pypi.org/project/article-parser/) +[![wheel](https://img.shields.io/pypi/wheel/article-parser)](https://pypi.org/project/article-parser/) +[![license](https://img.shields.io/github/license/myifeng/article-parser)](https://pypi.org/project/article-parser/) +![PyPI - Downloads](https://img.shields.io/pypi/dd/article-parser) + + +**一种通过任意URL或者html文件解析网页的标题和正文,可以输出为markdown格式的通用库** + +*[English](README.md) ∙ [简体中文](README.zh-CN.md)* + +## 安装 + +[`article-parser`](https://pypi.org/project/article-parser/) 可以在[`Pipy`](https://pypi.org/project/article-parser/)中下载使用 + +``` +$ pip install article-parser +``` + +## 使用 + +```python +>>> import article_parser + +article_parser.parse( + url='', # 网页的地址. + html='', # Html文件内容 + threshold=0.9, # 阈值,默认0.9 + output='html', # 输出格式,支持 markdown 和 html, 默认html + **kwargs # 可选参数 + ), + + +## 输出markdown格式 +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5) + +## 输出html格式 +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5) + +``` + +## 示例 +[Djokovic wins record 36th Masters title in Rome - Chinadaily.com.cn](http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html) + + +* Markdown + +```python +>>> import article_parser +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5) +>>> print(title) +>>> print('----------------') +>>> print(content) + +Djokovic wins record 36th Masters title in Rome +---------------- +![](http://img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg) +Serbia's Novak Djokovic kisses the trophy after winning the final against +Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept +21, 2020. [Photo/Agencies] + +ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego +Schwartzman in the men's final of the ATP Italian Open on Monday. + +Djokovic, the world number one and the top seed at the tournament, won 7-5, +6-3 against Argentine Schwartzman to lift his 36th Masters title, one more +than Rafael Nadal. + +The Serb said he did not play his best tennis this time in Rome, but could +find it when needed. + +Simona Halep, top seed of the women's draw, won her first title in Rome after +defending champion Karolina Pliskova of the Czech Republic retired while +trailing 6-0, 2-1 in the final. +``` + + +* HTML +```python +>>> import article_parser +>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5) +>>> print(title) +>>> print('----------------') +>>> print(content) + +Djokovic wins record 36th Masters title in Rome +---------------- +
+ +
+ +
+ Serbia's Novak Djokovic kisses the trophy after winning the final against Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept 21, 2020. [Photo/Agencies] +
+
+

ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego Schwartzman in the men's final of the ATP Italian Open on Monday.

+

Djokovic, the world number one and the top seed at the tournament, won 7-5, 6-3 against Argentine Schwartzman to lift his 36th Masters title, one more than Rafael Nadal.

+

The Serb said he did not play his best tennis this time in Rome, but could find it when needed.

+

Simona Halep, top seed of the women's draw, won her first title in Rome after defending champion Karolina Pliskova of the Czech Republic retired while trailing 6-0, 2-1 in the final.

+
+``` +## Contributors + +[![All contributions](https://contrib.rocks/image?repo=myifeng/article-parser)](https://github.com/myifeng/article-parser/graphs/contributors) From 8d73c1cc86779ddc4011732a6c309a2df97d1ee2 Mon Sep 17 00:00:00 2001 From: myifengs Date: Mon, 18 Sep 2023 10:43:30 +0800 Subject: [PATCH 13/17] Update README.md --- .github/workflows/python-ci.yml | 2 +- README.md | 2 +- README.zh-CN.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/.github/workflows/python-ci.yml b/.github/workflows/python-ci.yml index 063400e..5eb4f45 100644 --- a/.github/workflows/python-ci.yml +++ b/.github/workflows/python-ci.yml @@ -5,7 +5,7 @@ name: CI on: push: - branches: [ dev ] + branches: [ master, dev ] pull_request: branches: [ master ] diff --git a/README.md b/README.md index 6e408d2..97734b7 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ ![PyPI - Downloads](https://img.shields.io/pypi/dd/article-parser) -**Extract article or news by url or html, parse the title and content, output in markdown format.** +**Extract article or news by url or html, parse the title and content.** *[English](README.md) ∙ [简体中文](README.zh-CN.md)* diff --git a/README.zh-CN.md b/README.zh-CN.md index b39a727..73159e3 100644 --- a/README.zh-CN.md +++ b/README.zh-CN.md @@ -8,7 +8,7 @@ ![PyPI - Downloads](https://img.shields.io/pypi/dd/article-parser) -**一种通过任意URL或者html文件解析网页的标题和正文,可以输出为markdown格式的通用库** +**一种通过任意URL或者html文件解析网页的标题和正文的通用库** *[English](README.md) ∙ [简体中文](README.zh-CN.md)* From 98e64b19e7def20a6913980cca479ec32c0e92eb Mon Sep 17 00:00:00 2001 From: myifengs Date: Mon, 18 Sep 2023 10:47:28 +0800 Subject: [PATCH 14/17] Update README.md --- README.md | 2 +- README.zh-CN.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 97734b7..38edf14 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ **Extract article or news by url or html, parse the title and content.** -*[English](README.md) ∙ [简体中文](README.zh-CN.md)* +*[English]([README.md](https://github.com/myifeng/article-parser/blob/master/README.md)) ∙ [简体中文]([README.zh-CN.md](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md))* ## How to install diff --git a/README.zh-CN.md b/README.zh-CN.md index 73159e3..078d8a9 100644 --- a/README.zh-CN.md +++ b/README.zh-CN.md @@ -10,7 +10,7 @@ **一种通过任意URL或者html文件解析网页的标题和正文的通用库** -*[English](README.md) ∙ [简体中文](README.zh-CN.md)* +*[English]([README.md](https://github.com/myifeng/article-parser/blob/master/README.md)) ∙ [简体中文]([README.zh-CN.md](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md))* ## 安装 From 307e7bca387d8d777242deb41b0f4d44545b9ad0 Mon Sep 17 00:00:00 2001 From: myifengs Date: Mon, 18 Sep 2023 12:26:48 +0800 Subject: [PATCH 15/17] Tolerating release package file duplicates --- .github/workflows/python-ci.yml | 6 ------ .github/workflows/python-publish.yml | 4 +++- 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/.github/workflows/python-ci.yml b/.github/workflows/python-ci.yml index 5eb4f45..55cae98 100644 --- a/.github/workflows/python-ci.yml +++ b/.github/workflows/python-ci.yml @@ -31,9 +31,3 @@ jobs: run: | python setup.py install sdist bdist_wheel pytest --disable-warnings - - name: Publish - uses: pypa/gh-action-pypi-publish@release/v1 - if: github.event_name == 'release' && github.event.action == 'created' - with: - user: __token__ - password: ${{ secrets.PYPI_API_TOKEN }} diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml index 7f12330..c842211 100644 --- a/.github/workflows/python-publish.yml +++ b/.github/workflows/python-publish.yml @@ -37,10 +37,12 @@ jobs: with: user: __token__ password: ${{ secrets.PYPI_API_TOKEN }} - repository_url: https://test.pypi.org/legacy/ + repository-url: https://test.pypi.org/legacy/ + skip-existing: true - name: Publish uses: pypa/gh-action-pypi-publish@release/v1 if: github.event_name == 'create' && github.event.ref_type == 'tag' with: user: __token__ password: ${{ secrets.PYPI_API_TOKEN }} + skip-existing: true From 7c79c43dd0ae20c4b32b759bcacce686f4166d0f Mon Sep 17 00:00:00 2001 From: myifengs Date: Mon, 18 Sep 2023 12:31:44 +0800 Subject: [PATCH 16/17] Update README.md --- README.md | 2 +- README.zh-CN.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 38edf14..ae5b336 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ **Extract article or news by url or html, parse the title and content.** -*[English]([README.md](https://github.com/myifeng/article-parser/blob/master/README.md)) ∙ [简体中文]([README.zh-CN.md](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md))* +*[English](https://github.com/myifeng/article-parser/blob/master/README.md) ∙ [简体中文](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md)* ## How to install diff --git a/README.zh-CN.md b/README.zh-CN.md index 078d8a9..d92e615 100644 --- a/README.zh-CN.md +++ b/README.zh-CN.md @@ -10,7 +10,7 @@ **一种通过任意URL或者html文件解析网页的标题和正文的通用库** -*[English]([README.md](https://github.com/myifeng/article-parser/blob/master/README.md)) ∙ [简体中文]([README.zh-CN.md](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md))* +*[English](https://github.com/myifeng/article-parser/blob/master/README.md) ∙ [简体中文](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md)* ## 安装 From d967cdc9473e7ea103468020c3f04ce8b0143d20 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun, 8 Oct 2023 16:06:32 +0800 Subject: [PATCH 17/17] Bump requests from 2.27.1 to 2.31.0 (#45) Bumps [requests](https://github.com/psf/requests) from 2.27.1 to 2.31.0. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](https://github.com/psf/requests/compare/v2.27.1...v2.31.0) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index c964942..9f1b261 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,5 +1,5 @@ beautifulsoup4==4.12.2 html2text==2020.1.16 -requests==2.27.1 +requests==2.31.0 setuptools==49.2.1 lxml==4.9.3