Skip to content

Commit f2153d3

Browse files
committed
pypi 0.2.1 release, added url to WorkGroup parameter
1 parent 46017c7 commit f2153d3

File tree

13 files changed

+60
-36
lines changed

13 files changed

+60
-36
lines changed

CHANGES

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,19 @@
33
========================
44

55
11/29/18
6-
- pypi 0.2.0 release
6+
- pypi 0.2.1 release
7+
8+
- added `url` parameter to the WorkGroup which is a bit more attractive
9+
API, instead of including the url in a kwarg. The reason why the url was
10+
originally included as a kwarg is because depending on how the custom
11+
Spider is setup, the url may already be specified, and it is redundant to
12+
specify it again. But for API clarity sake, now we just insist the url is
13+
specified in the WorkGroup. At least, it is easier to read at a quick glance.
714

815
11/28/18
16+
17+
- pypi 0.2.0 release
18+
919
More API breaking changes:
1020

1121
- previously, the Worker needed to be explicitly defined in the

README.rst

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@
55

66
.. image:: https://img.shields.io/badge/Python-3.6%20%7C%203.7-blue.svg
77
:target: https://github.com/bomquote/transistor
8-
.. image:: https://img.shields.io/badge/pypi%20package-0.2.0-blue.svg
9-
:target: https://pypi.org/project/transistor/0.2.0/
8+
.. image:: https://img.shields.io/badge/pypi%20package-0.2.1-blue.svg
9+
:target: https://pypi.org/project/transistor/0.2.1/
1010
.. image:: https://img.shields.io/badge/Status-Beta-blue.svg
1111
:target: https://github.com/bomquote/transistor
1212
.. image:: https://img.shields.io/badge/license-MIT-lightgrey.svg
@@ -167,13 +167,13 @@ Quickstart
167167

168168
First, install ``Transistor`` from pypi:
169169

170-
.. code-block:: python
170+
.. code-block:: rest
171171
172172
pip install transistor
173173
174174
If you have previously installed ``Transistor``, please ensure you are using the latest version:
175175

176-
.. code-block:: python
176+
.. code-block:: rest
177177
178178
pip-install --upgrade transistor
179179
@@ -404,7 +404,7 @@ Specifically, we are interested in the `book_title`, `stock` and `price` attribu
404404
405405
return self.items
406406
407-
Finally, to run the scrape, we will need to create a main.py file. This is all we need for the minimal example to scrape and export targeted data to cvs.
407+
Finally, to run the scrape, we will need to create a main.py file. This is all we need for the minimal example to scrape and export targeted data to csv.
408408

409409
So, at this point, we've:
410410

@@ -455,12 +455,13 @@ Third, setup the ``WorkGroup`` in a list we'll call *groups*. We use a list here
455455
groups = [
456456
WorkGroup(
457457
name='books.toscrape.com',
458+
url='http://books.toscrape.com/',
458459
spider=BooksToScrapeScraper,
459460
items=BookItems,
460461
loader=BookItemsLoader,
461462
exporters=exporters,
462463
workers=20, # this creates 20 scrapers and assigns each a book as a task
463-
kwargs={'url': 'http://books.toscrape.com/', 'timeout': (3.0, 20.0)})
464+
kwargs={'timeout': (3.0, 20.0)})
464465
]
465466
466467
Last, setup the ``WorkGroupManager`` and prepare the file to call the ``manager.main()`` method to start the scrape job:
@@ -506,7 +507,7 @@ Directly Using A SplashScraper
506507

507508
Perhaps you just want to do a quick one-off scrape?
508509

509-
It is possible to just use your custom scraper sublcassed from ``SplashScraper`` directly, without going through all the work to setup a ``StatefulBook``, ``BaseWorker``, ``BaseGroup``, ``WorkGroup``, and ``WorkGroupManager``.
510+
It is possible to just use your custom scraper subclassed from ``SplashScraper`` directly, without going through all the work to setup a ``StatefulBook``, ``BaseWorker``, ``BaseGroup``, ``WorkGroup``, and ``WorkGroupManager``.
510511

511512
Just fire it up in a python repl like below and ensure the ``start_http_session`` method is run, which can generally be done by setting ``autorun=True``.
512513

@@ -566,25 +567,25 @@ Next, we need to store our first two python objects in newt.db, which are:
566567

567568
.. code-block:: python
568569
569-
from transistor.persistence.newt_db.collections import ScrapeList, ScrapeLists
570+
from transistor.persistence.newt_db.collections import SpiderList, SpiderLists
570571
571572
Now, from your python repl:
572573

573574
.. code-block:: python
574575
575576
from transistor.newt_db import ndb
576577
577-
>>> ndb.root.scrapes = ScrapeLists() # Assigning ScrapeLists() is only required during initial seup. Or else, when/if you change the ScrapeLists() object, for example, to provide more functionality to the class.
578-
>>> ndb.root.scrapes.add('first-scrape', ScrapeList()) # You will add a new ScrapeList() anytime you need a new list container. Like, every single scrape you save. See ``process_exports`` method in ``examples/books_to_scrape/workgroup.py``.
578+
>>> ndb.root.spiders = SpiderLists() # Assigning SpiderLists() is only required during initial setup. Or else, when/if you change the SpiderLists() object, for example, to provide more functionality to the class.
579+
>>> ndb.root.spiders.add('first-scrape', SpiderList()) # You will add a new SpiderList() anytime you need a new list container. Like, every single scrape you save. See ``process_exports`` method in ``examples/books_to_scrape/workgroup.py``.
579580
>>> ndb.commit() # you must explicitly commit() after each change to newt.db.
580581
581582
At this point, you are ready-to-go with newt.db and PostgreSQL.
582583

583-
Later, when you have a scraper object instance, such as ``BooksToScrapeScraper()`` which has finished it's web scrape cycle, it will be stored in the ``ScrapeList()`` named ``first-scrape`` like such:
584+
Later, when you have a scraper object instance, such as ``BooksToScrapeScraper()`` which has finished it's web scrape cycle, it will be stored in the ``SpiderList()`` named ``first-scrape`` like such:
584585

585586
.. code-block:: python
586587
587-
>>> ndb.root.scrapes['first-scrape'].add(BooksToScrapeScraper(name="books.toscrape.com", book_title="Soumission"))
588+
>>> ndb.root.spiders['first-scrape'].add(BooksToScrapeScraper(name="books.toscrape.com", book_title="Soumission"))
588589
589590
590591
More on StatefulBook

dist/transistor-0.2.0.tar.gz

-114 KB
Binary file not shown.

dist/transistor-0.2.1.tar.gz

114 KB
Binary file not shown.

examples/books_to_scrape/main.py

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -53,14 +53,16 @@
5353
# finally, the core of what we need to launch the scrape job
5454
from transistor import WorkGroup, StatefulBook
5555
from transistor.persistence.exporters import CsvItemExporter
56+
from transistor.persistence.exporters.json import JsonLinesItemExporter
5657
from examples.books_to_scrape.workgroup import BooksWorker
5758
from examples.books_to_scrape.scraper import BooksToScrapeScraper
5859
from examples.books_to_scrape.manager import BooksWorkGroupManager
5960
from examples.books_to_scrape.persistence.serialization import (
6061
BookItems, BookItemsLoader)
6162

6263

63-
# 1) get the excel file path which has the book_titles we are interested to scrape
64+
# 1) Get the excel file path which has the book_titles we are interested to scrape.
65+
6466
def get_file_path(filename):
6567
"""
6668
Find the book_title excel file path.
@@ -70,42 +72,52 @@ def get_file_path(filename):
7072
filepath = root / 'books_to_scrape' / filename
7173
return r'{}'.format(filepath)
7274

73-
7475
# 2) Create a StatefulBook instance to read the excel file and load the work queue.
7576
# Set a list of tracker names, with one tracker name for each WorkGroup you create
76-
# in step three. Ensure the tracker name matches the WorkGroup.name in step three.
77+
# in step four. Ensure the tracker name matches the WorkGroup.name in step four.
78+
7779
file = get_file_path('book_titles.xlsx')
7880
trackers = ['books.toscrape.com']
79-
stateful_book = StatefulBook(file, trackers, keywords='titles', autorun=True)
81+
tasks = StatefulBook(file, trackers, keywords='titles', autorun=True)
82+
83+
# 3) Setup a list of exporters which than then be passed to whichever WorkGroup
84+
# objects you want to use them with. In this case, we are just going to use the
85+
# built-in CsvItemExporter but we could also use additional exporters to do
86+
# multiple exports at the same time, if desired.
8087

88+
exporters = [CsvItemExporter(
89+
fields_to_export=['book_title', 'stock', 'price'],
90+
file=open('c:/tmp/book_data.csv', 'a+b')),
91+
JsonLinesItemExporter(
92+
fields_to_export=['book_title', 'stock', 'price'],
93+
file=open('c:/tmp/book_data.json', 'a+b'),
94+
encoding='utf_8_sig')]
8195

82-
# 3) Setup the WorkGroups. You can create an arbitrary number of WorkGroups in a list.
96+
# 4) Setup the WorkGroups. You can create an arbitrary number of WorkGroups in a list.
8397
# For example, if there are three different domains which you want to search for
84-
# the book titles from the excel file. To, scrape the price and stock data on
85-
# each of the three different websites for each book title. You could setup three
98+
# the book titles from the excel file. If you wanted to scrape the price and stock data
99+
# on each of the three different websites for each book title. You could setup three
86100
# different WorkGroups here. Last, the WorkGroup.name should match the tracker name.
101+
87102
groups = [
88103
WorkGroup(
89104
name='books.toscrape.com',
105+
url='http://books.toscrape.com/',
90106
spider=BooksToScrapeScraper,
91107
worker=BooksWorker,
92108
items=BookItems,
93109
loader=BookItemsLoader,
94-
exporters=[
95-
CsvItemExporter(
96-
fields_to_export=['book_title', 'stock', 'price'],
97-
file=open('c:/tmp/book_data.csv', 'a+b'))
98-
],
110+
exporters=exporters,
99111
workers=20, # this creates 20 scrapers and assigns each a book as a task
100-
kwargs={'url': 'http://books.toscrape.com/', 'timeout': (3.0, 20.0)})
112+
kwargs={'timeout': (3.0, 20.0)})
101113
]
102114

103-
# 4) Last, setup the Manager. You can constrain the number of workers actually
115+
# 5) Last, setup the Manager. You can constrain the number of workers actually
104116
# deployed, through the `pool` parameter. For example, this is useful
105117
# when using a Crawlera 'C10' plan which limits concurrency to 10. To deploy all
106118
# the workers concurrently, set the pool to be marginally larger than the number
107119
# of total workers assigned in groups in step #3 above.
108-
manager = BooksWorkGroupManager('books_scrape', stateful_book, groups=groups, pool=25)
120+
manager = BooksWorkGroupManager('books_scrape', tasks, groups=groups, pool=25)
109121

110122

111123
if __name__ == "__main__":

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ def run(self):
123123
author_email=EMAIL,
124124
python_requires=REQUIRES_PYTHON,
125125
url=URL,
126-
download_url='https://github.com/bomquote/transistor/archive/v0.2.0.tar.gz',
126+
download_url='https://github.com/bomquote/transistor/archive/v0.2.1.tar.gz',
127127
keywords=['scraping', 'crawling', 'spiders', 'requests', 'beautifulsoup4',
128128
'mechanicalsoup', 'framework', 'headless-browser'],
129129
packages=find_packages(exclude=('tests',)),

tests/books_toscrape/test_books_toscrape.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,13 +99,14 @@ def bts_manager(_BooksToScrapeGroup, _BooksWorker):
9999
groups = [
100100
WorkGroup(
101101
name='books.toscrape.com',
102+
url='http://books.toscrape.com/',
102103
spider=BooksToScrapeScraper,
103104
worker=_BooksWorker,
104105
items=BookItems,
105106
loader=BookItemsLoader,
106107
exporters=exporters,
107108
workers=3, # this creates 3 scrapers and assigns each a book as a task
108-
kwargs={'url': 'http://books.toscrape.com/', 'timeout': (3.0, 20.0)})
109+
kwargs={'timeout': (3.0, 20.0)})
109110
]
110111
manager = BooksWorkGroupManager('books_scrape', tasks, groups=groups, pool=5)
111112

transistor/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
__title__ = 'transistor'
1313
__description__ = 'A web scraping framework for intelligent use cases.'
1414
__url__ = 'https://github.com/bomquote/transistor'
15-
__version__ = '0.2.0'
15+
__version__ = '0.2.1'
1616
__author__ = 'Bob Jordan'
1717
__author_email__ = 'bmjjr@bomquote.com'
1818
__license__ = 'MIT'

transistor/managers/base_manager.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ def _init_workers(self):
9999
# add the name to group.kwargs dict so it can be passed down
100100
# to the group/worker/spider and assigned as an attr
101101
group.kwargs['name'] = name
102+
group.kwargs['url'] = group.url
102103
group.kwargs['spider'] = group.spider
103104
group.kwargs['worker'] = group.worker
104105
group.kwargs['items'] = group.items

0 commit comments

Comments
 (0)