Scrapyd-client is a client for Scrapyd. It provides:
Command line tools:
scrapyd-deploy
, to deploy your project to a Scrapyd serverscrapyd-client
, to interact with your project once deployed
Python client:
ScrapydClient
, to interact with Scrapyd within your python code
It is configured using the Scrapy configuration file.
Deploying your project to a Scrapyd server involves:
- Eggifying your project.
- Uploading the egg to the Scrapyd server through the addversion.json webservice.
The scrapyd-deploy
tool automates the process of building the egg and pushing it to the target Scrapyd server.
Change (
cd
) to the root of your project (the directory containing thescrapy.cfg
file)Eggify your project and upload it to the target:
scrapyd-deploy <target> -p <project>
If you don't have a setup.py
file in the root of your project, one will be created. If you have one, it must set the entry_points
keyword argument in the setup()
function call, for example:
setup( name = 'project', version = '1.0', packages = find_packages(), entry_points = {'scrapy': ['settings = projectname.settings']}, )
If the command is successful, you should see a JSON response, like:
Deploying myproject-1287453519 to http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "spiders": ["spider1", "spider2"]}
To save yourself from having to specify the target and project, you can configure your defaults in the Scrapy configuration file.
By default, scrapyd-deploy
uses the current timestamp for generating the project version. You can pass a custom version using --version
:
scrapyd-deploy <target> -p <project> --version <version>
See Scrapyd's documentation on how it determines the latest version.
If you use Mercurial or Git, you can use HG
or GIT
respectively as the argument supplied to
--version
to use the current revision as the version. You can save yourself having to specify
the version parameter by adding it to your target's entry in scrapy.cfg
:
[deploy]
...
version = HG
Note: The version
keyword argument in the setup()
function call in the setup.py
file has no meaning to Scrapyd.
Create a requirements.txt file at the root of your project, alongside the
scrapy.cfg
fileUse the
--include-dependencies
option when building or deploying your project:scrapyd-deploy --include-dependencies
Alternatively, you can install the dependencies directly on the Scrapyd server.
Create a
setup.py
file at the root of your project, alongside thescrapy.cfg
file, if you don't have one:scrapyd-deploy --build-egg=/dev/null
Set the
package_data
andinclude_package_data` keyword arguments in the ``setup()
function call in thesetup.py
file. For example:from setuptools import setup, find_packages setup( name = 'project', version = '1.0', packages = find_packages(), entry_points = {'scrapy': ['settings = projectname.settings']}, package_data = {'projectname': ['path/to/*.json']}, include_package_data = True, )
You may want to keep certain settings local and not have them deployed to Scrapyd.
Create a
local_settings.py
file at the root of your project, alongside thescrapy.cfg
fileAdd the following to your project's settings file:
try: from local_settings import * except ImportError: pass
scrapyd-deploy
doesn't deploy anything outside of the project module, so the local_settings.py
file won't be deployed.
Problem: A settings file for local development is being included in the egg.
Solution: See Local settings. Or, exclude the module from the egg. If using scrapyd-client's default
setup.py
file, change thefind_package()
call:setup( name = 'project', version = '1.0', packages = find_packages(), entry_points = {'scrapy': ['settings = projectname.settings']}, )
to:
setup( name = 'project', version = '1.0', packages = find_packages(exclude=["myproject.devsettings"]), entry_points = {'scrapy': ['settings = projectname.settings']}, )
Problem: Code using
__file__
breaks when run in Scrapyd.Solution: Use pkgutil.get_data instead. For example, change:
path = os.path.dirname(os.path.realpath(__file__)) # BAD open(os.path.join(path, "tools", "json", "test.json"), "rb").read()
to:
import pkgutil pkgutil.get_data("projectname", "tools/json/test.json")
Be careful when writing to disk in your project, as Scrapyd will most likely be running under a different user which may not have write access to certain directories. If you can, avoid writing to disk and always use tempfile for temporary files.
If you use a proxy, use the
HTTP_PROXY
,HTTPS_PROXY
,NO_PROXY
and/orALL_PROXY
environment variables, as documented by the requests package.
For a reference on each subcommand invoke scrapyd-client <subcommand> --help
.
Where filtering with wildcards is possible, it is facilitated with fnmatch.
The --project
option can be omitted if one is found in a scrapy.cfg
.
This is a wrapper around scrapyd-deploy.
Lists all targets:
scrapyd-client targets
Lists all projects of a Scrapyd instance:
# lists all projects on the default target scrapyd-client projects # lists all projects from a custom URL scrapyd-client -t http://scrapyd.example.net projects
Schedules one or more spiders to be executed:
# schedules any spider scrapyd-client schedule # schedules all spiders from the 'knowledge' project scrapyd-client schedule -p knowledge \* # schedules any spider from any project whose name ends with '_daily' scrapyd-client schedule -p \* \*_daily # schedules spider1 in project1 specifying settings scrapyd-client schedule -p project1 spider1 --arg 'setting=DOWNLOADER_MIDDLEWARES={"my.middleware.MyDownloader": 610}'
Lists spiders of one or more projects:
# lists all spiders scrapyd-client spiders # lists all spiders from the 'knowledge' project scrapyd-client spiders -p knowledge
Interact with Scrapyd within your python code.
from scrapyd_client import ScrapydClient
client = ScrapydClient()
for project in client.projects():
print(client.jobs(project=project))
You can define a Scrapyd target in your project's scrapy.cfg
file. Example:
[deploy]
url = http://scrapyd.example.com/api/scrapyd
username = scrapy
password = secret
project = projectname
You can now deploy your project without the <target>
argument or -p <project>
option:
scrapyd-deploy
If you have multiple targets, add the target name in the section name. Example:
[deploy:targetname]
url = http://scrapyd.example.com/api/scrapyd
[deploy:another]
url = http://other.example.com/api/scrapyd
If you are working with CD frameworks, you do not need to commit your secrets to your repository. You can use environment variable expansion like so:
[deploy]
url = $SCRAPYD_URL
username = $SCRAPYD_USERNAME
password = $SCRAPYD_PASSWORD
or using this syntax:
[deploy]
url = ${SCRAPYD_URL}
username = ${SCRAPYD_USERNAME}
password = ${SCRAPYD_PASSWORD}
To deploy to one target, run:
scrapyd-deploy targetname -p <project>
To deploy to all targets, use the -a
option:
scrapyd-deploy -a -p <project>
While your target needs to be defined with its URL in scrapy.cfg
,
you can use netrc for username and password, like so:
machine scrapyd.example.com login scrapy password secret