Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates From The Active Fork: Ruby 3 Support, Bug Fixes, New Drivers (Cuprite, Apparition), Test Updates #70

Open
wants to merge 31 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5f0b62d
Add additional command line arguments config option
dtengeri Sep 10, 2019
f9ba987
Merge branch '1-additional-command-line-params'
dtengeri Sep 10, 2019
43d6273
Rename the additional_arguments config option to browser_cmd_line_arg…
dtengeri Sep 10, 2019
32816eb
Merge branch '1-additional-command-line-params'
dtengeri Sep 10, 2019
80ff4cf
Remove built gems
dtengeri Sep 10, 2019
6b37bfc
Merge branch '1-additional-command-line-params'
dtengeri Sep 10, 2019
de6d560
add data attribute to crawl! method
tilhoft Sep 30, 2019
42124f6
Merge pull request #3 from tilhoft/add_data_to_crawl
dtengeri Oct 1, 2019
780e5a1
Add debugger address config option
dtengeri Jun 23, 2020
b80004f
Update version.rb
dtengeri Jun 23, 2020
569d2fd
Fix debugger address configuration
dtengeri Jun 24, 2020
390efe9
Update README
vifreefly Jul 1, 2020
542fe43
Update README
vifreefly Jul 9, 2020
d3d1064
fix: Ruby 3 kwargs
n-studio Mar 3, 2022
0d35578
Fix of sample code.
utakaha Nov 1, 2019
30d54e2
Use config argument on parse! to set config
duleorlovic Nov 14, 2019
6cec224
Switch to Addressable.URI.escape away from obsolete URI.escape; updat…
johnphamvan May 13, 2020
8a66648
fix: ruby 3 keyword arguments
n-studio Mar 26, 2022
129ac83
Merge branch 'ruby-3-and-crawl-params'
jkeen May 30, 2022
7de27fb
Fix: double splat keyword arguments for Saver.new
andrewperis Jan 22, 2023
c13096d
Missing double splat inside of parse
andrewperis Jan 24, 2023
66a7eb6
Missed a double splat for Saver.new
andrewperis Jan 24, 2023
cf4271b
Merge pull request #1 from andrewperis/fix/Saver.new-keyword-arguments
jkeen Aug 27, 2023
257bf01
Check for scheme using addressable instead of comparing to URI:HTTP
jkeen Aug 27, 2023
b279c6c
add apparition driver
glaucocustodio Nov 28, 2020
be7aa7c
add cuprite driver
glaucocustodio Sep 17, 2021
ad1dc20
update readme, changelog and version
glaucocustodio Sep 17, 2021
8114ca9
add response_type to in_parallel
glaucocustodio Aug 13, 2022
4349f17
add support to ruby 3
glaucocustodio Jan 20, 2023
3d1cf32
write first tests with rspec
glaucocustodio Jan 6, 2023
f3983e7
Update readme with some lofty goals
jkeen Aug 27, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@
Gemfile.lock

*.retry
.tags*
*.gem
3 changes: 3 additions & 0 deletions .rspec
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
--format documentation
--color
--require spec_helper
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,26 @@
# CHANGELOG
## 1.6.0
* Rename `additional_arguments` config option to `browser_cmd_line_arguments` (see [All available config options](https://github.com/vifreefly/kimuraframework#all-available-config-options))

## 1.6.0
### New
* Add support to Ruby 3

## 1.5.1
### New
* Add `response_type` to `in_parallel`

## 1.5.0
### New
* Add support to [Apparition](https://github.com/twalpole/apparition)
* Add support to [Cuprite](https://github.com/rubycdp/cuprite)
*
## 1.4.1
New
* Updated for Ruby 2.7+ support
* Switched to Addressable.URI.escape from obsolete URI.escape


## 1.4.0
### New
* Add `encoding` config option (see [All available config options](https://github.com/vifreefly/kimuraframework#all-available-config-options))
Expand Down
191 changes: 106 additions & 85 deletions README.md

Large diffs are not rendered by default.

13 changes: 8 additions & 5 deletions kimurai.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -33,16 +33,19 @@ Gem::Specification.new do |spec|
spec.add_dependency "capybara-mechanize"
spec.add_dependency "poltergeist"
spec.add_dependency "selenium-webdriver"
spec.add_dependency "apparition"
spec.add_dependency "cuprite"

spec.add_dependency "headless"
spec.add_dependency "pmap"

spec.add_dependency "addressable"
spec.add_dependency "whenever"

spec.add_dependency "rbcat", "~> 0.2"
spec.add_dependency "pry"
spec.add_dependency "rbcat", ">= 0.2.2", "< 0.3"
spec.add_dependency "pry-nav"

spec.add_development_dependency "bundler", "~> 1.16"
spec.add_development_dependency "rake", "~> 10.0"
spec.add_development_dependency "minitest", "~> 5.0"
spec.add_development_dependency "bundler", "~> 2.1.4"
spec.add_development_dependency "rake", "~> 13.0"
spec.add_development_dependency "rspec", "~> 3"
end
35 changes: 21 additions & 14 deletions lib/kimurai/base.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
require_relative 'base/saver'
require_relative 'base/storage'
require 'addressable/uri'

module Kimurai
class Base
Expand Down Expand Up @@ -99,7 +100,7 @@ def self.logger
end
end

def self.crawl!(exception_on_fail: true)
def self.crawl!(exception_on_fail: true, data: {})
logger.error "Spider: already running: #{name}" and return false if running?

@storage = Storage.new
Expand All @@ -123,13 +124,13 @@ def self.crawl!(exception_on_fail: true)
if start_urls
start_urls.each do |start_url|
if start_url.class == Hash
spider.request_to(:parse, start_url)
spider.request_to(:parse, url: start_url[:url], data: data)
else
spider.request_to(:parse, url: start_url)
spider.request_to(:parse, url: start_url, data: data)
end
end
else
spider.parse
spider.parse(data: data)
end
rescue StandardError, SignalException, SystemExit => e
@run_info.merge!(status: :failed, error: e.inspect)
Expand All @@ -154,12 +155,18 @@ def self.crawl!(exception_on_fail: true)
end

def self.parse!(handler, *args, **request)
spider = self.new
if request.has_key? :config
config = request[:config]
request.delete :config
else
config = {}
end
spider = self.new config: config

if args.present?
spider.public_send(handler, *args)
elsif request.present?
spider.request_to(handler, request)
spider.request_to(handler, **request)
else
spider.public_send(handler)
end
Expand Down Expand Up @@ -191,7 +198,7 @@ def browser
end

def request_to(handler, delay = nil, url:, data: {}, response_type: :html)
raise InvalidUrlError, "Requested url is invalid: #{url}" unless URI.parse(url).kind_of?(URI::HTTP)
raise InvalidUrlError, "Requested url is invalid: #{url}" unless URI.parse(url).scheme =~ /http(s)?/

if @config[:skip_duplicate_requests] && !unique_request?(url)
add_event(:duplicate_requests) if self.with_info
Expand All @@ -201,7 +208,7 @@ def request_to(handler, delay = nil, url:, data: {}, response_type: :html)
visited = delay ? browser.visit(url, delay: delay) : browser.visit(url)
return unless visited

public_send(handler, browser.current_response(response_type), { url: url, data: data })
public_send(handler, browser.current_response(response_type), **{ url: url, data: data })
end

def console(response = nil, url: nil, data: {})
Expand All @@ -224,9 +231,9 @@ def save_to(path, item, format:, position: true, append: false)
@savers[path] ||= begin
options = { format: format, position: position, append: append }
if self.with_info
self.class.savers[path] ||= Saver.new(path, options)
self.class.savers[path] ||= Saver.new(path, **options)
else
Saver.new(path, options)
Saver.new(path, **options)
end
end

Expand Down Expand Up @@ -286,7 +293,7 @@ def send_item(item, options = {})
end
end

def in_parallel(handler, urls, threads:, data: {}, delay: nil, engine: @engine, config: {})
def in_parallel(handler, urls, threads:, data: {}, delay: nil, engine: @engine, config: {}, response_type: :html)
parts = urls.in_sorted_groups(threads, false)
urls_count = urls.size

Expand All @@ -304,12 +311,12 @@ def in_parallel(handler, urls, threads:, data: {}, delay: nil, engine: @engine,
part.each do |url_data|
if url_data.class == Hash
if url_data[:url].present? && url_data[:data].present?
spider.request_to(handler, delay, url_data)
spider.request_to(handler, delay, **{ **url_data, response_type: response_type })
else
spider.public_send(handler, url_data)
spider.public_send(handler, **url_data)
end
else
spider.request_to(handler, delay, url: url_data, data: data)
spider.request_to(handler, delay, url: url_data, data: data, response_type: response_type)
end
end
ensure
Expand Down
8 changes: 5 additions & 3 deletions lib/kimurai/base_helper.rb
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
require 'addressable/uri'

module Kimurai
module BaseHelper
private

def absolute_url(url, base:)
return unless url
URI.join(base, URI.escape(url)).to_s
Addressable::URI.join(base, Addressable::URI.escape(url)).to_s
end

def escape_url(url)
uri = URI.parse(url)
uri = Addressable::URI.parse(url)
rescue URI::InvalidURIError => e
URI.parse(URI.escape url).to_s rescue url
Addressable::URI.parse(Addressable::URI.escape url).to_s rescue url
else
url
end
Expand Down
58 changes: 58 additions & 0 deletions lib/kimurai/browser_builder/apparition_builder.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
require 'capybara/apparition'
require_relative '../capybara_configuration'
require_relative '../capybara_ext/session'
require_relative '../capybara_ext/apparition/driver'

module Kimurai::BrowserBuilder
class ApparitionBuilder
attr_reader :logger, :spider

def initialize(config, spider:)
@config = config
@spider = spider
@logger = spider.logger
end

def build
# Register driver
Capybara.register_driver :apparition do |app|
timeout = ENV.fetch('TIMEOUT', 30).to_i
driver_options = { js_errors: false, timeout: timeout, debug: ENV['DEBUG'] }

driver_options[:headless] = ENV.fetch("HEADLESS", "true") == "true"
logger.debug "BrowserBuilder (apparition): enabled extensions"

Capybara::Apparition::Driver.new(app, driver_options)
end

# Create browser instance (Capybara session)
@browser = Capybara::Session.new(:apparition)
@browser.spider = spider
logger.debug "BrowserBuilder (apparition): created browser instance"

# Headers
if headers = @config[:headers].presence
@browser.driver.headers = headers
logger.debug "BrowserBuilder (apparition): enabled custom headers"
end

if user_agent = @config[:user_agent].presence
user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip

@browser.driver.add_header("User-Agent", user_agent_string)
logger.debug "BrowserBuilder (apparition): enabled custom user_agent"
end

# Cookies
if cookies = @config[:cookies].presence
cookies.each do |cookie|
@browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
end

logger.debug "BrowserBuilder (apparition): enabled custom cookies"
end

@browser
end
end
end
54 changes: 54 additions & 0 deletions lib/kimurai/browser_builder/cuprite_builder.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
require 'capybara/cuprite'
require_relative '../capybara_configuration'
require_relative '../capybara_ext/session'
require_relative '../capybara_ext/cuprite/driver'

module Kimurai::BrowserBuilder
class CupriteBuilder
attr_reader :logger, :spider

def initialize(config, spider:)
@config = config
@spider = spider
@logger = spider.logger
end

def build
# Register driver
Capybara.register_driver :cuprite do |app|
driver_options = { headless: ENV.fetch("HEADLESS", "true") == "true" }
logger.debug "BrowserBuilder (cuprite): enabled extensions"

Capybara::Cuprite::Driver.new(app, driver_options)
end

# Create browser instance (Capybara session)
@browser = Capybara::Session.new(:cuprite)
@browser.spider = spider
logger.debug "BrowserBuilder (cuprite): created browser instance"

# Headers
if headers = @config[:headers].presence
@browser.driver.headers = headers
logger.debug "BrowserBuilder (cuprite): enabled custom headers"
end

if user_agent = @config[:user_agent].presence
user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip
@browser.driver.headers = {"User-Agent" => user_agent_string}
logger.debug "BrowserBuilder (cuprite): enabled custom user_agent"
end

# Cookies
if cookies = @config[:cookies].presence
cookies.each do |cookie|
@browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
end

logger.debug "BrowserBuilder (cuprite): enabled custom cookies"
end

@browser
end
end
end
6 changes: 6 additions & 0 deletions lib/kimurai/browser_builder/poltergeist_phantomjs_builder.rb
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,12 @@ def build
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled disable_images"
end

# Additional arguments
if @config[:browser_cmd_line_arguments].present?
driver_options[:phantomjs_options] += @config[:browser_cmd_line_arguments]
logger.debug "BrowserBuilder (poltergeist_phantomjs): additional browser command line arguments have been added"
end

Capybara::Poltergeist::Driver.new(app, driver_options)
end

Expand Down
13 changes: 11 additions & 2 deletions lib/kimurai/browser_builder/selenium_chrome_builder.rb
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,12 @@ def build
if chrome_path = Kimurai.configuration.selenium_chrome_path
opts.merge!(binary: chrome_path)
end

# See all options here: https://seleniumhq.github.io/selenium/docs/api/rb/Selenium/WebDriver/Chrome/Options.html
driver_options = Selenium::WebDriver::Chrome::Options.new(opts)
driver_options = Selenium::WebDriver::Chrome::Options.new(**opts)

if @config[:debugger_address]
driver_options.add_option(:debuggerAddress, @config[:debugger_address])
end

# Window size
if size = @config[:window_size].presence
Expand Down Expand Up @@ -109,6 +112,12 @@ def build
end
end

# Additional arguments
if @config[:browser_cmd_line_arguments].present?
driver_options.args << @config[:browser_cmd_line_arguments].join(' ')
logger.debug "BrowserBuilder (selenium_chrome): additional browser command line arguments have been added"
end

chromedriver_path = Kimurai.configuration.chromedriver_path || "/usr/local/bin/chromedriver"
service = Selenium::WebDriver::Service.chrome(path: chromedriver_path)
Capybara::Selenium::Driver.new(app, browser: :chrome, options: driver_options, service: service)
Expand Down
6 changes: 6 additions & 0 deletions lib/kimurai/browser_builder/selenium_firefox_builder.rb
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,12 @@ def build
end
end

# Additional arguments
if @config[:browser_cmd_line_arguments].present?
driver_options.args << @config[:browser_cmd_line_arguments].join(' ')
logger.debug "BrowserBuilder (selenium_firefox): additional browser command line arguments have been added"
end

Capybara::Selenium::Driver.new(app, browser: :firefox, options: driver_options)
end

Expand Down
13 changes: 13 additions & 0 deletions lib/kimurai/capybara_ext/apparition/driver.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
require_relative '../driver/base'

module Capybara::Apparition
class Driver
def pid
@pid ||= `lsof -i tcp:#{port} -t`.strip.to_i
end

def port
@port ||= browser.client.instance_variable_get("@ws").instance_variable_get("@driver").instance_variable_get("@socket").instance_variable_get("@io").remote_address.inspect_sockaddr.split(':').last
end
end
end
13 changes: 13 additions & 0 deletions lib/kimurai/capybara_ext/cuprite/driver.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
require_relative '../driver/base'

module Capybara::Cuprite
class Driver
def pid
@pid ||= `lsof -i tcp:#{port} -t`.strip.to_i
end

def port
@port ||= browser.client.instance_variable_get("@ws").instance_variable_get("@driver").instance_variable_get("@socket").instance_variable_get("@sock").remote_address.inspect_sockaddr.split(':').last
end
end
end
2 changes: 1 addition & 1 deletion lib/kimurai/capybara_ext/mechanize/driver.rb
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def set_cookie(name, value, options = {})
options[:name] ||= name
options[:value] ||= value

cookie = Mechanize::Cookie.new(options.merge path: "/")
cookie = Mechanize::Cookie.new(**options.merge(path: "/"))
browser.agent.cookie_jar << cookie
end

Expand Down
Loading