Skip to content

Conversation

@fkztw
Copy link

@fkztw fkztw commented Jun 28, 2017

When arg is not a ascii char,
len(arg) may be differrent from len(arg.encode("utf-8")).
For example, a Chinese character '哈',
len('哈') is 1, but len('哈'.encode("utf-8")) is 3,
which will raise an InvalidResponseError because of the wrong length.

This PR should fix this kind of problem.

@fkztw
Copy link
Author

fkztw commented Jun 28, 2017

hmm... looks like #28 and #31 are also dealing with this problem.
I should add a patch which can solve it once for all and this is related to the annoyed problem of unicode and bytes of Python 2 and Python 3.

+ Determine the Python version of current environment
+ Deal with the args according to py2 or py3
@fkztw fkztw force-pushed the fix-issue-command branch from e4f3c9c to e0d9072 Compare June 28, 2017 13:58
@fkztw
Copy link
Author

fkztw commented Jun 29, 2017

@quique0194, @pjrobertson:
Could both of you help me testing if this patch works for both of you in your scenario?
I've tested this patch in both Python 2 and Python 3 with this simple script and it worked for me,
I would be appreciate if you can help testing this patch,
since one of you is using Python 2 and another is using Python 3.
Thank you.
(Click Details to see the script)

# -*- coding: utf-8 -*-

import sys
import dryscrape


dryscrape.start_xvfb()
br = dryscrape.Session()
br.visit("https://google.com")


if sys.version_info[0] == 2:
    PY2 = True
    PY3 = False
if sys.version_info[0] == 3:
    PY2 = False
    PY3 = True


def test_py2():
    q = br.at_xpath('//*[@name="q"]')
    q.set(u"canción")
    q.form().submit()
    br.render("py2-unicode.png")

    q = br.at_xpath('//*[@name="q"]')
    q.set("北京")
    q.form().submit()
    br.render("py2-str.png")

    q = br.at_xpath('//*[@name="q"]')
    q.set(b"123")
    q.form().submit()
    br.render("py2-bytes.png")


def test_py3():
    q = br.at_xpath('//*[@name="q"]')
    q.set("北京")
    q.form().submit()
    br.render("py3-unicode.png")

    q = br.at_xpath('//*[@name="q"]')
    q.set("canción".encode("utf-8"))
    q.form().submit()
    br.render("py3-bytes.png")


if __name__ == "__main__":
    if PY2:
        test_py2()
    if PY3:
        test_py3()

    print("Done")

""" Deal with the unicode and bytes problem in Python 2 and Python 3 for socket.sendall """

if PY2:
if type(arg) != unicode:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this raise an NameError in Python 3 since there's no unicode type in Python 3.
But I've tested it in Python 3 and it worked.
Would be appreciate if anyone has some advices on this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this would be a problem in other languages such as C++, where identifiers need to be defined at compilation time even if they are never used at runtime, this should not be an issue for Python, since here definitions of identifiers are only checked when encountered, and not during the syntax check.

In Python, it is also common to import packages only when they are required, and therefore import statements can be nested into if statements, indicating that an identifier does not have to be defined until it is encountered by the interpreter.

Therefore, I do not expect to find any problems with this line of code. In addition, I have successfully tested the suggested changes while observing no error messages or warnings.

webkit_server.py Outdated
arg = str(arg)
self._writeline(str(len(arg)))
self._sock.sendall(arg.encode("utf-8"))
self.send_arg(arg)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems to be using an indentation of 2 spaces rather than 4 consistently throughout the file. I suggest correcting the indentation of line 523.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've fixed it in this commit 62ad2e9.

@PhilippVerpoort
Copy link

I came across the same problem, and found a very similar solution. I can confirm that the fundamental problem is the computed length of each argument as len(arg) in line 517 of the changed file.

I have reviewed the solution presented in this pull request. I find the updated code well written, and have successfully tested the updated version with both Python2 and Python3 using the provided example script from Dryscrape, as well as a script that I had written myself for a different purpose. In response to @M157q's comment on the use of the unicode identifier after checking for Python2, I believe that this should not be a problem (see my reply to the comment). I have added one comment to the changed file requesting a fix for the indentation.

I suggest to the maintainers of this repository to merge this pull request #33, to reject pull requests #28 and #31, and to mark issues #27 and #30 as fixed (or duplicates where appropriate).

To fit the coding style of this project.

ref: niklasb#33 (comment)
@niklasb
Copy link
Owner

niklasb commented Jul 26, 2017

@M157q thank you for the patch and @PhilippVeerport thank you for your detailed analysis. I will try to look at this ASAP.

@niklasb
Copy link
Owner

niklasb commented Aug 22, 2017

@M157q @PhilippVerpoort Thank you for your contribution. Unfortunately I have decided to no longer maintain dryscrape or webkit-server, since properly doing so would require rewriting webkit-server using QtWebEngine. Consider using a different package for your scraping if you do not want to be vulnerable to old WebKit bugs.

@PhilippVerpoort
Copy link

@niklasb Thanks for your reply. This is completely fair, but would you consider implementing this change and filing a new release with this bug fix anyway? At least people who do consider for some reason to use this piece of software will not have to go through the same process of manually applying this patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants