Skip to content

Selector initialization ignores user choice of type #321

Open
@marcoaaguiar

Description

@marcoaaguiar

Selector initialization ignores user choice of type if the text/body is JSON-serializable, this seems to have been introduce in 1.8.0+
This is also a documentation problem, since from 1.8.0+ it seems that Parsel tries to guess the type of the data instead of defaulting to "html".

The problem is that this makes parsing unknown text unreliable because they might be interpreted as something else than expected (as in the examples below), and things like .xpath(...) may break.

For instance in 1.7.0:

>>> Selector("2000")  # loaded as html 
<Selector xpath=None data='<html><body><p>2000</p></body></html>'>
>>> Selector("foo")  # loaded as html
<Selector xpath=None data='<html><body><p>foo</p></body></html>'>

In 1.8.1:

>>> Selector("foo")  # loaded as html
<Selector query=None data='<html><body><p>foo</p></body></html>'>
>>> Selector("2000")  # loaded as json
<parsel.selector.Selector object at 0x1247f8e20>
>>> Selector("200", type="html")  # loade as json, even if html is requested
<parsel.selector.Selector object at 0x104e72a40>

The root cause is when identifying the data type, the logic does not check what was passed by the user

parsel/parsel/selector.py

Lines 332 to 333 in 4966533

if data is not _NOT_SET:
return data, "json"

There are possible solutions:

  1. Check if the user passed type=="json" (Defaults to "html")
  2. Check if the user passed type in (None, "json") (Auto-detect type)

I could open a PR with either

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions