Skip to content

The library should validate the document before processing it #34

Open
@sneko

Description

@sneko

Hi @samclarke ,

I have a script to watch multiple robots.txt from websites but in some case they have none but still display a fallback content. The issue is your library will tell isAllowed() -> true even if HTML code is passed.

  it('should not confirm it can be indexed', async () => {
    const body = `<html></html>`;

    const robots = robotsParser(robotsUrl, body);
    const canBeIndexed = robots.isAllowed(rootUrl);

    expect(canBeIndexed).toBeFalsy();
  });

(this test will fail, whereas it should pass, or better, it should throw since there are both isDisallowed and isAllowed)

Did I miss something to check the robots.txt format?

Does it make sense to throw an error instead of allowing/disallowing something based on nothing?

Thank you,

EDIT: a workaround could be to check if any HTML inside the file... hoping the website does not return another format (JSON, raw...). But it's a bit hacky, no?

EDIT2: a point of view https://stackoverflow.com/a/31598530/3608410

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions