Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tv_grab_uk_freeview produces bad XML for some channels #244

Open
nhathaway opened this issue Sep 10, 2024 · 11 comments
Open

tv_grab_uk_freeview produces bad XML for some channels #244

nhathaway opened this issue Sep 10, 2024 · 11 comments

Comments

@nhathaway
Copy link

XMLTV Version?

(Please specify release version or git commit ID)
f84e2eb

XMLTV Component?

(Grabber name or utility)
tv_grab_uk_freeview

Perl Version

5.38.2

Operating System

Ubuntu 24.04 - note: only the grabber is from github. The rest is from the Ubuntu distro.

What happened?

Aborted and produced invalid file(s)

What did you expect to happen?

Run to completion and produce valid file(s)

Did you see any warnings/errors?

(Please paste any warnings/errors, if available)
Code point \u0018 is not a valid character in XML at /usr/share/perl5/XMLTV.pm line 2197.
Code point \u0018 is not a valid character in XML at /usr/share/perl5/XMLTV.pm line 2197.
Code point \u0018 is not a valid character in XML at /usr/share/perl5/XMLTV.pm line 2197.
malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/XMLTV/Get_nice.pm line 136.
malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/XMLTV/Get_nice.pm line 136.
Code point \u001C is not a valid character in XML at /usr/share/perl5/XMLTV.pm line 2197.
no programmes found
no programmes found

What steps are needed to reproduce this issue?

(Please provide the full commands you are running)

  1. Generate the config for my postcode
  2. Split the config down into multiple files, one channel per config file
  3. Run the grabber for each channel, one at a time

Please attach your config file below:

(Remember to remove any usernames/passwords)
I have attached the entire output as well as the main config file, and the resulting per-channel xml files. I ran tv_validate_file on each and marked the bad ones as bad:
grab267.xml is bad
grab269.xml is bad
grab272.xml is bad
grab273.xml is bad
grab43.xml is bad
grab707.xml is bad
grab790.xml is bad

Any other information?

(For example, is this a new or intermittent issue?)
This gives more in depth info for problems that other have reported'

Maybe Unicode::Escape could be used to convert \uNNNN to UTF-8?
https://manpages.ubuntu.com/manpages/mantic/man3/Unicode::Escape.3pm.html

I'm not sure what is being received for the bad JSON string. I have 6 errors and 7 bad files, so it's difficult to tell which one corresponds to which, but the errors and the bad files are likely to be in the same order (in the 2 lists above). In any case, it's not many to try to find out wnat is going wrong.

tv_grab_uk_freeview.zip

@honir
Copy link
Contributor

honir commented Sep 11, 2024

I only see the .conf file not the others?

What are you running: --days 1 --offset 0 ?

@nhathaway
Copy link
Author

tv_grab_uk_freeview.tar.gz
Sorry, bad zip file. The new tarball also has all the cache files.

It was all channels, all days

honir added a commit that referenced this issue Sep 12, 2024
edge case when no programmes written in output xml file
@honir
Copy link
Contributor

honir commented Sep 12, 2024

'code point' and 'no programmes' xml fixed.
I can't do anything with the 'malformed json' unless you know the specific channel+day it occurred

@nhathaway
Copy link
Author

I sent a full set of cache files. Will one of these contain the offending JSON? If so, is it possible to run a batch file to read them all and see which ones fail?

@nhathaway
Copy link
Author

nhathaway commented Sep 12, 2024

Maybe this?

  <programme start="20240913024000 +0000" stop="20240913030000 +0000" channel="707.freeview.co.uk">
    <title lang="en">The Rise and Fall of Oasis</title>
    <desc lang="en">

@honir
Copy link
Contributor

honir commented Sep 12, 2024

is it possible to run a batch file to...

Possibly, but I don't have time to do that. (I don't get paid for this :) )

Maybe this?...

I think that was a control code problem.

@nhathaway
Copy link
Author

nhathaway commented Sep 12, 2024

How about this:

xmltv@ubuntu:~/.xmltv/cache$ for FILE in `ls -1`; do if ! tail -n +7 $FILE | jq -e . >/dev/null 2>&1; then echo $FILE failed; fi; done
0a52542532cac77375c4ea0776f8eb85 failed
8211b15a956759e5600eb82ed82418fc failed
xmltv@ubuntu:~/.xmltv/cache$

Both those have no json in.

@honir
Copy link
Contributor

honir commented Sep 12, 2024

Nice! Good idea.

Both those have no json in.

That seems to be it. Neither of the main Perl JSON packages seem to handle an empty string without croaking

@honir
Copy link
Contributor

honir commented Sep 12, 2024

I've made a change to fix the missing JSON. Please give it a try.

@nhathaway
Copy link
Author

Output from the cron job, which ran in the early hours of this morning:

could not fetch https://www.freeview.co.uk/api/program?sid=10&nid=64321&pid=crid://csi.enh.digitaluk.co.uk/af452102-916c-42da-b7e8-26b2d66a093c&start=2024-09-14T01:00:00+0000&duration=PT30M, error: 502 Bad Gateway, aborting
Code point \u001D is not a valid character in XML at /usr/share/perl5/XMLTV.pm line 2197.
no programmes found
no programmes found
grab10.xml is bad
grab707.xml is bad
grab790.xml is bad

New unicode escape sequences seem to appear at any time. It might be better to use Unicode::Escape than keep adding new exceptions.

5xx errors seem to be a regular feature of the Freeview website. Most runs I have done has at least one of these. The script currently seems to abort on the first encounter of one of these errors. The documentation for HTTP::Cache::Transparent has an "approve" interface which can be implemented to say "use the cached data on error". But then the cache timeout would probably want to be governed by a parameter.

honir added a commit that referenced this issue Oct 14, 2024
Incoming data contains non-printable ascii characters.
@honir
Copy link
Contributor

honir commented Oct 14, 2024

Unicode::Escape only fixes non-ascii characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants