-
-
Notifications
You must be signed in to change notification settings - Fork 161
Description
Basic information
- Program version: 0.2.8
- Python version: 3.11.9
- Operating system: Linux
Describe the bug
If a chat message in a VOD contains a non-ASCII character (any 2-bytes UTF-8 symbol for example) then emotes[].name field of message JSON from the library parsed wrongly.
Command/Code used
chat_downloader --start_time 05:58:28 --end_time 05:58:30 --output test.jsonl --testing 'https://www.twitch.tv/videos/2184933543'
- The command used (including the verbose tag,
-v):
chat_downloader --start_time 05:58:28 --end_time 05:58:30 --output test.jsonl --testing 'https://www.twitch.tv/videos/2184933543'- Output from the above command:
(I've patcher the library with temporarily debugging by prints to see the raw GQL content for the message mapper (chat_downloader.sites.twitch.TwitchChatDownloader._parse_message_info()))
[DEBUG] Python version: 3.11.9 (main, Jul 3 2024, 00:12:48) [GCC 12.2.0]
[DEBUG] Program version: 0.2.8
[DEBUG] Initialisation parameters: {'headers': None, 'cookies': None, 'proxy': None}
[DEBUG] Created TwitchChatDownloader session.
[INFO] Site: twitch.tv
[DEBUG] Program parameters: {'url': 'https://www.twitch.tv/videos/2184933543', 'start_time': '05:58:28', 'end_time': '05:58:30', 'max_attempts': 15, 'retry_timeout': None, 'interruptible_retry': True, 'timeout': None, 'inactivity_timeout': None, 'max_messages': None, 'message_groups': ['messages'], 'message_types': None, 'output': 'test.jsonl', 'overwrite': True, 'sort_keys': True, 'indent': 4, 'format': 'twitch', 'format_file': None, 'chat_type': 'live', 'ignore': None, 'message_receive_timeout': 0.1, 'buffer_size': 4096}
[DEBUG] Starting new HTTPS connection (1): gql.twitch.tv:443
[DEBUG] https://gql.twitch.tv:443 "POST /gql HTTP/11" 200 880
[DEBUG] https://gql.twitch.tv:443 "POST /gql HTTP/11" 200 None
[DEBUG] Match found: "<re.Match object; span=(0, 39), match='https://www.twitch.tv/videos/2184933543'>". Running "_get_chat_by_vod_id" function in "TwitchChatDownloader".
[DEBUG] Chat information: {'chat': <generator object TwitchChatDownloader._get_chat_messages_by_vod_id at 0x7f8ce80acf40>, 'title': 'DLC НА КАЗУАЛЫЧАХ | Прохождение #2 | ELDEN RING Shadow of the Erdtree | стрим 9', 'duration': 23578, 'status': 'past', 'video_type': 'video', 'start_time': None, 'id': '2184933543', '_output_writer': <chat_downloader.output.continuous_write.ContinuousWriter object at 0x7f8ce7fc0250>, '_output_callback': None, 'format': <function ChatDownloader.get_chat.<locals>.<lambda> at 0x7f8ce7f83880>, 'site': <chat_downloader.sites.twitch.TwitchChatDownloader object at 0x7f8ce8c21e50>}
[INFO] Retrieving chat for "DLC НА КАЗУАЛЫЧАХ | Прохождение #2 | ELDEN RING Shadow of the Erdtree | стрим 9".
[DEBUG] https://gql.twitch.tv:443 "POST /gql HTTP/11" 200 None
...
message={'fragments': [{'emote': None, 'text': 'Спасибо за стрим ', '__typename': 'VideoCommentMessageFragment'}, {'emote': {'id': '196892;31;41', 'emoteID': '196892', 'from': 31, '__typename': 'EmbeddedEmote'}, 'text': 'TwitchUnity', '__typename': 'VideoCommentMessageFragment'}, {'emote': None, 'text': ' Удовольствия от игры', '__typename': 'VideoCommentMessageFragment'}], 'userBadges': [], 'userColor': '#FF69B4', '__typename': 'VideoCommentMessage'}
fragment={'emote': None, 'text': 'Спасибо за стрим ', '__typename': 'VideoCommentMessageFragment'}
fragment={'emote': {'id': '196892;31;41', 'emoteID': '196892', 'from': 31, '__typename': 'EmbeddedEmote'}, 'text': 'TwitchUnity', '__typename': 'VideoCommentMessageFragment'}
fragment={'emote': None, 'text': ' Удовольствия от игры', '__typename': 'VideoCommentMessageFragment'}
[DEBUG] Writing to file: test.jsonl
5:58:29 | NIKI_ORNIS: Спасибо за стрим TwitchUnity Удовольствия от игры
...
[INFO] Finished retrieving chat messages.
[DEBUG] Session closed.
Actual content of test.jsonl (prettified)
{
"author": {
"colour": "#FF69B4",
"display_name": "NIKI_ORNIS",
"id": "458636669",
"name": "niki_ornis"
},
"emotes": [
{
"id": "196892",
"images": [
{
"height": 28,
"id": "28x28-light",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/1.0",
"width": 28
},
{
"height": 56,
"id": "56x56-light",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/2.0",
"width": 56
},
{
"height": 112,
"id": "112x112-light",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/3.0",
"width": 112
},
{
"height": 28,
"id": "28x28-dark",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/1.0",
"width": 28
},
{
"height": 56,
"id": "56x56-dark",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/2.0",
"width": 56
},
{
"height": 112,
"id": "112x112-dark",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/3.0",
"width": 112
}
],
"locations": "31-41",
"name": ""
}
],
"message": "\u0421\u043f\u0430\u0441\u0438\u0431\u043e \u0437\u0430 \u0441\u0442\u0440\u0438\u043c TwitchUnity \u0423\u0434\u043e\u0432\u043e\u043b\u044c\u0441\u0442\u0432\u0438\u044f \u043e\u0442 \u0438\u0433\u0440\u044b",
"message_id": "5bc4d778-e3fa-45da-bdb4-0206dd035902",
"message_type": "text_message",
"time_in_seconds": 21509,
"time_text": "5:58:29",
"timestamp": 1719705721803000
}Expected content of test.jsonl (prettified)
name field of the emote should be filled:
{
"author": {
"colour": "#FF69B4",
"display_name": "NIKI_ORNIS",
"id": "458636669",
"name": "niki_ornis"
},
"emotes": [
{
"id": "196892",
"images": [
{
"height": 28,
"id": "28x28-light",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/1.0",
"width": 28
},
{
"height": 56,
"id": "56x56-light",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/2.0",
"width": 56
},
{
"height": 112,
"id": "112x112-light",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/3.0",
"width": 112
},
{
"height": 28,
"id": "28x28-dark",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/1.0",
"width": 28
},
{
"height": 56,
"id": "56x56-dark",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/2.0",
"width": 56
},
{
"height": 112,
"id": "112x112-dark",
"url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/3.0",
"width": 112
}
],
"locations": "31-41",
"name": "TwitchUnity"
}
],
"message": "\u0421\u043f\u0430\u0441\u0438\u0431\u043e \u0437\u0430 \u0441\u0442\u0440\u0438\u043c TwitchUnity \u0423\u0434\u043e\u0432\u043e\u043b\u044c\u0441\u0442\u0432\u0438\u044f \u043e\u0442 \u0438\u0433\u0440\u044b",
"message_id": "5bc4d778-e3fa-45da-bdb4-0206dd035902",
"message_type": "text_message",
"time_in_seconds": 21509,
"time_text": "5:58:29",
"timestamp": 1719705721803000
}Additional context/information
Twitch GQL uses byte positioning as the beginning and the end of an emote code inside the chat text, so for non-ASCII characters the byte form of Python string should be used as the source of applying locations.
The fix is straightforward:
'name': message_text.encode("utf-8")[begin:end + 1].decode("utf-8")instead of
| 'name': message_text[begin:end + 1] |