Map item catch #365

dale-wahl · 2023-05-24T08:58:43Z

This works... but is logging to the original dataset not the new one being created (plus I'm warning via the database since datasets do not have the 4CAT log). I think it may make more sense to wrap map_item at the processor level. But map_item is a static method and thus, get_mapped_item (or whatever I call the new method) ought to be as well... but then I don't have a log or dataset or anything useful really. Also iterate_item is the main use of map_item and exists in the dataset. So perhaps I need a processor method to check if map_item works and then use the get_mapped_item.

…n and user

…_item compatibility with dataset

dale-wahl · 2023-05-24T14:40:36Z

Created BasicProcessor class methods get_mapped_item to check if map_item returns actual data (if not raise new Exception) and map_item_method_available to check if a processor and dataset can use map_item.

I did not combine these methods as it seems unnecessary to check compatibility of processor and dataset every time get_mapped_item is used.

Note: get_mapped_item could also be used to catch things like KeyErrors that are sometime seen from map_item. This would allow a processor to skip an item missing some particular key. However, I did not do that as I felt it may be better to crash the processor and force us to contend with the malformed item. In essence this get_mapped_item only catches purposefully skipped map_item results e.g. when map_item returns {} (or False, None, some pythonic False, etc.).

dale-wahl · 2023-05-24T14:43:24Z

Tested it on all the ZeeSchuimer datasources. Apparently there is some other oddity in LinkedIn. This update works as intended and notifies both us and the User.

`get_item_keys` would also raise an error on a non csv/ndjson. It will return the first item keys even in the instance of `map_item` not existing... but that was how it always functioned so I am leaving it the same.

…t keys which is done A LOT)

… removed/deprecated)

…is already in the log)

stijn-uva · 2023-05-24T15:37:03Z

backend/abstract/processor.py

+		"""
+		# only run item mapper if extension of processor == extension of
+		# data file, for the scenario where a csv file was uploaded and
+		# converted to an ndjson-based data source, for example


@dale-wahl is there any datasource that does this at the moment?

That's some old voodoo (I think it is yours to be honest!)... I extracted that code and comment into a new method to use as a check to see if map_item should be run, but I did not modify the code. I believe it has to do with custom imports of, say, a Twitter dataset. You could convert the dataset type from custom to a Twitter datasource type, but it would not be an NDJSON as expected. That would also presumably apply to any ZeeSchuimer datasource (a custom Doutin/Instagram CSV uploaded would not be able to use map_item). In practice, I am not exactly sure how often we ran into this problem since users cannot change datatypes only admins (am I wrong about that?).

backend/lib/search.py

stijn-uva · 2023-10-12T08:22:01Z

common/lib/dataset.py

-				# not a valid NDJSON file?
-				return []
-
+		if (self.get_results_path().suffix.lower() == ".csv") or (self.get_results_path().suffix.lower() == ".ndjson" and self.get_own_processor() is not None and self.get_own_processor().map_item_method_available(dataset=self)):


Are the extension checks necessary here? I think if there is a map_item there will be columns and if not then not, regardless of extension

(I know they were already there, but I think they might be a left-over from when we were less extension-agnostic)

I vaguely remember issue here with something in the frontend. get_item_keys would fail because get_own_processor would return None. But perhaps it the extension checks are redundant. I can test this.

CSV datasets do not have map_item, but do have columns we wish to return. NDJSON datasets should only return columns if they can be mapped (else keys are not consistent enough to be used as "columns"). Some datasets do not have processors (rare; from what I can tell only deprecated datasets) and a check here avoids an error being raised by iterate_items.

Potentially we could combine the two Dataset methods get_item_keys and get_columns. They are almost the same thing. get_item_keys is used in various processors while get_columns is used by the frontend and most often get_options. Currently, get_item_keys takes advantage of iterate_items and grabs the first item's keys. get_columns was a copy of the code from iterate_items which is the code that I'm modifying in this PR.

When attempting to consolidate the code, I realized that get_columns could not be completely replaced by get_item_keys because of the previously mentioned instances where we do not wish to return columns. Possibly we do not wish to return item keys either in these instances via get_item_keys, but I did not explore all usages of that function as no errors were occurring. Mostly likely, get_columns returns False via the frontend and so the backend never runs into a misuse of get_item_keys.

dale-wahl added 5 commits May 23, 2023 17:50

use iterate_item to check map_item returns something if not warn admi…

292d51b

…n and user

apparently something with spacy and typing_extensions is broken

deb7791

remove old debug print

6b5e31d

wrap map_item with processor.get_mapped_item as well as check for map…

635e205

…_item compatibility with dataset

conform iterate_mapped_items to new method

c4d6d41

dale-wahl requested a review from stijn-uva May 24, 2023 14:40

dale-wahl marked this pull request as ready for review May 24, 2023 14:41

dale-wahl added 3 commits May 30, 2023 11:48

get_columns needs to detect non CSV and NDJSON

0fb19cc

`get_item_keys` would also raise an error on a non csv/ndjson. It will return the first item keys even in the instance of `map_item` not existing... but that was how it always functioned so I am leaving it the same.

Check Instagram items for ads

3944165

warn on instagram ads; add customizable warning message

5ffd6dc

sal-uva mentioned this pull request Jul 6, 2023

Write annotations feature requests #293

Closed

dale-wahl mentioned this pull request Sep 7, 2023

Issues Previewing/exporting Instagram data collected with Zeeschuimer #387

Closed

dale-wahl added 9 commits September 13, 2023 11:35

add healthcheck from master

6367e31

Merge branch 'master' into map_item_catch

1725a92

do not always warn on map_item error (for example when getting datase…

b32ddca

…t keys which is done A LOT)

ensure processor exists prior to checking map_item (processors may be…

2184e2e

… removed/deprecated)

no mapping when there is no processor

a1d99d5

Merge branch 'master' into map_item_catch

f71110e

do not warn unmappable for previews or creating CSV extract (warning …

5d9c5d5

…is already in the log)

only warn admins once per dataset

92440f5

fix warning when processor is None

1cab055

stijn-uva requested changes Oct 12, 2023

View reviewed changes

Count number of unmappable items

dd1027f

stijn-uva merged commit 1101a0a into master Oct 12, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map item catch #365

Map item catch #365

dale-wahl commented May 24, 2023

dale-wahl commented May 24, 2023 •

edited

Loading

dale-wahl commented May 24, 2023

stijn-uva May 24, 2023

dale-wahl Oct 12, 2023

stijn-uva Oct 12, 2023

stijn-uva Oct 12, 2023

dale-wahl Oct 12, 2023

dale-wahl Oct 12, 2023

Map item catch #365

Map item catch #365

Conversation

dale-wahl commented May 24, 2023

dale-wahl commented May 24, 2023 • edited Loading

dale-wahl commented May 24, 2023

stijn-uva May 24, 2023

Choose a reason for hiding this comment

dale-wahl Oct 12, 2023

Choose a reason for hiding this comment

stijn-uva Oct 12, 2023

Choose a reason for hiding this comment

stijn-uva Oct 12, 2023

Choose a reason for hiding this comment

dale-wahl Oct 12, 2023

Choose a reason for hiding this comment

dale-wahl Oct 12, 2023

Choose a reason for hiding this comment

dale-wahl commented May 24, 2023 •

edited

Loading