Add ability to find events by slide text & captions in search #1189

LukasKalbertodt · 2024-06-24T17:02:05Z

Fixes #677

For testers

Test

This PR does not contain any changes to the search page except adding this timeline. This is planned for later. This PR is already big enough.

Also note that the usefulness and the UX of this feature depends a lot on the available data! On our test instance, roughly 2500 events have OCR'ed slide texts, while only very few have subtitles. I will try to upload more videos soon to simply have more videos with subtitles available. Subtitle timespans are usually shorter (in the order of seconds or 10s), while the timespans associated with slide text can have durations of many minutes.

Questions/discussions

Unlike the old video portal, this shows the actual text that was matched (with a small context). I find this cool, but it of course somewhat exposes how bad the OCR slide text and automatic subtitles sometimes are.
What do you think about the timeline design and how the matched text is highlighted?
Report any query that leads to "internal server error" please. That should obviously never happen.

Search terms to get started

While testing myself I found a few good queries to get started. Of course, do try your own ones and also try prefixes of these to see how well it works. Also try multiple query words.

open: big mixed bag
meilisearch: finds two Tobira videos talking about Meilisearch (never mentioned in metadata)
tycho: finds the "Tycho crater" in the NASA moon video subtitles AND text detection
crater: lots of usages in the subtitles of the NASA moon video
pyroxene: finds the mineral in NASA moon subtitles
elasticsearch: lots of matches in Opencast-related videos
postgres: showing some videos with "postgres" in title first (makes sense) and only then once that only mention postgres
videoportal: obvious Tobira videos, but also one unrelated video screen-sharing the old ETH video portal briefly and one mentioning "videoportal" in its slides
feynman: further down lots of videos just mentioning feynman

Technical info

This PR has these main parts:

Add DB table event_texts (for storing all texts belonging to an event)
Add event_texts_queue and process to automatically download text assets from Opencast (this is ran as part of the worker)
- This was the most tricky part actually, in order to make it robust against random errors, OC or otherwise. To work well enough in most cases, without ever running into a super busy loop or something like that.
Various helper sub commands to manage fetching assets
VTT and MPEG7 parsers to parse the text assets
Add texts to MeiliSearch in a special encoded form to optimize for Meili-search-performance while still allowing us to figure out the timespans of a match
Make frontend use this data and show a timeline with matches for events

This can be mostly reviewed commit by commit. There are two times where I move a big chunk of code around that was added in a previous commit, but it should be fairly clear what and where.

Performance is kind of important for this one, since we are dealing with potentially lots of data. So far it seems like Meili responds within 25ms in all cases I tested. That's fine, but still a big increase from before. We should make sure that we don't accidentally introduce some slowness. Though right now I also have no idea how we would optimize further....

Something I want to improve in a follow up PR: replace the busy polling in the "download assets" and "update search index" workers by LISTEN/NOTIFY events from Postgres. Right now, both default to 30s or sth, which means that adding an event has quite a round trip (sync + 30s + 30s) before its text assets are searchable. That can be vastly reduced. But again, this PR is already big enough.

backend/src/args.rs

owi92

Looked through the code and tested a bunch but didn't find any obvious issues.
I'll do a final round of testing today and then this can be merged. My comments are of no real concern and definitely no blockers.

backend/src/sync/text/mpeg7.rs

backend/src/search/event.rs

frontend/src/routes/Search.tsx

LukasKalbertodt · 2024-07-01T11:26:00Z

@oas777 @dagraf Maybe the two of you want to take a short look at this before our Wednesday meeting. See the top comment for more information. But if you don't have the time till then, no worries, there will be plenty of more time to discuss search-related stuff before this is released.

oas777 · 2024-07-03T12:20:59Z

First of all it's good to see this in action, thanks. Some initial observations:

https://pr1189.tobira.opencast.org/~search?q=tyco tells me we're currently not distinguishing where the results come from, right?
https://pr1189.tobira.opencast.org/~search?q=opencast tells me we're looking for pages, series, and videos, right?
In conjunction, I think we might have to make these distinctions clearer for users to understand the search results they are looking at (and their order in the sense of importance also).
Also, with the number of results for https://pr1189.tobira.opencast.org/~search?q=internet we probably have to think about filters.
https://pr1189.tobira.opencast.org/~search?q=schulte (blush) providing results for "schule" to me indicates search terms are too open.
I prefer a preview of the slides to a preview of the text extracted. The image also is an additional help to remember a certain part of the lecture you are looking for.
Design: The timeline looks odd, mainly because highlighted segments hover over the timeline. I prefer having them "stringed" on the actual timeline.
Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

LukasKalbertodt · 2024-07-03T13:46:17Z

https://pr1189.tobira.opencast.org/~search?q=tyco tells me we're currently not distinguishing where the results come from, right?

Do you mean "slide text" vs "captions"? Yes, currently both are treated as one thing. Is that different in your current portal?

https://pr1189.tobira.opencast.org/~search?q=opencast tells me we're looking for pages, series, and videos, right?

Correct, which was like that already before. There will be some improvements there in an upcoming PR, like combining a series with the page listing only that series, as having these as two separate results is fairly useless.

In conjunction, I think we might have to make these distinctions clearer for users to understand the search results they are looking at (and their order in the sense of importance also).

That is also something I'm planning to do in the upcoming PR. I am not sure if I will succeed with that, as it requires clever design, but yeah: my goal is that it's clear at a glance whether I'm looking at a video (should should be most results), a series or something else.

Also, with the number of results for https://pr1189.tobira.opencast.org/~search?q=internet we probably have to think about filters.

Filters are of course planned already, and in fact some basic ones are already implemented. That feature is still hidden though, and will be reenabled with, you guessed it, my upcoming PR.

Apart from that, I would expect most users to just specify more query terms. I can't imagine a scenario where someone wants to find a video that they just remember had "internet" in it. And thanks to the clever ranking, users can just add a bunch of query words that they think are relevant, and the result containing most of these words will be shown first. Not to say we don't want filters -- we do -- but these search engines make filters less necessary as just adding more search terms usually works out.

https://pr1189.tobira.opencast.org/~search?q=schulte (blush) providing results for "schule" to me indicates search terms are too open.

Mh I'm not sure I agree. That's typo tolerance in action. All videos by you (with an exact "schulte" match) are sorted before all other videos. So in my book that's exactly as it should be. And as last resort, you can always search by "schulte" (with quotes), which works exactly like in Google or most other search engines: looking for that term exactly.

I prefer a preview of the slides to a preview of the text extracted. The image also is an additional help to remember a certain part of the lecture you are looking for.

Mh fair, the image seems useful. We don't always have an image though, especially for search results in captions, it might not be clear what to show. And: would you not show the extracted text at all then? I think it's useful.

Design: The timeline looks odd, mainly because highlighted segments hover over the timeline. I prefer having them "stringed" on the actual timeline.

So more like the design in your current video portal?

Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

Not sure I understand.

oas777 · 2024-07-03T13:59:31Z

So more like the design in your current video portal?

Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

Not sure I understand.

Weird coincidence, but it's similar to what I just said for Paella:

looks like five different fonts for five different text elements.

oas777 · 2024-07-06T10:36:05Z

For reference, here's how Kaltura organises search results in the UZH video portal.

Their search seems "too open" as well, providing results for "Stein" might be ok, but "Einstellungen" and "Kleinstaaten" blurs results; not sure why "Die" is also listed.
I like the fact that you can filter this a) by source and b) date - they call it relevance, but I think "date ascending/descending" and "semester" would be perfect.
Clear distinction between "Channels" and videos, which we also need, especially if we add "pages".

dagraf · 2024-07-09T20:04:07Z

For reference, here's how Kaltura organises search results in the UZH video portal.
* Their search seems "too open" as well, providing results for "Stein" might be ok, but "Einstellungen" and "Kleinstaaten" blurs results; not sure why "Die" is also listed.

I agree that the UZH results are too open. Olaf's example where he was looking for "Schulte" and "Schule" also showed up in a video further down does not bother me.

* I like the fact that you can filter this a) by source and b) date - they call it relevance, but I think "date ascending/descending" and "semester" would be perfect.

Me too.

* Clear distinction between "Channels" and videos, which we also need, especially if we add "pages".

I agree.

Additionally:

I like the fact that the expressions or words that are responsible for the video to show up in the search result being highlighted.
Timeline: I would prefer to have the highlighted blocks be showing up not hovering over the timeline but more like in the actual video portal of ETH. And for me, an icon just before the timeline (e.g., a "Play" icon) would help to understand immediately what this strange line with the blocks is all about. But maybe if we someday have our thumbnails to the left, this will not be necessary anymore. ::

LukasKalbertodt · 2024-09-04T08:18:21Z

The size of the white canvas in which slide and transcript results are presented seems to depend on the lenght of the text found which makes it uneasy to look at (first result for "video", Matthias' talk).

Yeah true, fair point, will try to fix that.

I spoke too quickly: how would you even fix this? The problematic case is when the matched text is long. I think there are only these options, none of which is optimal:

Cut off the matched text, only showing a very tiny section around the match: I think the length of the shown text is already very short and reducing the context around the match isn't great.
Let the text wrap into more than two lines: also not sure how useful that is.
Make the image larger: blowing up these low resolution images isn't great, and having such a big tooltip isn't great either.

I guess I can somewhat improve the situation by making the image a bit larger still and reducing the max-width of the tooltip a bit (i.e. cutting more text off). But I don't think I can come up with a perfect solution.

dagraf · 2024-09-04T12:52:53Z

For slides results, skip the text extract, avoiding gibberish like "Schuleetal 13032024 7" and distinguishing results more clearly from those in audio/transcript.

It's usually not possible to see the actual matched text in the slide due to the low resolution. And the matched part wouldn't be highlighted as it is now. I think the actual matched text can help a lot with giving context. Also: the slide preview image is of no use to a blind person, so accessibility-wise it is useful to show the matched text. But of course, showing gibberish is not useful either. So I will make it configurable at least.

I'm strongly in favour of keeping the text extract also when there is a slide. Accessibility-wise and also for all other users the shown text can help with giving context.

For transcript results, maybe add quotation marks? "…Entwicklung scheint das Zauberwort der Zukun 211 sein und… "

That's a good idea. Unfortunately, there is hidden complexity. There are many different kinds of quotation marks and using the right one depends on the language of the text. „German”, “English”, »German alternative and Danish«, « French », «Swiss French and German?» ... just to name a few of languages/cultures familiar to us. There are way more. And the problem is knowing the language of the search result. We can check the language of the video (which is already annoying but ok), but that might not be set or not match the text in question. It might be possible to only use one style of quotation mark by having them far away from the text and stylized somehow, but still... I would probably just leave as is.

I'm fine to leave it as is.

Maybe distinguish the two in color in the timeline?

Or are you doing this already? Search "video" and see David's talk:

Yes, this is already done. It's just a different brightness of the color, not a different hue. But we don't really have a different-hue-color we could use (except the "danger color", but that's not fitting).

I don't get this point. Lets discuss it in our todays meeting

dagraf · 2024-09-04T12:56:54Z

The size of the white canvas in which slide and transcript results are presented seems to depend on the lenght of the text found which makes it uneasy to look at (first result for "video", Matthias' talk).

Yeah true, fair point, will try to fix that.

I spoke too quickly: how would you even fix this? The problematic case is when the matched text is long. I think there are only these options, none of which is optimal:
* Cut off the matched text, only showing a very tiny section around the match: I think the length of the shown text is already very short and reducing the context around the match isn't great.

* Let the text wrap into more than two lines: also not sure how useful that is.

* Make the image larger: blowing up these low resolution images isn't great, and having such a big tooltip isn't great either.
I guess I can somewhat improve the situation by making the image a bit larger still and reducing the max-width of the tooltip a bit (i.e. cutting more text off). But I don't think I can come up with a perfect solution.

I would suggest to fix the width of the shown white canvas (see screenshot for a possible size) and cut off the matched text so it always fits into this canvas and does NOT warp into more than one line.

oas777 · 2024-09-04T13:48:54Z

PS: No one mentioned the color of the underscore, right?

Where does that come from in a blueish color scheme?

Since the area for showing the text is made a bit smaller, we also use less context in the backend.

LukasKalbertodt · 2024-09-05T12:17:39Z

Okay, I tried to change what we discussed. However, I decided to have different "tooltip size" behavior depending on whether a preview image is available:

No image: the tooltip is sized dynamically according to the amount of text.
With image: the tooltip has a fixed height, the width can vary just a tiny bit, if there is a lot of text.

The fixed height and some other changes make hovering over parts of the timeline a lot smoother I think, i.e. less jumping around of the tooltip. Let me know what you think about the remaining width-flexibility for large text.

As a second step, I added a faint icon to the lower bottom corner of the tooltip to show what kind of result we are talking about. Please also let me know what you think.

oas777 · 2024-09-05T12:27:42Z

Thanks, Lukas. The icon are very helpful, though the one for slide content could be more along the lines of
.
"Tooltip size" is better/more coherent also, though there is a (different) problem with tooltips below the timeline:

LukasKalbertodt · 2024-09-06T07:31:47Z

there is a (different) problem with tooltips below the timeline

Good catch, should be fixed now.

the one for slide content could be more along the lines of

Is that a Lucide icon you found there? I could not find it anywhere. I checked Lucide and found these that somewhat match your suggestion: scan-text, letter-text and file-text. Oh and there is presentation?

oas777 · 2024-09-06T07:34:58Z

No, that was me searching Google. I like "letter-text".

oas777 · 2024-09-09T11:07:05Z

Looks good, thanks. Allow me to mention that two tooltips are visible at the same time sometimes when you hover the timeline (slowly and) horizontally:

Couldn't detect a logic for this happening and maybe it's only due to the number of results.

This fix was a easier than anticipated (once I figured this out anyway). The problem is: giving the `WithTooltip` a `z-index`, means that parent element (containing both the trigger element and the tooltip), creates a new stacking context. And a stacking context and its children behaves as one unit in regards to other stacking contexts. So even setting the z-index of the tooltip itself to 100, two timelines are two different stacking contexts, so the 100 of the one tooltip is not rendered in front of the 4 of the trigger elements of another timeline. So there were two options: - Either introduce another div in the floating components that would have the trigger element as children, but not the tooltip, and that could get a z-index then. But that would have required changing appkit. - Remove the z-index from `WithTooltip` to avoid creating stacking contexts. First I thought the only way to do this is to remove the big clickable link area of the search results. And we might still do that, as it often brings annoying problems. But for this particular problem, the solution turned out easier than I thought.

Unfortunately, this required adding a new package. `react-icons` has a super old version of lucide icons, even in its newest version. Since almost all of our icons are from Lucide anyway, it makes sense to just use their packages directly.

LukasKalbertodt · 2024-09-09T11:25:55Z

two tooltips are visible at the same time sometimes

Yes I noticed this too :/ I think it's a glitch in some other package and we can't do a whole lot about it. But we can still look into it again.

We are currently still trying to fix some weird technical bug. But once that's done, this can be merged I think.

Luckily, Ole noticed this weird error that only occurred sometimes. I was able to find a repro: on search page, click one video, go back (via browser), then hover over some text timeline and then click that item. I am not exactly sure what relay does wrong here, but also requesting the ID on the search page makes relay understand stuff, so that the video page is not rendered with the incomplete query from the search page.

owi92 · 2024-09-09T12:34:43Z

Nice, so apart from the glitch Lukas and Olaf mentioned above everybody's happy and the technical bug also appears to be fixed. I think this is a great improvement. I'll merge this now 👍

LukasKalbertodt added the changelog:user User facing changes label Jun 24, 2024

github-actions bot temporarily deployed to test-deployment-pr1189 June 24, 2024 17:08 Destroyed

LukasKalbertodt force-pushed the searchable-text branch from a2adb5a to 1c23f3c Compare June 24, 2024 17:14

github-actions bot temporarily deployed to test-deployment-pr1189 June 24, 2024 17:17 Destroyed

LukasKalbertodt force-pushed the searchable-text branch from 1c23f3c to 163a682 Compare June 25, 2024 10:24

github-actions bot temporarily deployed to test-deployment-pr1189 June 25, 2024 10:29 Destroyed

LukasKalbertodt force-pushed the searchable-text branch 2 times, most recently from 1c7e401 to 0661fd5 Compare June 26, 2024 15:28

github-actions bot temporarily deployed to test-deployment-pr1189 June 26, 2024 15:30 Destroyed

github-actions bot temporarily deployed to test-deployment-pr1189 June 27, 2024 12:06 Destroyed

owi92 reviewed Jun 27, 2024

View reviewed changes

backend/src/args.rs Show resolved Hide resolved

owi92 approved these changes Jul 1, 2024

View reviewed changes

backend/src/sync/text/mpeg7.rs Outdated Show resolved Hide resolved

backend/src/search/event.rs Outdated Show resolved Hide resolved

frontend/src/routes/Search.tsx Outdated Show resolved Hide resolved

frontend/src/routes/Search.tsx Show resolved Hide resolved

github-actions bot temporarily deployed to test-deployment-pr1189 July 1, 2024 10:03 Destroyed

owi92 approved these changes Jul 1, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

github-actions bot added the status:conflicts This PR has conflicts that need to be resolved label Jul 1, 2024

LukasKalbertodt force-pushed the searchable-text branch from e15a9cc to 203793b Compare July 1, 2024 12:42

github-actions bot removed the status:conflicts This PR has conflicts that need to be resolved label Jul 1, 2024

github-actions bot temporarily deployed to test-deployment-pr1189 July 1, 2024 12:44 Destroyed

LukasKalbertodt added this to the 2.11 milestone Jul 1, 2024

This comment was marked as resolved.

Sign in to view

github-actions bot added the status:conflicts This PR has conflicts that need to be resolved label Jul 2, 2024

LukasKalbertodt modified the milestones: 2.11, 2.12 Jul 22, 2024

LukasKalbertodt marked this pull request as ready for review September 4, 2024 11:46

Adjust text match tooltip sizing

4ae7c7d

Since the area for showing the text is made a bit smaller, we also use less context in the backend.

github-actions bot temporarily deployed to test-deployment-pr1189 September 5, 2024 12:11 Destroyed

github-actions bot temporarily deployed to test-deployment-pr1189 September 6, 2024 07:31 Destroyed

github-actions bot temporarily deployed to test-deployment-pr1189 September 6, 2024 08:23 Destroyed

Add dim icon to text match tooltip to show kind of match

3f0eacf

LukasKalbertodt force-pushed the searchable-text branch 2 times, most recently from 30b3363 to 18ae926 Compare September 9, 2024 10:30

github-actions bot temporarily deployed to test-deployment-pr1189 September 9, 2024 10:32 Destroyed

LukasKalbertodt force-pushed the searchable-text branch from 18ae926 to 823f971 Compare September 9, 2024 10:40

github-actions bot temporarily deployed to test-deployment-pr1189 September 9, 2024 10:43 Destroyed

LukasKalbertodt added 2 commits September 9, 2024 13:16

LukasKalbertodt force-pushed the searchable-text branch from 823f971 to cb4721e Compare September 9, 2024 11:17

github-actions bot temporarily deployed to test-deployment-pr1189 September 9, 2024 11:19 Destroyed

github-actions bot temporarily deployed to test-deployment-pr1189 September 9, 2024 11:37 Destroyed

owi92 merged commit 5c69148 into elan-ev:master Sep 9, 2024
5 checks passed

LukasKalbertodt deleted the searchable-text branch September 9, 2024 12:36

Add ability to find events by slide text & captions in search #1189

Add ability to find events by slide text & captions in search #1189

Uh oh!

Conversation

LukasKalbertodt commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For testers

Questions/discussions

Search terms to get started

Technical info

Uh oh!

Uh oh!

owi92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LukasKalbertodt commented Jul 1, 2024

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

oas777 commented Jul 3, 2024

Uh oh!

LukasKalbertodt commented Jul 3, 2024

Uh oh!

oas777 commented Jul 3, 2024

Uh oh!

oas777 commented Jul 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dagraf commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukasKalbertodt commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dagraf commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dagraf commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oas777 commented Sep 4, 2024

Uh oh!

LukasKalbertodt commented Sep 5, 2024

Uh oh!

oas777 commented Sep 5, 2024

Uh oh!

LukasKalbertodt commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oas777 commented Sep 6, 2024

Uh oh!

oas777 commented Sep 9, 2024

Uh oh!

LukasKalbertodt commented Sep 9, 2024

Uh oh!

owi92 commented Sep 9, 2024

Uh oh!

Uh oh!

Uh oh!

LukasKalbertodt commented Jun 24, 2024 •

edited

Loading

oas777 commented Jul 6, 2024 •

edited

Loading

dagraf commented Jul 9, 2024 •

edited

Loading

LukasKalbertodt commented Sep 4, 2024 •

edited

Loading

dagraf commented Sep 4, 2024 •

edited

Loading

dagraf commented Sep 4, 2024 •

edited

Loading

LukasKalbertodt commented Sep 6, 2024 •

edited

Loading