Skip to content

Add ability to find events by slide text & captions in search #1189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Sep 9, 2024

Conversation

LukasKalbertodt
Copy link
Member

@LukasKalbertodt LukasKalbertodt commented Jun 24, 2024

Fixes #677

For testers

Test

This PR does not contain any changes to the search page except adding this timeline. This is planned for later. This PR is already big enough.

Also note that the usefulness and the UX of this feature depends a lot on the available data! On our test instance, roughly 2500 events have OCR'ed slide texts, while only very few have subtitles. I will try to upload more videos soon to simply have more videos with subtitles available. Subtitle timespans are usually shorter (in the order of seconds or 10s), while the timespans associated with slide text can have durations of many minutes.

Questions/discussions

  • Unlike the old video portal, this shows the actual text that was matched (with a small context). I find this cool, but it of course somewhat exposes how bad the OCR slide text and automatic subtitles sometimes are.
  • What do you think about the timeline design and how the matched text is highlighted?
  • Report any query that leads to "internal server error" please. That should obviously never happen.

Search terms to get started

While testing myself I found a few good queries to get started. Of course, do try your own ones and also try prefixes of these to see how well it works. Also try multiple query words.

  • open: big mixed bag
  • meilisearch: finds two Tobira videos talking about Meilisearch (never mentioned in metadata)
  • tycho: finds the "Tycho crater" in the NASA moon video subtitles AND text detection
  • crater: lots of usages in the subtitles of the NASA moon video
  • pyroxene: finds the mineral in NASA moon subtitles
  • elasticsearch: lots of matches in Opencast-related videos
  • postgres: showing some videos with "postgres" in title first (makes sense) and only then once that only mention postgres
  • videoportal: obvious Tobira videos, but also one unrelated video screen-sharing the old ETH video portal briefly and one mentioning "videoportal" in its slides
  • feynman: further down lots of videos just mentioning feynman

Technical info

This PR has these main parts:

  • Add DB table event_texts (for storing all texts belonging to an event)
  • Add event_texts_queue and process to automatically download text assets from Opencast (this is ran as part of the worker)
    • This was the most tricky part actually, in order to make it robust against random errors, OC or otherwise. To work well enough in most cases, without ever running into a super busy loop or something like that.
  • Various helper sub commands to manage fetching assets
  • VTT and MPEG7 parsers to parse the text assets
  • Add texts to MeiliSearch in a special encoded form to optimize for Meili-search-performance while still allowing us to figure out the timespans of a match
  • Make frontend use this data and show a timeline with matches for events

This can be mostly reviewed commit by commit. There are two times where I move a big chunk of code around that was added in a previous commit, but it should be fairly clear what and where.

Performance is kind of important for this one, since we are dealing with potentially lots of data. So far it seems like Meili responds within 25ms in all cases I tested. That's fine, but still a big increase from before. We should make sure that we don't accidentally introduce some slowness. Though right now I also have no idea how we would optimize further....

Something I want to improve in a follow up PR: replace the busy polling in the "download assets" and "update search index" workers by LISTEN/NOTIFY events from Postgres. Right now, both default to 30s or sth, which means that adding an event has quite a round trip (sync + 30s + 30s) before its text assets are searchable. That can be vastly reduced. But again, this PR is already big enough.

@LukasKalbertodt LukasKalbertodt added the changelog:user User facing changes label Jun 24, 2024
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 June 24, 2024 17:08 Destroyed
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 June 24, 2024 17:17 Destroyed
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 June 25, 2024 10:29 Destroyed
@LukasKalbertodt LukasKalbertodt force-pushed the searchable-text branch 2 times, most recently from 1c7e401 to 0661fd5 Compare June 26, 2024 15:28
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 June 26, 2024 15:30 Destroyed
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 June 27, 2024 12:06 Destroyed
Copy link
Member

@owi92 owi92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through the code and tested a bunch but didn't find any obvious issues.
I'll do a final round of testing today and then this can be merged. My comments are of no real concern and definitely no blockers.

@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 July 1, 2024 10:03 Destroyed
@LukasKalbertodt
Copy link
Member Author

@oas777 @dagraf Maybe the two of you want to take a short look at this before our Wednesday meeting. See the top comment for more information. But if you don't have the time till then, no worries, there will be plenty of more time to discuss search-related stuff before this is released.

This comment was marked as resolved.

@github-actions github-actions bot added the status:conflicts This PR has conflicts that need to be resolved label Jul 1, 2024
@github-actions github-actions bot removed the status:conflicts This PR has conflicts that need to be resolved label Jul 1, 2024
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 July 1, 2024 12:44 Destroyed
@LukasKalbertodt LukasKalbertodt added this to the 2.11 milestone Jul 1, 2024

This comment was marked as resolved.

@github-actions github-actions bot added the status:conflicts This PR has conflicts that need to be resolved label Jul 2, 2024
@oas777
Copy link
Collaborator

oas777 commented Jul 3, 2024

First of all it's good to see this in action, thanks. Some initial observations:

  • https://pr1189.tobira.opencast.org/~search?q=tyco tells me we're currently not distinguishing where the results come from, right?
  • https://pr1189.tobira.opencast.org/~search?q=opencast tells me we're looking for pages, series, and videos, right?
  • In conjunction, I think we might have to make these distinctions clearer for users to understand the search results they are looking at (and their order in the sense of importance also).
  • Also, with the number of results for https://pr1189.tobira.opencast.org/~search?q=internet we probably have to think about filters.
  • https://pr1189.tobira.opencast.org/~search?q=schulte (blush) providing results for "schule" to me indicates search terms are too open.
  • I prefer a preview of the slides to a preview of the text extracted. The image also is an additional help to remember a certain part of the lecture you are looking for.
  • Design: The timeline looks odd, mainly because highlighted segments hover over the timeline. I prefer having them "stringed" on the actual timeline.
  • Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

@LukasKalbertodt
Copy link
Member Author

https://pr1189.tobira.opencast.org/~search?q=tyco tells me we're currently not distinguishing where the results come from, right?

Do you mean "slide text" vs "captions"? Yes, currently both are treated as one thing. Is that different in your current portal?

https://pr1189.tobira.opencast.org/~search?q=opencast tells me we're looking for pages, series, and videos, right?

Correct, which was like that already before. There will be some improvements there in an upcoming PR, like combining a series with the page listing only that series, as having these as two separate results is fairly useless.

In conjunction, I think we might have to make these distinctions clearer for users to understand the search results they are looking at (and their order in the sense of importance also).

That is also something I'm planning to do in the upcoming PR. I am not sure if I will succeed with that, as it requires clever design, but yeah: my goal is that it's clear at a glance whether I'm looking at a video (should should be most results), a series or something else.

Also, with the number of results for https://pr1189.tobira.opencast.org/~search?q=internet we probably have to think about filters.

Filters are of course planned already, and in fact some basic ones are already implemented. That feature is still hidden though, and will be reenabled with, you guessed it, my upcoming PR.

Apart from that, I would expect most users to just specify more query terms. I can't imagine a scenario where someone wants to find a video that they just remember had "internet" in it. And thanks to the clever ranking, users can just add a bunch of query words that they think are relevant, and the result containing most of these words will be shown first. Not to say we don't want filters -- we do -- but these search engines make filters less necessary as just adding more search terms usually works out.

https://pr1189.tobira.opencast.org/~search?q=schulte (blush) providing results for "schule" to me indicates search terms are too open.

Mh I'm not sure I agree. That's typo tolerance in action. All videos by you (with an exact "schulte" match) are sorted before all other videos. So in my book that's exactly as it should be. And as last resort, you can always search by "schulte" (with quotes), which works exactly like in Google or most other search engines: looking for that term exactly.

I prefer a preview of the slides to a preview of the text extracted. The image also is an additional help to remember a certain part of the lecture you are looking for.

Mh fair, the image seems useful. We don't always have an image though, especially for search results in captions, it might not be clear what to show. And: would you not show the extracted text at all then? I think it's useful.

Design: The timeline looks odd, mainly because highlighted segments hover over the timeline. I prefer having them "stringed" on the actual timeline.

So more like the design in your current video portal?

Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

Not sure I understand.

@oas777
Copy link
Collaborator

oas777 commented Jul 3, 2024

So more like the design in your current video portal?

Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

Not sure I understand.

Weird coincidence, but it's similar to what I just said for Paella:

grafik

looks like five different fonts for five different text elements.

@oas777
Copy link
Collaborator

oas777 commented Jul 6, 2024

For reference, here's how Kaltura organises search results in the UZH video portal.

  • Their search seems "too open" as well, providing results for "Stein" might be ok, but "Einstellungen" and "Kleinstaaten" blurs results; not sure why "Die" is also listed.
  • I like the fact that you can filter this a) by source and b) date - they call it relevance, but I think "date ascending/descending" and "semester" would be perfect.
  • Clear distinction between "Channels" and videos, which we also need, especially if we add "pages".

@dagraf
Copy link
Collaborator

dagraf commented Jul 9, 2024

For reference, here's how Kaltura organises search results in the UZH video portal.

* Their search seems "too open" as well, providing results for "Stein" might be ok, but "Einstellungen" and "Kleinstaaten" blurs results; not sure why "Die" is also listed.

I agree that the UZH results are too open. Olaf's example where he was looking for "Schulte" and "Schule" also showed up in a video further down does not bother me.

* I like the fact that you can filter this a) by source and b) date - they call it relevance, but I think "date ascending/descending" and "semester" would be perfect.

Me too.

* Clear distinction between "Channels" and videos, which we also need, especially if we add "pages".

I agree.

Additionally:

  • I like the fact that the expressions or words that are responsible for the video to show up in the search result being highlighted.
  • Timeline: I would prefer to have the highlighted blocks be showing up not hovering over the timeline but more like in the actual video portal of ETH. And for me, an icon just before the timeline (e.g., a "Play" icon) would help to understand immediately what this strange line with the blocks is all about. But maybe if we someday have our thumbnails to the left, this will not be necessary anymore. ::

@LukasKalbertodt LukasKalbertodt modified the milestones: 2.11, 2.12 Jul 22, 2024
@LukasKalbertodt
Copy link
Member Author

LukasKalbertodt commented Sep 4, 2024

  • The size of the white canvas in which slide and transcript results are presented seems to depend on the lenght of the text found which makes it uneasy to look at (first result for "video", Matthias' talk).

Yeah true, fair point, will try to fix that.

I spoke too quickly: how would you even fix this? The problematic case is when the matched text is long. I think there are only these options, none of which is optimal:

  • Cut off the matched text, only showing a very tiny section around the match: I think the length of the shown text is already very short and reducing the context around the match isn't great.
  • Let the text wrap into more than two lines: also not sure how useful that is.
  • Make the image larger: blowing up these low resolution images isn't great, and having such a big tooltip isn't great either.

I guess I can somewhat improve the situation by making the image a bit larger still and reducing the max-width of the tooltip a bit (i.e. cutting more text off). But I don't think I can come up with a perfect solution.

@LukasKalbertodt LukasKalbertodt marked this pull request as ready for review September 4, 2024 11:46
@dagraf
Copy link
Collaborator

dagraf commented Sep 4, 2024

  • For slides results, skip the text extract, avoiding gibberish like "Schuleetal 13032024 7" and distinguishing results more clearly from those in audio/transcript.

It's usually not possible to see the actual matched text in the slide due to the low resolution. And the matched part wouldn't be highlighted as it is now. I think the actual matched text can help a lot with giving context. Also: the slide preview image is of no use to a blind person, so accessibility-wise it is useful to show the matched text. But of course, showing gibberish is not useful either. So I will make it configurable at least.

I'm strongly in favour of keeping the text extract also when there is a slide. Accessibility-wise and also for all other users the shown text can help with giving context.

  • For transcript results, maybe add quotation marks? "…Entwicklung scheint das Zauberwort der Zukun 211 sein und… "

That's a good idea. Unfortunately, there is hidden complexity. There are many different kinds of quotation marks and using the right one depends on the language of the text. „German”, “English”, »German alternative and Danish«, « French », «Swiss French and German?» ... just to name a few of languages/cultures familiar to us. There are way more. And the problem is knowing the language of the search result. We can check the language of the video (which is already annoying but ok), but that might not be set or not match the text in question. It might be possible to only use one style of quotation mark by having them far away from the text and stylized somehow, but still... I would probably just leave as is.

I'm fine to leave it as is.

  • Maybe distinguish the two in color in the timeline?
  • Or are you doing this already? Search "video" and see David's talk:

Yes, this is already done. It's just a different brightness of the color, not a different hue. But we don't really have a different-hue-color we could use (except the "danger color", but that's not fitting).

I don't get this point. Lets discuss it in our todays meeting

@dagraf
Copy link
Collaborator

dagraf commented Sep 4, 2024

  • The size of the white canvas in which slide and transcript results are presented seems to depend on the lenght of the text found which makes it uneasy to look at (first result for "video", Matthias' talk).

Yeah true, fair point, will try to fix that.

I spoke too quickly: how would you even fix this? The problematic case is when the matched text is long. I think there are only these options, none of which is optimal:

* Cut off the matched text, only showing a very tiny section around the match: I think the length of the shown text is already very short and reducing the context around the match isn't great.

* Let the text wrap into more than two lines: also not sure how useful that is.

* Make the image larger: blowing up these low resolution images isn't great, and having such a big tooltip isn't great either.

I guess I can somewhat improve the situation by making the image a bit larger still and reducing the max-width of the tooltip a bit (i.e. cutting more text off). But I don't think I can come up with a perfect solution.

I would suggest to fix the width of the shown white canvas (see screenshot for a possible size) and cut off the matched text so it always fits into this canvas and does NOT warp into more than one line.
Bildschirmfoto 2024-09-04 um 14 54 40

@oas777
Copy link
Collaborator

oas777 commented Sep 4, 2024

PS: No one mentioned the color of the underscore, right?
grafik
Where does that come from in a blueish color scheme?

Since the area for showing the text is made a bit smaller, we also use
less context in the backend.
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 September 5, 2024 12:11 Destroyed
@LukasKalbertodt
Copy link
Member Author

Okay, I tried to change what we discussed. However, I decided to have different "tooltip size" behavior depending on whether a preview image is available:

  • No image: the tooltip is sized dynamically according to the amount of text.
  • With image: the tooltip has a fixed height, the width can vary just a tiny bit, if there is a lot of text.

The fixed height and some other changes make hovering over parts of the timeline a lot smoother I think, i.e. less jumping around of the tooltip. Let me know what you think about the remaining width-flexibility for large text.


As a second step, I added a faint icon to the lower bottom corner of the tooltip to show what kind of result we are talking about. Please also let me know what you think.

@oas777
Copy link
Collaborator

oas777 commented Sep 5, 2024

Thanks, Lukas. The icon are very helpful, though the one for slide content could be more along the lines of
2911230.
"Tooltip size" is better/more coherent also, though there is a (different) problem with tooltips below the timeline:
grafik

@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 September 6, 2024 07:31 Destroyed
@LukasKalbertodt
Copy link
Member Author

LukasKalbertodt commented Sep 6, 2024

there is a (different) problem with tooltips below the timeline

Good catch, should be fixed now.

the one for slide content could be more along the lines of

Is that a Lucide icon you found there? I could not find it anywhere. I checked Lucide and found these that somewhat match your suggestion: scan-text, letter-text and file-text. Oh and there is presentation?

@oas777
Copy link
Collaborator

oas777 commented Sep 6, 2024

No, that was me searching Google. I like "letter-text".

@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 September 6, 2024 08:23 Destroyed
@LukasKalbertodt LukasKalbertodt force-pushed the searchable-text branch 2 times, most recently from 30b3363 to 18ae926 Compare September 9, 2024 10:30
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 September 9, 2024 10:32 Destroyed
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 September 9, 2024 10:43 Destroyed
@oas777
Copy link
Collaborator

oas777 commented Sep 9, 2024

Looks good, thanks. Allow me to mention that two tooltips are visible at the same time sometimes when you hover the timeline (slowly and) horizontally:
grafik
Couldn't detect a logic for this happening and maybe it's only due to the number of results.

This fix was a easier than anticipated (once I figured this out anyway).
The problem is: giving the `WithTooltip` a `z-index`, means that parent
element (containing both the trigger element and the tooltip), creates
a new stacking context. And a stacking context and its children behaves
as one unit in regards to other stacking contexts. So even setting the
z-index of the tooltip itself to 100, two timelines are two different
stacking contexts, so the 100 of the one tooltip is not rendered in
front of the 4 of the trigger elements of another timeline. So there
were two options:
- Either introduce another div in the floating components that would
  have the trigger element as children, but not the tooltip, and that
  could get a z-index then. But that would have required changing
  appkit.
- Remove the z-index from `WithTooltip` to avoid creating stacking
  contexts. First I thought the only way to do this is to remove the
  big clickable link area of the search results. And we might still do
  that, as it often brings annoying problems. But for this particular
  problem, the solution turned out easier than I thought.
Unfortunately, this required adding a new package. `react-icons` has
a super old version of lucide icons, even in its newest version. Since
almost all of our icons are from Lucide anyway, it makes sense to just
use their packages directly.
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 September 9, 2024 11:19 Destroyed
@LukasKalbertodt
Copy link
Member Author

two tooltips are visible at the same time sometimes

Yes I noticed this too :/ I think it's a glitch in some other package and we can't do a whole lot about it. But we can still look into it again.

We are currently still trying to fix some weird technical bug. But once that's done, this can be merged I think.

Luckily, Ole noticed this weird error that only occurred sometimes. I
was able to find a repro: on search page, click one video, go back (via
browser), then hover over some text timeline and then click that item.

I am not exactly sure what relay does wrong here, but also requesting
the ID on the search page makes relay understand stuff, so that the
video page is not rendered with the incomplete query from the search
page.
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 September 9, 2024 11:37 Destroyed
@owi92
Copy link
Member

owi92 commented Sep 9, 2024

Nice, so apart from the glitch Lukas and Olaf mentioned above everybody's happy and the technical bug also appears to be fixed. I think this is a great improvement. I'll merge this now 👍

@owi92 owi92 merged commit 5c69148 into elan-ev:master Sep 9, 2024
5 checks passed
@LukasKalbertodt LukasKalbertodt deleted the searchable-text branch September 9, 2024 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:user User facing changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make events findable by transcript (ideally jump to timestamp)
4 participants