Skip to content

Recommended Content Block powered by AI embeddings #881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Jun 11, 2025

Conversation

dkotter
Copy link
Collaborator

@dkotter dkotter commented Apr 4, 2025

Description of the Change

We currently have a Recommendation Service section with a single Feature within that, Recommended Content. At the moment, the only Provider for this is Azure AI Personalizer though that has been deprecated by Microsoft and while we have kept the code in place (for now) we don't recommend anyone uses that.

We've looked into various alternatives(see #392) but haven't found a similar service to Azure AI Personalizer that we've liked. We have been using AI embeddings in other places to get content that is similar to each other (term cleanup, smart 404, classification) and I've had the idea for awhile to use that same concept to power content recommendations in a Recommended Content block. That's what this PR introduces.

Worth noting this is different from the recommendations Azure AI Personalizer provides, which are supposed to be more personalized content recommendations rather than just similar content recommendations.

This new Provider does the following:

  • When configured, embeddings are generated for all existing Posts that don't already have embedding data
  • In addition, anytime a Post is published (either initially or a published Post is saved) we generate embedding data for that item
  • We've introduced a new Content Recommendation block that is a block variation of the core Query Loop block. I've left the settings here pretty simple, you can change the number of posts that show and the sort order
  • This variation block uses a default template of the featured image, title, date and excerpt, though this can be modified after the block is inserted
  • When this block renders, we modify the query that is run to first get the embedding data for the current item, run a similarity calculation against the 5000 most recent posts (this is to try and keep performance decent) and their embeddings and then return the posts that match our threshold (defaults to 75% but can be changed in the settings for the Feature). These results are cached to help with performance, with that cache expiring every hour. We then display however many items have been set (default is 3). Worth noting that if less items meet the threshold, we only display those. So if someone chooses to display 5 items but only 2 meet the threshold, we only display 2

While I see this as being a complete feature, there are additional improvements we could look to make, potentially in this PR but ideally in follow up PRs so we can ship this sooner:

  • Right now we hardcode to Posts only. Ideal would be to allow someone to choose the post types they want, both at the settings level for the Feature but also in the block itself. Becomes a little tricky to sync those but is solvable
  • We could look to backfill the results if less items meet the threshold than what we want. This could either be the default or introduce a new setting for this
  • The Query Loop block has additional settings we can look to support, like filtering by taxonomy, filtering by author, etc
  • As mentioned above, we default to a template that uses the featured image, title, date and excerpt. This can be changed after the block is added but we may want to add global settings to set the default here (or just change the default if we think there's a better standard template)
  • Right now I only added support for OpenAI. Ideally we add our other embedding Providers (Azure OpenAI, Ollama)
  • As I was working through our embedding code, there's quite a bit here that's duplicated across other embedding Providers. Potential here to do some code cleanup and either move shared code into the Feature or move shared code into some other shared class
  • Probably worth discussion around removing Azure AI Personalizer now

Screenshots

Settings for the Recommended Content Feature Admin rendering of the Recommended Content block

How to test the Change

  1. Check out this PR and then go to Tools > ClassifAI > Recommendation Service > Recommended Content
  2. Turn the Feature on and select and configure the OpenAI Embeddings Provider
  3. After saving, you should see a message that embeddings are being generated. This may take some time depending on how much content you have but eventually this process should finish and a success message will show
  4. Go to an existing Post and add the Recommended Content block
  5. Ensure this is rendering some content, both in the editor and the front-end

Changelog Entry

Added - New Recommended Content block powered by the OpenAI Embeddings Provider

Credits

Props @dkotter, @fabiankaegy

Checklist:

@dkotter dkotter added this to the 3.4.0 milestone Apr 4, 2025
@dkotter dkotter self-assigned this Apr 4, 2025
@dkotter dkotter requested review from jeffpaul and a team as code owners April 4, 2025 14:16
@github-actions github-actions bot added the needs:code-review This requires code review. label Apr 4, 2025
@dkotter dkotter force-pushed the feature/query-variation-block branch from e23ce08 to df41475 Compare April 4, 2025 15:32
cy.saveFeatureSettings();
} );

it.skip( 'Can add the Recommended Content block in a post', () => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I'm skipping this test for now as it fails in the pipeline. Works fine for me locally but fails consistently when run here. In doing some debugging, seems the block can't be found but couldn't figure out why

@jeffpaul jeffpaul moved this to Code Review in Open Source Practice Apr 8, 2025
@jeffpaul jeffpaul requested review from Sidsector9 and removed request for a team April 8, 2025 13:33
@jeffpaul
Copy link
Member

jeffpaul commented Apr 8, 2025

Making a note here @dkotter @vikrampm1 to remind us once this is released that we'll likely want to adjust our microsite and ClassifAI overview deck with how this feature now works. Probably also worth a quick scan through the dev docs site to see if there are any updates there that are worth including in this PR.

@jeffpaul
Copy link
Member

jeffpaul commented Apr 8, 2025

@vikrampm1 please open individual follow-up issues for the items that Darin called out:

  • Right now we hardcode to Posts only. Ideal would be to allow someone to choose the post types they want, both at the settings level for the Feature but also in the block itself. Becomes a little tricky to sync those but is solvable
  • We could look to backfill the results if less items meet the threshold than what we want. This could either be the default or introduce a new setting for this
  • The Query Loop block has additional settings we can look to support, like filtering by taxonomy, filtering by author, etc
  • As mentioned above, we default to a template that uses the featured image, title, date and excerpt. This can be changed after the block is added but we may want to add global settings to set the default here (or just change the default if we think there's a better standard template)
  • Right now I only added support for OpenAI. Ideally we add our other embedding Providers (Azure OpenAI, Ollama)
  • As I was working through our embedding code, there's quite a bit here that's duplicated across other embedding Providers. Potential here to do some code cleanup and either move shared code into the Feature or move shared code into some other shared class
  • Probably worth discussion around removing Azure AI Personalizer now

@Sidsector9
Copy link
Member

@dkotter I'm not sure if I'm testing this incorrectly, but I'm unable to make this work.

This is the ZIP of posts I'm testing with. classifai-posts.xml.zip

The Recommended Block is added to the post: Senate Republicans vote to revoke California’s right to set its own tailpipe pollution rules. This is the post which is unique and unrelated to other posts, but still the block renders posts which are unrelated.

I have confirmed with 10% and 100% threshold, and also confirmed that post embedding meta exists for all the posts.

@fabiankaegy
Copy link
Member

Just adding my twocents here.

I am thinking of content recommendation in a query loop more like a sorting than a filtering.

I always want to display some posts at the bottom of an article to keep visitors in my site. Ideally those are as relevant as possible. But even when we don't find anything relevant I still want there to be some posts.

@dkotter
Copy link
Collaborator Author

dkotter commented Jun 5, 2025

This is the post which is unique and unrelated to other posts, but still the block renders posts which are unrelated.

@Sidsector9 I tested this out and did find an issue, though I think unrelated to your results. I think what you're running into is by design (though we can change this) where if no related results are found we just run a normal query. This is to ensure we always show something instead of just having this section be blank.

This also matches @fabiankaegy's point:

I always want to display some posts at the bottom of an article to keep visitors in my site. Ideally those are as relevant as possible. But even when we don't find anything relevant I still want there to be some posts.

I did find an issue though where if a post has this query block added (or any query block) the content from that is used when we generate embeddings. This can lead to unexpected matches because we have partial content from one or more posts that is used as embeddings for a different post, leading to high matches on those chunks. This has been resolved in 082f370 by stripping any Query blocks out before we render content

@dkotter
Copy link
Collaborator Author

dkotter commented Jun 5, 2025

I always want to display some posts at the bottom of an article to keep visitors in my site. Ideally those are as relevant as possible. But even when we don't find anything relevant I still want there to be some posts.

As mentioned above, it does work this way right now, where it will run the default query if no relevant results are found. One thing it doesn't do (though I have a TODO statement in the code to consider this) is backfill results if we don't have enough relevant results. Meaning if someone wants to show 3 items but we only have 1 that matches our threshold, right now that is the only item that shows. In those situations we could run an extra query to get more results, ensuring we always have 3 items to show.

@github-actions github-actions bot added the needs:refresh This requires a refreshed PR to resolve. label Jun 9, 2025
@github-actions github-actions bot removed the needs:refresh This requires a refreshed PR to resolve. label Jun 9, 2025
dkotter added 3 commits June 10, 2025 13:39
feat/886: Add Default Template global settings for Recommended Content Block
@dkotter dkotter linked an issue Jun 10, 2025 that may be closed by this pull request
…th the default query so we always have the number of items we want
@dkotter
Copy link
Collaborator Author

dkotter commented Jun 10, 2025

One thing it doesn't do (though I have a TODO statement in the code to consider this) is backfill results if we don't have enough relevant results. Meaning if someone wants to show 3 items but we only have 1 that matches our threshold, right now that is the only item that shows. In those situations we could run an extra query to get more results, ensuring we always have 3 items to show

This has been added now, where we backfill the results using the default query if needed. Could look at making this a setting in the future but seems like good default behavior

…ent with multiple chunks, we compare each chunk against each chunk of all other posts. This means we run the query to get posts multiple times and determine our threshold multiple times. Made these static variables so they only run once, irregardless of how many chunks the original content has
Comment on lines +252 to +264
$backfill_query = new \WP_Query(
array_merge(
$new_query_vars,
[
'posts_per_page' => (int) $query_vars['posts_per_page'] - $post_count,
'post__not_in' => [ $post_id ],
'fields' => 'ids',
]
)
);

// Add the backfilled posts to the post__in array.
$post__in = array_merge( $post__in, $backfill_query->posts );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG why did I never think of doing it this way.... 🤦 I spent so much time hacking the internals of how queries are done to add backfilling support. This is so nice!

Copy link
Member

@Sidsector9 Sidsector9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, and works as per description and discussion 👍

@github-project-automation github-project-automation bot moved this from Code Review to QA Testing in Open Source Practice Jun 11, 2025
@dkotter dkotter merged commit 5863e6d into develop Jun 11, 2025
19 of 22 checks passed
@dkotter dkotter deleted the feature/query-variation-block branch June 11, 2025 16:46
@github-project-automation github-project-automation bot moved this from QA Testing to Done in Open Source Practice Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs:code-review This requires code review.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add Option for Backfilling Results When Threshold Isn't Met
4 participants