Skip to content

Music Insights Provider#2153

Draft
ztripez wants to merge 9 commits intomusic-assistant:devfrom
ztripez:clap-plugin
Draft

Music Insights Provider#2153
ztripez wants to merge 9 commits intomusic-assistant:devfrom
ztripez:clap-plugin

Conversation

@ztripez
Copy link
Contributor

@ztripez ztripez commented Apr 27, 2025

Description

This PR introduces a new provider, Music Insights, designed to enhance Music Assistant with features based on audio embeddings and user interaction analysis. It leverages ChromaDB for vector storage and CLAP models (via the transformers library) for generating embeddings.

Current Features (Work-in-Progress):

  • Provider Setup: Basic configuration flow with presets for different hardware capabilities (CPU/GPU).
  • ChromaDB Integration: Sets up a persistent ChromaDB client within the MA data directory.
  • Text Embeddings: Generates text embeddings for tracks based on metadata (genre, artist, title, album, mood).
  • Semantic Search: Allows searching for tracks using natural language queries.
  • Similar Tracks: Finds tracks similar to a given track based on text embedding similarity
  • User Interaction Tracking: Records basic track playback events (start, progress, scrobble) using a dedicated InsightScrobbler. Data is stored in a separate ChromaDB collection.
  • Library Sync: Automatically updates embeddings when tracks are added, updated, or deleted from the library.
  • Configuration Handling: Rebuilds embeddings if relevant configuration (model name, window size) changes.

TODOs:

  • [ ] Audio Embeddings: Currently only text embeddings are generated and used.
  • [ ] Recommendations The core logic to analyze user interactions and generate personalized recommendations based on embeddings needs implementation.

How to Test:

  1. Enable the music_insights provider in the MA settings.
  2. Choose a preset (or configure manually). Note that the first startup might take time to download the embedding model.
  3. Allow the initial embedding process to run (check logs for progress - currently only logs start/finish/errors).
  4. Use the search function with descriptive terms (e.g., "upbeat electronic music", "sad acoustic song").
  5. View a track and check the "Similar Tracks" section.
  6. Play some tracks and observe logs for interaction recording messages (debug level).

This provider is still under active development, but this initial version lays the foundation for music discovery and recommendation features within Music Assistant.

async def async_init(self) -> None:
"""Asynchronously initialize the embedding models."""
# Run blocking model setup in a background task using a thread
self.mass.create_task(asyncio.to_thread(self._setup_models))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store the task in a variable if you want to cancel it on unload

Comment on lines +155 to +156
# waveform = None
# sample_rate = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need help with this part, ping me on discord. Its relatively easy to get the audio stream in pcm.

@OzGav
Copy link
Contributor

OzGav commented Sep 9, 2025

@ztripez any more progress on this one?

@ztripez
Copy link
Contributor Author

ztripez commented Dec 23, 2025

@ztripez any more progress on this one?

Hey @OzGav - I have some bandwidth to pick this back up, but I need to address a fundamental architectural issue before moving forward.

The Problem:
The current approach of embedding CLAP/transformers directly into MA creates a dependency nightmare:

  • PyTorch + CUDA support = 2-3GB+ download
  • CUDA version compatibility hell (PyTorch vs system CUDA vs transformers)
  • Forces GPU dependencies on ALL MA users, even those who never use this feature
  • Different hardware needs different builds (CPU/CUDA/ROCm)
  • Makes MA installation significantly heavier

Proposed Solution:
I'm thinking this should be a separate sidecar service that MA communicates with via HTTP/gRPC. Benefits:

  • Optional - only users who want AI features install it
  • Clean separation of GPU/ML dependencies
  • Can be containerized independently with proper CUDA base images
  • Easier to iterate on models without touching MA core
  • Users can run it on different hardware (GPU server separate from MA)

Questions:

  1. Is there precedent for optional sidecar services in the MA ecosystem?
  2. Would you accept this as a separate service that MA integrates with, rather than a built-in provider?
  3. Any preferences on communication protocol (REST/gRPC)?

If the sidecar approach is acceptable, I can get moving on this. If you strongly prefer it as an integrated provider, we need to discuss how to handle the dependency bloat - maybe optional extras in requirements with clear documentation about the 3GB+ install size?


My personal take: The sidecar is objectively the right architecture here. We would be bolting ML/GPU workloads onto a music server, that screams "separate service."

@MarvinSchenkel
Copy link
Contributor

I think it makes sense to have the analysis part as a separate sidecar. We can then implement a thin MetadataPlugin that can obtain the results and store it in MA alongside the MediaItems. We could eventually host that Docker file alongside our other MA addons and containers

The only thing I am unsure about is that we will need to stream raw PCM to that sidecar, which might be a bit bandwidth heavy.

Looping in @marcelveldt as he will definitely have some ideas for this.

Question back: I did some work on audio analysis for smart fades already (simple beat/downbeat analysis). Could your libraries possibly enhance this information as well? (think phrase detection, key detection etc.)

@ztripez
Copy link
Contributor Author

ztripez commented Dec 28, 2025

Thanks @MarvinSchenkel! Glad the sidecar approach makes sense.
I've started to with the sidecar over here: (don't read to much into the README.md it have drifted a lot from what the api actually look like) https://github.com/ztripez/music-assistant-insights

On streaming bandwidth:

The current implementation streams raw PCM frames over HTTP/msgpack, but the sidecar is designed to handle this efficiently:

  • Audio frames are buffered on MA's side (~1 second chunks, ~384KB at 48kHz stereo f32) to reduce HTTP overhead
  • The sidecar converts to mono, resamples to 48kHz if needed, and computes mel spectrograms incrementally during streaming (cheap STFT + filterbank operations)
  • Expensive model inference is deferred to session end, so streaming itself is lightweight
  • Sessions are keyed by track, with automatic cleanup of stale sessions

That said, I'm open to moving mel spectrogram computation to MA's side if bandwidth becomes an issue - mel features are ~64x256 floats (~65KB) for a 10-second window vs ~1.8MB of raw PCM. That's a significant reduction.

On audio analysis expansion:

The sidecar is designed to be modular. I'm already doing zero-shot mood classification using CLAP's joint embedding space (energetic, melancholic, aggressive, etc.) during ingestion.

For your use cases:

  • Beat/downbeat detection: Could integrate with existing beat detection models or use onset detection algorithms (librosa-style)
  • Key detection: Models like https://github.com/spotify/audio-features or lighter neural approaches exist
  • Phrase detection: More experimental, but possible with structural segmentation models

The architecture already supports this - the watcher module can decode full audio files using symphonia (mp3, flac, ogg, m4a, etc.), resample, and run multiple analysis passes. Adding new feature extractors would be straightforward.

Bonus: Local file scanning

I'm also building a folder watcher module that runs alongside the sidecar (sidecar² lol) - it monitors local music directories, decodes files directly, extracts ID3/Vorbis metadata, and generates embeddings. This could be useful for:

  • Users who want embeddings without MA integration
  • Pre-populating the vector DB before MA syncs
  • Analyzing tracks that MA doesn't have metadata for

Happy to collaborate on the audio analysis expansion if there's interest.

@ztripez
Copy link
Contributor Author

ztripez commented Dec 28, 2025

another thing is to move to a sqlite fork like turso with vector capabilities, then the sidecar can be a processor and stateless and just return the embeddings and all queries can run in MA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants