Skip to content

Multithreading for embeddings extraction #81

@AFAgarap

Description

@AFAgarap

Hello. May I ask if there is a way to extract word embeddings using multiple cores?
Right now, I'm getting the word embeddings representation for the 20 newsgroups dataset, and it still takes a while to complete the whole dataset. Thank you.

For reference, this is my current function,

def extract_sentence_embeddings(
    texts: str or List, batch_size: int = 2048
) -> np.ndarray:
    """
    Returns the sentence embeddings for the input texts.

    Parameter
    ---------
    texts: str or List
        The input text to vectorize.
    batch_size: int
        The mini-batch size to use for computation.

    Returns
    -------
    vectors: np.ndarray
        The sentence embeddings representation for the input texts.
    """
    vectorizer = pymagnitude.Magnitude("data/glove.840B.300d.magnitude")
    if isinstance(texts, str):
        vectors = vectorizer.query(texts.split())
        vectors = np.mean(vectors, axis=0)
        return vectors
    elif isinstance(texts, list):
        vectors = []
        for index in range(len(texts) // batch_size):
            offset = (index * batch_size) % len(texts)
            vector = vectorizer.query(
                list(
                    map(
                        lambda text: ["", ""]
                        if len(text.split()) == 0
                        else text.split(),
                        texts[offset : offset + batch_size],
                    )
                )
            )
            vector = np.mean(vector, axis=1)
            vectors.append(vector)
        return vectors

Since I'm using 300D vectors, the memory can easily be exhausted, that's why I opt for batching the text data.

Looking forward to your response! Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions