Skip to content

error: input contains invalid UTF-8 around byte XXXX #42

@Sripaad

Description

@Sripaad

I am trying to extract keywords from amazon_reviews dataset, when using it for spanish i encounter this error that am unable to resolve.

STACK TRACE
/python3.8/site-packages/multi_rake/algorithm.py in apply(self, text, text_for_stopwords)
     60 
     61         else:
---> 62             language_code = detect_language(text, self.lang_detect_threshold)
     63 
     64             if language_code is not None and language_code in STOPWORDS:

/opt/conda/lib/python3.8/site-packages/multi_rake/utils.py in detect_language(text, proba_threshold)
     12 
     13 def detect_language(text, proba_threshold):
---> 14     _, _, details = pycld2.detect(text)
     15 
     16     language_code = details[0][1]

error: input contains invalid UTF-8 around byte 2094 (of 5341)

Is there a workaround by manually entering Language code or something ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions