Analyze usage of English in Spanish language women's magazines (and usage of Spanish in English language women's magazines) in the US.
- Get some data: scrape magazine articles from
- siempremujer.com
- latina.com
- Text pre-processing
- convert to plain text
- Dictionary pre-processing
- we want to annotate tokens in articles according to their language
- use open offices dictionaries
- we need simple word lists, so clean up annotation here and convert to utf-8!
- make a common Spanish dictionary (intersection of all Spanish dictionaries)
- and regional specific Spanish dictionaries for the different countries
- Annotate text
- look up all the words in Spanish/English dictionaries
- annotate language (multiple labels possible)
-
Get n-grammes with alternating language use
-
Try to generalize some contexts where such n-grammes appear
- webscraping: request and pattern
- natural language processing: nltk