An exploration of the use of APIs, natural language, and text processing to study online interactions via social media.
Designed as a self-study, this module will take you through the process of retrieving, storing and analyzing tweets. Topics will include qualitative annotation, natural language processing, and classification using basic machine-learning methods.
Content is provided in the form of Jupyter notebooks. If you need an introduction to Jupyter, you can see the official documents or this Medium article.
You can run these notebooks on a jupyterhub server - potentially one provided by your course - or on your own computer, appropriately configured with Python and other libraries. See Part 0 for information on Python modules that you will need.
Data science modules developed by the University of Pittsburgh Biomedical Informatics Training Program with the support of the National Library of Medicine data science supplement to the University of Pittsburgh (Grant # T15LM007059-30S1).
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The following steps describe how to use these modules in a Jupyter environment.
These exercises will be completed using Jupyter computational notebooks. There are two ways that you might do this:
- In a server environment provided by your instructor and/or institution
- On your own computer.
Furthermore, there are multiple ways that you can do this:
- Using traditional Jupyter notebooks
- Using JupyterLab, an alternative interface with features similar to those of an integrated development environmnent.
In either case, it is important to get the right version of Python. As of October 2018, Python 3.6 is necessary, along with several libraries (listed below in Section 3.3). The need to have the appropirate Python version will be mentioned below when we discuss the use of the notebooks.
To use the server environment, you will need to follow directions from your instructor or mentor as to how to access appropriate local computing facilities. Reiterating the point made above, please be sure that your instructor and computing staff configure the servers for Python 3.6.
Installation on your own computer is not difficult for those who are familiar with some amount of command line operation and software configuration.
This installation generally requires 3 components:
- Python
- Jupyter tools
- Additional Python libraries (described below in 1.3)
There are a variety of ways to accomplish these tasks. Here, we will focus on one approach that has proven relatively straightforward - the miniconda tool. Go the miniconda site and follow the installation instructions.
Once you have miniconda installled, you will need to install Python 3.6 and Jupyter tools. Gergeley Szerovay's article Why you need Python environments and how to manage them with Conda provides a good explanation of how this can be done. Once you complete the installation of the basic notebooks, you can install JupyterLab if desired.
There are several add-on Python libraries that are used in these modules:
- NumPy - for preparing data for plotting
- Matplotlib - plots and garphs
- jsonpickle for storing tweets.
- spaCy - an NLP toolkit.
- scikit-learn for machine learning
- tweepy for retrieving Tweets via the Twitter API.
If you are using an installation provided by your instructor, please work with your instructor to install these libraries correctly. If you are using your own equipment, you will need to follow the instructions given with your tools, such as miniconda, to complete the installation.
Completing these modules requires a Twitter account with develop privileges. To do this, you can either create an individual Twitter account or join an organization account created by an instructor. As of October 2018, individual developer accounts require an approval process that has taken two weeks or longer, making educational group accounts an attractive option.
Instructions for both approaches are given below, based on Twitter facilities as of October 2018. Although every attempt will be made to keep these materials up to date, please note that details may change. If any aspect of this document seems out of date, please file a GitHub issue.
To create an individual developer account:
- Go to Twitter's developer site
- Click on the "Apply" link. You wil be required to login to Twitter if you haven't already done so
- Complete the application. When asked for Account details, select "I am requesting access for my own personal use".
- Fill out the form and indicate as best possible what your are building - say you are exploring the application of natural language processing and machine learning to Tweets.
- Submit the forms and wait for approval.
Please note that approval might take two weeks or more.
If you are an instructor, please visit the Twitter for education playbook and follow the instructions. As part of this process, you will collect Twitter Ids from your students and provide them to Twitter for access.
Note that you will be asked to verify that you are a legitimate human educator.
If you are a student, work with your instructor to ensure that they create an appropriate account and invite you to the resulting organization.
Github and GitLab are two popular community sites based on the Git source-code control system. We're going to use Git to create your own local copy of these modules, and to store any changes. We'll tell you a bit about it here, but there's much more to learn - for more information on Git, see git-scm.com.
For now, go to either Github or GitLab and create an account. Remember your account name.
The Jupyter notebook exercises for these models are contained in a GitHub repository - a collection of related files managed using the Git source code control system. To do your work for these modules, you will need your own personal copy of this repository, stored in your GitHub or GitLab account. Below, we give different descriptions for GitHub and GitLab.
To do this in GitHub will require three browser windows:
- The first should be point at the GitHub.com repository that this file is in. You're probably on that page right now as you read this file.
- The second window should be on your Jupyter notebook or JupyterLab home page.
Once this is setup, we can go to work.
- In the window on your GitHub repository page, press the "fork" button, found just below your avatar and user name in the title bar. It will ask you "Where should we fork this repository?". Choose the selection corresponding to your GitHub user name. The system will ask you to wait, and it will point you to a new page with a copy of this repository under your account. This is now your repository to use as you will, without any fear of damaging the main repository.
- On this page, there is a "Clone or download" button. Click on this button and copy the link that shows up. You will need this URL in step 5.
- Go to your GitLab homepage.
- Click "+" in the top menu bar.
- Select "New Project"
- Click "Import Project" and then "Git Repo by URL"
- At the top of the page containing this document, click on "Clone or download". Copy this URL
- Paste this URL into the "Git repository URL" textbook on the GitLab import page.
- Click "create project". You can keep the repository to start.
- You'll see a message indicating that something is happening.
- When it is done, go to your home page. You should see a new repository. Go to the home page of that repository. There will be a box under the repository title that says "SSH" with a URL next to it. Click on the "SSH" button to change it to "HTTPS".
- Copy the URL next to the "HTTPS" button. You will need this URL in step 5.
At this point, you should have the URL for your own personal copy of this repository. You will now need to clone it into your Jupyter environment.
- Go to your Jupyter home page. Click "New" on the top right, and select "Terminal". This will create a linux command-line terminal window in the browser. Alternatively, if you are using JupyterHub, press the "Terminal" button in the Launcher screen.
- run "git clone .." followed by the URL of your repository. You will need to provide your GitHub or GitLab user name and password.
- In a new browser tab (or in an existing browser tab if you already have it open), go to the home page of your Jupyter environment. You will see a new directory, likely entitled "SocialMediaDataScience". This is where you will do your work. If you are using JupyterLab, this directory will appear in the file browser on the left.
Here, you're going to start working in the Notebooks.
- From the Jupyter home page, select the folder "SocialMediaDataScience." If you are using JupyterLab, you'll see this folder listed on the file chooser on the left-hand side.
- You'll see several files listed, including notebooks with names like "SocialMedia - Part 0.ibpynb", with numbers indicating part 0 - part 5.
- Click on "Part 0" and start the notebook.
- Read through and execute the code in the notebook, going from part 0 to part 5 in order. You can add cells to experiment and run code as you like.
- You will be asked to create API Keys for Twitter and to save them in the notebooks. Please do that where indicated.
- Each part of the module will have exercises, with cells inidicating where you should submit answers. Use these parts of the notebook to work on the exercises and to show your work.
- Every once in a while, you will want to hit "Save and Checkpoint" to save your work. This will also happen automatically, but it doesn't hurt to do it manually as well.
- Each of the 5 parts has one or more exercises to go through, with indications of where your work should be added. Add cells and show the proper execution of your code, saving the worksheets when you are done.
If you ever have trouble running the code, it may be because you are running the wrong version of Python. To change this, look under the "kernel" menu on the notebook page. Select it and switch to a kernel that specifies Python 3.6. This should be clear somewhere in the program name. If this doesn't work, you might have to ask your instructor (if you are using an environment provided by the school) or revisit your installation to ensure that you are using Python 3.6.
As you do your work on these modules, you will change the notebook and add file to the directory. To keep track of these artifacts, you will need to "push" them back on to GitHub/GitLab. To do this, follow these instructions:
- From the Jupyter homepage, create a terminal window. As described above, Click "New" on the top right, and select "Terminal" Terminal.
- In the terminal window, change into the directory for this project - "cd SocialMediaDataScience"
- Run "git status". This will tell you which files have been added or modified.
- For each file listed under "git status", type "git add .." followed by the filename, and then press return. This will prepare all of the files to be added or revised in the repository.
- type "git commit -m'some message'",where you can replace 'some message' with an informative message. For example, you might say 'git commit -m'finished part 1'".
- Type "git push origin" and press return. You may see a warning message (which you can ignore). You will then be asked for your GitHub/GitLab user name and password.
- Go to your home page on GitHub/GitLab and look at the repository. You will see the updated changes.
When you are done with all of the exercises, you will proceed as follows:
- Edit all of the files to remove your API keys
- Push your work (as in Step 6, above)
- Make the repository public, as in the following directions
- Send the repository link to your instructor.
- From the repository home page, click on the "Settings" gear icon.
- Scroll to the "Danger Zone" box
- Click on "Make public" and follow the irections.
- On the project page, click on "Settings" on the left bar.
- Click on "General"
- Find "Permissions" and click on "Expand"
- Go to "Project visibility" and change it to "Public"
- Click "Save Changes".
Upon completion of this module, students will be able to:
- Understand the use of Application Programming Interfaces (APIs) to retrieve data from sites such as Twitter.
- Understand the structure and content of resulting data
- Use and extend a Python class definition for managing extracted social media data, using Twitter as an example
- Explore resulting social media data for patterns of authorship and other metadata.
- Annotate/classify social media posts for further analysis.
- Identify and discuss basic Natural Language Processing steps, including tokenization, lemmatization, part-of-speech tagging, and named entity recognition.
- Use and extend code for executing key natural language processing pipeline steps.
- Appreciate the relevance of vectorization for machine-learning classification of texts.
- Convert tweets into appropriate vector representations.
- Verify the ouptut of a vectorizer.
- Divide a dataset into test and train sets for machine learning.
- Verify the distribution of classes into test and train sets.
- Train and evaluate an SVM-based classifier.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Upper undergraduate or first-year graduate students.
Students should have some familiarity with Python programming, including at least basic exposure to object-oriented programming.
Social media has become a useful source of information about trends in perceptions and attitudes towards various health questions. This module challenges students to learn how to retrieve social media data and to use Natural Language Processing to extract key trends and to classify messages based on those classifications.
Simulated tweets about smoking and vaping, hand-crafted to resemble plausible content. Subsequent data is retrieved from Twitter using the Twitter developer API.
- Manipulation of JSON Tweet data structures
- Use of an object class for managing tweets
- Creation of scikit-learn vector representation of documents
- Dividing datasets into train and test subsets.
- Retrieval of Tweets via Twitter API
- Descriptive summaries of Tweet attribute distributions
- Annotation of tweets with free-text codes
- NLP Parsing with spaCy
- Basic SVM Classification with scikit-learn
- Graphs from matplotlib
- Basic descriptive statistics
- Calculation of precision and recall
- Serialization and reuse of tweets in JSON format.
- Jupyter notebooks documenting processing steps.
- Exploring distributions of key values in a dataset
- Using REST APIs to retrieve data from web servers
- Qualitative data annotation
- NLP Parsing
- Preparation of data for machine learning
- Evaluation of classifier
- What are some of the challenges associated with using Twitter data?
- Why is NLP on Twitter data different from NLP on other data sets, such as more familiar English prose or clinical documentation?
- Which parts of the NLP work well, and which don't? How might the less well-performing components be improved?
- How large of an annotated dataset might be needed to build a basic classifier?
- What additional tools and code infrastructure would be needed to broaden the processes used in this modle to other datasets?
These instructions are, we think, accurate at the time of writing. Please submit issues with any difficulties or inaccuracies.