-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
145 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
|
||
ss: self_study/data_handling.pdf \ | ||
self_study/data_retrieval.pdf \ | ||
self_study/machine_learning.pdf | ||
|
||
self_study/%.pdf: self_study/%.markdown | ||
pandoc --pdf-engine=xelatex $< -o $@ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
--- | ||
title: 'Self-study track: data handling' | ||
author: 'Jason T. Kiley' | ||
date: 'January 12, 2020' | ||
|
||
geometry: margin=1.1in | ||
fontsize: 12pt | ||
mainfont: 'Source Serif Pro' | ||
monofont: 'Source Code Pro' | ||
urlcolor: 'Blue' | ||
|
||
--- | ||
|
||
# Data Handling | ||
|
||
This data handling self-study track is designed to help you handle and transform data in ways that are more sophisticated than most commercial stats packages and much better at handling larger datasets (up to about 100 GB). | ||
|
||
In many text analysis projects, one of the key challenges is handling a large amount of text data in ways that allow us to perform content analysis and then merge those results back into our archival data. | ||
Python and pandas make many things easy, and they make some complex things quite straightforward. | ||
One limitation of pandas (and commercial stats packages like Stata) is that they need the computer to have enough memory to hold an entire dataset in memory. | ||
Using a database management system ("DBMS"), even a very lightweight one like SQLite (which is built-in to Python), can allow us to overcome that limitation in a way that works simply with pandas. | ||
Structured query language ("SQL") is the language that we use to interact with a DMBS. | ||
Combined, these tools allow us to efficiently---both in terms of our time and computational resources---handle nearly any kind of text data up to about 100GB in size. | ||
For comparison, I have a 2 million article full-text news database that is 11GB. | ||
|
||
My own research has benefitted greatly from adopting Python and pandas for data handling. | ||
I have assembled datasets in a few hundred lines of Python code that are larger and much more complex than datasets that required thousands of lines of Stata code to assemble. | ||
As one small example, Python (and SQL) can do what is called a non-equi-join. | ||
This is a merge where, instead of matching on the equality of two columns, it allows less-than and greater-than comparisons. | ||
These are really helpful when matching on dates such that you want the most recent match. | ||
This is possible in Stata, but you have to write the logic yourself, instead of using the straightforward `df.merge_asof()` in pandas (see [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html)). | ||
Also, we can easily do things like querying the news articles for an acquiring firm for the three months before an acquisition. | ||
That requires a custom query for every row in the dataset, but we can automate those with Python and pandas, though it would be a large amount of work with numerous intermediate states in Stata. | ||
|
||
There is a largely-absent topic from this self-study: database design. | ||
The two reasons for that are (a) you will pick up enough of the essentials, and (b) it is not, overall, a high return-on-investment skill for research. | ||
We typically use databases as a workaround for memory limitations, so we are less bothered by "bad" design that would create real issues for typical databases that have many transactions and applications built on top of them. | ||
|
||
|
||
# Resources | ||
|
||
1. Wes McKinney's book, [*Python for Data Analysis*](https://wesmckinney.com/pages/book.html). This book covers a lot of ground in a practical way. Do note that pandas is under very active development, so some things may work slightly differently on the most current version. It helps to use the code samples on the book's [Github repository](https://github.com/wesm/pydata-book), as those can be updated by the author. | ||
1. Read [*Teach Yourself SQL in 10 Minutes*](https://forta.com/books/0672336073/), 4th edition, by Ben Forta. The title is perhaps a bit optimistic, though the design of the book (like others in the series) is that it teaches you one specific topic in each of many short chapters. I like this book because it quickly gives you enough of a skillset to handle some common tasks. It's not that deep, but it is very high ROI. Some notes: | ||
- This is also on the data retrieval self study, reflecting how useful SQL is in general. | ||
- SQL has a standard version (called ANSI SQL) and many DMBS-specific customizations. As a result, you will often find yourself searching to find out how to do a particular thing with a particular DBMS (see e.g., [datetimes in SQLite](https://stackoverflow.com/questions/12406295/how-to-query-in-sqlite-for-different-date-format)). | ||
- I suggest using SQLite built into Python (see e.g., [a tutorial](https://likegeeks.com/python-sqlite3-tutorial/)) for learning and for most project use. For a lot of things we do, it works fine, and it's a large step up in learning and complexity to run a server-based DBMS like PostgreSQL. | ||
1. (optional) Read [*SQL Queries for mere mortals*](https://www.pearson.com/us/higher-education/program/Viescas-SQL-Queries-for-Mere-Mortals-A-Hands-On-Guide-to-Data-Manipulation-in-SQL-4th-Edition/PGM1937355.html), 4th edition, by John L. Viescas. This book is also designed for beginners, but it is more comprehensive (if less efficient) than the prior book. As such, it can help understand why things are designed as they are, and that knowledge is often helpful when solving problems. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
--- | ||
title: 'Self-study track: data retrieval' | ||
author: 'Jason T. Kiley' | ||
date: 'January 12, 2020' | ||
|
||
geometry: margin=1.1in | ||
fontsize: 12pt | ||
mainfont: 'Source Serif Pro' | ||
monofont: 'Source Code Pro' | ||
urlcolor: 'Blue' | ||
|
||
--- | ||
|
||
# Data Retrieval | ||
|
||
This data retrieval self-study track is designed to help you programmatically access, retrieve, and structure data of interest in research. | ||
The techniques covered include web scraping, API access, and SQL. | ||
|
||
The internet has a wealth of text and data sources---from press releases to Kickstarter to to social media---that can allow us to examine novel research questions that are hard to examine otherwise. | ||
In some cases, these sources are accessible on the web, though they often require some transformation to extract the data that we want. | ||
These techniques work best when there is on (or a small number) of websites with a lot of data. | ||
This is true because we generally have to write bespoke code for each website that we want to gather data from (and sometimes per section). | ||
There are some great Python packages that do most of the heavy lifting, but we have to link them together. | ||
As a result, this web-scraping topic is perhaps the most independent topic across these self-study tracks. | ||
|
||
In some cases, services and data providers provide a structured way to obtain data. | ||
Some of these use an Application Programming Interface ("API"), and these often have tools designed to interface with them directly. | ||
For example, the `pandas-datareader` Python package provides an interface to [many sources](https://pandas-datareader.readthedocs.io/en/latest/remote_data.html) that have provide structured data, including many financial data sources, FRED, Eurostat, and OECD. | ||
In addition, sources like Wharton Research Data Services ("WRDS") have their own packages that allow data access, in this case using SQL to specify queries. | ||
|
||
|
||
# Resources | ||
|
||
1. Web scraping in its most basic form generally involves retrieving the HTML for a page and navigating the contents using a Python package called [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/). | ||
- These [video tutorials](https://www.youtube.com/playlist?list=PLQVvvaa0QuDfV1MIRBOcqClP6VZXsvyZS) do a good job of walking you through the basics and some of the common problems. Note, the author is using a text editor and a terminal, but you can do this in Jupyter instead. | ||
- Some folks I know like to use [scrapy](https://scrapy.org) to automate the retrieval part. I tend to write my own with the [requests](https://2.python-requests.org/en/master/) package, but this may be a higher ROI for you. | ||
- As I mention in the workshop, I like to start with getting what I want from one target page, then building the automation to find, retrieve, and process a lot of pages. Sometimes, the processing is complex and there are many pages to gather, so you might choose to retrieve and save the pages. While those are being retrieved, you can work on developing the processing using the saved files. Just be sure that you get the processing working and then process all pages with the same code to keep all of your data consistent. | ||
- Websites don't like it when web scrapers are heavy users of their servers. In general, my advice is to be cool about it. Don't pull more than a page every 10 seconds or so (maybe slower for smaller sites). | ||
- For some commercial subscription sources, scraping is against their terms of service, so be careful. Often, asking your research librarians about the data is a better idea, because (a) they can clear it with the provider, and (b) sometimes they will query their data and send you structured data (which is much less work). | ||
1. Some web pages send your web browser code that it runs in order to get data. As a result, our standard techniques of requesting pages and finding the content we want within them will not work. [Selenium](https://www.seleniumhq.org) is a Python package that allows us to programmatically control a web browser and have open the pages for us (including executing dynamic code) and then retrieving the data. | ||
- [This tutorial](https://www.freecodecamp.org/news/better-web-scraping-in-python-with-selenium-beautiful-soup-and-pandas-d6390592e251/) is a good place to get started. | ||
- Some social media sites load more data when you scroll to the bottom of the page, so scrolling down and pausing while waiting for more data (often 3-5 seconds) can often be enough of a rate limit to be cool, as I mention in the prior resource. | ||
1. As I mentioned above, [pandas-datareader](https://pandas-datareader.readthedocs.io/en/latest/index.html) provides an interface to a number of high-quality data sources. In addition, if you start building your own more advanced tools, the project structure using classes is a good exemplar. | ||
1. Read [*Teach Yourself SQL in 10 Minutes*](https://forta.com/books/0672336073/), 4th edition, by Ben Forta. The title is perhaps a bit optimistic, though the design of the book (like others in the series) is that it teaches you one specific topic in each of many short chapters. I like this book because it quickly gives you enough of a skillset to handle some common tasks. It's not that deep, but it is very high ROI. Some notes: | ||
- This is also on the data handling self study, reflecting how useful SQL is in general. | ||
- SQL has a standard version (called ANSI SQL) and many DMBS-specific customizations. As a result, you will often find yourself searching to find out how to do a particular thing with a particular DBMS (see e.g., [datetimes in SQLite](https://stackoverflow.com/questions/12406295/how-to-query-in-sqlite-for-different-date-format)). | ||
- I suggest using SQLite built into Python (see e.g., [a tutorial](https://likegeeks.com/python-sqlite3-tutorial/)) for learning and for most project use. For a lot of things we do, it works fine, and it's a large step up in learning and complexity to run a server-based DBMS like PostgreSQL. | ||
- For retrieval, WRDS has a [Python package](https://github.com/wharton/wrds) that allows you to access WRDS using SQL queries in Python. Do note that, like when developing any query, you should use a `LIMIT` statement with a reasonable limit (I often use 100 or 1000) until you are satisfied that it is doing exactly what you want. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
--- | ||
title: 'Self-study track: machine learning' | ||
author: 'Jason T. Kiley' | ||
date: 'January 12, 2020' | ||
|
||
geometry: margin=1.1in | ||
fontsize: 12pt | ||
mainfont: 'Source Serif Pro' | ||
monofont: 'Source Code Pro' | ||
urlcolor: 'Blue' | ||
|
||
--- | ||
|
||
# Machine Learning | ||
|
||
This machine learning self-study track is designed to start you on the path to a solid practical understanding of machine learning that allows you to train models and get results, followed by an optional deeper dive into the underlying math. | ||
|
||
In research, we often use machine learning for some certain types of tasks. | ||
First, we use it to do a good-enough job of labeling data for us that is too time-consuming or poorly-suited for human coding. | ||
For example, instead of classifying tens of thousands of press releases, we can classify up to a few thousand, and use those to train a model to classify the others. | ||
Second, we use unsupervised learning techniques, like k-means clustering or topic models to find groups in observations on the features that we specify. | ||
Then, we have to interpret the meanings of those groupings, which can sometimes reveal something interesting. | ||
This self-study focuses on the former (i.e. supervised learning), not the latter, though there is interesting work done with both. | ||
|
||
Because most state-of-the-art results use deep learning---a type of neural network that uses multiple layers to model higher-level features from the input---I have emphasized resources that cover those techniques well. | ||
However, as we discuss in the workshop, start with simpler models and work up to the more complex ones. | ||
Doing so allows you to see what increasingly-complex methods actually gain over simpler methods, and they are part of what you should report in papers (in part to inform future researchers). | ||
|
||
|
||
# Resources | ||
|
||
1. Andrew Ng's [Machine Learning Coursera course](https://www.coursera.org/learn/machine-learning). This course is really well designed, and it is perhaps the most popular machine learning resource (and for good reason). It teaches you a lot of practical concepts with exposure to the underlying math, but without proofs or assuming multivariable calculus. A few tips: | ||
- The course uses a language called Octave, which is an open source clone of Matlab (which would work, too). Even though you will likely do real work with Python, I suggest using Octave here. In practice, you won't be implementing anything this low level, but it's helpful to understand how things work and why. | ||
- The course time estimate is about 56 hours in total, and that is fairly accurate overall, though I spend time differently than the individual item estimates. I usually take twice the length of the video (pausing and making notes), and the reading is much faster than estimated. | ||
- If you are tempted to start running your own data, wait and use `tensorflow`, which is covered in the next item (and does a lot of work for you). Once you're through Week 4 of this course, you could start the next one. However, I suggest working through this course's coverage of a topic before the next one. | ||
1. Coursera [TensorFlow in Practice Specialization courses](https://www.coursera.org/specializations/tensorflow-in-practice) (particularly parts I and III, though the other parts cover some methods with applications on text data). These are very practical courses that walk you through actually using tensorflow to specify and train models. Some notes: | ||
- I find the estimated times to be much higher than the time I need to complete them. Some sections that are estimated at 3-4 hours only took me about an hour. | ||
- I suggest using tensorflow 2.0, which these courses support. However, they don't mention some code changes that are necessary. In particular, the new version has friendlier syntax for specifying some parameters, so, for example, you would specify the optimization algorithm using `optimizer='adam'`, not `optimizer=tf.nn.AdamOptimizer()`. The activation functions are also similarly changed in 2.0 (see [documentation](https://www.tensorflow.org/beta/guide/migration_guide)). | ||
1. [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/), 3rd ed. draft, by Jurafsky and Martin. This book is more directly about text and the kinds of features we extract in text, and it also has some coverage of using these methods to get at psychological constructs. | ||
1. Google's [Text classification](https://developers.google.com/machine-learning/guides/text-classification) guide. In particular, look at their advice on choosing a model based on the number of samples and length of text. | ||
1. (optional) If you want to dive deeper into the math-heavy treatments of these topics and want some additional review of the underlying math topics, the [Mathematics for Machine Learning Specialization](https://www.coursera.org/specializations/mathematics-machine-learning) sequence may be helpful. | ||
1. (optional) Read either [*An Introduction to Statistical Learning*](http://faculty.marshall.usc.edu/gareth-james/ISL/) ("ISL") by James, Witten, Hastie, and Tibshirani or [*The Elements of Statistical Learning*](https://web.stanford.edu/~hastie/ElemStatLearn/) ("ESL") by Hastie, Tibshirani, and Friedman. The former is less-math-heavy (and generally more approachable for non-stats/ML PhDs), though it omits coverage of neural networks. Ideally, you would have access to both, but I would read ISL and then look at ESL if you want to go deeper (and for neural networks). | ||
1. (optional) Read [*The Deep Learning Book*](https://www.deeplearningbook.org) by Goodfellow, Bengio, and Courville. This is a math-heavy book that provides through coverage of deep learning specifically. The important part for our purposes is Part II (and, optionally, Part I which should largely be review). Even if the math is a bit abstract at times, the text does a great job of explaining the concepts. |