-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Proposed itinerary at bottom :)
I realized my last description on the Slack left a bit to be desired, so I wanted to flesh it out:
What I'm proposing is a media citation and reference crawler which can produce reference trees for analysis and determining the strength of a source (with respect to how well it backs itself up with citations, at least).
Let's say, for instance, that you take a Washington Post article...
You would then grab only the body content of the article itself with a web scraper and grab every <a href="...">
tag from it. You could also save the context of the tag--all the paragraph text surrounding it--tagging the words and content used to frame the reference as the source/description/lead-up/etc. This could be done with something like Python's lxml package and a little tree traversal, but let's forget those implementation details for now.
Imagine that this article itself is a node in a larger n-ary tree
. Its children and parents are tweets, articles on other sites, government releases, comments and text posts on Reddit, and maybe some other internal articles from Washington Post--all the way down to collected transcripts. Let's call these media nodes
for reference. All the articles and their references out there are just hanging out in one big, generally acyclic graph.
You could start from any of these potential media nodes and build a tree of sources from a given root media node. You could even allow a user to submit an article, tweet, post, or whatever on a web frontend to generate a tree of a certain (probably limited) depth which they can have visualized.
Scaled up, this could also permit analysis of citations from bodies of sources themselves. How often does WaPo (or other media entities) cite external sources? How often does WaPo cite themselves as an entity? What can be said for the authors of articles and comments? What can be said for users on Twitter? Do sources from certain entities tend to fall back on government documents and corroborated sources, tweets from the horse's mouth, or just good-old-fashioned hot air
two levels down? How deep does a certain rabbit hole go?
The context of references can be used for both pruning and natural language analysis in the context of research as well.
You could store trees in a growing document database as well as specialized graph databases like Neo4j and text-search databases like ElasticSearch.
The main catch I see with this is how to work with each specific site. HTML traversal logic can be generified to some extent, but utilities and crawlers for each variety of media node will likely be necessary at some level. Wrappers for Reddit and Twitter could be useful as well. The silver lining with respect to non-API sites is that citations/references seem to just hang out in a
tags in the middle of p
, span
, em
, and other tags with text content.
I'd like this to grow to include even media such as Breitbart, Sputnik, and other intensely polarized sites and sources. Scrutiny doesn't need to see party lines or extremes [unless you want to prune or tag those branches (; ].
I imagine this project could also impact existing D4D projects such as assemble and the collaboration with propublica if implemented with a well-documented web API.
Proposed Itinerary for Base API:
- Build base web scraper
- Write logic for generic HTML tree traversal and
a
tag farming (this will likely evolve through each phase with the media node varieties) - Write logic for scraping mainstream media nodes
- Write logic for working with Reddit, Twitter, and other APIs as media nodes
- Write logic for progressively less and less mainstream media nodes--opening the floor to each as an issue and eventually PR that can be integrated
A frontend and database solution can begin happening once the API and node structure are reasonably solidified, respectively. This will also likely evolve as the project grows.