GitHub - dodgejoel/mtg_gatherer_database: Web scraper which gathers magic card data

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
README		README
data_formatting.py		data_formatting.py
gather_card_data.py		gather_card_data.py
initialize_database.py		initialize_database.py
input_to_database.py		input_to_database.py

Repository files navigation

This bundle of code is designed to scrape Wizards of the Coast's Magic: the
Gathering website at 'http://gatherer.wizards.com' and build an SQLite database
called mtg_gatherer.db with all of the card information contained in it. First
run 'initialize_database' to create the database with the appropriate tables.
Next run 'gather_card_data' to extract all of the data.  Finally run
'input_to_database' to input the data into the database. The first and last
programs take very little time to run but gather_card_data takes about 20
minutes to complete.   

If gather_card_data.py is interrupted and then called again, it will pick up
where it left off. i.e. the card data already retrieved will not be fetched
again. 

This is an ongoing project. The next step is to bundle this into a single
script that, when called once, will execute the three parts of this program
without further user interaction. After this, I will implement a GUI for
interacting with the database. In essence I would like to recreate locally all
of the functionality of the original website.

Non built-in modules required are: bs4, sqlite3.

Bugs/Problems: 

1. Fuse/Split/Flip/Werewolf cards are supported but are quirky.  Only the name,
mana cost and card text is currently shown for both halves with each halves
data seperated by a '//'.  This is a little undesirable as some of these cards
types/subtypes change upon flipping. The parsing of the type subtype data is a
little different than he rest of the data so I am putting this off for now.

Currently, there is a problem with cards that are creatures on one side but
flip to non creatures.  This is sloppy on my part but I have fixed the problem
of only collecting power nad toughness data for one side of the werewolf cards.

Gatherer's treatment of these cards is not uniform so there are some funny
things.  Currently all werewolves have two multiverse_ids for example whereas
split cards only have one.  This will need to be dealt with in a more heavy
handed way to avoid duplicate werewolves in the resulting database.  Another
little problem with these cards is the computation of the converted mana cost
of the split and fuse cards.  As it currently stands, the database entries for
these cards will have cmcs of the form '3//3' for example and this compares as
> than any single number.  

2. There is a difficult to reproduce bug involving URL request timeouts and
some of the threads in the gather_card_data script not finishing their tasks
and not terminating correctly. Perhaps has something to do with the speed of
the internet connection? Not a serious problem as the program can always be run
a second time to clean up the lost jobs.  I still have not understood this
well.