Ruby OAI interface for harvesting the arXiv. Can be used to store and update an XML mirror of paper metadata, and parse the XML into Ruby objects to allow conversion into a friendlier format.
gem install arxivsync
Use the included shell command:
arxivsync ARCHIVE_DIR
This stores each XML response as an individual file, each containing up to 1000 records. Following an initial harvest, you can rerun this to add additional files containing all records since the last harvest.
Remember to leave at least a day between syncs-- the temporal granularity doesn't go any smaller than that!
archive = ArxivSync::XMLArchive.new("/home/foo/savedir")
archive.read_metadata do |papers|
# Papers come in blocks of at most 1000 at a time
papers.each do |paper|
# Do stuff with papers
end
end
Parses the XML files using a SAX parser and yields Structs representing the metadata as it goes. The structures returned will closely match the arxivRaw format.
If you just want arxivsync to do the request-cycle and parsing bits but handle storage yourself:
ArxivSync.get_metadata(oai_params) do |resp, papers|
papers.each do |paper|
# Do stuff with paper
end
end
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request