Skip to content

improve xml_dump processing. #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

groceryheist
Copy link

  1. add lang field to Iterator to store wiki lanaguge.
  2. extract lang from root tag of xml dumps
  3. load siteinfo when it comes in a 'siteinfo' element.

1. add lang field to Iterator to store wiki lanaguge.
2. extract lang from root tag of xml dumps
3. load siteinfo when it comes in a 'siteinfo' element.
@makoshark
Copy link
Contributor

Where are you seeing SHA1s in page elements? Are these in the Wikia dumps?

@groceryheist
Copy link
Author

groceryheist commented May 13, 2017 via email

@halfak
Copy link
Member

halfak commented May 16, 2017

This is soon to be deprecated in favor of https://github.com/mediawiki-utilities/python-mwxml. Transition is slow because I'm the sole maintainer. I wonder if that other library works as expected here.

if tag == "page":
yield Page.from_element(sub_element)
else:
assert MalformedXML("Expected to see 'page'. " +
"Instead saw '{0}'".format(tag))

@classmethod
def from_element(cls, element):

def from_element(cls, element, lang=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not extract the lang inside of this method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

@@ -140,6 +145,9 @@ def load_site_info(cls, element):
namespaces = {}

for sub_element in element:

if sub_element.tag == 'siteinfo':
return(cls.load_site_info(sub_element))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return isn't a function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what's going on here? Is there a inside of a tag?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean is language data inside a tag?
Unfortunately not. The language is in the xml header, not a tag.

Copy link
Author

@groceryheist groceryheist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepted feedback from aaron halfaker

if tag == "page":
yield Page.from_element(sub_element)
else:
assert MalformedXML("Expected to see 'page'. " +
"Instead saw '{0}'".format(tag))

@classmethod
def from_element(cls, element):

def from_element(cls, element, lang=None):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

@@ -140,6 +145,9 @@ def load_site_info(cls, element):
namespaces = {}

for sub_element in element:

if sub_element.tag == 'siteinfo':
return(cls.load_site_info(sub_element))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean is language data inside a tag?
Unfortunately not. The language is in the xml header, not a tag.

@@ -140,6 +143,9 @@ def load_site_info(cls, element):
namespaces = {}

for sub_element in element:

if sub_element.tag == 'siteinfo':
return cls.load_site_info(sub_element)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this line because it looks like you're expecting to find something like this:

<siteinfo>
  <siteinfo> ... </siteinfo>
</siteinfo>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK i see now why this looks weird. I'll take another look. It's been a while since I did this so I'll see if it's a mistake or just something strange going on with Wikia dumps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for taking a look at this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😉 👍

@halfak
Copy link
Member

halfak commented May 18, 2017

I left one more note because I'm still confused about this one. Sorry if it seems nitpicky or if I'm missing something obvious :S

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants