-
Notifications
You must be signed in to change notification settings - Fork 20
improve xml_dump processing. #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
improve xml_dump processing. #43
Conversation
groceryheist
commented
Feb 17, 2017
- add lang field to Iterator to store wiki lanaguge.
- extract lang from root tag of xml dumps
- load siteinfo when it comes in a 'siteinfo' element.
1. add lang field to Iterator to store wiki lanaguge. 2. extract lang from root tag of xml dumps 3. load siteinfo when it comes in a 'siteinfo' element.
Where are you seeing SHA1s in page elements? Are these in the Wikia dumps? |
Yes.
…On Fri, May 12, 2017 at 2:09 PM, Benjamin Mako Hill < ***@***.***> wrote:
Where are you seeing SHA1s in page elements? Are these in the Wikia dumps?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAG6n41EespKzXw1MgyeZbnwNJGRqTc_ks5r5MpygaJpZM4MEr_2>
.
--
Nate
|
This is soon to be deprecated in favor of https://github.com/mediawiki-utilities/python-mwxml. Transition is slow because I'm the sole maintainer. I wonder if that other library works as expected here. |
if tag == "page": | ||
yield Page.from_element(sub_element) | ||
else: | ||
assert MalformedXML("Expected to see 'page'. " + | ||
"Instead saw '{0}'".format(tag)) | ||
|
||
@classmethod | ||
def from_element(cls, element): | ||
|
||
def from_element(cls, element, lang=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not extract the lang inside of this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
mw/xml_dump/iteration/iterator.py
Outdated
@@ -140,6 +145,9 @@ def load_site_info(cls, element): | |||
namespaces = {} | |||
|
|||
for sub_element in element: | |||
|
|||
if sub_element.tag == 'siteinfo': | |||
return(cls.load_site_info(sub_element)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return
isn't a function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what's going on here? Is there a inside of a tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean is language data inside a tag?
Unfortunately not. The language is in the xml header, not a tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepted feedback from aaron halfaker
if tag == "page": | ||
yield Page.from_element(sub_element) | ||
else: | ||
assert MalformedXML("Expected to see 'page'. " + | ||
"Instead saw '{0}'".format(tag)) | ||
|
||
@classmethod | ||
def from_element(cls, element): | ||
|
||
def from_element(cls, element, lang=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
mw/xml_dump/iteration/iterator.py
Outdated
@@ -140,6 +145,9 @@ def load_site_info(cls, element): | |||
namespaces = {} | |||
|
|||
for sub_element in element: | |||
|
|||
if sub_element.tag == 'siteinfo': | |||
return(cls.load_site_info(sub_element)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean is language data inside a tag?
Unfortunately not. The language is in the xml header, not a tag.
@@ -140,6 +143,9 @@ def load_site_info(cls, element): | |||
namespaces = {} | |||
|
|||
for sub_element in element: | |||
|
|||
if sub_element.tag == 'siteinfo': | |||
return cls.load_site_info(sub_element) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this line because it looks like you're expecting to find something like this:
<siteinfo>
<siteinfo> ... </siteinfo>
</siteinfo>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK i see now why this looks weird. I'll take another look. It's been a while since I did this so I'll see if it's a mistake or just something strange going on with Wikia dumps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for taking a look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😉 👍
I left one more note because I'm still confused about this one. Sorry if it seems nitpicky or if I'm missing something obvious :S |