improve xml_dump processing. #43

groceryheist · 2017-02-17T20:01:09Z

add lang field to Iterator to store wiki lanaguge.
extract lang from root tag of xml dumps
load siteinfo when it comes in a 'siteinfo' element.

1. add lang field to Iterator to store wiki lanaguge. 2. extract lang from root tag of xml dumps 3. load siteinfo when it comes in a 'siteinfo' element.

makoshark · 2017-05-12T21:09:05Z

Where are you seeing SHA1s in page elements? Are these in the Wikia dumps?

groceryheist · 2017-05-13T22:23:15Z

Yes.

…

On Fri, May 12, 2017 at 2:09 PM, Benjamin Mako Hill < ***@***.***> wrote: Where are you seeing SHA1s in page elements? Are these in the Wikia dumps? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#43 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAG6n41EespKzXw1MgyeZbnwNJGRqTc_ks5r5MpygaJpZM4MEr_2> .

-- Nate

halfak · 2017-05-16T16:27:36Z

This is soon to be deprecated in favor of https://github.com/mediawiki-utilities/python-mwxml. Transition is slow because I'm the sole maintainer. I wonder if that other library works as expected here.

halfak · 2017-05-16T16:40:15Z

mw/xml_dump/iteration/iterator.py

            if tag == "page":
                yield Page.from_element(sub_element)
            else:
                assert MalformedXML("Expected to see 'page'.  " +
                                    "Instead saw '{0}'".format(tag))

    @classmethod
-    def from_element(cls, element):
-
+    def from_element(cls, element, lang=None):


Why not extract the lang inside of this method?

halfak · 2017-05-16T16:41:08Z

mw/xml_dump/iteration/iterator.py

@@ -140,6 +145,9 @@ def load_site_info(cls, element):
        namespaces = {}

        for sub_element in element:
+
+            if sub_element.tag == 'siteinfo':
+                return(cls.load_site_info(sub_element))


return isn't a function

Also, what's going on here? Is there a inside of a tag?

Do you mean is language data inside a tag?
Unfortunately not. The language is in the xml header, not a tag.

groceryheist

accepted feedback from aaron halfaker

groceryheist · 2017-05-18T03:16:36Z

mw/xml_dump/iteration/iterator.py

            if tag == "page":
                yield Page.from_element(sub_element)
            else:
                assert MalformedXML("Expected to see 'page'.  " +
                                    "Instead saw '{0}'".format(tag))

    @classmethod
-    def from_element(cls, element):
-
+    def from_element(cls, element, lang=None):


groceryheist · 2017-05-18T03:18:02Z

mw/xml_dump/iteration/iterator.py

@@ -140,6 +145,9 @@ def load_site_info(cls, element):
        namespaces = {}

        for sub_element in element:
+
+            if sub_element.tag == 'siteinfo':
+                return(cls.load_site_info(sub_element))


Do you mean is language data inside a tag?
Unfortunately not. The language is in the xml header, not a tag.

halfak · 2017-05-18T08:51:49Z

mw/xml_dump/iteration/iterator.py

@@ -140,6 +143,9 @@ def load_site_info(cls, element):
        namespaces = {}

        for sub_element in element:
+
+            if sub_element.tag == 'siteinfo':
+                return cls.load_site_info(sub_element)


I'm confused about this line because it looks like you're expecting to find something like this:

<siteinfo> <siteinfo> ... </siteinfo> </siteinfo>

OK i see now why this looks weird. I'll take another look. It's been a while since I did this so I'll see if it's a mistake or just something strange going on with Wikia dumps.

Thanks a lot for taking a look at this.

halfak · 2017-05-18T08:55:11Z

I left one more note because I'm still confused about this one. Sorry if it seems nitpicky or if I'm missing something obvious :S

improve xml_dump processing.

d90249a

1. add lang field to Iterator to store wiki lanaguge. 2. extract lang from root tag of xml dumps 3. load siteinfo when it comes in a 'siteinfo' element.

groceryheist closed this Feb 17, 2017

groceryheist reopened this Feb 17, 2017

ignore sha1s in page elements

7e2e68b

halfak requested changes May 16, 2017

View reviewed changes

groceryheist added 2 commits May 17, 2017 20:02

don't use return like a function

da2bfe3

extract lang in from_element

ce22bf9

groceryheist commented May 18, 2017

View reviewed changes

halfak reviewed May 18, 2017

View reviewed changes

improve xml_dump processing. #43

Are you sure you want to change the base?

improve xml_dump processing. #43

Uh oh!

Conversation

groceryheist commented Feb 17, 2017

Uh oh!

makoshark commented May 12, 2017

Uh oh!

groceryheist commented May 13, 2017 via email

Uh oh!

halfak commented May 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

groceryheist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

halfak commented May 18, 2017

Uh oh!

Uh oh!