Skip to content

Latest commit

 

History

History
54 lines (39 loc) · 1.78 KB

0072-plump.org

File metadata and controls

54 lines (39 loc) · 1.78 KB

plump

Yesterday @Shinmera mentioned Plump in the @XH004’s thread about performance optimization of it’s new HTML parser. And I decided to review it.

Plump is able to parse, modify and serialize an HTML back.

Let’s write a crawler to grab @shinmera’s posts from Twitter!

POFTHEDAY> (defvar *raw-html*
              (dex:get "https://twitter.com/shinmera"))

POFTHEDAY> (defvar *html* (plump:parse *raw-html*))

;; We need all divs with class "tweet-text"
POFTHEDAY> (defvar *posts*
             (remove-if-not (lambda (div)
                              (str:containsp "tweet-text"
                                             (plump:attribute div "class")))
                            (plump:get-elements-by-tag-name *html* "p")))

POFTHEDAY> (loop for post in (rutils:take 5 *posts*)
                 for full-text = (plump:render-text post)
                 for short-text = (str:shorten 40 full-text)
                 do (format t "- ~A~2%" short-text))
- 1478 Lighting sketch #onesies https:/...

- Trust Level: Swiss A fridge with cool...

- The arch.pic.twitter.com/gMamJfZ1r4

- らくがきばかりアップしていたやつ、今度は動きます。週末にプロクリエイトで描...

- Shit's broken. Will be back in a few ...

This library has more utils for HTML parsing. Read the documentation to learn more.

If you are going to write crawlers in Common lisp, I recommend you to use Plump together with another @shimera’s library - clss but we’ll play with it tomorrow :)