Yesterday @Shinmera mentioned Plump in the @XH004’s thread about performance optimization of it’s new HTML parser. And I decided to review it.
Plump is able to parse, modify and serialize an HTML back.
Let’s write a crawler to grab @shinmera’s posts from Twitter!
POFTHEDAY> (defvar *raw-html*
(dex:get "https://twitter.com/shinmera"))
POFTHEDAY> (defvar *html* (plump:parse *raw-html*))
;; We need all divs with class "tweet-text"
POFTHEDAY> (defvar *posts*
(remove-if-not (lambda (div)
(str:containsp "tweet-text"
(plump:attribute div "class")))
(plump:get-elements-by-tag-name *html* "p")))
POFTHEDAY> (loop for post in (rutils:take 5 *posts*)
for full-text = (plump:render-text post)
for short-text = (str:shorten 40 full-text)
do (format t "- ~A~2%" short-text))
- 1478 Lighting sketch #onesies https:/...
- Trust Level: Swiss A fridge with cool...
- The arch.pic.twitter.com/gMamJfZ1r4
- らくがきばかりアップしていたやつ、今度は動きます。週末にプロクリエイトで描...
- Shit's broken. Will be back in a few ...
This library has more utils for HTML parsing. Read the documentation to learn more.
If you are going to write crawlers in Common lisp, I recommend you to use Plump together with another @shimera’s library - clss but we’ll play with it tomorrow :)