1
- = Hpricot, Read Any HTML
1
+ # Hpricot, Read Any HTML
2
2
3
3
Hpricot is a fast, flexible HTML parser written in C. It's designed to be very
4
4
accommodating (like Tanaka Akira's HTree) and to have a very helpful library
@@ -13,21 +13,21 @@ thing.
13
13
* Please read this entire document* before making assumptions about how this
14
14
software works.
15
15
16
- == An Overview
16
+ ## An Overview
17
17
18
18
Let's clear up what Hpricot is.
19
19
20
- # Hpricot is * a standalone library* . It requires no other libraries. Just Ruby!
21
- # While priding itself on speed, Hpricot * works hard to sort out bad HTML* and
20
+ * Hpricot is * a standalone library* . It requires no other libraries. Just Ruby!
21
+ * While priding itself on speed, Hpricot * works hard to sort out bad HTML* and
22
22
pays a small penalty in order to get that right. So that's slightly more important
23
23
to me than speed.
24
- # * If you can see it in Firefox, then Hpricot should parse it.* That's
24
+ * * If you can see it in Firefox, then Hpricot should parse it.* That's
25
25
how it should be! Let me know the minute it's otherwise.
26
- # Primarily, Hpricot is used for reading HTML and tries to sort out troubled
26
+ * Primarily, Hpricot is used for reading HTML and tries to sort out troubled
27
27
HTML by having some idea of what good HTML is. Some people still like to use
28
28
Hpricot for XML reading, but * remember to use the Hpricot::XML() method* for that!
29
29
30
- == The Hpricot Kingdom
30
+ ## The Hpricot Kingdom
31
31
32
32
First, here are all the links you need to know:
33
33
@@ -43,184 +43,185 @@ not going to say "Use at your own risk" because I don't want this library to be
43
43
risky. If you trip on something, I'll share the liability by repairing things
44
44
as quickly as I can. Your responsibility is to report the inadequacies.
45
45
46
- == Installing Hpricot
46
+ ## Installing Hpricot
47
47
48
48
You may get the latest stable version from Rubyforge. Win32 binaries,
49
49
Java binaries (for JRuby), and source gems are available.
50
50
51
- $ gem install hpricot
51
+ $ gem install hpricot
52
52
53
- == An Hpricot Showcase
53
+ ## An Hpricot Showcase
54
54
55
55
We're going to run through a big pile of examples to get you jump-started.
56
56
Many of these examples are also found at
57
57
http://wiki.github.com/hpricot/hpricot/hpricot-basics , in case you
58
58
want to add some of your own.
59
59
60
- === Loading Hpricot Itself
60
+ ### Loading Hpricot Itself
61
61
62
62
You have probably got the gem, right? To load Hpricot:
63
63
64
- require 'rubygems'
65
- require 'hpricot'
64
+ require 'rubygems'
65
+ require 'hpricot'
66
66
67
67
If you've installed the plain source distribution, go ahead and just:
68
68
69
- require 'hpricot'
69
+ require 'hpricot'
70
70
71
- === Load an HTML Page
71
+ ### Load an HTML Page
72
72
73
73
The <tt >Hpricot()</tt > method takes a string or any IO object and loads the
74
74
contents into a document object.
75
75
76
- doc = Hpricot("<p >A simple <b >test</b > string.</p >")
76
+ doc = Hpricot("<p>A simple <b>test</b> string.</p>")
77
77
78
78
To load from a file, just get the stream open:
79
79
80
- doc = open("index.html") { |f| Hpricot(f) }
80
+ doc = open("index.html") { |f| Hpricot(f) }
81
81
82
82
To load from a web URL, use <tt >open-uri</tt >, which comes with Ruby:
83
83
84
- require 'open-uri'
85
- doc = open("http://qwantz.com/ ") { |f| Hpricot(f) }
84
+ require 'open-uri'
85
+ doc = open("http://qwantz.com/") { |f| Hpricot(f) }
86
86
87
87
Hpricot uses an internal buffer to parse the file, so the IO will stream
88
88
properly and large documents won't be loaded into memory all at once. However,
89
89
the parsed document object will be present in memory, in its entirety.
90
90
91
- === Search for Elements
91
+ ### Search for Elements
92
92
93
93
Use <tt >Doc.search</tt >:
94
94
95
- doc.search("//p[ @class ='posted'] ")
96
- #=> #<Hpricot: Elements [ {p ...}, {p ...}] >
95
+ doc.search("//p[@class='posted']")
96
+ #=> #<Hpricot:Elements[{p ...}, {p ...}]>
97
97
98
98
<tt >Doc.search</tt > can take an XPath or CSS expression. In the above example,
99
99
all paragraph <tt ><p ></tt > elements are grabbed which have a <tt >class</tt >
100
100
attribute of <tt >"posted"</tt >.
101
101
102
102
A shortcut is to use the divisor:
103
103
104
- (doc/"p.posted")
105
- #=> #<Hpricot: Elements [ {p ...}, {p ...}] >
104
+ (doc/"p.posted")
105
+ #=> #<Hpricot:Elements[{p ...}, {p ...}]>
106
106
107
- === Finding Just One Element
107
+ ### Finding Just One Element
108
108
109
109
If you're looking for a single element, the <tt >at</tt > method will return the
110
110
first element matched by the expression. In this case, you'll get back the
111
111
element itself rather than the <tt >Hpricot::Elements</tt > array.
112
112
113
- doc.at("body")[ 'onload']
113
+ doc.at("body")['onload']
114
114
115
115
The above code will find the body tag and give you back the <tt >onload</tt >
116
116
attribute. This is the most common reason to use the element directly: when
117
117
reading and writing HTML attributes.
118
118
119
- === Fetching the Contents of an Element
119
+ ### Fetching the Contents of an Element
120
120
121
121
Just as with browser scripting, the <tt >inner_html</tt > property can be used to
122
122
get the inner contents of an element.
123
123
124
- (doc/"#elementID").inner_html
125
- #=> "..< b > contents</ b > .."
124
+ (doc/"#elementID").inner_html
125
+ #=> "..contents.."
126
126
127
127
If your expression matches more than one element, you'll get back the contents
128
128
of ''all the matched elements''. So you may want to use <tt >first</tt > to be
129
129
sure you get back only one.
130
130
131
- (doc/"#elementID").first.inner_html
132
- #=> "..< b > contents</ b > .."
131
+ (doc/"#elementID").first.inner_html
132
+ #=> "..contents.."
133
133
134
- === Fetching the HTML for an Element
134
+ ### Fetching the HTML for an Element
135
135
136
136
If you want the HTML for the whole element (not just the contents), use
137
137
<tt >to_html</tt >:
138
138
139
- (doc/"#elementID").to_html
140
- #=> "<div id =' elementID ' >...</div >"
139
+ (doc/"#elementID").to_html
140
+ #=> "<div id='elementID'>...</div>"
141
141
142
- === Looping
142
+ ### Looping
143
143
144
144
All searches return a set of <tt >Hpricot::Elements</tt >. Go ahead and loop
145
145
through them like you would an array.
146
146
147
- (doc/"p/a/img").each do |img|
148
- puts img.attributes[ 'class']
149
- end
147
+ (doc/"p/a/img").each do |img|
148
+ puts img.attributes['class']
149
+ end
150
150
151
- === Continuing Searches
151
+ ### Continuing Searches
152
152
153
153
Searches can be continued from a collection of elements, in order to search deeper.
154
154
155
- # find all paragraphs.
156
- elements = doc.search("/html/body//p")
157
- # continue the search by finding any images within those paragraphs.
158
- (elements/"img")
159
- #=> #<Hpricot::Elements[ {img ...}, {img ...}] >
155
+ # find all paragraphs.
156
+ elements = doc.search("/html/body//p")
157
+ # continue the search by finding any images within those paragraphs.
158
+ (elements/"img")
159
+ #=> #<Hpricot::Elements[{img ...}, {img ...}]>
160
160
161
161
Searches can also be continued by searching within container elements.
162
162
163
- # find all images within paragraphs.
164
- doc.search("/html/body//p").each do |para|
165
- puts "== Found a paragraph =="
166
- pp para
163
+ # find all images within paragraphs.
164
+ doc.search("/html/body//p").each do |para|
165
+ puts "== Found a paragraph =="
166
+ pp para
167
167
168
- imgs = para.search("img")
169
- if imgs.any?
170
- puts "== Found #{imgs.length} images inside =="
171
- end
172
- end
168
+ imgs = para.search("img")
169
+ if imgs.any?
170
+ puts "== Found #{imgs.length} images inside =="
171
+ end
172
+ end
173
173
174
174
Of course, the most succinct ways to do the above are using CSS or XPath.
175
175
176
- # the xpath version
177
- (doc/"/html/body//p//img")
178
- # the css version
179
- (doc/"html > body > p img")
180
- # ..or symbols work, too!
181
- (doc/: html /: body /: p /: img )
176
+ # the xpath version
177
+ (doc/"/html/body//p//img")
178
+ # the css version
179
+ (doc/"html > body > p img")
180
+ # ..or symbols work, too!
181
+ (doc/:html/:body/:p/:img)
182
182
183
- === Looping Edits
183
+ ### Looping Edits
184
184
185
185
You may certainly edit objects from within your search loops. Then, when you
186
186
spit out the HTML, the altered elements will show.
187
187
188
- (doc/"span.entryPermalink").each do |span|
189
- span.attributes[ 'class'] = 'newLinks'
190
- end
191
- puts doc
188
+
189
+ (doc/"span.entryPermalink").each do |span|
190
+ span.attributes['class'] = 'newLinks'
191
+ end
192
+ puts doc
192
193
193
194
This changes all <tt >span.entryPermalink</tt > elements to
194
195
<tt >span.newLinks</tt >. Keep in mind that there are often more convenient ways
195
196
of doing this. Such as the <tt >set</tt > method:
196
197
197
- (doc/"span.entryPermalink").set(: class => 'newLinks')
198
+ (doc/"span.entryPermalink").set(:class => 'newLinks')
198
199
199
- === Figuring Out Paths
200
+ ### Figuring Out Paths
200
201
201
202
Every element can tell you its unique path (either XPath or CSS) to get to the
202
203
element from the root tag.
203
204
204
205
The <tt >css_path</tt > method:
205
206
206
- doc.at("div > div: nth (1)").css_path
207
- #=> "div > div: nth (1)"
208
- doc.at("#header").css_path
209
- #=> "#header"
207
+ doc.at("div > div:nth(1)").css_path
208
+ #=> "div > div:nth(1)"
209
+ doc.at("#header").css_path
210
+ #=> "#header"
210
211
211
212
Or, the <tt >xpath</tt > method:
212
213
213
- doc.at("div > div: nth (1)").xpath
214
- #=> "/div/div: eq (1)"
215
- doc.at("#header").xpath
216
- #=> "//div[ @id ='header'] "
214
+ doc.at("div > div:nth(1)").xpath
215
+ #=> "/div/div:eq(1)"
216
+ doc.at("#header").xpath
217
+ #=> "//div[@id='header']"
217
218
218
- == Hpricot Fixups
219
+ ## Hpricot Fixups
219
220
220
221
When loading HTML documents, you have a few settings that can make Hpricot more
221
222
or less intense about how it gets involved.
222
223
223
- == : fixup_tags
224
+ ## : fixup_tags
224
225
225
226
Really, there are so many ways to clean up HTML and your intentions may be to
226
227
keep the HTML as-is. So Hpricot's default behavior is to keep things flexible.
@@ -229,7 +230,7 @@ Making sure to open and close all the tags, but ignore any validation problems.
229
230
As of Hpricot 0.4, there's a new <tt >: fixup_tags </tt > option which will attempt
230
231
to shift the document's tags to meet XHTML 1.0 Strict.
231
232
232
- doc = open("index.html") { |f| Hpricot f, : fixup_tags => true }
233
+ doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }
233
234
234
235
This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow
235
236
the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's
@@ -238,13 +239,13 @@ where paragraphs don't belong.
238
239
239
240
If an unknown element is found, it is ignored. Again, <tt >: fixup_tags </tt >.
240
241
241
- == : xhtml_strict
242
+ ## : xhtml_strict
242
243
243
244
So, let's go beyond just trying to fix the hierarchy. The
244
245
<tt >: xhtml_strict </tt > option really tries to force the document to be an XHTML
245
246
1.0 Strict document. Even at the cost of removing elements that get in the way.
246
247
247
- doc = open("index.html") { |f| Hpricot f, : xhtml_strict => true }
248
+ doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }
248
249
249
250
What measures does <tt >: xhtml_strict </tt > take?
250
251
@@ -254,7 +255,7 @@ What measures does <tt>:xhtml_strict</tt> take?
254
255
4 . Remove illegal content.
255
256
5 . Alter the doctype to XHTML 1.0 Strict.
256
257
257
- == Hpricot.XML()
258
+ ## Hpricot.XML()
258
259
259
260
The last option is the <tt >: xml </tt > option, which makes some slight variations
260
261
on the standard mode. The main difference is that : xml mode won't try to output
@@ -266,9 +267,9 @@ to case, friends.
266
267
267
268
The primary way to use Hpricot's XML mode is to call the Hpricot.XML method:
268
269
269
- doc = open("http://redhanded.hobix.com/index.xml ") do |f|
270
- Hpricot.XML(f)
271
- end
270
+ doc = open("http://redhanded.hobix.com/index.xml") do |f|
271
+ Hpricot.XML(f)
272
+ end
272
273
273
274
* Also, : fixup_tags is canceled out by the : xml option.* This is because
274
275
: fixup_tags makes assumptions based how HTML is structured. Specifically, how
0 commit comments