Skip to content

Commit c078acc

Browse files
author
Mike Taylor
committed
finish up markdown formatting for great victory
1 parent d8f22aa commit c078acc

File tree

1 file changed

+84
-83
lines changed

1 file changed

+84
-83
lines changed

README.md

+84-83
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
= Hpricot, Read Any HTML
1+
# Hpricot, Read Any HTML
22

33
Hpricot is a fast, flexible HTML parser written in C. It's designed to be very
44
accommodating (like Tanaka Akira's HTree) and to have a very helpful library
@@ -13,21 +13,21 @@ thing.
1313
*Please read this entire document* before making assumptions about how this
1414
software works.
1515

16-
== An Overview
16+
## An Overview
1717

1818
Let's clear up what Hpricot is.
1919

20-
# Hpricot is *a standalone library*. It requires no other libraries. Just Ruby!
21-
# While priding itself on speed, Hpricot *works hard to sort out bad HTML* and
20+
* Hpricot is *a standalone library*. It requires no other libraries. Just Ruby!
21+
* While priding itself on speed, Hpricot *works hard to sort out bad HTML* and
2222
pays a small penalty in order to get that right. So that's slightly more important
2323
to me than speed.
24-
# *If you can see it in Firefox, then Hpricot should parse it.* That's
24+
* *If you can see it in Firefox, then Hpricot should parse it.* That's
2525
how it should be! Let me know the minute it's otherwise.
26-
# Primarily, Hpricot is used for reading HTML and tries to sort out troubled
26+
* Primarily, Hpricot is used for reading HTML and tries to sort out troubled
2727
HTML by having some idea of what good HTML is. Some people still like to use
2828
Hpricot for XML reading, but *remember to use the Hpricot::XML() method* for that!
2929

30-
== The Hpricot Kingdom
30+
## The Hpricot Kingdom
3131

3232
First, here are all the links you need to know:
3333

@@ -43,184 +43,185 @@ not going to say "Use at your own risk" because I don't want this library to be
4343
risky. If you trip on something, I'll share the liability by repairing things
4444
as quickly as I can. Your responsibility is to report the inadequacies.
4545

46-
== Installing Hpricot
46+
## Installing Hpricot
4747

4848
You may get the latest stable version from Rubyforge. Win32 binaries,
4949
Java binaries (for JRuby), and source gems are available.
5050

51-
$ gem install hpricot
51+
$ gem install hpricot
5252

53-
== An Hpricot Showcase
53+
## An Hpricot Showcase
5454

5555
We're going to run through a big pile of examples to get you jump-started.
5656
Many of these examples are also found at
5757
http://wiki.github.com/hpricot/hpricot/hpricot-basics, in case you
5858
want to add some of your own.
5959

60-
=== Loading Hpricot Itself
60+
### Loading Hpricot Itself
6161

6262
You have probably got the gem, right? To load Hpricot:
6363

64-
require 'rubygems'
65-
require 'hpricot'
64+
require 'rubygems'
65+
require 'hpricot'
6666

6767
If you've installed the plain source distribution, go ahead and just:
6868

69-
require 'hpricot'
69+
require 'hpricot'
7070

71-
=== Load an HTML Page
71+
### Load an HTML Page
7272

7373
The <tt>Hpricot()</tt> method takes a string or any IO object and loads the
7474
contents into a document object.
7575

76-
doc = Hpricot("<p>A simple <b>test</b> string.</p>")
76+
doc = Hpricot("<p>A simple <b>test</b> string.</p>")
7777

7878
To load from a file, just get the stream open:
7979

80-
doc = open("index.html") { |f| Hpricot(f) }
80+
doc = open("index.html") { |f| Hpricot(f) }
8181

8282
To load from a web URL, use <tt>open-uri</tt>, which comes with Ruby:
8383

84-
require 'open-uri'
85-
doc = open("http://qwantz.com/") { |f| Hpricot(f) }
84+
require 'open-uri'
85+
doc = open("http://qwantz.com/") { |f| Hpricot(f) }
8686

8787
Hpricot uses an internal buffer to parse the file, so the IO will stream
8888
properly and large documents won't be loaded into memory all at once. However,
8989
the parsed document object will be present in memory, in its entirety.
9090

91-
=== Search for Elements
91+
### Search for Elements
9292

9393
Use <tt>Doc.search</tt>:
9494

95-
doc.search("//p[@class='posted']")
96-
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
95+
doc.search("//p[@class='posted']")
96+
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
9797

9898
<tt>Doc.search</tt> can take an XPath or CSS expression. In the above example,
9999
all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt>
100100
attribute of <tt>"posted"</tt>.
101101

102102
A shortcut is to use the divisor:
103103

104-
(doc/"p.posted")
105-
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
104+
(doc/"p.posted")
105+
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
106106

107-
=== Finding Just One Element
107+
### Finding Just One Element
108108

109109
If you're looking for a single element, the <tt>at</tt> method will return the
110110
first element matched by the expression. In this case, you'll get back the
111111
element itself rather than the <tt>Hpricot::Elements</tt> array.
112112

113-
doc.at("body")['onload']
113+
doc.at("body")['onload']
114114

115115
The above code will find the body tag and give you back the <tt>onload</tt>
116116
attribute. This is the most common reason to use the element directly: when
117117
reading and writing HTML attributes.
118118

119-
=== Fetching the Contents of an Element
119+
### Fetching the Contents of an Element
120120

121121
Just as with browser scripting, the <tt>inner_html</tt> property can be used to
122122
get the inner contents of an element.
123123

124-
(doc/"#elementID").inner_html
125-
#=> "..<b>contents</b>.."
124+
(doc/"#elementID").inner_html
125+
#=> "..contents.."
126126

127127
If your expression matches more than one element, you'll get back the contents
128128
of ''all the matched elements''. So you may want to use <tt>first</tt> to be
129129
sure you get back only one.
130130

131-
(doc/"#elementID").first.inner_html
132-
#=> "..<b>contents</b>.."
131+
(doc/"#elementID").first.inner_html
132+
#=> "..contents.."
133133

134-
=== Fetching the HTML for an Element
134+
### Fetching the HTML for an Element
135135

136136
If you want the HTML for the whole element (not just the contents), use
137137
<tt>to_html</tt>:
138138

139-
(doc/"#elementID").to_html
140-
#=> "<div id='elementID'>...</div>"
139+
(doc/"#elementID").to_html
140+
#=> "<div id='elementID'>...</div>"
141141

142-
=== Looping
142+
### Looping
143143

144144
All searches return a set of <tt>Hpricot::Elements</tt>. Go ahead and loop
145145
through them like you would an array.
146146

147-
(doc/"p/a/img").each do |img|
148-
puts img.attributes['class']
149-
end
147+
(doc/"p/a/img").each do |img|
148+
puts img.attributes['class']
149+
end
150150

151-
=== Continuing Searches
151+
### Continuing Searches
152152

153153
Searches can be continued from a collection of elements, in order to search deeper.
154154

155-
# find all paragraphs.
156-
elements = doc.search("/html/body//p")
157-
# continue the search by finding any images within those paragraphs.
158-
(elements/"img")
159-
#=> #<Hpricot::Elements[{img ...}, {img ...}]>
155+
# find all paragraphs.
156+
elements = doc.search("/html/body//p")
157+
# continue the search by finding any images within those paragraphs.
158+
(elements/"img")
159+
#=> #<Hpricot::Elements[{img ...}, {img ...}]>
160160

161161
Searches can also be continued by searching within container elements.
162162

163-
# find all images within paragraphs.
164-
doc.search("/html/body//p").each do |para|
165-
puts "== Found a paragraph =="
166-
pp para
163+
# find all images within paragraphs.
164+
doc.search("/html/body//p").each do |para|
165+
puts "== Found a paragraph =="
166+
pp para
167167

168-
imgs = para.search("img")
169-
if imgs.any?
170-
puts "== Found #{imgs.length} images inside =="
171-
end
172-
end
168+
imgs = para.search("img")
169+
if imgs.any?
170+
puts "== Found #{imgs.length} images inside =="
171+
end
172+
end
173173

174174
Of course, the most succinct ways to do the above are using CSS or XPath.
175175

176-
# the xpath version
177-
(doc/"/html/body//p//img")
178-
# the css version
179-
(doc/"html > body > p img")
180-
# ..or symbols work, too!
181-
(doc/:html/:body/:p/:img)
176+
# the xpath version
177+
(doc/"/html/body//p//img")
178+
# the css version
179+
(doc/"html > body > p img")
180+
# ..or symbols work, too!
181+
(doc/:html/:body/:p/:img)
182182

183-
=== Looping Edits
183+
### Looping Edits
184184

185185
You may certainly edit objects from within your search loops. Then, when you
186186
spit out the HTML, the altered elements will show.
187187

188-
(doc/"span.entryPermalink").each do |span|
189-
span.attributes['class'] = 'newLinks'
190-
end
191-
puts doc
188+
189+
(doc/"span.entryPermalink").each do |span|
190+
span.attributes['class'] = 'newLinks'
191+
end
192+
puts doc
192193

193194
This changes all <tt>span.entryPermalink</tt> elements to
194195
<tt>span.newLinks</tt>. Keep in mind that there are often more convenient ways
195196
of doing this. Such as the <tt>set</tt> method:
196197

197-
(doc/"span.entryPermalink").set(:class => 'newLinks')
198+
(doc/"span.entryPermalink").set(:class => 'newLinks')
198199

199-
=== Figuring Out Paths
200+
### Figuring Out Paths
200201

201202
Every element can tell you its unique path (either XPath or CSS) to get to the
202203
element from the root tag.
203204

204205
The <tt>css_path</tt> method:
205206

206-
doc.at("div > div:nth(1)").css_path
207-
#=> "div > div:nth(1)"
208-
doc.at("#header").css_path
209-
#=> "#header"
207+
doc.at("div > div:nth(1)").css_path
208+
#=> "div > div:nth(1)"
209+
doc.at("#header").css_path
210+
#=> "#header"
210211

211212
Or, the <tt>xpath</tt> method:
212213

213-
doc.at("div > div:nth(1)").xpath
214-
#=> "/div/div:eq(1)"
215-
doc.at("#header").xpath
216-
#=> "//div[@id='header']"
214+
doc.at("div > div:nth(1)").xpath
215+
#=> "/div/div:eq(1)"
216+
doc.at("#header").xpath
217+
#=> "//div[@id='header']"
217218

218-
== Hpricot Fixups
219+
## Hpricot Fixups
219220

220221
When loading HTML documents, you have a few settings that can make Hpricot more
221222
or less intense about how it gets involved.
222223

223-
== :fixup_tags
224+
## :fixup_tags
224225

225226
Really, there are so many ways to clean up HTML and your intentions may be to
226227
keep the HTML as-is. So Hpricot's default behavior is to keep things flexible.
@@ -229,7 +230,7 @@ Making sure to open and close all the tags, but ignore any validation problems.
229230
As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt
230231
to shift the document's tags to meet XHTML 1.0 Strict.
231232

232-
doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }
233+
doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }
233234

234235
This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow
235236
the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's
@@ -238,13 +239,13 @@ where paragraphs don't belong.
238239

239240
If an unknown element is found, it is ignored. Again, <tt>:fixup_tags</tt>.
240241

241-
== :xhtml_strict
242+
## :xhtml_strict
242243

243244
So, let's go beyond just trying to fix the hierarchy. The
244245
<tt>:xhtml_strict</tt> option really tries to force the document to be an XHTML
245246
1.0 Strict document. Even at the cost of removing elements that get in the way.
246247

247-
doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }
248+
doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }
248249

249250
What measures does <tt>:xhtml_strict</tt> take?
250251

@@ -254,7 +255,7 @@ What measures does <tt>:xhtml_strict</tt> take?
254255
4. Remove illegal content.
255256
5. Alter the doctype to XHTML 1.0 Strict.
256257

257-
== Hpricot.XML()
258+
## Hpricot.XML()
258259

259260
The last option is the <tt>:xml</tt> option, which makes some slight variations
260261
on the standard mode. The main difference is that :xml mode won't try to output
@@ -266,9 +267,9 @@ to case, friends.
266267

267268
The primary way to use Hpricot's XML mode is to call the Hpricot.XML method:
268269

269-
doc = open("http://redhanded.hobix.com/index.xml") do |f|
270-
Hpricot.XML(f)
271-
end
270+
doc = open("http://redhanded.hobix.com/index.xml") do |f|
271+
Hpricot.XML(f)
272+
end
272273

273274
*Also, :fixup_tags is canceled out by the :xml option.* This is because
274275
:fixup_tags makes assumptions based how HTML is structured. Specifically, how

0 commit comments

Comments
 (0)