Skip to content

Commit aec4754

Browse files
committed
Refactoring, cleanups and documentation
1 parent e64f757 commit aec4754

File tree

4 files changed

+112
-7
lines changed

4 files changed

+112
-7
lines changed

README.md

+69-5
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,65 @@
1010
grobotstxt is a native Go port of [Google's robots.txt parser and matcher C++
1111
library](https://github.com/google/robotstxt).
1212

13-
- Direct function-for-function conversion/port.
14-
- Preserves all behaviour of original library.
15-
- All 100% of original test suite functionality.
16-
- The code is not pretty :/
17-
- But the tests all pass! :)
13+
- Direct function-for-function conversion/port
14+
- Preserves all behaviour of original library
15+
- All 100% of original test suite functionality
16+
- Minor language-specific cleanups
17+
18+
The original package includes a standalone binary but that has not yet been ported as part of this package.
19+
20+
## Installation
21+
```bash
22+
$ go get github.com/jimsmart/grobotstxt
23+
```
24+
25+
```go
26+
import "github.com/jimsmart/grobotstxt"
27+
```
28+
29+
### Dependencies
30+
31+
- Standard library.
32+
- [Ginkgo](https://onsi.github.io/ginkgo/) and [Gomega](https://onsi.github.io/gomega/) if you wish to run the tests.
33+
34+
## Examples
35+
36+
```go
37+
import "github.com/jimsmart/grobotstxt"
38+
39+
// Fetched robots.txt file.
40+
robotsTxt := `
41+
# robots.txt with restricted area
42+
43+
User-agent: *
44+
Disallow: /members/*
45+
46+
Sitemap: http://example.net/sitemap.xml
47+
`
48+
49+
// User-agent of bot.
50+
const userAgent = "FooBot/1.0"
51+
52+
// Target URI.
53+
uri := "http://example.net/members/index.html"
54+
55+
56+
// Is bot allowed to visit this page?
57+
ok := grobotstxt.AgentAllowed(robotsTxt, userAgent, uri)
58+
59+
```
60+
61+
Additionally, one can also extract all Sitemap URIs from a given robots.txt file:
62+
63+
```go
64+
sitemaps := grobotstxt.Sitemaps(robotsTxt)
65+
```
66+
67+
See GoDocs for further information.
68+
69+
## Documentation
70+
71+
GoDocs [https://godoc.org/github.com/jimsmart/grobotstxt](https://godoc.org/github.com/jimsmart/grobotstxt)
1872

1973
## Testing
2074

@@ -26,6 +80,16 @@ For a full coverage report, try:
2680
$ go test -coverprofile=coverage.out && go tool cover -html=coverage.out
2781
```
2882

83+
## Notes
84+
85+
Parsing of robots.txt files themselves is done exactly as in the production
86+
version of Googlebot, including how percent codes and unicode characters in
87+
patterns are handled. The user must ensure however that the URI passed to the
88+
AgentAllowed and AgentsAllowed functions, or to the URI parameter
89+
of the robots tool, follows the format specified by RFC3986, since this library
90+
will not perform full normalization of those URI parameters. Only if the URI is
91+
in this format, the matching will be done according to the REP specification.
92+
2993
## License
3094

3195
Package grobotstxt is licensed under the terms of the

doc.go

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
// Package grobotstxt is a Go port of Google's robots.txt parser and matcher C++ library.
22
//
3-
// See https://github.com/google/robotstxt
3+
// See: https://github.com/google/robotstxt
44
package grobotstxt

examples_test.go

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
package grobotstxt_test
2+
3+
import (
4+
"fmt"
5+
6+
"github.com/jimsmart/grobotstxt"
7+
)
8+
9+
func ExampleAgentAllowed() {
10+
11+
robotsTxt := `
12+
# robots.txt with restricted area
13+
14+
User-agent: *
15+
Disallow: /members/*
16+
`
17+
ok := grobotstxt.AgentAllowed(robotsTxt, "FooBot/1.0", "http://example.net/members/index.html")
18+
fmt.Println(ok)
19+
20+
// Output:
21+
// false
22+
}
23+
24+
func ExampleSitemaps() {
25+
26+
robotsTxt := `
27+
# robots.txt with sitemaps
28+
29+
User-agent: *
30+
Disallow: /members/*
31+
32+
Sitemap: http://example.net/sitemap.xml
33+
Sitemap: http://example.net/sitemap2.xml
34+
`
35+
sitemaps := grobotstxt.Sitemaps(robotsTxt)
36+
fmt.Println(sitemaps)
37+
38+
// Output:
39+
// [http://example.net/sitemap.xml http://example.net/sitemap2.xml]
40+
}

robots_cc.go

+2-1
Original file line numberDiff line numberDiff line change
@@ -595,7 +595,8 @@ func NewRobotsMatcher() *RobotsMatcher {
595595
return &m
596596
}
597597

598-
// init initialises the userAgents and path for this RobotsMatcher.
598+
// init Initialises next path and user-agents to check. Path must contain only the
599+
// path, params, and query (if any) of the url and must start with a '/'.
599600
func (m *RobotsMatcher) init(userAgents []string, path string) {
600601
// Line :478
601602
m.path = path

0 commit comments

Comments
 (0)