Skip to content
This repository was archived by the owner on Mar 8, 2021. It is now read-only.

Commit 4fe5e6f

Browse files
committed
change corpus paths
1 parent d432805 commit 4fe5e6f

File tree

1 file changed

+7
-10
lines changed

1 file changed

+7
-10
lines changed

Chapter 9 - Learning from Examples.ipynb

+7-10
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"metadata": {
33
"name": "",
4-
"signature": "sha256:a1ff8ec8ebc0312f22bc9c580940eafd0be4d19b198403b4e1f62ebfc5a9d4cf"
4+
"signature": "sha256:9c4f189328f7bcdfb6a6712ea0a60f3d59f38125773549703574e6ecc776eb9e"
55
},
66
"nbformat": 3,
77
"nbformat_minor": 0,
@@ -753,7 +753,7 @@
753753
"\n",
754754
"What kind of features can we use for authorship attribution? Words, defined a everything surrounded by spaces, are generally conceived as good features. The same holds for bigrams of words and character $n$-grams: [there ain't no such thing as a free lunch](https://en.wikipedia.org/wiki/No_free_lunch_theorem). Therefore let's not restrict ourselves to a single feature representation but experiment with a number of different representations and see what works best.\n",
755755
"\n",
756-
"In the folder `supervized-learning/data/novels` you will find 26 famous British novels downloaded from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). This is a small toy dataset that we will use in our experiments. First we will create a simple representation of a document. I choose to represent each document as a tuple of an author, a title and the actual text. Instead of ordinary tuples we will use the `namedtuple` from the [collections](https://docs.python.org/3.4/library/collections.html#collections.namedtuple) module in Python's standard library. A namedtuple can be constructed as follows:"
756+
"In the folder `data/british-novels` you will find 26 famous British novels downloaded from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). This is a small toy dataset that we will use in our experiments. First we will create a simple representation of a document. I choose to represent each document as a tuple of an author, a title and the actual text. Instead of ordinary tuples we will use the `namedtuple` from the [collections](https://docs.python.org/3.4/library/collections.html#collections.namedtuple) module in Python's standard library. A namedtuple can be constructed as follows:"
757757
]
758758
},
759759
{
@@ -896,7 +896,7 @@
896896
"cell_type": "markdown",
897897
"metadata": {},
898898
"source": [
899-
"We will write a function `make_document` that takes as argument a filename and returns an instance of our named tuple `Document`. Each filename in `supervized-learning/data/british-novels` consist of the author and the title separated by an underscore. This allows us to use the filenames to easily extract the title and author. The function `make_document` takes as argument a filename, an $n$-gram range, an argument that states whether to lowercase the text the type of $n$-grams (either word or char) and how large the sample of each text should be:"
899+
"We will write a function `make_document` that takes as argument a filename and returns an instance of our named tuple `Document`. Each filename in `data/british-novels` consist of the author and the title separated by an underscore. This allows us to use the filenames to easily extract the title and author. The function `make_document` takes as argument a filename, an $n$-gram range, an argument that states whether to lowercase the text the type of $n$-grams (either word or char) and how large the sample of each text should be:"
900900
]
901901
},
902902
{
@@ -1382,7 +1382,7 @@
13821382
"input": [
13831383
"from glob import glob\n",
13841384
"\n",
1385-
"documents = [make_document(f) for f in glob('supervized-learning/data/british-novels/*.txt')]"
1385+
"documents = [make_document(f) for f in glob('data/british-novels/*.txt')]"
13861386
],
13871387
"language": "python",
13881388
"metadata": {},
@@ -1445,8 +1445,7 @@
14451445
"scores = {}\n",
14461446
"# insert your code here\n",
14471447
"for sample in range(100, 5000, 500):\n",
1448-
" documents = [make_document(f, sample=sample) for f in glob(\n",
1449-
" 'supervized-learning/data/british-novels/*.txt')]\n",
1448+
" documents = [make_document(f, sample=sample) for f in glob('data/british-novels/*.txt')]\n",
14501449
" authors, titles, texts = zip(*documents)\n",
14511450
" scores[sample] = cross_validate(AuthorshipLearner(), texts, authors, k=None, score_fn=f_score)"
14521451
],
@@ -1488,8 +1487,7 @@
14881487
"scores = {}\n",
14891488
"# insert your code here\n",
14901489
"for n_most_frequent in range(50, 500, 100):\n",
1491-
" documents = [make_document(f) for f in glob(\n",
1492-
" 'supervized-learning/data/british-novels/*.txt')]\n",
1490+
" documents = [make_document(f) for f in glob('data/british-novels/*.txt')]\n",
14931491
" authors, titles, texts = zip(*documents)\n",
14941492
" scores[n_most_frequent] = cross_validate(AuthorshipLearner(n_most_frequent=n_most_frequent), \n",
14951493
" texts, authors, k=None, score_fn=f_score)"
@@ -1594,8 +1592,7 @@
15941592
"cell_type": "code",
15951593
"collapsed": false,
15961594
"input": [
1597-
"grid_search(AuthorshipLearner(), \n",
1598-
" 'supervized-learning/data/british-novels/', \n",
1595+
"grid_search(AuthorshipLearner(), 'data/british-novels/', \n",
15991596
" params=params, n_folds=None, score_fn=f_score, verbose=1)"
16001597
],
16011598
"language": "python",

0 commit comments

Comments
 (0)