|
1 | 1 | {
|
2 | 2 | "metadata": {
|
3 | 3 | "name": "",
|
4 |
| - "signature": "sha256:a1ff8ec8ebc0312f22bc9c580940eafd0be4d19b198403b4e1f62ebfc5a9d4cf" |
| 4 | + "signature": "sha256:9c4f189328f7bcdfb6a6712ea0a60f3d59f38125773549703574e6ecc776eb9e" |
5 | 5 | },
|
6 | 6 | "nbformat": 3,
|
7 | 7 | "nbformat_minor": 0,
|
|
753 | 753 | "\n",
|
754 | 754 | "What kind of features can we use for authorship attribution? Words, defined a everything surrounded by spaces, are generally conceived as good features. The same holds for bigrams of words and character $n$-grams: [there ain't no such thing as a free lunch](https://en.wikipedia.org/wiki/No_free_lunch_theorem). Therefore let's not restrict ourselves to a single feature representation but experiment with a number of different representations and see what works best.\n",
|
755 | 755 | "\n",
|
756 |
| - "In the folder `supervized-learning/data/novels` you will find 26 famous British novels downloaded from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). This is a small toy dataset that we will use in our experiments. First we will create a simple representation of a document. I choose to represent each document as a tuple of an author, a title and the actual text. Instead of ordinary tuples we will use the `namedtuple` from the [collections](https://docs.python.org/3.4/library/collections.html#collections.namedtuple) module in Python's standard library. A namedtuple can be constructed as follows:" |
| 756 | + "In the folder `data/british-novels` you will find 26 famous British novels downloaded from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). This is a small toy dataset that we will use in our experiments. First we will create a simple representation of a document. I choose to represent each document as a tuple of an author, a title and the actual text. Instead of ordinary tuples we will use the `namedtuple` from the [collections](https://docs.python.org/3.4/library/collections.html#collections.namedtuple) module in Python's standard library. A namedtuple can be constructed as follows:" |
757 | 757 | ]
|
758 | 758 | },
|
759 | 759 | {
|
|
896 | 896 | "cell_type": "markdown",
|
897 | 897 | "metadata": {},
|
898 | 898 | "source": [
|
899 |
| - "We will write a function `make_document` that takes as argument a filename and returns an instance of our named tuple `Document`. Each filename in `supervized-learning/data/british-novels` consist of the author and the title separated by an underscore. This allows us to use the filenames to easily extract the title and author. The function `make_document` takes as argument a filename, an $n$-gram range, an argument that states whether to lowercase the text the type of $n$-grams (either word or char) and how large the sample of each text should be:" |
| 899 | + "We will write a function `make_document` that takes as argument a filename and returns an instance of our named tuple `Document`. Each filename in `data/british-novels` consist of the author and the title separated by an underscore. This allows us to use the filenames to easily extract the title and author. The function `make_document` takes as argument a filename, an $n$-gram range, an argument that states whether to lowercase the text the type of $n$-grams (either word or char) and how large the sample of each text should be:" |
900 | 900 | ]
|
901 | 901 | },
|
902 | 902 | {
|
|
1382 | 1382 | "input": [
|
1383 | 1383 | "from glob import glob\n",
|
1384 | 1384 | "\n",
|
1385 |
| - "documents = [make_document(f) for f in glob('supervized-learning/data/british-novels/*.txt')]" |
| 1385 | + "documents = [make_document(f) for f in glob('data/british-novels/*.txt')]" |
1386 | 1386 | ],
|
1387 | 1387 | "language": "python",
|
1388 | 1388 | "metadata": {},
|
|
1445 | 1445 | "scores = {}\n",
|
1446 | 1446 | "# insert your code here\n",
|
1447 | 1447 | "for sample in range(100, 5000, 500):\n",
|
1448 |
| - " documents = [make_document(f, sample=sample) for f in glob(\n", |
1449 |
| - " 'supervized-learning/data/british-novels/*.txt')]\n", |
| 1448 | + " documents = [make_document(f, sample=sample) for f in glob('data/british-novels/*.txt')]\n", |
1450 | 1449 | " authors, titles, texts = zip(*documents)\n",
|
1451 | 1450 | " scores[sample] = cross_validate(AuthorshipLearner(), texts, authors, k=None, score_fn=f_score)"
|
1452 | 1451 | ],
|
|
1488 | 1487 | "scores = {}\n",
|
1489 | 1488 | "# insert your code here\n",
|
1490 | 1489 | "for n_most_frequent in range(50, 500, 100):\n",
|
1491 |
| - " documents = [make_document(f) for f in glob(\n", |
1492 |
| - " 'supervized-learning/data/british-novels/*.txt')]\n", |
| 1490 | + " documents = [make_document(f) for f in glob('data/british-novels/*.txt')]\n", |
1493 | 1491 | " authors, titles, texts = zip(*documents)\n",
|
1494 | 1492 | " scores[n_most_frequent] = cross_validate(AuthorshipLearner(n_most_frequent=n_most_frequent), \n",
|
1495 | 1493 | " texts, authors, k=None, score_fn=f_score)"
|
|
1594 | 1592 | "cell_type": "code",
|
1595 | 1593 | "collapsed": false,
|
1596 | 1594 | "input": [
|
1597 |
| - "grid_search(AuthorshipLearner(), \n", |
1598 |
| - " 'supervized-learning/data/british-novels/', \n", |
| 1595 | + "grid_search(AuthorshipLearner(), 'data/british-novels/', \n", |
1599 | 1596 | " params=params, n_folds=None, score_fn=f_score, verbose=1)"
|
1600 | 1597 | ],
|
1601 | 1598 | "language": "python",
|
|
0 commit comments