|
1 | 1 | {
|
2 | 2 | "metadata": {
|
3 |
| - "name": "Chapter 5 - Building NLP Applications" |
| 3 | + "name": "" |
4 | 4 | },
|
5 | 5 | "nbformat": 3,
|
6 | 6 | "nbformat_minor": 0,
|
|
26 | 26 | "cell_type": "markdown",
|
27 | 27 | "metadata": {},
|
28 | 28 | "source": [
|
29 |
| - "In the last chapter we made some tools to prepare corpora for further processing. To be able to tokenise a text is nice, but from a humanities perspective not very interesting. So, what are we going to do with it? In this chapter, you'll implement two major applications that build upon the tools you developed. The first will be a relatively simple program that scores each text in a corpus according to its Automatic Readability Index. In the second application we will build a system that can predict who wrote a certain text. Again, we'll need to cover a lot of ground and things are becoming increasingly difficult now. So, let's get started!" |
| 29 | + "In the last chapter we made some tools to prepare corpora for further processing. To be able to tokenise a text is nice, but from a humanities perspective not very interesting. So, what are we going to do with it? In this chapter, you'll implement two major applications that build upon the tools you developed. The first will be a relatively simple program that scores each text in a corpus according to its *Automatic Readability Index*. In the second application we will build a system that can predict who wrote a certain text. Again, we'll need to cover a lot of ground and things are becoming increasingly difficult now. So, let's get started!" |
30 | 30 | ]
|
31 | 31 | },
|
32 | 32 | {
|
|
41 | 41 | "cell_type": "markdown",
|
42 | 42 | "metadata": {},
|
43 | 43 | "source": [
|
44 |
| - "The Automatic Readability Index is a readability test designed to gauge the understandability of a text. The formula for calculating the Automated Readability Index is as follows:\n", |
| 44 | + "The *Automatic Readability Index* is a readability test designed to gauge the understandability of a text. The formula for calculating the *Automated Readability Index* is as follows:\n", |
45 | 45 | "\n",
|
46 | 46 | "$$ 4.71 \\cdot \\frac{nchars}{nwords} + 0.5 \\cdot \\frac{nwords}{nsents} - 21.43 $$\n",
|
47 | 47 | "\n",
|
|
67 | 67 | "cell_type": "markdown",
|
68 | 68 | "metadata": {},
|
69 | 69 | "source": [
|
70 |
| - "Write a function `AutomaticReadabilityIndex` that takes three arguments `n_chars`, `n_words` and `n_sents` and returns the ARI given those arguments." |
| 70 | + "Write a function `automatic_readability_index` that takes three arguments `n_chars`, `n_words` and `n_sents` and returns the ARI given those arguments." |
71 | 71 | ]
|
72 | 72 | },
|
73 | 73 | {
|
74 | 74 | "cell_type": "code",
|
75 | 75 | "collapsed": false,
|
76 | 76 | "input": [
|
77 |
| - "def AutomaticReadabilityIndex(n_chars, n_words, n_sents):\n", |
| 77 | + "def automatic_readability_index(n_chars, n_words, n_sents):\n", |
78 | 78 | " # insert your code here\n",
|
79 | 79 | "\n",
|
80 | 80 | "# do not modify the code below, it is for testing your answer only!\n",
|
81 | 81 | "# it should output True if you did well\n",
|
82 |
| - "print(abs(AutomaticReadabilityIndex(300, 40, 10) - 15.895) < 0.001)" |
| 82 | + "print(abs(automatic_readability_index(300, 40, 10) - 15.895) < 0.001)" |
83 | 83 | ],
|
84 | 84 | "language": "python",
|
85 | 85 | "metadata": {},
|
|
96 | 96 | "cell_type": "markdown",
|
97 | 97 | "metadata": {},
|
98 | 98 | "source": [
|
99 |
| - "Now we need to write some code to obtain the numbers we so wishfully assumed to have. We will use the code we wrote in earlier chapters to read and tokenise texts. We stored all the functions we wrote for our corpus reader in `preprocess.py`. We only need the function `readcorpus` and import it here." |
| 99 | + "Now we need to write some code to obtain the numbers we so wishfully assumed to have. We will use the code we wrote in earlier chapters to read and tokenise texts. In the file `preprocessing.py` we defines a function `read_corpus` which reads all files with the extension `.txt` in the given directory. It tokenizes each text into a list of sentences each of which is represented by a list of words. All words are lowercased and we remove all punctuation. We import the function using the following line of code:" |
100 | 100 | ]
|
101 | 101 | },
|
102 | 102 | {
|
103 | 103 | "cell_type": "code",
|
104 | 104 | "collapsed": false,
|
105 | 105 | "input": [
|
106 |
| - "from preprocess import readcorpus" |
| 106 | + "from pyhum.preprocessing import read_corpus" |
107 | 107 | ],
|
108 | 108 | "language": "python",
|
109 | 109 | "metadata": {},
|
110 |
| - "outputs": [] |
| 110 | + "outputs": [], |
| 111 | + "prompt_number": 2 |
111 | 112 | },
|
112 | 113 | {
|
113 | 114 | "cell_type": "markdown",
|
114 | 115 | "metadata": {},
|
115 | 116 | "source": [
|
116 |
| - "Remember that the function readcorpus returns a generator of `(filename, sentences)` tuples. Sentences are represented by lists of strings, i.e. a list of tokens. Let's write a function `extract_counts` that takes a list of sentences as input and returns the number of characters, the number of words and the number of sentences as a tuple." |
| 117 | + "Let's write a function `extract_counts` that takes a list of sentences as input and returns the number of characters, the number of words and the number of sentences as a tuple." |
117 | 118 | ]
|
118 | 119 | },
|
119 | 120 | {
|
|
147 | 148 | "\n",
|
148 | 149 | "# do not modify the code below, for testing only!\n",
|
149 | 150 | "print(extract_counts(\n",
|
150 |
| - " [[\"This\", \"was\", \"rather\", \"easy\", \".\"], \n", |
151 |
| - " [\"Please\", \"give\", \"me\", \"something\", \"more\", \"challenging\"]]) == (54, 11, 2))" |
| 151 | + " [[\"this\", \"was\", \"rather\", \"easy\"], \n", |
| 152 | + " [\"please\", \"give\", \"me\", \"something\", \"more\", \"challenging\"]]) == (53, 10, 2))" |
152 | 153 | ],
|
153 | 154 | "language": "python",
|
154 | 155 | "metadata": {},
|
|
172 | 173 | "cell_type": "code",
|
173 | 174 | "collapsed": false,
|
174 | 175 | "input": [
|
175 |
| - "sentences = [[\"This\", \"was\", \"rather\", \"easy\", \".\"], \n", |
| 176 | + "sentences = [[\"this\", \"was\", \"rather\", \"easy\"], \n", |
176 | 177 | " [\"Please\", \"give\", \"me\", \"something\", \"more\", \"challenging\"]]\n",
|
177 | 178 | "\n",
|
178 | 179 | "n_chars, n_words, n_sents = extract_counts(sentences)\n",
|
179 |
| - "\n", |
180 |
| - "print(abs(AutomaticReadabilityIndex(n_chars, n_words, n_sents) - 4.442) < 0.001)" |
| 180 | + "print(automatic_readability_index(n_chars, n_words, n_sents))" |
181 | 181 | ],
|
182 | 182 | "language": "python",
|
183 | 183 | "metadata": {},
|
|
209 | 209 | "cell_type": "markdown",
|
210 | 210 | "metadata": {},
|
211 | 211 | "source": [
|
212 |
| - "Write the function `compute_ARI` that takes as argument a list of sentences (represented by lists of words) and returns the Automatic Readability Index for that input." |
| 212 | + "Write the function `compute_ARI` that takes as argument a list of sentences (represented by lists of words) and returns the *Automatic Readability Index* for that input." |
213 | 213 | ]
|
214 | 214 | },
|
215 | 215 | {
|
|
274 | 274 | "metadata": {},
|
275 | 275 | "outputs": []
|
276 | 276 | },
|
| 277 | + { |
| 278 | + "cell_type": "markdown", |
| 279 | + "metadata": {}, |
| 280 | + "source": [ |
| 281 | + "Remember that in Chapter 3, we plotted different basic statistics using Python plotting library matplotlib. Can you do the same for all ARIs?" |
| 282 | + ] |
| 283 | + }, |
| 284 | + { |
| 285 | + "cell_type": "code", |
| 286 | + "collapsed": false, |
| 287 | + "input": [ |
| 288 | + "import matplotlib.pyplot as plt\n", |
| 289 | + "\n", |
| 290 | + "# insert your code here" |
| 291 | + ], |
| 292 | + "language": "python", |
| 293 | + "metadata": {}, |
| 294 | + "outputs": [] |
| 295 | + }, |
277 | 296 | {
|
278 | 297 | "cell_type": "markdown",
|
279 | 298 | "metadata": {},
|
|
527 | 546 | "collapsed": false,
|
528 | 547 | "input": [
|
529 | 548 | "def add_file_to_database(filename, feature_database):\n",
|
530 |
| - " return update_counts(extract_author(filename), extract_features(filename), feature_database)" |
| 549 | + " return update_counts(extract_author(filename), \n", |
| 550 | + " extract_features(filename), \n", |
| 551 | + " feature_database)" |
531 | 552 | ],
|
532 | 553 | "language": "python",
|
533 | 554 | "metadata": {},
|
|
0 commit comments