fbkarsdorp
diff --git a/‎Chapter 2 - First steps.ipynb
+17-10 b/‎Chapter 2 - First steps.ipynb
+17-10
diff --git a/‎Chapter 3 - Text analysis.ipynb
+11-8 b/‎Chapter 3 - Text analysis.ipynb
+11-8
diff --git a/‎Chapter 5 - Building NLP Applications.ipynb
+38-17 b/‎Chapter 5 - Building NLP Applications.ipynb
+38-17
@@ -160,7 +160,7 @@
       "# insert your code here\n",
       "\n",
       "# The following test should print True if your code is correct \n",
-      "print(number_of_es == text.count(\"e\"))"
+      "print(number_of_es == 78)"
      ],
      "language": "python",
      "metadata": {},
@@ -458,7 +458,8 @@
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "# insert your code here"
+      "# insert your code here\n",
+      "print(count_in_list(\"the\", words))"
      ],
      "language": "python",
      "metadata": {},
@@ -650,6 +651,11 @@
      "cell_type": "code",
      "collapsed": false,
      "input": [
+      "infile = open('data/austen-emma-excerpt.txt')\n",
+      "text = infile.read()\n",
+      "infile.close()\n",
+      "words = text.split()\n",
+      "\n",
       "for word in words:\n",
       "    print(word, count_in_list(word, words))"
      ],
@@ -870,7 +876,7 @@
      "collapsed": false,
      "input": [
       "short_text = \"Commas, as it turns out, are so much overestimated.\"\n",
-      "short_text = # insert your code here\n",
+      "# insert your code here\n",
       "\n",
       "# The following test should print True if your code is correct \n",
       "print(short_text == \"commas as it turns out are so much overestimated.\")"
@@ -1003,7 +1009,7 @@
       "# insert your code here\n",
       "\n",
       "# The following test should print True if your code is correct \n",
-      "print(woodhouse_counts = 263)"
+      "print(woodhouse_counts == 263)"
      ],
      "language": "python",
      "metadata": {},
@@ -1080,17 +1086,18 @@
       "infile = # insert your code here\n",
       "text = # insert your code here\n",
       "\n",
-      "# now clean up the text, and turn all characters to lowercase\n",
+      "# now clean up the text, turn all characters to lowercase \n",
+      "# and split the text into a list of words\n",
       "text = # insert you code here\n",
       "\n",
       "# next compute the frequency distribution\n",
       "frequency_distribution = # insert your code here\n",
       "\n",
-      "# now open the file data/austen-frequency-distribution for writing\n",
+      "# now open the file data/austen-frequency-distribution.txt for writing\n",
       "outfile = # insert your code here\n",
       "\n",
       "for word, frequency in frequency_distribution.items():\n",
-      "    outfile.write(word + \";\" + str(frequency))\n",
+      "    outfile.write(word + \";\" + str(frequency) + '\\n')\n",
       "    \n",
       "# close the outfile\n",
       "outfile.# insert your code here"
@@ -1194,13 +1201,13 @@
        ],
        "metadata": {},
        "output_type": "pyout",
-       "prompt_number": 124,
+       "prompt_number": 47,
        "text": [
-        "<IPython.core.display.HTML at 0x109425c50>"
+        "<IPython.core.display.HTML at 0x1091ab310>"
        ]
       }
      ],
-     "prompt_number": 124
+     "prompt_number": 47
     },
     {
      "cell_type": "markdown",
 
@@ -310,7 +310,7 @@
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "from preprocessing import clean_text"
+      "from pyhum.preprocessing import clean_text"
      ],
      "language": "python",
      "metadata": {},
@@ -325,7 +325,7 @@
       "    each sentence and remove all punctuation. Finally split each\n",
       "    sentence into a list of words.\"\"\"\n",
       "    # insert your code here\n",
-      "    \n",
+      "\n",
       "# these tests should return True if your code is correct\n",
       "print(tokenize(\"This is a sentence. So, what!\") == \n",
       "      [[\"this\", \"is\", \"a\", \"sentence\"], [\"so\", \"what\"]])"
@@ -395,7 +395,10 @@
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "# insert your code here"
+      "# insert your code here\n",
+      "corpus = []\n",
+      "for filename in list_textfiles('data/arabian_nights'):\n",
+      "    corpus.append(tokenize(read_file(filename)))"
      ],
      "language": "python",
      "metadata": {},
@@ -921,7 +924,7 @@
       "    # insert your code here\n",
       "\n",
       "# these tests should return True if your code is correct\n",
-      "print(story_time([\"story\"] * 130) == 1.0)"
+      "print(story_time([[\"story\"]]) * 130 == 1.0)"
      ],
      "language": "python",
      "metadata": {},
@@ -1048,7 +1051,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "**3)** In this final exercize we will put everything together what we have learnt so far. We want you to write a function `positions_of` that returns for a given word all sentence positions in the *Arabian Nights* where that word occurs. We are not interested in the positions relative to a particular night, but only to the corpus as a whole. Use that function to find all occurences of the name Sharahzad and store the corresponding indexes in the variable `positions_of_shahrazad`. Do the same thing for the name *Ali*. Store the result in `positions_of_ali`. Finally, find all occurences of *Egypt* and store the indexes in `positions_of_egypt`. Tip: (1) remember that we lowercased the entire corpus! (2) remember that indexes start at 0."
+      "**3)** In this final exercise we will put everything together what we have learnt so far. We want you to write a function `positions_of` that returns for a given word all sentence positions in the *Arabian Nights* where that word occurs. We are not interested in the positions relative to a particular night, but only to the corpus as a whole. Use that function to find all occurences of the name Sharahzad and store the corresponding indexes in the variable `positions_of_shahrazad`. Do the same thing for the name *Ali*. Store the result in `positions_of_ali`. Finally, find all occurences of *Egypt* and store the indexes in `positions_of_egypt`. Tip: (1) remember that we lowercased the entire corpus! (2) remember that indexes start at 0."
      ]
     },
     {
@@ -1198,13 +1201,13 @@
        ],
        "metadata": {},
        "output_type": "pyout",
-       "prompt_number": 163,
+       "prompt_number": 214,
        "text": [
-        "<IPython.core.display.HTML at 0x1103e2090>"
+        "<IPython.core.display.HTML at 0x110239ad0>"
        ]
       }
      ],
-     "prompt_number": 163
+     "prompt_number": 214
     },
     {
      "cell_type": "markdown",
 
@@ -1,6 +1,6 @@
 {
  "metadata": {
-  "name": "Chapter 5 - Building NLP Applications"
+  "name": ""
  },
  "nbformat": 3,
  "nbformat_minor": 0,
@@ -26,7 +26,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "In the last chapter we made some tools to prepare corpora for further processing. To be able to tokenise a text is nice, but from a humanities perspective not very interesting. So, what are we going to do with it? In this chapter, you'll implement two major applications that build upon the tools you developed. The first will be a relatively simple program that scores each text in a corpus according to its Automatic Readability Index. In the second application we will build a system that can predict who wrote a certain text. Again, we'll need to cover a lot of ground and things are becoming increasingly difficult now. So, let's get started!"
+      "In the last chapter we made some tools to prepare corpora for further processing. To be able to tokenise a text is nice, but from a humanities perspective not very interesting. So, what are we going to do with it? In this chapter, you'll implement two major applications that build upon the tools you developed. The first will be a relatively simple program that scores each text in a corpus according to its *Automatic Readability Index*. In the second application we will build a system that can predict who wrote a certain text. Again, we'll need to cover a lot of ground and things are becoming increasingly difficult now. So, let's get started!"
      ]
     },
     {
@@ -41,7 +41,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "The Automatic Readability Index is a readability test designed to gauge the understandability of a text. The formula for calculating the Automated Readability Index is as follows:\n",
+      "The *Automatic Readability Index* is a readability test designed to gauge the understandability of a text. The formula for calculating the *Automated Readability Index* is as follows:\n",
       "\n",
       "$$ 4.71 \\cdot \\frac{nchars}{nwords} + 0.5 \\cdot \\frac{nwords}{nsents} - 21.43 $$\n",
       "\n",
@@ -67,19 +67,19 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Write a function `AutomaticReadabilityIndex` that takes three arguments `n_chars`, `n_words` and `n_sents` and returns the ARI given those arguments."
+      "Write a function `automatic_readability_index` that takes three arguments `n_chars`, `n_words` and `n_sents` and returns the ARI given those arguments."
      ]
     },
     {
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "def AutomaticReadabilityIndex(n_chars, n_words, n_sents):\n",
+      "def automatic_readability_index(n_chars, n_words, n_sents):\n",
       "    # insert your code here\n",
       "\n",
       "# do not modify the code below, it is for testing your answer only!\n",
       "# it should output True if you did well\n",
-      "print(abs(AutomaticReadabilityIndex(300, 40, 10) - 15.895) < 0.001)"
+      "print(abs(automatic_readability_index(300, 40, 10) - 15.895) < 0.001)"
      ],
      "language": "python",
      "metadata": {},
@@ -96,24 +96,25 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Now we need to write some code to obtain the numbers we so wishfully assumed to have. We will use the code we wrote in earlier chapters to read and tokenise texts. We stored all the functions we wrote for our corpus reader in `preprocess.py`. We only need the function `readcorpus` and import it here."
+      "Now we need to write some code to obtain the numbers we so wishfully assumed to have. We will use the code we wrote in earlier chapters to read and tokenise texts. In the file `preprocessing.py` we defines a function `read_corpus` which reads all files with the extension `.txt` in the given directory. It tokenizes each text into a list of sentences each of which is represented by a list of words. All words are lowercased and we remove all punctuation. We import the function using the following line of code:"
      ]
     },
     {
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "from preprocess import readcorpus"
+      "from pyhum.preprocessing import read_corpus"
      ],
      "language": "python",
      "metadata": {},
-     "outputs": []
+     "outputs": [],
+     "prompt_number": 2
     },
     {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Remember that the function readcorpus returns a generator of `(filename, sentences)` tuples. Sentences are represented by lists of strings, i.e. a list of tokens. Let's write a function `extract_counts` that takes a list of sentences as input and returns the number of characters, the number of words and the number of sentences as a tuple."
+      "Let's write a function `extract_counts` that takes a list of sentences as input and returns the number of characters, the number of words and the number of sentences as a tuple."
      ]
     },
     {
@@ -147,8 +148,8 @@
       "\n",
       "# do not modify the code below, for testing only!\n",
       "print(extract_counts(\n",
-      "    [[\"This\", \"was\", \"rather\", \"easy\", \".\"], \n",
-      "     [\"Please\", \"give\", \"me\", \"something\", \"more\", \"challenging\"]]) == (54, 11, 2))"
+      "    [[\"this\", \"was\", \"rather\", \"easy\"], \n",
+      "     [\"please\", \"give\", \"me\", \"something\", \"more\", \"challenging\"]]) == (53, 10, 2))"
      ],
      "language": "python",
      "metadata": {},
@@ -172,12 +173,11 @@
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "sentences = [[\"This\", \"was\", \"rather\", \"easy\", \".\"], \n",
+      "sentences = [[\"this\", \"was\", \"rather\", \"easy\"], \n",
       "             [\"Please\", \"give\", \"me\", \"something\", \"more\", \"challenging\"]]\n",
       "\n",
       "n_chars, n_words, n_sents = extract_counts(sentences)\n",
-      "\n",
-      "print(abs(AutomaticReadabilityIndex(n_chars, n_words, n_sents) - 4.442) < 0.001)"
+      "print(automatic_readability_index(n_chars, n_words, n_sents))"
      ],
      "language": "python",
      "metadata": {},
@@ -209,7 +209,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Write the function `compute_ARI` that takes as argument a list of sentences (represented by lists of words) and returns the Automatic Readability Index for that input."
+      "Write the function `compute_ARI` that takes as argument a list of sentences (represented by lists of words) and returns the *Automatic Readability Index* for that input."
      ]
     },
     {
@@ -274,6 +274,25 @@
      "metadata": {},
      "outputs": []
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Remember that in Chapter 3, we plotted different basic statistics using Python plotting library matplotlib. Can you do the same for all ARIs?"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import matplotlib.pyplot as plt\n",
+      "\n",
+      "# insert your code here"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
     {
      "cell_type": "markdown",
      "metadata": {},
@@ -527,7 +546,9 @@
      "collapsed": false,
      "input": [
       "def add_file_to_database(filename, feature_database):\n",
-      "    return update_counts(extract_author(filename), extract_features(filename), feature_database)"
+      "    return update_counts(extract_author(filename), \n",
+      "                         extract_features(filename), \n",
+      "                         feature_database)"
      ],
      "language": "python",
      "metadata": {},