NLP-Basic-Ass.tex


\documentclass[11pt]{article}
%\usepackage[top=20mm,left=20mm,right=20mm,bottom=15mm,a4paper]{geometry} % see geometry.pdf on how to lay out the page. There's lots.
\usepackage[top=20mm,left=20mm,right=20mm,bottom=15mm,headsep=15pt,footskip=15pt,a4paper]{geometry} % see geometry.pdf on how to lay out the page. There's lots.
%\geometry{a4paper} % or letter or a5paper or ... etc
% \geometry{landscape} % rotated page geometry
\usepackage[round]{natbib}
\setlength{\bibsep}{0.0pt}
\usepackage{color}
\usepackage{times}
%\usepackage[T1]{fontenc}
%\usepackage{mathptmx}
\usepackage{tikz-dependency}
\usepackage{enumitem}
%\usepackage{times}

\usepackage[procnames]{listings}
\usepackage{color}
\usepackage{todonotes}
 
 
% See the ``Article customise'' template for come common customisations
\newcommand{\refeq}[1]{Equation~\ref{eq:#1}}
\newcommand{\reffig}[1]{Figure~\ref{fig:#1}}
\newcommand{\reftab}[1]{Table~\ref{tab:#1}}
\newcommand{\refsec}[1]{\textsection\ref{sec:#1}}
\newcommand{\newsec}[2]{\section{#1}\label{sec:#2}\noindent}
\newcommand{\newsubsec}[2]{\subsection{#1}\label{sec:#2}\noindent}
\newcommand{\argmax}{\operatornamewithlimits{argmax}} 
\newcommand{\argmin}{\operatornamewithlimits{argmin}} 

\makeatletter         
\def\@maketitle{   % custom maketitle 
\begin{center}%
{\bfseries \@title}%
{\bfseries \@author}%
\end{center}%
\smallskip \hrule \bigskip }

% custom section 
\renewcommand{\section}{\@startsection
{section}%                   % the name
{1}%                         % the level
{0mm}%                       % the indent
{-0.8\baselineskip}%            % the before skip
{0.3\baselineskip}%          % the after skip
{\bfseries\large}}% the style

% custom subsection 
\renewcommand{\subsection}{\@startsection
{subsection}%                   % the name
{2}%                         % the level
{0mm}%                       % the indent
{-0.8\baselineskip}%            % the before skip
{0.3\baselineskip}%          % the after skip
{\bfseries\large}}% the style

\renewcommand{\paragraph}{%
  \@startsection{paragraph}{4}%
  {\z@}{1.5ex \@plus 1ex \@minus .2ex}{-1em}%
  {\normalfont\normalsize\bfseries}%
}\makeatother

\title{{\LARGE Natural Language Processing}\\[1.5mm]{\large Lab 2: Tokenization and Sentence Segmentation}}
\author{}
\date{} % delete this line to display the current date

%%% BEGIN DOCUMENT
\begin{document}

\definecolor{keywords}{RGB}{255,0,90}
\definecolor{comments}{RGB}{0,0,113}
\definecolor{red}{RGB}{160,0,0}
\definecolor{green}{RGB}{0,150,0}
 
\lstset{language=Python, 
        basicstyle=\ttfamily\small, 
        keywordstyle=\color{keywords},
        commentstyle=\color{comments},
        stringstyle=\color{red},
        showstringspaces=false,
        identifierstyle=\color{green},
        procnamekeys={def,class}}

\maketitle
%\tableofcontents
%\vspace{3mm}
\vspace{-2mm} \newsec{Introduction}{intro}%
In this lab, we are going to refine a tokenizer to handle
phenomena such as punctuation, abbreviations, contractions, and words
containing non-alphabetic characters. We will also add a simple
mechanism for sentence segmentation. For development of the tokenizer,
we use the same text as in Lab 1 when we manually screened and
corrected the output of the simple tokenizer. As soon as we are
satisfied with the performance of the tokenizer on this text, we will
then evaluate it on a previously unseen (but similar) text. All the
files needed for the lab are in {\tt
  /local/kurs/nlp/basic2/}. Again, start by making copies into your home
directory (see Lab 1 for details).

\newsec{The gold standard}{gold}%
The file {\tt dev1-gold.txt} contains the correct tokenization of the
text in {\tt dev1-raw.txt} according to the widely adopted Penn
Treebank standard.  Have a look at the tokenized text and make sure
that you understand the principles by which it has been
tokenized. Most of the tokens are either regular words or punctuation
marks, but note the following:
\begin{itemize}[noitemsep,topsep=0.2cm]
\item Numerical expressions like {\tt 3.8} are treated as single
  tokens.
\item Logograms like {\tt \%} and {\tt \$} are treated as separate
  tokens.
\item Contractions like {\tt it's} and {\tt don't} are split into two
  tokens -- but how?
\item Straight quotation marks ({\tt "}) have been changed to opening
  ({\tt ``}) or closing ({\tt ''}) quotation marks.
  % \item An end-of-sentence period ({\tt .}) has been inserted in two
  %   places -- can you find them?
\end{itemize}

\newsec{A pattern-matching tokenizer}{tokenize}%
The file {\tt tokenizer1.py} contains a very simple tokenizer:
\begin{center}
\fbox{
\lstinputlisting{code/tokenizer1.py}
}
\end{center}
The main difference compared to the whitespace tokenizer used in
Lab 1 is that we are now trying to define what a valid token
looks like, instead of just using whitespace to find token
boundaries. Therefore, we use the Python method {\tt re.findall}
instead of {\tt re.split}. This method takes as arguments a regular
expression and a string, and returns a list of all the matches found
in the string. The matching is done from left to right and greedily
tries to find the largest possible match each time. Let us consider
the regular expression used in {\tt tokenizer1.py}:
\begin{center}
\fbox{
\lstinputlisting{code/tok1.py}
}
\end{center}
The regular expression, enclosed in double quotes ({\tt "}), is a
simple union expression, where the first part is a character set
containing the most common punctuation marks ({\tt
  [,;:.!?$\backslash$"]}), and the second part is an expression ({\tt
  $\backslash$w+}) matching any non-empty sequence of alpha-numeric
characters. Note that the double quotes must be escaped in the set of
punctuation marks. Why?
% In addition, there are two pairs of round brackets in the regular
% expression, one enclosing the entire disjunction, and one enclosing
% an empty string at the end of the second part. These brackets are
% completely superfluous as far as the matching is concerned. They are
% only there to guarantee that the behavior of the program will not
% change if you later feel the need to add your own brackets for
% grouping. (Unless you are especially interested in the inner
% workings of Python, you can completely ignore this technicality for
% now as long as you don't remove the brackets.)

\newsec{Refine the tokenizer }{refine}%
Run {\tt tokenizer1.py} on {\tt dev1-raw.txt}, and compare the output
to {\tt dev1-gold.txt} using the Linux command {\tt diff}: \\

\noindent
{\tt python tokenizer1.py < dev1-raw.txt > dev1-tok.txt}\\

\noindent
{\tt diff dev1-gold.txt dev1-tok.txt}\\

\textbf{Note:} For better visualisation, you may want to try using vimdiff:\\
{\tt vimdiff dev1-gold.txt dev1-tok.txt}\\
Note that vimdiff opens an editor which has different key bindings than
you may be used to. To quit, press :q.\\

%\clearpage
\noindent
This gives you a list of the problems you have to fix.  The left column
corresponds to the gold standard token and the right column corresponds to the
token given by the tokenizer. Your task is to modify the regular expression
used for matching in order to eliminate as many of the problems as possible. A
good strategy is to concentrate on one type of token at a time, and add a new
subexpression to the disjunction, specifically designed to handle that token
type. Note that Python processes the regular expression from left to right,
which means that earlier subexpressions will take priority over later
subexpressions if they have overlapping matches. (The two subexpressions in the
original expression never have overlapping matches. Why?) If you need help,
consult the Python documentation at {\tt
https://docs.python.org/3/howto/regex.html}. If you want to measure the exact
precision and recall on the token level, you can use {\tt score-tokens.py} as
follows:
\begin{verbatim}
    python score-tokens.py dev1-gold.txt dev1-tok.txt
\end{verbatim}


\paragraph{Note on quotation marks:} In order to match the gold
tokenization exactly, you would not only have to segment the text
correctly but in addition replace any token of the form ({\tt "}) by
either ({\tt ``}) or ({\tt ''}). For the purpose of this lab,
you can ignore any errors that only concern the form of quotation
marks, which is also what the evaluation script {\tt score-tokens.py}
does.

\paragraph{Note on Python regular expression matching:} The original
regular expression is surrounded by a single pair of round brackets,
and each match returned by {\tt findall} (in the variable {\tt token})
is a string matching the entire regular expression. When you refine
the expression, you may have to add internal brackets to get the right
grouping of subexpressions. This will change the behavior of the
program and each match will now return a tuple of strings matching the
different bracketings, where the first element is always the string
matching the entire regular expression. In this case, you therefore
have to change {\tt print(token)} to {\tt print(token[0])} to get the
right output.
 
\newsec{Add sentence segmentation }{sent}%
The final task is to add sentence segmentation to the tokenization
process. In general, this is a rather hard problem, but for the
sentences in {\tt dev1-raw.txt} it should not be too difficult to come
up with something reasonable. It is customary to mark sentence
boundaries by a blank line, so you just need to come up with rules for
inserting a blank line after each sentence (including the last
one). When you have done this, you can compare your output to {\tt
  dev1-gold-sent.txt}, which is identical to {\tt dev1-gold.txt}
except that sentence boundaries have been inserted.

\paragraph{Note on sentence-final abbreviations:} When an abbreviation
such as {\tt e.g.} occurs at the end of a sentence, orthographic
conventions prescribe that there is no final stop ({\tt .}) to end the
sentence. According to the Penn Treebank standard, however, the final
stop should be restored as part of the sentence segmentation process,
which means that an additional token is inserted in this case. For the
purpose of this lab, you can ignore this complication.

\indent Try to reach at least 95\% precision and recall.

%TODO: add goal, like try to reach 95%


\end{document}