-
Notifications
You must be signed in to change notification settings - Fork 0
krivard/nilmodule
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
################################################################################ # README NILCLUSTERING MODULE TAC/KBP 2014 # ################################################################################ Configuration ------------- Before installing and running the code it is necessary to specify the paths to the external data (e.g., ProPPR output). This has to be done in the first section of the Makefile. The necessary files are: - PROPPR_TRAIN: 'qid eid'. ProPPR results for the training data with query column and eid column, e.g., ------------------------------------ | edl14_eng_training_0001 nil | | edl14_eng_training_0002 nil | | edl14_eng_training_0006 e0731517 | ------------------------------------ - PROPPR_TEST: 'qid eid'. ProPPR results for the test data (if any) with query column and eid column, e.g., ------------------------------------ | edl14_eng_training_0005 nil | | edl14_eng_training_0032 e0330409 | | edl14_eng_training_0044 e0805223 | ------------------------------------ - SCORE_TRAIN: 'rank score eid'. ProPPR solutions (scores) for the training data, e.g., ----------------------------------------------------------------------------- | # proved 1 answerQuery(d1020302,edl14_eng_training_0363,-1) 2107 msec | | 1 0.9039872065556289 -1=c[nil] | | 2 0.004607878107344008 -1=c[e0082408] | | 3 0.004607878107344008 -1=c[e0491365] | ----------------------------------------------------------------------------- - SCORE_TEST: 'rank score eid'. ProPPR solutions (scores) for the test data (if any), e.g., ----------------------------------------------------------------------------- | # proved 1 answerQuery(d1020302,edl14_eng_training_0956,-1) 3319 msec | | 1 0.9029592137998453 -1=c[nil] | | 2 0.003451444907669729 -1=c[e0027798] | | 3 0.003451444907669729 -1=c[e0658312] | ----------------------------------------------------------------------------- - QNAME: 'queryName qid string'. File containing the qid and the query string as well as a "queryName" dummy column, e.g., --------------------------------------------------------- | queryName edl14_eng_training_0001 xenophon | | queryName edl14_eng_training_0002 richmond | | queryName edl14_eng_training_0003 newark_teachers_union | --------------------------------------------------------- - TOKEN: 'inDocument did token'. File containing the did, the tokens in that document as well as an "inDocument" dummy column, e.g., -------------------------------------- | inDocument d1020302 abramovich | | inDocument d1020302 africa | | inDocument d1020302 agents | -------------------------------------- - QSENT: 'querySentence qid sid'. File containing the qid and the sid as well as a "querySentence" dummy column, e.g., --------------------------------------------------- | querySentence EDL14_ENG_TRAINING_0002 d484430s1 | | querySentence EDL14_ENG_TRAINING_0003 d1886127s12 | | querySentence EDL14_ENG_TRAINING_0006 d54485s10 | --------------------------------------------------- - INSENT: 'inSentence sid token'. File containing the sid, the tokens in that sentence as well as an "inSentence" dummy column, e.g., ------------------------------------- | inSentence d1020302s1 ballack | | inSentence d1020302s1 champions | | inSentence d1020302s1 chelsea | ------------------------------------- - TACPR: 'did wp14 entype score begin end mention tacid name type gentype'. File containing a TAC-aligned PR output file, e.g., ---------------------------------------------------------- | AFP_ENG_20090802.0401 Tikka (food) prepared food \ | | 0.252 66 71 tikka OTHER | | AFP_ENG_20090802.0401 Spice mix null \ | | 0.132 72 78 masala NULL | | AFP_ENG_20090802.0401 Agence France-Presse company \ | | 0.113 155 158 AFP E0689220 Agence France-Presse \ | | ORG ORGANIZATION | ---------------------------------------------------------- It is possible to specify the amount of padding (e.g., the number of "X" in "NILXXX") by changing the FORMATTING_FLAGS. Installation ------------ Run "make" inside the main directory. This will set up a virtual environment with the necessary python dependencies and then run the different clustering scripts. If you only want to install the virtual environment without running the clustering scripts, use "make venv" instead. Running the code ---------------- The makefile provides several targets corresponding to the different steps of the clustering process: - "make raw" generates all the input needed for the different clustering versions. - "make baseline" performs baseline clustering. At the moment, there are the following baseline versions: -- baseline0: String only. All queries with the same query string are assigned the same nid. -- baseline1: String and did. All queries with the same query string and the same did are assigned the same nid. -- baseline2: String and document distance. Queries that have the same query string are assigned to different clusters based on the distance between the documents they appear in. This distance is calculated based on the tokens that appear in the documents. Agglomerative clustering is used. The clustering parameters (pairwise distance measure between documents, distance measure between clusters, clustering method and threshold for flattening) can be specified in the makefile. Detailed information regarding the clustering algorithm and its parameters can be found in the scipy documentation. -- baseline3: String and document distance. Same as baseline2, but instead of agglomerative clustering exploratory clustering is used. Detailed information regarding the clustering algorithm and its parameters can be found in the ExploreEM documentation. -- baseline4: String and sentence distance. Queries that have the same query string are assigned to different clusters based on the distance between the sentences they appear in. The distance between the documents they appear in. This distance is calculated based on the tokens that appear in the documents. Agglomerative clustering is used. The clustering parameters (pairwise distance measure between documents, distance measure between clusters, clustering method and threshold for flattening) can be specified in the makefile. Detailed information regarding the clustering algorithm and its parameters can be found in the scipy documentation. -- baseline5: String and sentence distance. Same as baseline4, but instead of agglomerative clustering exploratory clustering is used. Detailed information regarding the clustering algorithm and its parameters can be found in the ExploreEM documentation. -- baseline6: String and sentence distance with string and document distance as a fallback. In a first step, queries for which sentence information is available are clustered using baseline4. In a second step, the remaining queries are clustered using baseline2. -- baseline7: String and sentence distance with string and document distance as a fallback. In a first step, queries for which sentence information is available are clustered using baseline5. In a second step, the remaining queries are clustered using baseline3. - "make explore" performs exploreEM clustering. At the moment, there are the following exploreEM versions: -- unsupervised0: String, did, document tokens, and score. For each query, a feature vector with the query string, did, the tokens of the document and the ProPPR scores is generated. This is a global context only unsupervised version of exploreEM. -- unsupervised1: String, sid, sentence tokens, and score. For each query, a feature vector with the query string, did, the tokens of the sentence and the ProPPR scores is generated. This is a local context only unsupervised version of exploreEM. -- unsupervised2: String, did, document tokens, sid, sentence tokens, and score. For each query, a feature vector with the query string, did, the tokens of the document, sid, sentence tokens and the ProPPR scores is generated. This is a combined local and global context unsupervised version of exploreEM. -- semi_supervised0: String, did, document tokens, and score. For each query, a feature vector with the query string, did, the tokens of the document and the ProPPR scores is generated. Additionally, PageReactor output is used to generate training data (seeds) for a number of queries. This is a global context only semi-supervised version of exploreEM. -- semi_supervised1: String, sid, sentence tokens, and score. For each query, a feature vector with the query string, sid, the tokens of the sentence and the ProPPR scores is generated. Additionally, PageReactor output is used to generate training data (seeds) for a number of queries. This is a local context only semi-supervised version of exploreEM. -- semi_supervised2: String, did, document tokens, sid, sentence tokens, and score. For each query, a feature vector with the query string, sid, the tokens of the sentence and the ProPPR scores is generated. Additionally, PageReactor output is used to generate training data (seeds) for a number of queries. This is a combined global and local context semi-supervised version of exploreEM. Detailed information regarding the clustering algorithm and its parameters can be found in the ExploreEM documentation. - "make pagereactor" performs clustering based on the pagereactor output. At the moment, there is only one version: -- pagereactor0: String only. The assignment is performed as a two step process. In a first step, pagereactor entities whose tacid is NULL and whose generic type is not NULL and not OTHER are grouped by their wp14 name and assigned a nid. In a second step, those pagereactor entitites that have either an eid or a nid are matched with their qid. The matching is performed based on string identity and appearance, e.g. if there are three tac queries with the string "abraham_lincoln" and five pagereactor queries with the string "abraham_lincoln" then the first three of those pagereactor queries will be matched with the qid's from the three tac queries. - "make" or "make all" is shorthand for calling "make raw", "make baseline", "make explore", and "make pagereactor". Cleaning up ----------- - "make clean" removes all the input generated by "make raw" (i.e., the data directory) and the output generated by "make baseline", "make explore", and "make pagereactor" (i.e., the output directory). - "make cleandist" removes the same targets as "make clean" and, in addition to that, also removes the virtual environment. Note that in general "make clean" should suffice.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published