-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathinstructions
executable file
·95 lines (80 loc) · 4.24 KB
/
instructions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
#This file contains all the configurations for the elasticsearch instance
#These configurations form the basis of the search engine.
1)Create the index using
curl -s -XPUT 'http://localhost:9200/url-test/' -d '{
"mappings": {
"document": {
"properties": {
"content": {
"type": "string",
"analyzer" : "lowercase_with_stopwords"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"filter" : {
"stopwords_filter" : {
"type" : "stop",
"stopwords" : ["http", "https", "ftp", "www","a","about","above","after","again","against","all","am","an","and","any","are","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can't","cannot","could","couldn't","did","didn't","do","does","doesn't","doing","don't","down","during","each","few","for","from","further","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","he's","her","here","here's","hers","herself","him","himself","his","how","how's","i","i'd","i'll","i'm","i've","if","in","into","is","isn't","it","it's","its","itself","let's","me","more","most","mustn't","my","myself","no","nor","not","of","off","on","once","only","or","other","ought","our","ours","ourselves","out","over","own","same","shan't","she","she'd","she'll","she's","should","shouldn't","so","some","such","than","that","that's","the","their","theirs","them","themselves","then","there","there's","these","they","they'd","they'll","they're","they've","this","those","through","to","too","under","until","up","very","was","wasn't","we","we'd","we'll","we're","we've","were","weren't","what","what's","when","when's","where","where's","which","while","who"]
}
},
"analyzer": {
"lowercase_with_stopwords": {
"type": "custom",
"tokenizer": "lowercase",
"filter": [ "stopwords_filter","porter_stem"]
}
}
}
}
}' ;
(Above code handles both, stopwords and stemming of the content)
2)the url-test can be replaced with the name of the index that you would like to use.
list of stopwords can be improved. Above code is very easy to understand.
3)Use followig code to insert data into the index 'url-test'
Syntax:
PUT /<INDEX_NAME>/<INDEX_TYPE>/<ID>
PUT /url-test/document/1?pretty=true
{
"title" : "myWebsite",
"link" : "https://tusharagey.github.io/Test",
"content" : "Small content with URL https://tusharagey.github.io/Test."
}
4)The ID is expected to change for each document we index.
document = "html page or pdf or any content possible"
5)The fields: title : Title of the search item
link : Link to the page
content : data to be indexed.
6)Now, after doing these basic configurations, index.php(our crawler) can be invoked and can be used to feed all the title, urls and content to this index.
Next Tasks:
1)using sample code from line 76, feed each data fetched by crawler to elasticsearch.
2)using similar code, write the search API
Search works using this:
GET /url-test/_search?pretty
{
"query" : {
"query_string" : {
"query" : "some query"
}
}
}
3)Using this, a JSON is returned. From this JSON, the task is to display the search results in suitable format
(Note: create a Drupal plugin for this search UI/API)
4)The file cron.php is a way in drupal for for scheduling the task of crawling.
replacing the default cron.php in drupal installation directory with this cron.php will work.
5)For enabling the suggestions over query, follow:
i. Install Pspell using following command;
sudo apt-get install php7.0-pspell
ii. Create a file called dictionary.txt in your home directory. This file contains the keywords related with "coep" which normal english dictionary doesn't have. So the newly added words will be available in our spell checker application.
iii. Install the text file in aspell:
sudo aspell --lang=en create master /usr/lib/aspell/dictionary.rws < dictionary.txt
iv. Now go to the aspell directory:
cd /usr/lib/aspell/
v. Then edit: sudo vi en.multi and add this line: add dictionary.rws
*** crawler and indexer will run on the same machine. Search API will run on drupal (just calling the elasticsearch url and getting the results back) ***