The Web has become the main source of information in the digital world, spanning to heterogeneous domains and in continuous growth. Usually, a web search engine systematically searches over the web for particular information based on a text query, on the basis of a domain-unaware crawler that maintains real-time information. A semantic focused web crawler (SFWC) exploits the semantics of the Web based on some ontology heuristics to determine which web pages belong to the domain defined by the query. A SFWC is highly dependent on the ontology used which is designed by domain human experts, limiting the representation to their understanding. Instead of using an ontology as it is generally the case, in this work we propose a novel SFWC based on a generic knowledge representation schema (KRS) to build a domain specific knowledge base, thus avoiding the complexity and cost of constructing a more formal representation. For the first time we propose a similarity measure based on the combination of the Inverse document frequency (IDF) metric, standard deviation and the arithmetic mean to filter web page contents in accordance to a given domain during the crawling task. We ran a set of experiments over the domains of computer science, politics and diabetes to validate the correct functionality of our proposed crawler. The experiments considers two source to select the seed links: Google and Wikipedia. The quantitative (harvest ratio) and qualitative (Fleiss' kappa) evaluations demonstrate the feasibility of our SFWC to crawl the Web over an specific topic.
-
Notifications
You must be signed in to change notification settings - Fork 0
Julio-Noe/SFCrawler
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
A new Semantic Focused Web Crawler
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published