Skip to content

Julio-Noe/SFCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SFCrawler

Abstract

The Web has become the main source of information in the digital world, spanning to heterogeneous domains and in continuous growth. Usually, a web search engine systematically searches over the web for particular information based on a text query, on the basis of a domain-unaware crawler that maintains real-time information. A semantic focused web crawler (SFWC) exploits the semantics of the Web based on some ontology heuristics to determine which web pages belong to the domain defined by the query. A SFWC is highly dependent on the ontology used which is designed by domain human experts, limiting the representation to their understanding. Instead of using an ontology as it is generally the case, in this work we propose a novel SFWC based on a generic knowledge representation schema (KRS) to build a domain specific knowledge base, thus avoiding the complexity and cost of constructing a more formal representation. For the first time we propose a similarity measure based on the combination of the Inverse document frequency (IDF) metric, standard deviation and the arithmetic mean to filter web page contents in accordance to a given domain during the crawling task. We ran a set of experiments over the domains of computer science, politics and diabetes to validate the correct functionality of our proposed crawler. The experiments considers two source to select the seed links: Google and Wikipedia. The quantitative (harvest ratio) and qualitative (Fleiss' kappa) evaluations demonstrate the feasibility of our SFWC to crawl the Web over an specific topic.

About

A new Semantic Focused Web Crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages