SFCrawler

Abstract

The Web has become the main source of information in the digital world, spanning to heterogeneous domains and in continuous growth. Usually, a web search engine systematically searches over the web for particular information based on a text query, on the basis of a domain-unaware crawler that maintains real-time information. A semantic focused web crawler (SFWC) exploits the semantics of the Web based on some ontology heuristics to determine which web pages belong to the domain defined by the query. A SFWC is highly dependent on the ontology used which is designed by domain human experts, limiting the representation to their understanding. Instead of using an ontology as it is generally the case, in this work we propose a novel SFWC based on a generic knowledge representation schema (KRS) to build a domain specific knowledge base, thus avoiding the complexity and cost of constructing a more formal representation. For the first time we propose a similarity measure based on the combination of the Inverse document frequency (IDF) metric, standard deviation and the arithmetic mean to filter web page contents in accordance to a given domain during the crawling task. We ran a set of experiments over the domains of computer science, politics and diabetes to validate the correct functionality of our proposed crawler. The experiments considers two source to select the seed links: Google and Wikipedia. The quantitative (harvest ratio) and qualitative (Fleiss' kappa) evaluations demonstrate the feasibility of our SFWC to crawl the Web over an specific topic.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.settings		.settings
config		config
corpus		corpus
java-web-crawler		java-web-crawler
resources		resources
src/main		src/main
target/classes		target/classes
.DS_Store		.DS_Store
.classpath		.classpath
.gitignore		.gitignore
.project		.project
05-csail.txt		05-csail.txt
ComputerScienceExp2_Sample.tsv		ComputerScienceExp2_Sample.tsv
README.md		README.md
computerScience_experiment_2_urls.txt		computerScience_experiment_2_urls.txt
diabetes_experiment_2_urls.txt		diabetes_experiment_2_urls.txt
diabetes_experiments_urls.txt		diabetes_experiments_urls.txt
log4j.properties		log4j.properties
logigng.log		logigng.log
politics_experiments_urls.txt		politics_experiments_urls.txt
pom.xml		pom.xml
sheet.tsv		sheet.tsv
virt_jena3.jar		virt_jena3.jar
virtjdbc4.jar		virtjdbc4.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SFCrawler

Abstract

About

Releases

Packages

Contributors 2

Languages

Julio-Noe/SFCrawler

Folders and files

Latest commit

History

Repository files navigation

SFCrawler

Abstract

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages