Skip to content

Welcome to Jinesh Parakh's submission for the UBS Avant Garde Engineering Challenge Round 2(UBS Project X Code Challenge Round II)

License

Notifications You must be signed in to change notification settings

jineshparakh/WebSpider

Repository files navigation

Web Spider

Welcome to Jinesh Parakh's submission for the UBS Avant Garde Engineering Challenge Round 2(UBS Project X Code Challenge Round II)

Problem Statement Chosen

Web spider – implement a “spider” that starts with one web page and follows to other websites mentioned on the previous web pages – up to pre-specified number of levels. The spider collects top N frequently used words on the linked sites and other statistics (your choice).

Project's Overview

1. Description

Web Spider(β) is an N level Multithreaded Web Spider which dives N levels deep and extracts data from the webpages to provide valuable insights and statistics on the data scrapped. It used a Breadth First Search type of approach for crawling the web. It crawls through webpages level by level in an asynchronous manner using multithreading and extracts words and sentences/phrases from them to give some wonderful statistical insights on the data extracted.

You can find the deployed project here

2. Statistics and features

a. NLP based Color Range inclusive Word Cloud with frequencies and percentage occurrence of words on the crawled sites
b. NLP based Color Range inclusive Bigram Cloud with frequencies and percentage occurrence of words on the crawled sites
c. Bar Graph denoting the Words in Each Level of Search
d. Line Chart denoting the Average Length of words in Each Level of Search
e. Search Bar for queries to check whether the word exists in the crawled sites or not and even find the frequency of occurrence
f. Responsive and User-Friendly User Interface for the Best User Experience

3. Tech Stack

a. Python for the core Spider Logic
b. NLTK and Python for cleaning and analyzing the data
c. Flask as the backend of the Web App
d. HTML, CSS, Bootstrap4, Javascript for the Frontend of the App and to display the statistics
e. Heroku as a Cloud Platform for the deployment of the Web App

4. Instructions for Running and Testing the project Locally

a. Clone the repository (Otherwise you can also download the zip file of the repository)

> git clone https://github.com/jineshparakh/WebCrawler.git

b. Change the directory. (You can also open this project on any IDE, preferably VSCode)

> cd WebCrawler

c. Setting Up Virtual Environment(Prerequisites: Your System should have Python3 installed)

> pip install virtualenv    #Installing the virtual environment module
> virtualenv venv           #Create virtual environment venv

d. Activating Virtual Environment:

On Windows (Tested on Windows 10)

> venv\Scripts\activate    #Activate Virtual Environment 

On macOS and Linux (Tested on MacOs Catlina)

> source venv/bin/activate   #Activate Virtual Environment

e. Installing requirements

> pip install -r requirements.txt    #Installing all the requirements for running the project

f. Once all the requirements are installed, it's time to run the Flask App

> python app.py    #Starts the Server

Once the server starts, go to http://127.0.0.1:5000/ to enjoy analyses based on Crawling the URLs

g. Testing the App with Unit Test Cases

> nose2 testcases    #Running Unit Test Cases for the Flask App

5. Screenshots

The Default Landing Page

The Default Landing Page


The Waiting Screen After the Crawling Starts

The Waiting Screen After the Crawling Startse


The Word Cloud

The Word Cloud


The Bigram Cloud

The Bigram Cloud


Bar Graph denoting Count of Words Per Level

Bar Graph denoting Count of Words Per Level


Line Chart denoting the average length of words per level

Line Chart denoting the average length of words per level


The search Query Feature

The search Query Feature


6. Performance with and without Multithreading

Performance With and Without Mulithreading

The base URL taken: https://www.ubs.com/in/en.html
Depth: 2

The first Measure-Command calculates the time taken for the spider to crawl data using Multithreading and async calls
The second Measure-Command calculates the time taken for the spider to crawl the data without the use of Multithreading

Time taken in the first case ≈ 63.42s
Time taken in the second case ≈ 190.90s

Approximate percentage increase reducing time complexity using Multithreading and async calls ≈ 201.009% increase

6. Assumptions

  1. The sites for which the crawling will be done will contain English words only. This is because while filtering stopwords, only English language stopwords are filtered.
  2. Currently, in the web App, the depth to crawl up to is limited to 2 levels only. This is due to the Request timeout constraint for extraction of data set by Heroku.

7. Current Shortcomings

  1. The deployed App has a timeout of 30s for getting the response (set by Heroku). But that much time is not sufficient for crawling some websites with levels 2 or more.
  2. The spider finds it difficult to get the most recent DOM elements that are generated by Javascript async calls (AJAX, etc) due to the use of Beautiful Soup.
  3. The spider finds it difficult to bypass Captcha.

8. Major References

a. https://towardsdatascience.com/in-10-minutes-web-scraping-with-beautiful-soup-and-selenium-for-data-professionals-8de169d36319
b. https://www.patricksoftwareblog.com/unit-testing-a-flask-application/
c. https://towardsdatascience.com/website-data-cleaning-in-python-for-nlp-dda282a7a871
d. https://docs.anychart.com/Quick_Start/Quick_Start
e. https://www.datacamp.com/community/tutorials/making-web-crawlers-scrapy-python
g. https://getbootstrap.com/docs/4.0/
h. https://flask.palletsprojects.com/en/1.1.x/
i. https://stackoverflow.com/
j. https://docs.python.org/3/
k. https://www.youtube.com/watch?v=L2CxFhkZrss

About

Welcome to Jinesh Parakh's submission for the UBS Avant Garde Engineering Challenge Round 2(UBS Project X Code Challenge Round II)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published