The task is to use Pyspark to solve a big data problem. In this project, the Databricks for Community Edition is used. Amazon Web Services (AWS) is chosen as the cloud provider. A notebook is created within the Databricks workplace with PySpark using a cluster (12.2 LTS (Scala 2.12, Spark 3.3.2)).
- Work out the frequencies with which distinct skills are mentioned in job descriptions, and present the top 10 skills, alongside the frequency of each across the entire dataset; check how the distribution of the frequencies with which distinct skills are mentioned in JDs changes if lowercase all the skills
- Find the 5 most frequent numbers of skills in JDs across the dataset
- Join the skills from JDs in the O*NET dataset to gain more insight
- Find the 10 most frequent “Commodity Title” across all the job descriptions