├── ana_code
│ ├── combined_data_analysis.hql
│ ├── screenshots
│ │ ├── **/*.png
├── data
│ ├── raw
│ │ ├── **/*.csv
├── data_ingest
│ ├── data_ingestion.sh
│ ├── data_ingest_screenshot.png
├── etl_code
│ ├── nicole
│ │ ├── cleaned_children_blood_lead.scala
│ │ ├── **/*.png
│ ├── seoeun
│ │ ├── cleaned_housing_con.scala
│ │ ├── **/*.png
│ ├── nicole&seoeun
│ │ ├── cleaned_economic.scala
│ │ ├── **/*.png
├── profiling_code
│ ├── nicole
│ │ ├── before_clean
│ │ │ ├── children_blood_lead_profile_before_clean.scala
│ │ │ ├── **/*.png
│ │ ├── after_clean
│ │ │ ├── children_blood_lead_profile_clean.hql
│ │ │ ├── **/*.png
│ ├── seoeun
│ │ ├── before_clean
│ │ │ ├── housing_con_data_profile_before_clean.scala
│ │ │ ├── **/*.png
│ │ ├── after_clean
│ │ │ ├── housing_con_profile_clean.hql
│ │ │ ├── **/*.png
│ ├── nicole&seoeun
│ │ ├── before_clean
│ │ │ ├── economics_data_profile_before_clean.scala
│ │ │ ├── **/*.png
│ │ ├── after_clean
│ │ │ ├── econ_profile_clean.hql
│ │ │ ├── **/*.png
└── README.md
-
Start dataproc
-
Upload the original version of the csv datasets which can be found in
data/rawto dataproc -
Upload the script file
data_ingestion.shwhich can be found indata_ingestto dataproc -
Make executable file and run it with this commands
chmod +x data_ingestion.sh ./data_ingestion.sh
this will set up a folder called
originalDataSetsall your csv files on hdfs
-
Upload
children_blood_lead_profile_before_clean.scalalocated inprofiling_code/nicole/before_cleanto dataproc -
Run that scala file using this command
spark-shell --deploy-mode client -i children_blood_lead_profile_before_clean.scala
-
After running this command, you will see the results of this profile on the spark Scala shell
-
Keep repeating
1 - 3steps for scala fileshousing_con_data_profile_before_clean.scalaandeconomics_data_profile_before_clean.scalalocated inprofiling_code/seoeun/before_cleanandprofiling_code/nicole&seoeun/before_clean
-
Upload
cleaned_children_blood_lead.scalalocated inetl_code/nicoleto dataproc -
Run that scala file using this command
spark-shell --deploy-mode client -i cleaned_children_blood_lead.scala
-
After running this command, you will see the results at
finalCode/leadon your hdfs. Note that results for other scala files will be located infinalCode/housingandfinalCode/econ -
Keep repeating
1 - 3steps for scala filescleaned_housing_con.scalaandcleaned_economic.scalalocated inetl_code/seoeunandetl_code/nicole&seoeun
-
Upload
children_blood_lead_profile_clean.hqllocated inprofiling_code/nicole/after_cleanto dataproc -
Run that HiveQL file using this command
beeline -u "jdbc:hive2://localhost:10000" -f children_blood_lead_profile_clean.hql -
After running this command, you will see the results of this profile on the hive shell
-
Keep repeating
1 - 3steps for hiveQL fileshousing_con_profile_clean.hqlandecon_profile_clean.hqllocated inprofiling_code/seoeun/after_cleanandprofiling_code/nicole&seoeun/after_clean
-
Upload
combined_data_analysis.hqllocated inana_codeto dataproc -
Run that HiveQL file using this command
beeline -u "jdbc:hive2://localhost:10000" -f combined_data_analysis.hql -
After running this command, you will see the results of this profile on the hive shell
We put our original input data into originalDataSets directory and cleaned input data into finalCode directory.

