Skip to content

SQL Based + Full 2015 Data Download & Processing

Latest
Compare
Choose a tag to compare
@gumdropsteve gumdropsteve released this 03 Feb 11:48
· 7 commits to master since this release
66756c2

Turbo-telegram v0.0.3-beta

This release brings 2 major updates for the price of 1!

Main PRs: #14, #17

BlazingSQL based NYC Dashboard

Rework of taxi_dashboard.ipynb to utilize SQL queries when producing all DataFrames.

  • BlazingSQL table of query results that's then focused accordingly
  • Apply cuDF's .to_pandas() for HoloViews plots

This eliminates post-query filtering of results, freeing up GPU memory & enabling use of much larger datasets.

2015 Taxi Data Download & Processing

Users can now download & pre-process all 12 months of 2015 NYC yellow cab data. Total download size is ~20.07 GB before processing and ~18.94 GB (CSV0) after processing.

NOTE: taxi_dashboard.ipynb does NOT yet point to this new data. This will be implemented soon, but issues such as optimizing big data integration for single-GPU users need to be addressed first.

New Files

  • download_data.ipynb.ipynb

    • based off HoloViz taxi_preprocessing_example.py
    • downloads & processes all 12 months of 2015 NYC taxi data
    • uses BlazingSQL & Numpy1 to configure data for use with Datashader / HoloViews
      • single node / processes 1 month at a time to ensure anyone w/ compatible GPU can run
      • tested w/ 16GB Tesla T4 GPU on AWS, runs end-to-end in 7-8 min2
      • GPU capacity test via final visualization under "Extra" (at end) calls thru August (8/12 months)34
  • sql_check.py

    • based off RAPIDS sql_check.py
    • checks for installation of BlazingSQL & installs via Anaconda if not found
    • called in download_data.ipynb imports section if BSQL not found & user wants to install

Footnotes

0 12 files, 18 columns (each) * 135,216,505 rows (total/combined)
1 elimination of NumPy expected w/ resolution of BlazingDB/blazingsql#334 (UPDATE 4 Feb: BSQL only merged to master branch 95c963c)
2 last run: 4m 27s download; 3m 25s processing (largely from writing .to_csv()); 7m 52s total
3 sticking to consecutive months starting with January, this was the largest table query to process w/o kernel crashing, ~12.6GB CSV which is ~25GB on GPU, running off 1 16GB Tesla T4 GPU AWS EC2 instance
4 Here's how that plot looked;
download (1)