Release SQL Based + Full 2015 Data Download & Processing · gumdropsteve/turbo-telegram

Turbo-telegram v0.0.3-beta

This release brings 2 major updates for the price of 1!

Main PRs: #14, #17

BlazingSQL based NYC Dashboard

Rework of taxi_dashboard.ipynb to utilize SQL queries when producing all DataFrames.

BlazingSQL table of query results that's then focused accordingly
Apply cuDF's .to_pandas() for HoloViews plots

This eliminates post-query filtering of results, freeing up GPU memory & enabling use of much larger datasets.

2015 Taxi Data Download & Processing

Users can now download & pre-process all 12 months of 2015 NYC yellow cab data. Total download size is ~20.07 GB before processing and ~18.94 GB (CSV⁰) after processing.

NOTE: taxi_dashboard.ipynb does NOT yet point to this new data. This will be implemented soon, but issues such as optimizing big data integration for single-GPU users need to be addressed first.

New Files

download_data.ipynb.ipynb
- based off HoloViz taxi_preprocessing_example.py
- downloads & processes all 12 months of 2015 NYC taxi data
- uses BlazingSQL & Numpy¹ to configure data for use with Datashader / HoloViews
  - single node / processes 1 month at a time to ensure anyone w/ compatible GPU can run
  - tested w/ 16GB Tesla T4 GPU on AWS, runs end-to-end in 7-8 min²
  - GPU capacity test via final visualization under "Extra" (at end) calls thru August (8/12 months)³⁴
sql_check.py
- based off RAPIDS sql_check.py
- checks for installation of BlazingSQL & installs via Anaconda if not found
- called in download_data.ipynb imports section if BSQL not found & user wants to install

Footnotes

⁰ 12 files, 18 columns (each) * 135,216,505 rows (total/combined)
¹ elimination of NumPy expected w/ resolution of BlazingDB/blazingsql#334 (UPDATE 4 Feb: BSQL only merged to master branch 95c963c)
² last run: 4m 27s download; 3m 25s processing (largely from writing .to_csv()); 7m 52s total
³ sticking to consecutive months starting with January, this was the largest table query to process w/o kernel crashing, ~12.6GB CSV which is ~25GB on GPU, running off 1 16GB Tesla T4 GPU AWS EC2 instance
⁴ Here's how that plot looked;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL Based + Full 2015 Data Download & Processing

Turbo-telegram v0.0.3-beta

BlazingSQL based NYC Dashboard

2015 Taxi Data Download & Processing

New Files

Footnotes