Turbo-telegram v0.0.3-beta
This release brings 2 major updates for the price of 1!
BlazingSQL based NYC Dashboard
Rework of taxi_dashboard.ipynb
to utilize SQL queries when producing all DataFrames.
- BlazingSQL table of query results that's then focused accordingly
- Apply cuDF's
.to_pandas()
for HoloViews plots
This eliminates post-query filtering of results, freeing up GPU memory & enabling use of much larger datasets.
2015 Taxi Data Download & Processing
Users can now download & pre-process all 12 months of 2015 NYC yellow cab data. Total download size is ~20.07 GB before processing and ~18.94 GB (CSV0) after processing.
NOTE: taxi_dashboard.ipynb
does NOT yet point to this new data. This will be implemented soon, but issues such as optimizing big data integration for single-GPU users need to be addressed first.
New Files
-
- based off HoloViz taxi_preprocessing_example.py
- downloads & processes all 12 months of 2015 NYC taxi data
- uses BlazingSQL & Numpy1 to configure data for use with Datashader / HoloViews
- single node / processes 1 month at a time to ensure anyone w/ compatible GPU can run
- tested w/ 16GB Tesla T4 GPU on AWS, runs end-to-end in 7-8 min2
- GPU capacity test via final visualization under "Extra" (at end) calls thru August (8/12 months)34
-
- based off RAPIDS sql_check.py
- checks for installation of BlazingSQL & installs via Anaconda if not found
- called in
download_data.ipynb
imports section if BSQL not found & user wants to install
Footnotes
0 12 files, 18 columns (each) * 135,216,505 rows (total/combined)
1 elimination of NumPy expected w/ resolution of BlazingDB/blazingsql#334 (UPDATE 4 Feb: BSQL only merged to master branch 95c963c)
2 last run: 4m 27s download; 3m 25s processing (largely from writing .to_csv()
); 7m 52s total
3 sticking to consecutive months starting with January, this was the largest table query to process w/o kernel crashing, ~12.6GB CSV which is ~25GB on GPU, running off 1 16GB Tesla T4 GPU AWS EC2 instance
4 Here's how that plot looked;