Merge pull request #1495 from BlazingDB/branch-0.19

0.19 Release
BlazingDB · Apr 21, 2021 · ff4ece0 · ff4ece0
2 parents 44aeef8 + 5964463
commit ff4ece0
Show file tree

Hide file tree

Showing 2,671 changed files with 4,492,660 additions and 4,774 deletions.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -23,6 +23,11 @@ A clear and concise description of what you expected to happen.
  - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  - Method of BlazingSQL install: [conda, Docker, or from source]
    - If method of install is [Docker], provide `docker pull` & `docker run` commands used
+ - **BlazingSQL Version** which can be obtained by doing as follows:
+   ```
+   import blazingsql
+   print(blazingsql.__info__())
+   ```
 
 **Environment details**
 Please run and paste the output of the `print_env.sh` script here, to gather any other relevant environment details

diff --git a/.github/workflows/build-docs.yml b/.github/workflows/build-docs.yml
@@ -0,0 +1,48 @@
+# This is a basic workflow to help you get started with Actions
+
+name: Build docs
+
+# Controls when the action will run. 
+on:
+  push:
+    branches: 
+      - main
+
+# A workflow run is made up of one or more jobs that can run sequentially or in parallel
+jobs:
+  # This workflow contains a single job called "build"
+  docs:
+    # The type of runner that the job will run on
+    runs-on: ubuntu-latest
+
+    # Steps represent a sequence of tasks that will be executed as part of the job
+    steps:
+      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
+      - uses: actions/checkout@v2
+
+      # Runs a single command using the runners shell
+      - uses: mattnotmitt/doxygen-action@v1
+        with:
+          working-directory: 'docsrc/'
+          doxyfile-path: 'source/Doxyfile'
+      - uses: ammaraskar/sphinx-action@master
+        with:
+          build-command: "make html -e"
+          docs-folder: "docsrc/"
+      - name: Commit documentation changes
+        run: |
+          git clone https://github.com/romulo-auccapuclla/blazingsql.git --branch main --single-branch main
+          cp -a docsrc/build/html/. docs/
+          cd docs
+          touch .nojekyll
+          git config --local user.email "[email protected]"
+          git config --local user.name "GitHub Action"
+          git add .
+          git commit -m "Update documentation" -a || true
+      - name: Push changes
+        uses: ad-m/github-push-action@master
+        with:
+          branch: main
+          directory: docs
+          force: true
+          github_token: ${{ secrets.GITHUB_TOKEN }}
diff --git a/.gitignore b/.gitignore
@@ -6,6 +6,7 @@
 .idea/
 
 engine/cmake-build-debug/
+cmake-*
 
 algebra.log
 
@@ -92,3 +93,12 @@ powerpc/blazingsql.tar.gz
 powerpc/developer/requirements.txt
 powerpc/developer/core
 powerpc/developer/blazingsql.tar.gz
+
+# mac junk
+.DS_Store
+._.*
+
+# docs build folders
+docsrc/build/
+docsrc/source/doxyfiles/
+docsrc/source/xml
diff --git a/.nojekyll b/.nojekyll
@@ -0,0 +1 @@
+
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,65 @@
+# BlazingSQL 0.19.0 (April 21, 2021)
+
+## New Features
+- #1367 OverlapAccumulator Kernel
+- #1364 Implement the concurrent API (bc.sql with token, bc.status, bc.fetch)
+- #1426 Window Functions without partitioning 
+- #1349 Add e2e test for Hive Partitioned Data
+- #1396 Create tables from other RDBMS
+- #1427 Support for CONCAT alias operator
+- #1424 Add get physical plan with explain
+
+## Improvements
+- #1325 Refactored CacheMachine.h and CacheMachine.cpp 
+- #1322 Updated and enabled several E2E tests
+- #1333 Fixing build due to cudf update
+- #1344 Removed GPUCacheDataMetadata class
+- #1376 Fixing build due to some strings refactor in cudf, undoing the replace workaround
+- #1430 Updating GCP to >= version
+- #1331 Added flag to enable null e2e testing
+- #1418 Adding support for docker image
+- #1434 Added documentation for C++ and Python in Sphinx
+- #1419 Added concat cache machine timeout 
+- #1444 Updating GCP to >= version
+- #1349 Add e2e test for Hive Partitioned Data
+- #1447 Improve getting estimated output num rows
+- #1473 Added Warning to Window Functions 
+- #1480 Improve dependencies script
+
+## Bug Fixes
+- #1335 Fixing uninitialized var in orc metadata and handling the parseMetadata exceptions properly
+- #1339 Handling properly the nulls in case conditions with strings
+- #1346 Delete allocated host chunks
+- #1348 Capturing error messages due to exceptions properly
+- #1350 Fixed bug where there are no projects in a bindable table scan
+- #1359 Avoid cuda issues when free pinned memory
+- #1365 Fixed build after sublibs changes on cudf
+- #1369 Updated java path for powerpc build 
+- #1371 Fixed e2e settings
+- #1372 Recompute `columns_to_hash` in DistributeAggregationKernel
+- #1375 Fix empty row_group_ids for parquet
+- #1380 Fixed issue with int64 literal values 
+- #1379 Remove ProjectRemoveRule
+- #1389 Fix issue when CAST a literal
+- #1387 Skip getting orc metadata for decimal type
+- #1392 Fix substrings with nulls
+- #1398 Fix performance regression
+- #1401 Fix support for minus unary operation
+- #1415 Fixed bug where num_batches was not getting set in BindableTableScan 
+- #1413 Fix for null tests 13 and 23 of windowFunctionTest
+- #1416 Fix full join when both tables contains nulls
+- #1423 Fix temporary directory for hive partition test
+- #1351 Fixed 'count distinct' related issues
+- #1425 Fix for new joins API
+- #1400 Fix for Column aliases when exists a Join op
+- #1456 Raising exceptions on Python side for RAL
+- #1466 SQL providers: update README.md
+- #1470 Fix pre compiler flags for sql parsers
+
+
+## Deprecated Features
+- #1394 Disabled support for outer joins with inequalities 
+
 # BlazingSQL 0.18.0 (February 24, 2021)
 
 ## New Features
@@ -17,6 +79,10 @@
 - #1284 Initial support for Windows Function
 - #1303 Add support for INITCAP
 - #1313 getting and using ORC metadata
+- #1347 Fixing issue when reading orc metadata from DATE dtype
+- #1338 Window Function support for LEAD and LAG statements 
+- #1362 give useful message when file extension is not recognized
+- #1361 Supporting first_value and last_value for Window Function
 
 
 ## Improvements
@@ -38,7 +104,7 @@
 - #1314 Added unit tests to verify that OOM error handling works well
 - #1320 Revamping cache logger
 - #1323 Made progress bar update continuously and stay after query is done 
-
+- #1336 Improvements for the cache API
 
 ## Bug Fixes
 - #1249 Fix compilation with cuda 11
@@ -52,7 +118,6 @@
 - #1312 Fix progress bar for jupyterlab
 - #1318 Disabled require acknowledge 
 
-
 # BlazingSQL 0.17.0 (December 10, 2020)
 
 ## New Features

diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,67 @@
+ARG CUDA_VER="10.2"
+ARG UBUNTU_VERSION="16.04"
+FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu${UBUNTU_VERSION}
+LABEL Description="blazingdb/blazingsql is the official BlazingDB environment for BlazingSQL on NIVIDA RAPIDS." Vendor="BlazingSQL" Version="0.4.0"
+
+ARG CUDA_VER=10.2
+ARG CONDA_CH="-c blazingsql -c rapidsai -c nvidia"
+ARG PYTHON_VERSION="3.7"
+ARG RAPIDS_VERSION="0.18"
+
+SHELL ["/bin/bash", "-c"]
+ENV PYTHONDONTWRITEBYTECODE=true
+
+RUN apt-get update -qq && \
+    apt-get install curl git -yqq --no-install-recommends && \
+    apt-get clean -y && \
+    curl -s -o /tmp/miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
+    bash /tmp/miniconda.sh -bfp /usr/local/ && \
+    rm -rf /tmp/miniconda.sh && \
+    conda create --no-default-packages python=${PYTHON_VERSION} -y -n bsql && \
+    conda install -y --freeze-installed -n bsql \
+    ${CONDA_CH} \
+    -c conda-forge -c defaults \
+    cugraph=${RAPIDS_VERSION} cuml=${RAPIDS_VERSION} \
+    cusignal=${RAPIDS_VERSION} \
+    cuspatial=${RAPIDS_VERSION} \
+    cuxfilter clx=${RAPIDS_VERSION} \
+    python=${PYTHON_VERSION} cudatoolkit=${CUDA_VER} \
+    blazingsql=${RAPIDS_VERSION} \
+    jupyterlab \
+    networkx statsmodels xgboost scikit-learn \
+    geoviews seaborn matplotlib holoviews colorcet && \
+    conda clean -afy && \
+    rm -rf /var/cache/apt /var/lib/apt/lists/* /tmp/miniconda.sh /usr/local/pkgs/* && \
+    rm -rf /usr/local/envs/bsql/conda-meta && \
+    rm -rf /usr/local/envs/bsql/include && \
+    rm /usr/local/envs/bsql/lib/libpython3.7m.so.1.0 && \
+    find /usr/local/envs/bsql -name '__pycache__' -type d -exec rm -rf '{}' '+' && \
+    find /usr/local/envs/bsql -follow -type f -name '*.pyc' -delete && \
+    rm -rf /usr/local/envs/bsql/lib/libasan.so.5.0.0 \
+    /usr/local/envs/bsql/lib/libtsan.so.0.0.0 \
+    /usr/local/envs/bsql/lib/liblsan.so.0.0.0 \
+    /usr/local/envs/bsql/lib/libubsan.so.1.0.0 \
+    /usr/local/envs/bsql/bin/x86_64-conda-linux-gnu-ld \
+    /usr/local/envs/bsql/bin/sqlite3 \
+    /usr/local/envs/bsql/bin/openssl \
+    /usr/local/envs/bsql/share/terminfo \
+    /usr/local/envs/bsql/bin/postgres \
+    /usr/local/envs/bsql/bin/pg_* \
+    /usr/local/envs/bsql/man \
+    /usr/local/envs/bsql/qml \
+    /usr/local/envs/bsql/qsci \
+    /usr/local/envs/bsql/mkspecs && \
+    find /usr/local/envs/bsql/lib/python3.7/site-packages -name 'tests' -type d -exec rm -rf '{}' '+' && \
+    find /usr/local/envs/bsql/lib/python3.7/site-packages -name '*.pyx' -delete && \
+    find /usr/local/envs/bsql -name '*.c' -delete && \
+  git clone --branch=master https://github.com/BlazingDB/Welcome_to_BlazingSQL_Notebooks /blazingsql && \
+  rm -rf /blazingsql/.git && \
+  mkdir /.local /.jupyter /.cupy && chmod 777 /.local /.jupyter /.cupy
+
+WORKDIR /blazingsql
+COPY run_jupyter.sh /blazingsql
+
+# Jupyter
+EXPOSE 8888
+CMD ["/blazingsql/run_jupyter.sh"]
+
diff --git a/README.md b/README.md
@@ -99,13 +99,14 @@ conda install -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults bla
 ``` 
 
 ## Nightly Version
+For nightly version cuda 11+ are only supported, see https://github.com/rapidsai/cudf#cudagpu-requirements
 ```bash
 conda install -c blazingsql-nightly -c rapidsai-nightly -c nvidia -c conda-forge -c defaults blazingsql python=$PYTHON_VERSION  cudatoolkit=$CUDA_VERSION
 ```
-Where $CUDA_VERSION is 10.1, 10.2 or 11.0 and $PYTHON_VERSION is 3.7 or 3.8
-*For example for CUDA 10.1 and Python 3.7:*
+Where $CUDA_VERSION is 11.0 or 11.2 and $PYTHON_VERSION is 3.7 or 3.8
+*For example for CUDA 11.2 and Python 3.8:*
 ```bash
-conda install -c blazingsql-nightly -c rapidsai-nightly -c nvidia -c conda-forge -c defaults blazingsql python=3.7  cudatoolkit=10.1
+conda install -c blazingsql-nightly -c rapidsai-nightly -c nvidia -c conda-forge -c defaults blazingsql python=3.8  cudatoolkit=11.2
 ```
 
 # Build/Install from Source (Conda Environment)
@@ -117,18 +118,14 @@ This is the recommended way of building all of the BlazingSQL components and dep
 ```bash
 conda create -n bsql python=$PYTHON_VERSION
 conda activate bsql
-conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
-conda install --yes -c rapidsai -c nvidia -c conda-forge -c defaults dask-cuda=0.18 dask-cudf=0.18 cudf=0.18 ucx-py=0.18 ucx-proc=*=gpu python=3.7 cudatoolkit=$CUDA_VERSION
-conda install --yes -c conda-forge cmake=3.18 gtest gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets
+./dependencies.sh 0.19 $CUDA_VERSION
 ```
 Where $CUDA_VERSION is is 10.1, 10.2 or 11.0 and $PYTHON_VERSION is 3.7 or 3.8
 *For example for CUDA 10.1 and Python 3.7:*
 ```bash
 conda create -n bsql python=3.7
 conda activate bsql
-conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
-conda install --yes -c rapidsai -c nvidia -c conda-forge -c defaults dask-cuda=0.18 dask-cudf=0.18 cudf=0.18 ucx-py=0.18 ucx-proc=*=gpu python=3.7 cudatoolkit=10.1
-conda install --yes -c conda-forge cmake=3.18 gtest gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets
+./dependencies.sh 0.19 10.1
 ```
 
 ### Build
@@ -149,21 +146,18 @@ $CONDA_PREFIX now has a folder for the blazingsql repository.
 ## Nightly Version
 
 ### Install build dependencies
+For nightly version cuda 11+ are only supported, see https://github.com/rapidsai/cudf#cudagpu-requirements
 ```bash
 conda create -n bsql python=$PYTHON_VERSION
 conda activate bsql
-conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
-conda install --yes -c rapidsai-nightly -c nvidia -c conda-forge -c defaults dask-cuda=0.19 dask-cudf=0.19 cudf=0.19 ucx-py=0.19 ucx-proc=*=gpu python=3.7 cudatoolkit=$CUDA_VERSION
-conda install --yes -c conda-forge cmake=3.18 gtest==1.10.0=h0efe328_4 gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets
+./dependencies.sh 0.20 $CUDA_VERSION nightly
 ```
-Where $CUDA_VERSION is is 10.1, 10.2 or 11.0 and $PYTHON_VERSION is 3.7 or 3.8
-*For example for CUDA 10.1 and Python 3.7:*
+Where $CUDA_VERSION is 11.0 or 11.2 and $PYTHON_VERSION is 3.7 or 3.8
+*For example for CUDA 11.2 and Python 3.8:*
 ```bash
-conda create -n bsql python=3.7
+conda create -n bsql python=3.8
 conda activate bsql
-conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
-conda install --yes -c rapidsai-nightly -c nvidia -c conda-forge -c defaults dask-cuda=0.19 dask-cudf=0.19 cudf=0.19 ucx-py=0.19 ucx-proc=*=gpu python=3.7 cudatoolkit=10.1
-conda install --yes -c conda-forge cmake=3.18 gtest==1.10.0=h0efe328_4 gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets
+./dependencies.sh 0.20 11.2 nightly
 ```
 
 ### Build
@@ -196,18 +190,34 @@ To build without the storage plugins (AWS S3, Google Cloud Storage) use the next
 ```
 NOTE: By disabling the storage plugins you don't need to install previously AWS SDK C++ or Google Cloud Storage (neither any of its dependencies).
 
+#### SQL providers
+To build without the SQL providers (MySQL, PostgreSQL, SQLite) use the next arguments:
+```bash
+# Disable all SQL providers
+./build.sh disable-mysql disable-sqlite disable-postgresql
+
+# Disable MySQL provider
+./build.sh disable-mysql
+
+...
+```
+NOTES:
+- By disabling the storage plugins you don't need to install mysql-connector-cpp=8.0.23 libpq=13 sqlite=3 (neither any of its dependencies).
+- Currenlty we support only MySQL. but PostgreSQL and SQLite will be ready for the next version!
+
 # Documentation
 User guides and public APIs documentation can be found at [here](https://docs.blazingdb.com/docs)
 
 Our internal code architecture can be built using Spinx.
 ```bash
-pip install recommonmark exhale
 conda install -c conda-forge doxygen
 cd $CONDA_PREFIX
-cd blazingsql/docs
+cd blazingsql/docsrc
+pip install -r requirements.txt
+make doxygen
 make html
 ```
-The generated documentation can be viewed in a browser at `blazingsql/docs/_build/html/index.html`
+The generated documentation can be viewed in a browser at `blazingsql/docsrc/build/html/index.html`
 
 
 # Community
@@ -230,4 +240,4 @@ The RAPIDS suite of open source software libraries aim to enable execution of en
 
 ## Apache Arrow on GPU
 
-The GPU version of [Apache Arrow](https://arrow.apache.org/) is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.
+The GPU version of [Apache Arrow](https://arrow.apache.org/) is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported. 
diff --git a/algebra/blazingdb-calcite-application/selected-queries-reference.bin b/algebra/blazingdb-calcite-application/selected-queries-reference.bin