Skip to content
This repository has been archived by the owner on Dec 21, 2018. It is now read-only.

[WIP] Apache Parquet reader #85

Closed
wants to merge 91 commits into from
Closed
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
6cb51df
[parquet-reader] Add parquet reader wrapper
gcca Jul 17, 2018
bbe9467
[parquet-reader] Add column reader
gcca Jul 18, 2018
6ced85b
[parquet-reader] Enable read new page call
gcca Jul 20, 2018
16b40cb
WIP: add custom decoder
aocsa Jul 20, 2018
fc57ccb
[parquet-reader] Update parquet API to v1.3.1
gcca Jul 23, 2018
3000f89
[parquet-reader] Read batch as gdf column
gcca Jul 25, 2018
a6e7d0e
arrow decoder
aocsa Jul 26, 2018
7c24364
merge with parquet-reader
aocsa Jul 26, 2018
3b9af0e
Merge branch 'parquet-reader' into parquet-decoder
aocsa Jul 26, 2018
4593968
[parquet-reader] Add gdf column read test
gcca Jul 26, 2018
abe73d3
[parquet-reader] Add file reader by columns benchmark
gcca Jul 27, 2018
a384b15
decoder using host
aocsa Jul 27, 2018
79470ea
decoder using gpu
aocsa Jul 27, 2018
3ef6ecd
[parquet-reader] Read spaced batches to gdf column
gcca Jul 30, 2018
4282650
Merge branch 'parquet-reader' into parquet-decoder
aocsa Aug 1, 2018
819af4e
use specific gpu-decoder for int32
aocsa Aug 1, 2018
5713017
[parquet-reader] Add API to read a parquet file
gcca Aug 2, 2018
7ad9972
[parquet-reader] Merge from parquet-decoder
gcca Aug 2, 2018
882a296
[parquet-reader] Fix template definitions for readers
gcca Aug 2, 2018
e8068eb
[parquet-reader] Merger from LibGDF/master
gcca Aug 2, 2018
e407912
[parquet-reader] Fix testing files
gcca Aug 2, 2018
9ba5d7e
[parquet-reader] Move tests to src
gcca Aug 2, 2018
6aaaa51
[parquet-reader] Fix access to parquetcpp repository
gcca Aug 2, 2018
13e27c7
[parquet-reader] Fix benchmark test building
gcca Aug 2, 2018
15ff796
[parquet-reader] Fix build moving tests into src
gcca Aug 2, 2018
d7bed6a
[parquet-reader] Update tests building process
gcca Aug 2, 2018
92d89e9
[parquet-reader] Add conda dependencies for Thrift
gcca Aug 3, 2018
f56a978
[parquet-reader] Check gdf dtype from parquet type
gcca Aug 6, 2018
9043c7a
[parquet-reader] Apply batch spaced reading on tests
gcca Aug 6, 2018
9d2275e
[parquet-reader] Add column filter from file
gcca Aug 7, 2018
d0b265c
[parquet-reader] Add read to gdf column method
gcca Aug 7, 2018
3b464bd
[parquet-reader] Remove ReadGdfColumn method
gcca Aug 7, 2018
f92a931
decode bitpacking data using pinned memory
aocsa Aug 7, 2018
d25db66
Merge branch 'parquet-reader' of https://github.com/BlazingDB/libgdf …
aocsa Aug 7, 2018
1716e81
[parquet-reader] Add parquet target for linking
gcca Aug 8, 2018
9e39227
decode bitpacking data using pinned memory: merge
aocsa Aug 8, 2018
ab07b56
bitpacking decoding for all types
aocsa Aug 9, 2018
5ebc08c
start gpu benchmark for parquet reader
aocsa Aug 13, 2018
54a63a1
improve copy scheme from pinned memory to device memory
aocsa Aug 15, 2018
7ee8760
init benchmark for parquet reader
aocsa Aug 16, 2018
2ad9c25
wip: decode using only gpu
aocsa Aug 21, 2018
02c1132
gdf_column in device and benchmark for parquet reader
aocsa Aug 21, 2018
8be8e9e
implemented new expand function. Commented out problematic tests. sta…
Aug 21, 2018
273e17d
benckmark with huge parquet file
aocsa Aug 22, 2018
30c581a
added compact_to_sparse_for_nulls
Aug 23, 2018
c129c94
starting with kernel
Aug 23, 2018
298dc3d
starting with kernel
Aug 23, 2018
7f0f570
[parquet-reader]: ToGdfColumn using gpu using ReadBatch
aocsa Aug 23, 2018
7da1549
reimplemented compact_to_sparse_for_nulls
Aug 23, 2018
6979c33
added includes
Aug 23, 2018
fbae2c8
Merge branch 'willParquetExp' into willParquetKernelExp
Aug 24, 2018
bceb98b
fixed build errors but commented out usage of compact_to_sparse_for_n…
Aug 24, 2018
26a5ce5
Merge branch 'willParquetExp' into willParquetKernelExp
Aug 24, 2018
869d9eb
[parquet-reader] toGdfColumn valid support and expand using ReadBatch
aocsa Aug 24, 2018
55c53ae
kernel compiles
Aug 24, 2018
3c97bb2
improved kernel call
Aug 24, 2018
8f06c8f
improved kernel call
Aug 24, 2018
12f6404
[parquet-reader]: custom gpu kernel for definition levels to valid_bits
aocsa Aug 24, 2018
149f8d3
[parquet-reader] Add test for valid and nulls
gcca Aug 25, 2018
93a0235
[parquet-reader] Merged from branch
gcca Aug 25, 2018
d4f0be9
[parquet-reader] Test nulls with two row groups
gcca Aug 25, 2018
616b303
[parquet-reader] Update conversion to gdf column
gcca Aug 27, 2018
ce430a4
Merge branch 'parquet-reader' into willParquetKernelExp
Aug 27, 2018
67068eb
changed unpack_using_gpu to use new kernel. Changed metadata gatherin…
Aug 27, 2018
98940b8
[parquet-reader]: ReadBatchSpace support on gpu
aocsa Aug 27, 2018
f639c2b
[parquet-reader] Remove unexistent directory
gcca Aug 27, 2018
51f7479
[parquet-reader] check unit test and benchmark
aocsa Aug 28, 2018
4f88e80
changed bitpack remainders implementation
Aug 28, 2018
9f6adb7
[parquet-reader] Read filtering by row_groups and columns indices
gcca Aug 28, 2018
19628d5
Merge branch 'parquet-reader' of github.com:BlazingDB/libgdf into par…
gcca Aug 28, 2018
42bf16d
[parquet-reader] Merged from master
gcca Aug 29, 2018
e6810b5
[parquet-reader] Update to work with arrow 0.9
gcca Aug 29, 2018
81d8cb9
merged in bitpacking kernels
Aug 31, 2018
dbcf578
[parquet-reader] Fix broken ByIdsInOrder unit test
gcca Aug 31, 2018
6d2e4b3
[parquet-reader] update benchmark
aocsa Aug 31, 2018
6646f09
Merge branch 'parquet-reader' of https://github.com/BlazingDB/libgdf …
aocsa Aug 31, 2018
94ea6a4
[parquet-reader] Add read column method
gcca Aug 31, 2018
2950374
fixed an issue with parquet-benchmark test
Sep 5, 2018
fc0a72e
[parquet-reader]: fix parquet reader (tested with mortgage data)
aocsa Sep 7, 2018
fc85c2e
[parquet-reader] fix parquet benchmark
aocsa Sep 11, 2018
b6784de
[parquet-reader] rebase and fix types conversion
aocsa Sep 18, 2018
ea06079
[parquet-reader]: fix warnings
aocsa Sep 18, 2018
31326fa
[parquet-reader] Downgrade bison and flex
gcca Sep 18, 2018
55ab718
[parquet-reader] Add global ParquetCpp include directories
gcca Sep 18, 2018
c3f2552
[parquet-reader] Fix compiling warnings
gcca Sep 18, 2018
dc76e3d
[parquet-reader] fix bitpacking decoder and transform_valid
aocsa Sep 19, 2018
8bf8311
[parquet-reader]: merge with last fixes
aocsa Sep 19, 2018
951cbf9
[parquet-reader]: fix warnings
aocsa Sep 19, 2018
294f345
[parquet-reader]: fix warnings, type convertion
aocsa Sep 20, 2018
fe3def3
[parquet-reader] Merged from remote
gcca Oct 17, 2018
2e77073
[parquet-reader] Add API documentation
gcca Oct 18, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#=============================================================================
# Copyright 2018 BlazingDB, Inc.
# Copyright 2018 Percy Camilo Triveño Aucahuasi <[email protected]>
# Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -26,7 +27,7 @@

PROJECT(libgdf)

cmake_minimum_required(VERSION 2.8) # not sure about version required
cmake_minimum_required(VERSION 3.3) # not sure about version required

set(CMAKE_CXX_STANDARD 11)
message(STATUS "Using C++ standard: c++${CMAKE_CXX_STANDARD}")
Expand All @@ -43,6 +44,7 @@ include(CTest)
# Include custom modules (see cmake directory)
include(ConfigureGoogleTest)
include(ConfigureArrow)
include(ConfigureParquetCpp)

find_package(CUDA)
set_package_properties(
Expand Down
5 changes: 3 additions & 2 deletions cmake/Modules/ConfigureArrow.cmake
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#=============================================================================
# Copyright 2018 BlazingDB, Inc.
# Copyright 2018 Percy Camilo Triveño Aucahuasi <[email protected]>
# Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -15,7 +16,7 @@
# limitations under the License.
#=============================================================================

set(ARROW_DOWNLOAD_BINARY_DIR ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/arrow-download/)
set(ARROW_DOWNLOAD_BINARY_DIR ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/arrow-download)

# Download and unpack arrow at configure time
configure_file(${CMAKE_SOURCE_DIR}/cmake/Templates/Arrow.CMakeLists.txt.cmake ${ARROW_DOWNLOAD_BINARY_DIR}/CMakeLists.txt COPYONLY)
Expand Down Expand Up @@ -46,7 +47,7 @@ endif()
set(ARROW_ROOT ${ARROW_DOWNLOAD_BINARY_DIR}/arrow-prefix/src/arrow-install/usr/local/)

# Copy the arrow-format flatbuffer headers to include/ipc using configure_file (will sync if input file changes)
set(ARROW_GENERATED_IPC_DIR ${ARROW_DOWNLOAD_BINARY_DIR}/arrow-prefix/src/arrow/cpp/src/arrow/ipc/)
set(ARROW_GENERATED_IPC_DIR ${ARROW_DOWNLOAD_BINARY_DIR}/arrow-prefix/src/arrow/cpp/src/arrow/ipc)

configure_file(${ARROW_GENERATED_IPC_DIR}/File_generated.h ${CMAKE_SOURCE_DIR}/include/gdf/ipc/File_generated.h COPYONLY)
configure_file(${ARROW_GENERATED_IPC_DIR}/Message_generated.h ${CMAKE_SOURCE_DIR}/include/gdf/ipc/Message_generated.h COPYONLY)
Expand Down
85 changes: 85 additions & 0 deletions cmake/Modules/ConfigureParquetCpp.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#=============================================================================
# Copyright 2018 BlazingDB, Inc.
# Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#=============================================================================

# Download and unpack ParquetCpp at configure time
configure_file(${CMAKE_SOURCE_DIR}/cmake/Templates/ParquetCpp.CMakeLists.txt.cmake ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-download/CMakeLists.txt)

execute_process(
COMMAND ${CMAKE_COMMAND} -F "${CMAKE_GENERATOR}" .
RESULT_VARIABLE result
WORKING_DIRECTORY ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-download/
)

if(result)
message(FATAL_ERROR "CMake step for ParquetCpp failed: ${result}")
endif()

# Transitive dependencies
set(ARROW_TRANSITIVE_DEPENDENCIES_PREFIX ${ARROW_DOWNLOAD_BINARY_DIR}/arrow-prefix/src/arrow-build)
set(BROTLI_TRANSITIVE_DEPENDENCY_PREFIX ${ARROW_TRANSITIVE_DEPENDENCIES_PREFIX}/brotli_ep/src/brotli_ep-install/lib/x86_64-linux-gnu)
set(BROTLI_STATIC_LIB_ENC ${BROTLI_TRANSITIVE_DEPENDENCY_PREFIX}/libbrotlienc.a)
set(BROTLI_STATIC_LIB_DEC ${BROTLI_TRANSITIVE_DEPENDENCY_PREFIX}/libbrotlidec.a)
set(BROTLI_STATIC_LIB_COMMON ${BROTLI_TRANSITIVE_DEPENDENCY_PREFIX}/libbrotlicommon.a)
set(SNAPPY_STATIC_LIB ${ARROW_TRANSITIVE_DEPENDENCIES_PREFIX}/snappy_ep/src/snappy_ep-install/lib/libsnappy.a)
set(ZLIB_STATIC_LIB ${ARROW_TRANSITIVE_DEPENDENCIES_PREFIX}/zlib_ep/src/zlib_ep-install/lib/libz.a)
set(ARROW_HOME ${ARROW_ROOT})

set(ENV{BROTLI_STATIC_LIB_ENC} ${BROTLI_STATIC_LIB_ENC})
set(ENV{BROTLI_STATIC_LIB_DEC} ${BROTLI_STATIC_LIB_DEC})
set(ENV{BROTLI_STATIC_LIB_COMMON} ${BROTLI_STATIC_LIB_COMMON})
set(ENV{SNAPPY_STATIC_LIB} ${SNAPPY_STATIC_LIB})
set(ENV{ZLIB_STATIC_LIB} ${ZLIB_STATIC_LIB})
set(ENV{ARROW_HOME} ${ARROW_HOME})

execute_process(
COMMAND ${CMAKE_COMMAND} --build .
RESULT_VARIABLE result
WORKING_DIRECTORY ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-download)

if(result)
message(FATAL_ERROR "Build step for ParquetCpp failed: ${result}")
endif()

# Add transitive dependency: Thrift
set(THRIFT_ROOT ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-build/thrift_ep/src/thrift_ep-install)

# Locate ParquetCpp package
set(PARQUETCPP_ROOT ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-install)
set(PARQUETCPP_BINARY_DIR ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-build)
set(PARQUETCPP_SOURCE_DIR ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-src)

# Dependency interfaces
find_package(Boost REQUIRED COMPONENTS regex)

add_library(Apache::Thrift INTERFACE IMPORTED)
set_target_properties(Apache::Thrift
PROPERTIES INTERFACE_INCLUDE_DIRECTORIES ${THRIFT_ROOT}/include)
set_target_properties(Apache::Thrift
PROPERTIES INTERFACE_LINK_LIBRARIES ${THRIFT_ROOT}/lib/libthrift.a)

add_library(Apache::Arrow INTERFACE IMPORTED)
set_target_properties(Apache::Arrow
PROPERTIES INTERFACE_INCLUDE_DIRECTORIES ${ARROW_ROOT}/include)
set_target_properties(Apache::Arrow
PROPERTIES INTERFACE_LINK_LIBRARIES "${ARROW_ROOT}/lib/libarrow.a;${BROTLI_STATIC_LIB_ENC};${BROTLI_STATIC_LIB_DEC};${BROTLI_STATIC_LIB_COMMON};${SNAPPY_STATIC_LIB};${ZLIB_STATIC_LIB}")

add_library(Apache::ParquetCpp INTERFACE IMPORTED)
set_target_properties(Apache::ParquetCpp
PROPERTIES INTERFACE_INCLUDE_DIRECTORIES
"${PARQUETCPP_ROOT}/include;${PARQUETCPP_BINARY_DIR}/src;${PARQUETCPP_SOURCE_DIR}/src")
set_target_properties(Apache::ParquetCpp
PROPERTIES INTERFACE_LINK_LIBRARIES "${PARQUETCPP_ROOT}/lib/libparquet.a;Apache::Arrow;Apache::Thrift;Boost::regex")
15 changes: 7 additions & 8 deletions cmake/Templates/Arrow.CMakeLists.txt.cmake
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#=============================================================================
# Copyright 2018 BlazingDB, Inc.
# Copyright 2018 Percy Camilo Triveño Aucahuasi <[email protected]>
# Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -34,27 +35,25 @@ message(STATUS "Using Apache Arrow version: ${ARROW_VERSION}")
set(ARROW_URL "https://github.com/apache/arrow/archive/${ARROW_VERSION}.tar.gz")

set(ARROW_CMAKE_ARGS
#Arrow dependencies
-DARROW_WITH_LZ4=OFF
-DARROW_WITH_ZSTD=OFF
-DARROW_WITH_BROTLI=OFF
-DARROW_WITH_SNAPPY=OFF
-DARROW_WITH_ZLIB=OFF

#Build settings
-DARROW_BUILD_STATIC=ON
-DARROW_BUILD_SHARED=OFF
-DARROW_BOOST_USE_SHARED=ON
-DARROW_BUILD_TESTS=OFF
-DARROW_TEST_MEMCHECK=OFF
-DARROW_BUILD_BENCHMARKS=OFF
-DARROW_BUILD_UTILITIES=OFF
-DARROW_JEMALLOC=OFF
-DARROW_WITH_LZ4=OFF
-DARROW_WITH_ZSTD=OFF

#Arrow modules
-DARROW_IPC=ON
-DARROW_COMPUTE=OFF
-DARROW_COMPUTE=ON
-DARROW_GPU=OFF
-DARROW_JEMALLOC=OFF
-DARROW_HDFS=OFF
-DARROW_BOOST_USE_SHARED=ON
-DARROW_BOOST_VENDORED=OFF
-DARROW_PYTHON=OFF
)
Expand Down
44 changes: 44 additions & 0 deletions cmake/Templates/ParquetCpp.CMakeLists.txt.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#=============================================================================
# Copyright 2018 BlazingDB, Inc.
# Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#=============================================================================

cmake_minimum_required(VERSION 2.8.12)

project(parquetcpp-download NONE)

include(ExternalProject)

set(PARQUET_VERSION apache-parquet-cpp-1.3.1)

if (NOT $ENV{PARQUET_VERSION} STREQUAL "")
set(PARQUET_VERSION $ENV{PARQUET_VETSION})
endif()

message(STATUS "Using Apache ParquetCpp version: ${PARQUET_VERSION}")

ExternalProject_Add(parquetcpp
BINARY_DIR "${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-build"
CMAKE_ARGS
-DCMAKE_BUILD_TYPE=RELEASE
-DCMAKE_INSTALL_PREFIX=${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-install
-DPARQUET_ARROW_LINKAGE=static
-DPARQUET_BUILD_SHARED=OFF
-DPARQUET_BUILD_TESTS=OFF
GIT_REPOSITORY https://github.com/apache/parquet-cpp.git
GIT_TAG apache-parquet-cpp-1.3.1
INSTALL_DIR "${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-install"
SOURCE_DIR "${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-src"
)
2 changes: 2 additions & 0 deletions conda_environments/dev_py35.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,5 @@ dependencies:
- numba=0.34.0.dev=np112py35_316
- cmake=3.6.3=0
- arrow-cpp=0.7.1
- flex=2.6.4
- bison=3.0.5
39 changes: 39 additions & 0 deletions include/gdf/parquet/api.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
/*
* Copyright 2018 BlazingDB, Inc.
* Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <gdf/gdf.h>

#ifdef __cplusplus
#define BEGIN_NAMESPACE_GDF_PARQUET \
namespace gdf { \
namespace parquet {
#define END_NAMESPACE_GDF_PARQUET \
} \
}
#else
#define BEGIN_NAMESPACE_GDF_PARQUET
#define END_NAMESPACE_GDF_PARQUET
#endif

BEGIN_NAMESPACE_GDF_PARQUET

extern "C" gdf_error
read_parquet_file(const char *const filename,
gdf_column **const out_gdf_columns,
std::size_t *const out_gdf_columns_length);

END_NAMESPACE_GDF_PARQUET
Loading