Skip to content
This repository was archived by the owner on Oct 15, 2025. It is now read-only.

Commit 495ce7d

Browse files
authored
GitHub Data Source Integration (#1233)
- [x] GitHub Data Source Integration - [x] Batching support for native storage engine. We can not do batching in storage engine, which does not work with limit. Revert the change. - [x] Full NamedUser table support - [x] Enable circle ci local PR cache for testmondata - [x] Native storage engine `read` refactory - [x] Testcases - [x] Github data source documentation
1 parent e8a181c commit 495ce7d

File tree

17 files changed

+464
-59
lines changed

17 files changed

+464
-59
lines changed

.circleci/config.yml

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -213,10 +213,11 @@ jobs:
213213
keys:
214214
- v1-model_cache-{{ checksum "setup.py" }}
215215

216-
# Always restore testmondata from staging, python3.10, ray disabled.
216+
# First try restoring the testmondata from PR, than staging.
217217
- restore_cache:
218218
keys:
219-
- v1-testmon_cache-staging-python3.10-rayDISABLED-{{ checksum "setup.py" }}
219+
- v1-testmon_cache-{{ .Branch }}-python<< parameters.v >>-ray<< parameters.ray >>-{{ checksum "setup.py" }}-
220+
- v1-testmon_cache-staging-python3.10-rayDISABLED-{{ checksum "setup.py" }}-
220221

221222
- run:
222223
name: Install EvaDB package from GitHub repo with all dependencies
@@ -271,16 +272,21 @@ jobs:
271272
# Collect the testmondata only for long intergration tests
272273
- when:
273274
condition:
274-
and:
275-
- equal: [ LONG INTEGRATION, << parameters.mode >> ]
276-
- equal: [ staging, << pipeline.git.branch >> ]
277-
- equal: [ "3.10", << parameters.v >> ]
278-
- equal: [ DISABLED, << parameters.ray >>]
275+
or:
276+
- equal: [ LONG INTEGRATION CACHE, << parameters.mode >> ]
277+
- and:
278+
- equal: [ LONG INTEGRATION, << parameters.mode >> ]
279+
- equal: [ staging, << pipeline.git.branch >> ]
280+
- equal: [ "3.10", << parameters.v >> ]
281+
- equal: [ DISABLED, << parameters.ray >>]
279282
steps:
280283
- save_cache:
281284
key: v1-testmon_cache-{{ .Branch }}-python<< parameters.v >>-ray<< parameters.ray >>-{{ checksum "setup.py" }}-{{ epoch }}
282285
paths:
283286
- .testmondata
287+
- .testmondata-shm
288+
- .testmondata-wal
289+
284290

285291
- save_cache:
286292
key: v1-pip-wheel_cache-python<< parameters.v >>-ray<< parameters.ray >>-{{ checksum "setup.py" }}

docs/_toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ parts:
6767
- file: source/reference/databases/sqlite
6868
- file: source/reference/databases/mysql
6969
- file: source/reference/databases/mariadb
70+
- file: source/reference/databases/github
7071

7172
- file: source/reference/ai/index
7273
title: AI Engines
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
Github
2+
==========
3+
4+
The connection to Github is based on the `PyGithub <https://github.com/PyGithub/PyGithub>`_ library.
5+
6+
Dependency
7+
----------
8+
9+
* PyGithub
10+
11+
12+
Parameters
13+
----------
14+
15+
Required:
16+
17+
* ``owner`` is the owner of the Github repository. For example, georgia-tech-db is the owner of the EvaDB repository.
18+
* ``repo`` is the name of the Github repository. For example, evadb is the name of this repository.
19+
20+
Optional:
21+
22+
* ``github_token`` is not required for public repositories. However, the rate limit is lower without a valid github_token. Check `Rate limits page <https://docs.github.com/en/rest/overview/resources-in-the-rest-api?apiVersion=2022-11-28#rate-limits>`_ to learn more about how to check your rate limit status. Check `Managing your personal access tokens page <https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens>`_ to learn how to create personal access tokens.
23+
24+
Create Connection
25+
-----------------
26+
27+
.. code-block:: text
28+
29+
CREATE DATABASE github_data WITH ENGINE = 'github', PARAMETERS = {
30+
"owner": "georgia-tech-db",
31+
"repo": "evadb"
32+
};
33+
34+
Supported Tables
35+
----------------
36+
37+
* ``stargazers``: Lists the people that have starred the repository. Check `evadb/third_party/databases/github/table_column_info.py` for all the available columns in the table.
38+
39+
.. code-block:: sql
40+
41+
SELECT * FROM github_data.stargazers;
42+
43+
Here is the query output:
44+
45+
.. code-block::
46+
47+
+---------------------------------------------------+-----+---------------------------------------------+
48+
| stargazers.avatar_url | ... | stargazers.url |
49+
|---------------------------------------------------|-----|---------------------------------------------|
50+
| https://avatars.githubusercontent.com/u/105357... | ... | https://api.github.com/users/jaehobang |
51+
| https://avatars.githubusercontent.com/u/436141... | ... | https://api.github.com/users/VineethAljapur |
52+
| ... | ... | ... |
53+
+---------------------------------------------------+-----+---------------------------------------------+
54+
55+
.. note::
56+
57+
Looking for another table from Github? You can add a table mapping in `evadb/third_party/databases/github/github_handler.py`, or simply raise a `Feature Request <https://github.com/georgia-tech-db/evadb/issues/new/choose>`_.

evadb/binder/statement_binder_context.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,10 @@ def add_table_alias(self, alias: str, database_name: str, table_name: str):
8989
db_catalog_entry.engine, **db_catalog_entry.params
9090
) as handler:
9191
# Assemble columns.
92-
column_df = handler.get_columns(table_name).data
92+
response = handler.get_columns(table_name)
93+
if response.error is not None:
94+
raise BinderError(response.error)
95+
column_df = response.data
9396
table_obj = create_table_catalog_entry_for_data_source(
9497
table_name, database_name, column_df
9598
)

evadb/catalog/catalog_manager.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,7 @@ def check_native_table_exists(self, table_name: str, database_name: str):
192192
resp = handler.get_tables()
193193

194194
if resp.error is not None:
195-
return False
195+
raise Exception(resp.error)
196196

197197
# Check table existence.
198198
table_df = resp.data

evadb/storage/native_storage_engine.py

Lines changed: 27 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -158,46 +158,41 @@ def write(self, table: TableCatalogEntry, rows: Batch):
158158
logger.exception(err_msg)
159159
raise Exception(err_msg)
160160

161-
def read(self, table: TableCatalogEntry) -> Iterator[Batch]:
161+
def read(
162+
self, table: TableCatalogEntry, batch_mem_size: int = 30000000
163+
) -> Iterator[Batch]:
162164
try:
163165
db_catalog_entry = self._get_database_catalog_entry(table.database_name)
164166
with get_database_handler(
165167
db_catalog_entry.engine, **db_catalog_entry.params
166168
) as handler:
167-
uri = handler.get_sqlalchmey_uri()
168-
169-
# Create a metadata object
170-
engine = create_engine(uri)
171-
metadata = MetaData()
172-
173-
Session = sessionmaker(bind=engine)
174-
session = Session()
175-
# Retrieve the SQLAlchemy table object for the existing table
176-
table_to_read = Table(table.name, metadata, autoload_with=engine)
177-
result = session.execute(table_to_read.select()).fetchall()
178-
data_batch = []
179-
180-
# Ensure that the order of columns in the select is same as in table.columns
181-
# Also verify if the column names are consistent
182-
if result:
183-
cols = result[0]._fields
184-
index_dict = {
185-
element.lower(): index for index, element in enumerate(cols)
186-
}
187-
try:
188-
ordered_columns = sorted(
189-
table.columns, key=lambda x: index_dict[x.name.lower()]
169+
handler_response = handler.select(table.name)
170+
# we prefer the generator/iterator when available
171+
result = []
172+
if handler_response.data_generator:
173+
result = handler_response.data_generator
174+
elif handler_response.data:
175+
result = handler_response.data
176+
177+
if handler.is_sqlalchmey_compatible():
178+
# For sql data source, we can deserialize sql rows into numpy array
179+
cols = result[0]._fields
180+
index_dict = {
181+
element.lower(): index for index, element in enumerate(cols)
182+
}
183+
try:
184+
ordered_columns = sorted(
185+
table.columns, key=lambda x: index_dict[x.name.lower()]
186+
)
187+
except KeyError as e:
188+
raise Exception(f"Column mismatch with error {e}")
189+
result = (
190+
_deserialize_sql_row(row, ordered_columns) for row in result
190191
)
191-
except KeyError as e:
192-
raise Exception(f"Column mismatch with error {e}")
193192

194-
for row in result:
195-
data_batch.append(_deserialize_sql_row(row, ordered_columns))
193+
for data_batch in result:
194+
yield Batch(pd.DataFrame([data_batch]))
196195

197-
if data_batch:
198-
yield Batch(pd.DataFrame(data_batch))
199-
200-
session.close()
201196
except Exception as e:
202197
err_msg = f"Failed to read the table {table.name} in data source {table.database_name} with exception {str(e)}"
203198
logger.exception(err_msg)

evadb/storage/sqlite_storage_engine.py

Lines changed: 5 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
from evadb.models.storage.batch import Batch
3030
from evadb.parser.table_ref import TableInfo
3131
from evadb.storage.abstract_storage_engine import AbstractStorageEngine
32-
from evadb.utils.generic_utils import PickleSerializer, get_size
32+
from evadb.utils.generic_utils import PickleSerializer
3333
from evadb.utils.logging_manager import logger
3434

3535
# Leveraging Dynamic schema in SQLAlchemy
@@ -189,23 +189,12 @@ def read(
189189
try:
190190
table_to_read = self._try_loading_table_via_reflection(table.name)
191191
result = self._sql_session.execute(table_to_read.select()).fetchall()
192-
data_batch = []
193-
row_size = None
194192
for row in result:
195-
# For table read, we provide row_id so that user can also retrieve
196-
# row_id from the table.
197-
data_batch.append(
198-
self._deserialize_sql_row(row._asdict(), table.columns)
193+
yield Batch(
194+
pd.DataFrame(
195+
[self._deserialize_sql_row(row._asdict(), table.columns)]
196+
)
199197
)
200-
if row_size is None:
201-
row_size = 0
202-
row_size = get_size(data_batch)
203-
if len(data_batch) * row_size >= batch_mem_size:
204-
yield Batch(pd.DataFrame(data_batch))
205-
data_batch = []
206-
if data_batch:
207-
yield Batch(pd.DataFrame(data_batch))
208-
209198
except Exception as e:
210199
err_msg = f"Failed to read the table {table.name} with exception {str(e)}"
211200
logger.exception(err_msg)
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# coding=utf-8
2+
# Copyright 2018-2023 EvaDB
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
"""github integration"""

0 commit comments

Comments
 (0)