Skip to content

Commit 4d20267

Browse files
Rollback to v3.20.0 (#471)
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (#466)" This reverts commit fdc9779. Revert "fix mistakes in documentation" This reverts commit 4e4b5e0. Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (#460)" This reverts commit d027ca2. Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (#461)" This reverts commit 97cd553. Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (#458)" This reverts commit e48d67a. Revert "Fix bug when loading few columns of a dataset with many primary indices (#446)" This reverts commit 90ee486. Revert "Prepare release 4.0.1" This reverts commit b278503. Revert "Fix tests for dask dataframe and delayed backends" This reverts commit 5520f74. Revert "Add end-to-end regression test" This reverts commit 8a3e6ae. Revert "Fix dataset corruption after updates (#445)" This reverts commit a26e840. Revert "Set release date for 4.0" This reverts commit 08a8094. Revert "Return dask scalar for store and update from ddf (#437)" This reverts commit 494732d. Revert "Add tests for non-default table (#440)" This reverts commit 3807a02. Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (#441)" This reverts commit f7615ec. Revert "Set default for dates_as_object to True (#436)" This reverts commit 75ffdb5. Revert "Remove inferred indices (#438)" This reverts commit b1e2535. Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (#422)" This reverts commit b349cee. Revert "Remove all deprecated arguments (#434)" This reverts commit 74f0790. Revert "Remove multi table feature (#431)" This reverts commit 032856a.
1 parent db840b5 commit 4d20267

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

81 files changed

+4888
-1580
lines changed

.github/workflows/ci-pre-commit.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,4 @@ jobs:
88
steps:
99
- uses: actions/checkout@v2
1010
- uses: actions/setup-python@v2
11-
- uses: pre-commit/[email protected].3
11+
- uses: pre-commit/[email protected].0

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ jobs:
135135
run: python setup.py sdist bdist_wheel
136136

137137
- name: Codecov
138-
uses: codecov/codecov-action@v1.5.0
138+
uses: codecov/codecov-action@v1.2.2
139139
with:
140140
# NOTE: `token` is not required, because the kartothek repo is public
141141
file: ./coverage.xml

CHANGES.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,23 @@
22
Changelog
33
=========
44

5+
Version 5.0.0 (2021-05-xx)
6+
==========================
7+
8+
This release rolls all the changes introduced with 4.x back to 3.20.0.
9+
10+
As the incompatibility between 4.0 and 5.0 will be an issue for some customers, we encourage you to use the very stable
11+
kartothek 3.20.0 and not version 4.x.
12+
13+
Please refer the Issue #471 for further information.
14+
515

616
Kartothek 4.0.3 (2021-06-10)
717
============================
18+
819
* Pin dask to not use 2021.5.1 and 2020.6.0 (#475)
920

21+
1022
Kartothek 4.0.2 (2021-06-07)
1123
============================
1224

asv_bench/benchmarks/index.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,9 @@ def setup(self, cardinality, num_values, partitions_to_merge):
131131
unique_vals = ["{:010d}".format(n) for n in range(cardinality)]
132132
array = [unique_vals[x % len(unique_vals)] for x in range(num_values)]
133133
self.df = pd.DataFrame({self.column: array})
134-
self.mp = MetaPartition(label=self.table, data=self.df, metadata_version=4)
134+
self.mp = MetaPartition(
135+
label=self.table, data={"core": self.df}, metadata_version=4
136+
)
135137
self.mp_indices = self.mp.build_indices([self.column])
136138
self.merge_indices.append(self.mp_indices)
137139

asv_bench/benchmarks/metapartition.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,14 +33,15 @@ def setup(self, num_rows, dtype):
3333
self.mp = MetaPartition(
3434
label="primary_key={}/base_label".format(dtype[0]),
3535
metadata_version=4,
36-
schema=self.schema,
36+
table_meta={"table": self.schema},
3737
)
3838

3939
def time_reconstruct_index(self, num_rows, dtype):
4040

4141
self.mp._reconstruct_index_columns(
4242
df=self.df,
4343
key_indices=[("primary_key", str(dtype[1]))],
44+
table="table",
4445
columns=None,
4546
categories=None,
4647
date_as_object=False,
@@ -50,7 +51,8 @@ def time_reconstruct_index_categorical(self, num_rows, dtype):
5051
self.mp._reconstruct_index_columns(
5152
df=self.df,
5253
key_indices=[("primary_key", str(dtype[1]))],
54+
table="table",
5355
columns=None,
54-
categories="primary_key",
56+
categories={"table": ["primary_key"]},
5557
date_as_object=False,
5658
)

asv_bench/benchmarks/write.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,12 @@
1717
from .config import AsvBenchmarkConfig
1818

1919

20-
def generate_mp():
20+
def generate_mp(dataset_metadata=None):
2121
return MetaPartition(
2222
label=uuid.uuid4().hex,
23-
schema=make_meta(get_dataframe_alltypes(), origin="alltypes"),
24-
file="fakefile",
23+
table_meta={"table": make_meta(get_dataframe_alltypes(), origin="alltypes")},
24+
files={"table": "fakefile"},
25+
dataset_metadata=dataset_metadata,
2526
)
2627

2728

@@ -49,7 +50,8 @@ class TimeStoreDataset(AsvBenchmarkConfig):
4950

5051
def setup(self, num_partitions, max_depth, num_leafs):
5152
self.store = get_store_from_url("hfs://{}".format(tempfile.mkdtemp()))
52-
self.partitions = [generate_mp() for _ in range(num_partitions)]
53+
dataset_metadata = generate_metadata(max_depth, num_leafs)
54+
self.partitions = [generate_mp(dataset_metadata) for _ in range(num_partitions)]
5355
self.dataset_uuid = "dataset_uuid"
5456
self.user_dataset_metadata = {}
5557

@@ -68,10 +70,8 @@ class TimePersistMetadata(AsvBenchmarkConfig):
6870

6971
def setup(self, num_partitions):
7072
self.store = get_store_from_url("hfs://{}".format(tempfile.mkdtemp()))
71-
self.schemas = [generate_mp().schema for _ in range(num_partitions)]
73+
self.partitions = [generate_mp() for _ in range(num_partitions)]
7274
self.dataset_uuid = "dataset_uuid"
7375

7476
def time_persist_common_metadata(self, num_partitions):
75-
persist_common_metadata(
76-
self.schemas, None, self.store, self.dataset_uuid, "name"
77-
)
77+
persist_common_metadata(self.partitions, None, self.store, self.dataset_uuid)

docs/conf.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,3 +129,8 @@
129129
"kartothek.serialization._generic": "kartothek.serialization",
130130
"kartothek.serialization._parquet": "kartothek.serialization",
131131
}
132+
133+
# In particular the deprecation warning in DatasetMetadata.table_schema is
134+
# raising too many warning to handle sensibly using ipython directive pseudo
135+
# decorators. Remove this with 4.X again
136+
ipython_warning_is_error = False

docs/guide/examples.rst

Lines changed: 45 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Setup a store
3636
3737
# Load your data
3838
# By default the single dataframe is stored in the 'core' table
39-
df_from_store = read_table(store=store_url, dataset_uuid=dataset_uuid)
39+
df_from_store = read_table(store=store_url, dataset_uuid=dataset_uuid, table="table")
4040
df_from_store
4141
4242
@@ -58,8 +58,14 @@ Write
5858
5959
# We'll define two partitions which both have two tables
6060
input_list_of_partitions = [
61-
pd.DataFrame({"A": range(10)}),
62-
pd.DataFrame({"A": range(10, 20)}),
61+
{
62+
"label": "FirstPartition",
63+
"data": [("FirstCategory", pd.DataFrame()), ("SecondCategory", pd.DataFrame())],
64+
},
65+
{
66+
"label": "SecondPartition",
67+
"data": [("FirstCategory", pd.DataFrame()), ("SecondCategory", pd.DataFrame())],
68+
},
6369
]
6470
6571
# The pipeline will return a :class:`~kartothek.core.dataset.DatasetMetadata` object
@@ -90,10 +96,17 @@ Read
9096
# In case you were using the dataset created in the Write example
9197
for d1, d2 in zip(
9298
list_of_partitions,
93-
[pd.DataFrame({"A": range(10)}), pd.DataFrame({"A": range(10, 20)}),],
99+
[
100+
# FirstPartition
101+
{"FirstCategory": pd.DataFrame(), "SecondCategory": pd.DataFrame()},
102+
# SecondPartition
103+
{"FirstCategory": pd.DataFrame(), "SecondCategory": pd.DataFrame()},
104+
],
94105
):
95-
for k1, k2 in zip(d1, d2):
96-
assert k1 == k2
106+
for kv1, kv2 in zip(d1.items(), d2.items()):
107+
k1, v1 = kv1
108+
k2, v2 = kv2
109+
assert k1 == k2 and all(v1 == v2)
97110
98111
99112
Iter
@@ -107,8 +120,14 @@ Write
107120
from kartothek.api.dataset import store_dataframes_as_dataset__iter
108121
109122
input_list_of_partitions = [
110-
pd.DataFrame({"A": range(10)}),
111-
pd.DataFrame({"A": range(10, 20)}),
123+
{
124+
"label": "FirstPartition",
125+
"data": [("FirstCategory", pd.DataFrame()), ("SecondCategory", pd.DataFrame())],
126+
},
127+
{
128+
"label": "SecondPartition",
129+
"data": [("FirstCategory", pd.DataFrame()), ("SecondCategory", pd.DataFrame())],
130+
},
112131
]
113132
114133
# The pipeline will return a :class:`~kartothek.core.dataset.DatasetMetadata` object
@@ -141,10 +160,17 @@ Read
141160
# In case you were using the dataset created in the Write example
142161
for d1, d2 in zip(
143162
list_of_partitions,
144-
[pd.DataFrame({"A": range(10)}), pd.DataFrame({"A": range(10, 20)}),],
163+
[
164+
# FirstPartition
165+
{"FirstCategory": pd.DataFrame(), "SecondCategory": pd.DataFrame()},
166+
# SecondPartition
167+
{"FirstCategory": pd.DataFrame(), "SecondCategory": pd.DataFrame()},
168+
],
145169
):
146-
for k1, k2 in zip(d1, d2):
147-
assert k1 == k2
170+
for kv1, kv2 in zip(d1.items(), d2.items()):
171+
k1, v1 = kv1
172+
k2, v2 = kv2
173+
assert k1 == k2 and all(v1 == v2)
148174
149175
Dask
150176
````
@@ -158,8 +184,14 @@ Write
158184
from kartothek.api.dataset import store_delayed_as_dataset
159185
160186
input_list_of_partitions = [
161-
pd.DataFrame({"A": range(10)}),
162-
pd.DataFrame({"A": range(10, 20)}),
187+
{
188+
"label": "FirstPartition",
189+
"data": [("FirstCategory", pd.DataFrame()), ("SecondCategory", pd.DataFrame())],
190+
},
191+
{
192+
"label": "SecondPartition",
193+
"data": [("FirstCategory", pd.DataFrame()), ("SecondCategory", pd.DataFrame())],
194+
},
163195
]
164196
165197
# This will return a :class:`~dask.delayed`. The figure below

docs/guide/getting_started.rst

Lines changed: 65 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ Getting Started
55
===============
66

77

8+
Kartothek manages datasets that consist of files that contain tables. It does so by offering
9+
a metadata definition to handle these datasets efficiently.
10+
11+
Datasets in Kartothek are made up of one or more ``tables``, each with a unique schema.
812
When working with Kartothek tables as a Python user, we will use :class:`~pandas.DataFrame`
913
as the user-facing type.
1014

@@ -127,25 +131,33 @@ This class holds information about the structure and schema of the dataset.
127131

128132
.. ipython:: python
129133
130-
dm.table_name
134+
dm.tables
131135
sorted(dm.partitions.keys())
132-
dm.schema.remove_metadata()
136+
dm.table_meta["table"].remove_metadata() # Arrow schema
137+
133138
139+
For this guide, two attributes that are noteworthy are ``tables`` and ``partitions``:
134140

135-
For this guide we want to take a closer look at the ``partitions`` attribute.
136-
``partitions`` are the physical "pieces" of data which together constitute the
137-
contents of a dataset. Data is written to storage on a per-partition basis. See
138-
the section on partitioning for further details: :ref:`partitioning_section`.
141+
- Each dataset has one or more ``tables``, where each table is a logical collection of data,
142+
bound together by a common schema.
143+
- ``partitions`` are the physical "pieces" of data which together constitute the
144+
contents of a dataset. Data is written to storage on a per-partition basis.
145+
See the section on partitioning for further details: :ref:`partitioning_section`.
139146

140-
The attribute ``schema`` can be accessed to see the underlying schema of the dataset.
147+
The attribute ``table_meta`` can be accessed to see the underlying schema of the dataset.
141148
See :ref:`type_system` for more information.
142149

143150
To store multiple dataframes into a dataset, it is possible to pass a collection of
144151
dataframes; the exact format will depend on the I/O backend used.
145152

146-
Kartothek assumes these dataframes are different chunks of the same table and
147-
will therefore be required to have the same schema. A ``ValueError`` will be
148-
thrown otherwise.
153+
Additionally, Kartothek supports several data input formats,
154+
it does not need to always be a plain ``pd.DataFrame``.
155+
See :func:`~kartothek.io_components.metapartition.parse_input_to_metapartition` for
156+
further details.
157+
158+
If table names are not specified when passing an iterator of dataframes,
159+
Kartothek assumes these dataframes are different chunks of the same table
160+
and expects their schemas to be identical. A ``ValueError`` will be thrown otherwise.
149161
For example,
150162

151163
.. ipython:: python
@@ -182,6 +194,39 @@ For example,
182194
.. note:: Read these sections for more details: :ref:`type_system`, :ref:`dataset_spec`
183195

184196

197+
When we do not explicitly define the name of the table and partition, Kartothek uses the
198+
default table name ``table`` and generates a UUID for the partition name.
199+
200+
.. admonition:: A more complex example: multiple named tables
201+
202+
Sometimes it may be useful to write multiple dataframes with different schemas into
203+
a single dataset. This can be achieved by creating a dataset with multiple tables.
204+
205+
In this example, we create a dataset with two tables: ``core-table`` and ``aux-table``.
206+
The schemas of the tables are identical across partitions (each dictionary in the
207+
``dfs`` list argument represents a partition).
208+
209+
.. ipython:: python
210+
211+
dfs = [
212+
{
213+
"data": {
214+
"core-table": pd.DataFrame({"id": [22, 23], "f": [1.1, 2.4]}),
215+
"aux-table": pd.DataFrame({"id": [22], "col1": ["x"]}),
216+
}
217+
},
218+
{
219+
"data": {
220+
"core-table": pd.DataFrame({"id": [29, 31], "f": [3.2, 0.6]}),
221+
"aux-table": pd.DataFrame({"id": [31], "col1": ["y"]}),
222+
}
223+
},
224+
]
225+
226+
dm = store_dataframes_as_dataset(store_url, dataset_uuid="two-tables", dfs=dfs)
227+
dm.tables
228+
229+
185230
Reading data from storage
186231
=========================
187232

@@ -193,24 +238,24 @@ table of the dataset as a pandas DataFrame.
193238
194239
from kartothek.api.dataset import read_table
195240
196-
read_table("a_unique_dataset_identifier", store_url)
241+
read_table("a_unique_dataset_identifier", store_url, table="table")
197242
198243
199244
We can also read a dataframe iteratively, using
200-
:func:`~kartothek.io.iter.read_dataset_as_dataframes__iterator`. This will return a generator of :class:`pandas.DataFrame` where every element represents one file. For example,
245+
:func:`~kartothek.io.iter.read_dataset_as_dataframes__iterator`. This will return a generator
246+
of dictionaries (one dictionary for each `partition`), where the keys of each dictionary
247+
represent the `tables` of the dataset. For example,
201248

202249
.. ipython:: python
203250
204251
from kartothek.api.dataset import read_dataset_as_dataframes__iterator
205252
206-
for partition_index, df in enumerate(
207-
read_dataset_as_dataframes__iterator(
208-
dataset_uuid="a_unique_dataset_identifier", store=store_url
209-
)
253+
for partition_index, df_dict in enumerate(
254+
read_dataset_as_dataframes__iterator(dataset_uuid="two-tables", store=store_url)
210255
):
211-
# Note: There is no guarantee on the ordering
212256
print(f"Partition #{partition_index}")
213-
print(f"Data: \n{df}")
257+
for table_name, table_df in df_dict.items():
258+
print(f"Table: {table_name}. Data: \n{table_df}")
214259
215260
Respectively, the ``dask.delayed`` back-end provides the function
216261
:func:`~kartothek.io.dask.delayed.read_dataset_as_delayed`, which has a very similar
@@ -230,7 +275,8 @@ function but returns a collection of ``dask.delayed`` objects.
230275

231276
.. ipython:: python
232277
233-
read_table("a_unique_dataset_identifier", store_url, predicates=[[("A", "<", 2.5)]])
278+
# Read only values table `core-table` where `f` < 2.5
279+
read_table("two-tables", store_url, table="core-table", predicates=[[("f", "<", 2.5)]])
234280
235281
.. _storefact: https://github.com/blue-yonder/storefact
236282
.. _dask: https://docs.dask.org/en/latest/

0 commit comments

Comments
 (0)