You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Increase speed and parallelism of the limit algorithm and implement descending sorting (#75)
* Increase speed and parallelism of the limit algorithm
* Fixed docs
* Implement descending sorting. Fixes#10
* Replace two delayed usages. Thanks @mrocklin
* Remoe the reference to the function - to make it usable without dask-sql installation on the workers
Copy file name to clipboardExpand all lines: docs/pages/sql.rst
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -301,15 +301,15 @@ Limitatons
301
301
Whenever you find a not already implemented operation, keyword
302
302
or functionality, please raise an issue at our `issue tracker <https://github.com/nils-braun/dask-sql/issues>`_ with your use-case.
303
303
304
-
Apart from those functional limitations, there are also two operations which need special care: ``ORDER BY`` and ``LIMIT``.
304
+
Apart from those functional limitations, there is a operation which need special care: ``ORDER BY```.
305
305
Normally, ``dask-sql`` calls create a ``dask`` data frame, which gets only computed when you call the ``.compute()`` member.
306
-
Due to internal constraints, this is currently not the case for ``ORDER BY`` and ``LIMIT``.
307
-
Including one of those operations will trigger a calculation of the full data frame already when calling ``Context.sql()``.
306
+
Due to internal constraints, this is currently not the case for ``ORDER BY``.
307
+
Including this operation will trigger a calculation of the full data frame already when calling ``Context.sql()``.
308
308
309
309
.. warning::
310
310
311
311
There is a subtle but important difference between adding ``LIMIT 10`` to your SQL query and calling ``sql(...).head(10)``.
312
312
The data inside ``dask`` is partitioned, to distribute it over the cluster.
313
313
``head`` will only return the first N elements from the first partition - even if N is larger than the partition size.
314
314
As a benefit, calling ``.head(N)`` is typically faster than calculating the full data sample with ``.compute()``.
315
-
``LIMIT`` on the other hand will always return the first N elements - no matter on how many partitions they are scattered - but will also need to compute the full data set for this.
315
+
``LIMIT`` on the other hand will always return the first N elements - no matter on how many partitions they are scattered - but will also need to precalculate the first partition to find out, if it needs to have a look into all data or not.
0 commit comments