Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print cost metrics as data size #11443

Closed
wants to merge 1 commit into from
Closed

Print cost metrics as data size #11443

wants to merge 1 commit into from

Conversation

kokosing
Copy link
Contributor

@kokosing kokosing commented Sep 7, 2018

Print cost metrics as data size

@kokosing
Copy link
Contributor Author

kokosing commented Sep 7, 2018

More or less cost is the amount of data to be processed by cpu, sent through network, stored in memory. Displaying them in data size units makes it more readable. Especially when comparing big numbers.

@sopel39
Copy link
Contributor

sopel39 commented Sep 7, 2018

More or less cost is the amount of data to be processed by cpu

In case of CPU that would rather be something like number of cycles (e.g: artificial unit of measure representing how heavy computation is). That it is currently more or less amount of data processed is a limitation of the model. We should put different weights on different operations (e.g: hash computation, hash lookup, etc).. @rschlussel

Copy link
Contributor

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with the change, but mind @sopel39 's comment.

}

return "?";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank

@kokosing
Copy link
Contributor Author

Output example:

     - CrossJoin => [nationkey:bigint, name:varchar(25), regionkey:bigint, comment:varchar(152), orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:i
             Distribution: REPLICATED
             Cost: {rows: 150030375 (33.53GB), cpu: 34.99G, memory: 747.67MB, network: 747.67MB}
         - TableScan[tpch:tpch:nation:sf1.0, grouped = false] => [nationkey:bigint, name:varchar(25), regionkey:bigint, comment:varchar(152)]
                 Cost: {rows: 25 (2.67kB), cpu: 2.67k, memory: 0B, network: 0B}
                 nationkey := tpch:nationkey
                 regionkey := tpch:regionkey
                 name := tpch:name
                 comment := tpch:comment
         - LocalExchange[SINGLE] () => orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, extendedprice:double, discount:do
                 Cost: {rows: 6001215 (747.67MB), cpu: 747.67M, memory: 0B, network: 747.67MB}
             - RemoteSource[2] => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, extendedprice:double, discount:double
                     Cost: {rows: 6001215 (747.67MB), cpu: 747.67M, memory: 0B, network: 747.67MB}

@kokosing
Copy link
Contributor Author

Currently (without using units) the above example looks like:

     - CrossJoin => [nationkey:bigint, name:varchar(25), regionkey:bigint, comment:varchar(152), orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:i
             Distribution: REPLICATED
             Cost: {rows: 150030375 (33.53GB), cpu: 37575027902.00, memory: 783988912.00, network: 783988912.00}
         - TableScan[tpch:tpch:nation:sf1.0, grouped = false] => [nationkey:bigint, name:varchar(25), regionkey:bigint, comment:varchar(152)]
                 Cost: {rows: 25 (2.67kB), cpu: 2734.00, memory: 0.00, network: 0.00}
                 nationkey := tpch:nationkey
                 regionkey := tpch:regionkey
                 name := tpch:name
                 comment := tpch:comment
         - LocalExchange[SINGLE] () => orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, extendedprice:double, discount:do
                 Cost: {rows: 6001215 (747.67MB), cpu: 783988912.00, memory: 0.00, network: 783988912.00}
             - RemoteSource[2] => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, extendedprice:double, discount:double
                     Cost: {rows: 6001215 (747.67MB), cpu: 783988912.00, memory: 0.00, network: 783988912.00}

Thanks to this PR the EXPLAIN output is much more readable.

if (value == Double.NEGATIVE_INFINITY) {
return "-INF";
}
else if (value == Double.POSITIVE_INFINITY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant else (here & below)

private static String formatDoubleAsCpuCost(double value)
{
if (value == Double.NEGATIVE_INFINITY) {
return "-INF";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why uppercase?

FWIW, Double#toString outputs -Infinity / Infinity

else if (value == Double.POSITIVE_INFINITY) {
return "+INF";
}
else if(!isNaN(value)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert if condition -- NaN is yet another special case, like +inf, -inf, so layout the conditions for these cases similarily

else if(!isNaN(value)) {
String formattedValue = DataSize.succinctDataSize(value, BYTE).toString();
// strip last character `B` to not to bound cpu cost with data size
return formattedValue.substring(0, formattedValue.length() - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to say "strip trailing be", formattedValue.replaceAll("b$", "") would be more direct way of saying that.

return "?";
}

private static String formatDoubleAsDataSize(double value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all comments from formatDoubleAsCpuCost

Copy link
Contributor

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • % prev comments
  • squash commits

@electrum
Copy link
Contributor

I don’t think using data size is correct for CPU cost. Billions should be “B” not “G”. See how we print row count in the CLI.

@martint
Copy link
Contributor

martint commented Oct 29, 2018

Why not use a Duration for cpu cost. It’s a measurement of time.

@findepi
Copy link
Contributor

findepi commented Oct 29, 2018

Estimated CPU cost is not time. Currently, it's (very roughly) the amount of data being processed.

@martint
Copy link
Contributor

martint commented Oct 29, 2018

That’s very misleading, then. CPU cost should, ideally, be a an estimation of the amount of CPU (as measured by cpu timers) the query will use.

If the current metric is an indication of something else, we should come up with a different name for it.

@dain
Copy link
Contributor

dain commented Oct 29, 2018

@findepi, CPU time is no longer an estimate. With the changes @arhimondr and I made, we get an actual measurement per operator. So, I think time is the right measure.

@findepi
Copy link
Contributor

findepi commented Oct 29, 2018

@dain i understood this as being about "CPU cost estimate" computed by CBO during planning rather than "CPU time measurement" as measured (or perviously approximated) during execution.

@sopel39
Copy link
Contributor

sopel39 commented Oct 29, 2018

@dain which changes you refer to?

@dain
Copy link
Contributor

dain commented Oct 29, 2018

@sopel39 I think this is the core PR #11408, but there were a few more followup ones.

@findepi I see, I thought we were talking about the actual measurements. What is the CBO CPU part actually estimating? Specifically, is it estimating the actual CPU time in the current cluster, or is it more of a estimate in a "model" cluster?

@findepi
Copy link
Contributor

findepi commented Oct 29, 2018

@dain currently this is an "abstract CPU cost". Hence the units are not time/ticks.
There is plan to make it more "material" (#11615 > "Adjusting model").
(This is not that trivial, since then it becomes hard (if not impossible) to compare different cost dimensions.)

I am for the change @kokosing is proposing. It's trivial and should help reading EXPLAINS today.
As soon we make CPU cost something different, closer to the actual thing it's estimating, we can very easily take back this change or replace it with something that is appropriate (eg. Duration).

Can we merge this as is then?

@dain
Copy link
Contributor

dain commented Oct 29, 2018

@findepi, I'm ok with what every you all agree on. @martint or @electrum, can you follow up on this one?

@electrum
Copy link
Contributor

My comment about the prefix has not been addressed. It’s only a number, so we should use “B” for billion, not “G”. Metric prefixes only make sense for a unit of some type.

@kokosing
Copy link
Contributor Author

@electrum I am going to address your comment, by fixing airlift/units#7

@kokosing
Copy link
Contributor Author

kokosing commented Nov 5, 2018

Related (dependency) PR: airlift/units#8

@kokosing
Copy link
Contributor Author

@nezihyigitbasi @mbasmanova ping

@mbasmanova
Copy link
Contributor

@kokosing Grzegorz, I can't look right now, because I'm over-booked. I'm on-call and I'm also finishing the FB-specific parts of the 0.216 release. I'll look into this next week.

Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

return "?";
}

return DataSize.succinctDataSize(value, BYTE).toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • static import succinctDataSize
  • you can also use succinctBytes((long) value)

@@ -476,6 +478,23 @@ private void printWindowOperatorStats(int indent, WindowOperatorStats stats)
output.append('\n');
}

private static String formatDoubleAsCpuCost(double value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simply rename these methods as formatCpuCost and formatDataSize, because the parameters are double so we don't need to repeat that we are formatting double values.

if (!isFinite(value)) {
return Double.toString(value);
}
else if (isNaN(value)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary else

@kokosing kokosing closed this Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants