Added preliminary support for Tez #278

abhishekdas99 · 2017-08-21T07:27:32Z

This is preliminary version of tez support.. The following changes are required to have full support.

Here is the summary:
Done:
Added basic support for tez jobs. Added couple of heuristics to make sure they appear in the UI.Current implementation is just having the same code as MR heuristics.

To Be Done:
The main problem with Tez support is one Yarn application can have multiple dags (or jobs). The current implementation assumes that for each Yarn application, there is one MR job. So some design changes need to be done.
Some UI support as well to show multiple dags under one Yarn application.
We need to come up with some class structure for heuristics as some heuristics will be exactly same for Tez and MR . So class hierarchy will help us not to write redundant code.

…plying it to the query instead of asking the query to perform the case-insensitive comparison.

Changed HadoopJobData to include finishTime since that is needed for metrics. Changed the signature of getJobCounter to include jobConf and jobData so that it can publish metrics Updated README.md Tested locally on my box and on spades RB=406817 BUGS=HADOOP-7814 R=fli,mwagner A=fli

The java file DaliMetricsAPI.java has a flavor of the APIs that we will be exposing from the dali library. We can split these classes into individual files when we move this functionality to the dali library. Changed start script to look for a config file that configures a publisher. If the file is present, then dr-elephant is started with an option that has the file name. If the file is not present, then the behavior is unchanged (i.e. no metrics are published). If the file is parsed correctly then dr-elephant publishes metrics in HDFS (one avro file per job) for jobs that are configured to publish the metrics. The job needs to set something like mapreduce.job.publish-counters='org.apache.hadoop.examples.WordCount$AppCounter:*' to publish all counters in the group given. The format is : 'groupName:counterName' where counterName can be an asterisk to indicate all counters in the group. See the class DaliMetricsAPI.CountersToPublish The HDFSPublisher is configured with a base path under which metrics are published. The date/hour hierarchy is added to the base path. The XML file for configuring dr-elephant is checked in as a template. A config file needs to be added to the 'conf' path of dr-elephant (manually, as per meeting with hadoop-admin) on clusters where we want dr-elephant to publish metrics. RB=409443 BUGS=HADOOP-7814 R=fli,csteinba,mwagner,cbotev,ahsu A=fli,ahsu

hadoop-1 does not have JobStatus.getFinishTime(). This causes dr-elephant to hang. Set the start time to be same as finish time for h1 jobs. For consistency, reverted to the old method of scraping the job tracker url so that we get only start time, and set the finish time to be equal to start time for retired jobs as well. RB=417975 BUGS=HADOOP-8640 R=fli,mwagner A=fli

RB=417448 BUGS=HADOOP-8648 R=fli A=fli

…ff 51 reducers instead of 50

…istic

…r time help page

…increasing mapred.min.split.size for too many mappers, NOT mapred.max.split.size

…n Help topics page

…name RB=468832 BUGS=HADOOP-10405 R=fli A=fli,ahsu

Jobs which put large files(> 500MB) in the distributed cache are flagged. Files as part of the following are considered. mapreduce.job.cache.files mapreduce.job.cache.archives

…rs (linkedin#202)

* Removes pattern matching

…p2 (linkedin#203) (1) Use ArrayList instead (2) Add unit test for this

…dd missing workflow links (linkedin#207)

…#210)

…n#217)

…a when sampling is enabled (linkedin#222)

) This commit allows Dr. Elephant to fetch Spark logs without universal read access to eventLog.dir on HDFS. SparkFetcher would use SparkRestClient instead of SparkLogClient if configured as <params> <use_rest_for_eventlogs>true</use_rest_for_eventlogs> </params> The default behaviour is to fetch the logs via SparkLogClient/WebHDFS.

…alone fetcher (linkedin#232) Remove backup for Rest Fetcher and make Legacy FSFetcher as top level fetcher. Change the default fetcher in the config.

* Fix SparkMetricsAggregator to not produce negative ResourceUsage

…#229)

…ally. (linkedin#243)

…hese objects when we implement our own parser (linkedin#248)

* We have been ignoring Failed Tasks in calculation of resource usage. This handles that. * Fixes Exception heuristic which was supposed to give the stacktrace.

…inkedin#250)

…to use them (linkedin#254)

fusonghe · 2019-04-15T11:48:03Z

Support sparksrteaming monitoring job?

fli and others added 30 commits December 11, 2014 11:22

HADOOP-8316: Update Dr.Elephant version number to 0.6.4-SNAPSHOT

1a4390f

HADOOP-8096: Property 'username' is converted to lowercase before sup…

65ae460

…plying it to the query instead of asking the query to perform the case-insensitive comparison.

[HADOOP-8648]Updating dr-elephant release ID to 0.6.4

be47552

RB=417448 BUGS=HADOOP-8648 R=fli A=fli

[HADOOP-8648] Reset version to 0.6.5-SNAPSHOT

b6ddadd

HADOOP-8294: Improved slow performance of join queries

194dee4

HADOOP-5369: Paginate search results of Dr. Elephant

45e3ca2

HADOOP-7948: Make Dr. Elephant's Reducer Time Moderate Severity cut-o…

f7562cc

…ff 51 reducers instead of 50

HADOOP-8320: Add detailed suggestions in Dr. Elephant help page

f6a2a95

HADOOP-8320: Add detailed suggestions in Dr. Elephant help page

8272cf0

HADOOP-8856: Fix a broken Dr.Elephant test case for reducer time heur…

138693a

…istic

HADOOP-8859: Add ideal task time suggestion in Dr. Elephant 's reduce…

81a27f7

…r time help page

HADOOP-8846: Release Dr. Elephant v0.6.5

6beb02d

HADOOP-8929: Update Dr.Elephant version number to 0.6.6-SNAPSHOT

12408ba

HADOOP-9685: Dr. Elephant fails to fetch task counter data on hadoop2

28d1283

HADOOP-8941: Further improve search performance for Dr. Elephant

335198a

HADOOP-9714: Add Pluggable jobtype to Dr. Elephant

adf1cc2

HADOOP-9716: Dr. Elephant rest interface needs search endpoint

c16a050

HADOOP-9853: Release Dr. Elephant v0.6.6

d73bc48

HADOOP-9886: Fix for making Dr. Elephant's pagination thread safe

d163a69

HADOOP-10289: Update Dr. Elephant version number to 0.6.7-SNAPSHOT

0fcaa7b

HADOOP-10089: Dr. Elephant misses on type of Voldemort bnp job

9e72289

HADOOP-10301: Dr. Elephant occasionally misses a job

872b20e

HADOOP-9900: Dr. Elephant Mapper Input Size Help Page should suggest …

877ae89

…increasing mapred.min.split.size for too many mappers, NOT mapred.max.split.size

HADOOP-10290: Release Dr. Elephant v0.6.7

71a7fc5

HADOOP-10314: Dr. Elephant should not mention deprecated properties i…

9009da5

…n Help topics page

[HADOOP-10405] Fix one-off error in getting cluster name from NN host…

1bab26e

…name RB=468832 BUGS=HADOOP-10405 R=fli A=fli,ahsu

HADOOP-11768: Update Dr. Elephant version to v1.0.0-SNAPSHOT

478b2d5

rajagopr and others added 26 commits February 6, 2017 17:44

Added new heuristic DistributedCacheLimit heuristic. (linkedin#187)

4df9ba9

Jobs which put large files(> 500MB) in the distributed cache are flagged. Files as part of the following are considered. mapreduce.job.cache.files mapreduce.job.cache.archives

Cleanes up MapReduceTaskData class by removing unnecessary constructo…

2a84735

…rs (linkedin#202)

Fixes Spark REST fetcher for client mode applications (linkedin#193)

e93d431

* Removes pattern matching

Fix for null pointers in TaskList returned by MapReduceFSFetcherHadoo…

dd7a458

…p2 (linkedin#203) (1) Use ArrayList instead (2) Add unit test for this

Fix linkedin#162 with the right calculation for resourceswasted and a…

d3c90d5

…dd missing workflow links (linkedin#207)

Fix Exception thrown when JAVA_EXTRA_OPTIONS is not present (linkedin…

da7983c

…#210)

Adds an option to fetch recently finished apps from RM (linkedin#212)

0d668ab

Fixes issue caused by http in history server config property (linkedi…

f6274b1

…n#217)

add config for timezone of job history server (linkedin#214)

965cba3

Include reference to the weekly meeting

6b80614

Fixes MapReduce aggregator and heuristic to correctly handle task dat…

5a98701

…a when sampling is enabled (linkedin#222)

Update linkedin#224 (credits: rayortigas) to add FSFetcher as a stand…

c8a7009

…alone fetcher (linkedin#232) Remove backup for Rest Fetcher and make Legacy FSFetcher as top level fetcher. Change the default fetcher in the config.

Spark metrics aggregator fix (linkedin#237)

b7e04ab

* Fix SparkMetricsAggregator to not produce negative ResourceUsage

Minor bug fixes in exception and UI (linkedin#238)

a1f866a

Updates Spark configuration heuristic severity calculations (linkedin…

7c373d4

…#229)

Enables SparkFetcher to only get eventLog via rest and process it loc…

1ca2676

…ally. (linkedin#243)

Refactor statusapiv1 to trait and implement for ease of creation of t…

cae79c7

…hese objects when we implement our own parser (linkedin#248)

MRfetcher ignores failed tasks (linkedin#249)

cdf680b

* We have been ignoring Failed Tasks in calculation of resource usage. This handles that. * Fixes Exception heuristic which was supposed to give the stacktrace.

Add index on severity, finish_time to speed up welcome page display (l…

f77886a

…inkedin#250)

Add pinball scheduler to dr-elephant (linkedin#253)

54a16fd

add s3, s3a, s3n bytes read and bytes written, and update heuristics …

1d6f3f6

…to use them (linkedin#254)

Add custom flowtime per scheduler (linkedin#268)

9a65e0e

Add filtering on Job Definition Id in the Search view (linkedin#269)

7230038

added logic for map reduce time-skew heuristic (linkedin#267)

752a94b

new: dev: Added preliminary support for Tez(linkedin#10)

1f07d14

vrajat mentioned this pull request Aug 23, 2017

Support Tez Metrics qubole/dr-elephant#4

Closed

akshayrai mentioned this pull request Aug 24, 2017

Getting error while compiling drelephant with tej jobs #277

Closed

akshayrai force-pushed the master branch from 7c2fd7f to 8b46933 Compare December 12, 2017 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added preliminary support for Tez #278

Added preliminary support for Tez #278

abhishekdas99 commented Aug 21, 2017

fusonghe commented Apr 15, 2019

Added preliminary support for Tez #278

Are you sure you want to change the base?

Added preliminary support for Tez #278

Conversation

abhishekdas99 commented Aug 21, 2017

fusonghe commented Apr 15, 2019