Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added preliminary support for Tez #278

Open
wants to merge 287 commits into
base: master
Choose a base branch
from

Conversation

abhishekdas99
Copy link
Contributor

This is preliminary version of tez support.. The following changes are required to have full support.

Here is the summary:
Done:
Added basic support for tez jobs. Added couple of heuristics to make sure they appear in the UI.Current implementation is just having the same code as MR heuristics.

To Be Done:
The main problem with Tez support is one Yarn application can have multiple dags (or jobs). The current implementation assumes that for each Yarn application, there is one MR job. So some design changes need to be done.
Some UI support as well to show multiple dags under one Yarn application.
We need to come up with some class structure for heuristics as some heuristics will be exactly same for Tez and MR . So class hierarchy will help us not to write redundant code.

fli and others added 30 commits December 11, 2014 11:22
…plying it to the query instead of asking the query to perform the case-insensitive comparison.
Changed HadoopJobData to include finishTime since that is needed for
metrics.
Changed the signature of getJobCounter to include jobConf and jobData
so that it can publish metrics
Updated README.md

Tested locally on my box and on spades

RB=406817
BUGS=HADOOP-7814
R=fli,mwagner
A=fli
The java file DaliMetricsAPI.java has a flavor of the APIs that we will be exposing from the dali library.
We can split these classes into individual files when we move this functionality to the dali library.

Changed start script to look for a config file that configures a publisher. If the file is present,
then dr-elephant is started with an option that has the file name. If the file is not present,
then the behavior is unchanged (i.e. no metrics are published).

If the file is parsed correctly then dr-elephant publishes metrics in HDFS (one avro file per job)
for jobs that are configured to publish the metrics.

The job needs to set something like mapreduce.job.publish-counters='org.apache.hadoop.examples.WordCount$AppCounter:*'
to publish all counters in the group given. The format is : 'groupName:counterName' where counterName can be an
asterisk to indicate all counters in the group. See the class DaliMetricsAPI.CountersToPublish

The HDFSPublisher is configured with a base path under which metrics are published. The date/hour hierarchy is added
to the base path.

The XML file for configuring dr-elephant is checked in as a template. A config file needs to be added to the
'conf' path of dr-elephant (manually, as per meeting with hadoop-admin) on clusters where we want dr-elephant
to publish metrics.

RB=409443
BUGS=HADOOP-7814
R=fli,csteinba,mwagner,cbotev,ahsu
A=fli,ahsu
hadoop-1 does not have JobStatus.getFinishTime(). This causes dr-elephant to hang.

Set the start time to be same as finish time for h1 jobs.

For consistency, reverted to the old method of scraping the job tracker url so that we get only
start time, and set the finish time to be equal to start time for retired jobs as well.

RB=417975
BUGS=HADOOP-8640
R=fli,mwagner
A=fli
RB=417448
BUGS=HADOOP-8648
R=fli
A=fli
…increasing mapred.min.split.size for too many mappers, NOT mapred.max.split.size
…name

RB=468832
BUGS=HADOOP-10405
R=fli
A=fli,ahsu
rajagopr and others added 26 commits February 6, 2017 17:44
Jobs which put large files(> 500MB) in the distributed cache are flagged.
Files as part of the following are considered.
  mapreduce.job.cache.files
  mapreduce.job.cache.archives
…p2 (linkedin#203)

(1) Use ArrayList instead
(2) Add unit test for this
)

This commit allows Dr. Elephant to fetch Spark logs without universal
read access to eventLog.dir on HDFS. SparkFetcher would use SparkRestClient
instead of SparkLogClient if configured as

    <params>
      <use_rest_for_eventlogs>true</use_rest_for_eventlogs>
    </params>

The default behaviour is to fetch the logs via SparkLogClient/WebHDFS.
…alone fetcher (linkedin#232)

Remove backup for Rest Fetcher and make Legacy FSFetcher as top level fetcher. Change the default fetcher in the config.
* Fix SparkMetricsAggregator to not produce negative  ResourceUsage
* We have been ignoring Failed Tasks in calculation of resource usage. This handles that.
* Fixes Exception heuristic which was supposed to give the stacktrace.
@fusonghe
Copy link

Support sparksrteaming monitoring job?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.