Skip to content

Data Manipulation of Big data using plyrmr function #77

@AVIN8233

Description

@AVIN8233

Hi all,
Lets say, I have a very big data 10-20GB(e.g BIGDATA.csv or .txt) which is in .csv or .txt format.
Obviously, I would not like to read the data in memory. So what I want is that, put that data in hdfs by
hdfs.put() in some HDFS directory. Then if I want to do some row or column operation,how I can use plyrmr function?
For simplicity,let say I have the data as "mtcars.csv". Now I want to put this data in hdfs directory and then calculate carb.per.cycle=carb/cycle. So please suggest me how to perform?
I am using rmr.options(backend="hadoop") #backend is hadoop
What I tried but its throwing error:

hdfs.mkdir("/user/cloudera/data")
hdfs.put("mtcars.csv","/user/cloudera/data")
bind.cols(input("/user/cloudera/data"), cycle=carb/cyl)

Output:
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob4514819939415411484.jar tmpDir=null
16/05/19 07:20:32 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/05/19 07:20:33 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/05/19 07:20:35 INFO mapred.FileInputFormat: Total input paths to process : 1
16/05/19 07:20:35 INFO mapreduce.JobSubmitter: number of splits:2
16/05/19 07:20:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1463635066767_0013
16/05/19 07:20:36 INFO impl.YarnClientImpl: Submitted application application_1463635066767_0013
16/05/19 07:20:36 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1463635066767_0013/
16/05/19 07:20:36 INFO mapreduce.Job: Running job: job_1463635066767_0013
16/05/19 07:20:48 INFO mapreduce.Job: Job job_1463635066767_0013 running in uber mode : false
16/05/19 07:20:48 INFO mapreduce.Job: map 0% reduce 0%
16/05/19 07:21:11 INFO mapreduce.Job: Task Id : attempt_1463635066767_0013_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

16/05/19 07:21:29 INFO mapreduce.Job: Task Id : attempt_1463635066767_0013_m_000000_1, Status : FAILED

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions