-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Hi all,
Lets say, I have a very big data 10-20GB(e.g BIGDATA.csv or .txt) which is in .csv or .txt format.
Obviously, I would not like to read the data in memory. So what I want is that, put that data in hdfs by
hdfs.put() in some HDFS directory. Then if I want to do some row or column operation,how I can use plyrmr function?
For simplicity,let say I have the data as "mtcars.csv". Now I want to put this data in hdfs directory and then calculate carb.per.cycle=carb/cycle. So please suggest me how to perform?
I am using rmr.options(backend="hadoop") #backend is hadoop
What I tried but its throwing error:
hdfs.mkdir("/user/cloudera/data")
hdfs.put("mtcars.csv","/user/cloudera/data")
bind.cols(input("/user/cloudera/data"), cycle=carb/cyl)
Output:
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob4514819939415411484.jar tmpDir=null
16/05/19 07:20:32 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/05/19 07:20:33 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/05/19 07:20:35 INFO mapred.FileInputFormat: Total input paths to process : 1
16/05/19 07:20:35 INFO mapreduce.JobSubmitter: number of splits:2
16/05/19 07:20:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1463635066767_0013
16/05/19 07:20:36 INFO impl.YarnClientImpl: Submitted application application_1463635066767_0013
16/05/19 07:20:36 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1463635066767_0013/
16/05/19 07:20:36 INFO mapreduce.Job: Running job: job_1463635066767_0013
16/05/19 07:20:48 INFO mapreduce.Job: Job job_1463635066767_0013 running in uber mode : false
16/05/19 07:20:48 INFO mapreduce.Job: map 0% reduce 0%
16/05/19 07:21:11 INFO mapreduce.Job: Task Id : attempt_1463635066767_0013_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
16/05/19 07:21:29 INFO mapreduce.Job: Task Id : attempt_1463635066767_0013_m_000000_1, Status : FAILED