hdfs.read() cannot load all data from huge csv file on hdfs

Hi,
   I have many huge csv files(more 20GB) on my hortonworks HDP 2.0.6.0 GA cluster,
I use the following code to read file from HDFS:

---

Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop") 
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")
Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
library(rmr2);
library(rhdfs);
library(lubridate);
hdfs.init();
f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600);
m = hdfs.read(f);
c = rawToChar(m);
data = read.table(textConnection(c), sep = ",");

---

When I use dim(data) to verify, it show me as following:
[1] 1523    7

---

```
But actually, it should be "134279407" instead of "1523".
I found the value of m show in RStudio is "raw [1:131072] 50 72 69 49 ...", and there is 
```

a thread in hadoop-hdfs-user mailing list(why can FSDataInputStream.read() only read 2^17 bytes in hadoop2.0?) .
Ref.
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201403.mbox/%3CCAGkDawm2ivCB+rNaMi1CvqpuWbQ6hWeb06YAkPmnOx=8PqbNGQ@mail.gmail.com%3E

```
Is it a bug of hdfs.read() in rhdfs-1.0.8?
```

Best Regards,
James Chang


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hdfs.read() cannot load all data from huge csv file on hdfs #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

hdfs.read() cannot load all data from huge csv file on hdfs #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions