Skip to content

[Bug] Foreign table(COPY FROM) can't skip lines for invalid multi-byte-encoding text #1425

@gfphoenix78

Description

@gfphoenix78

Apache Cloudberry version

main branch

What happened

create  external web  table t3(a int, b text)
LOCATION ('http://<ip>:<port>/bad_gb.txt')
FORMAT 'TEXT' (DELIMITER ','  NULL '' )  ENCODING 'GB18030'
LOG ERRORS SEGMENT REJECT LIMIT 2;
select * from t3;

output:

gpadmin=# select * from t3;
ERROR:  segment reject limit reached, aborting operation  (seg0 slice1 127.0.1.1:7002 pid=2316762)
DETAIL:  Last error was: invalid byte sequence for encoding "GB18030": 0xa3 0x0a
CONTEXT:  External table t3, line 3 of file http://.../bad_gb.txt

bad_gb.txt: encoding GB18030

gpadmin@hashdata:/tmp/www$ hexdump -C bad_gb.txt
00000000  31 2c ca c0 bd e7 0a 32  2c c4 e3 ba c3 c2 f0 a3  |1,.....2,.......|
00000010  0a 33 2c 6e 69 68 61 6f  0a                                      |.3,nihao.|
00000019

What you think should happen instead

Only the second line is bad, the first and third line should output according to its definition.

How to reproduce

repro, replace the

create  external web  table t3(a int, b text)
LOCATION ('http://<ip>:<port>/bad_gb.txt')
FORMAT 'TEXT' (DELIMITER ','  NULL '' )  ENCODING 'GB18030'
LOG ERRORS SEGMENT REJECT LIMIT 2;
select * from t3;


-- or
create temp table t0(a int, b text);
-- copy the file bad_gb.txt to /tmp
copy t0 from '/tmp/www/bad_gb.txt' with(encoding 'gb18030') log errors segment reject limit 2;

output:

gpadmin=# select * from t3;
ERROR:  segment reject limit reached, aborting operation  (seg0 slice1 127.0.1.1:7002 pid=2316762)
DETAIL:  Last error was: invalid byte sequence for encoding "GB18030": 0xa3 0x0a
CONTEXT:  External table t3, line 3 of file http://.../bad_gb.txt

-- or

gpadmin=# copy t0 from '/tmp/www/bad_gb.txt' with(encoding 'gb18030') log errors segment reject limit 2;
ERROR:  segment reject limit reached, aborting operation
DETAIL:  Last error was: invalid byte sequence for encoding "GB18030": 0xa3 0x0a, column a
CONTEXT:  COPY t0, line 2, column a: "1,世界"

bad_gb.txt: encoding GB18030

gpadmin@hashdata:/tmp/www$ hexdump -C bad_gb.txt
00000000  31 2c ca c0 bd e7 0a 32  2c c4 e3 ba c3 c2 f0 a3  |1,.....2,.......|
00000010  0a 33 2c 6e 69 68 61 6f  0a                       |.3,nihao.|
00000019

Operating System

ubuntu 22.04

Anything else

No response

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: BugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions