forked from piskvorky/smart_open
-
Notifications
You must be signed in to change notification settings - Fork 0
/
help.txt
270 lines (217 loc) · 10.3 KB
/
help.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
Help on package smart_open:
NAME
smart_open
DESCRIPTION
Utilities for streaming to/from several file-like data storages: S3 / HDFS / local
filesystem / compressed files, and many more, using a simple, Pythonic API.
The streaming makes heavy use of generators and pipes, to avoid loading
full file contents into memory, allowing work with arbitrarily large files.
The main functions are:
* `open()`, which opens the given file for reading/writing
* `s3_iter_bucket()`, which goes over all keys in an S3 bucket in parallel
* `register_compressor()`, which registers callbacks for transparent compressor handling
PACKAGE CONTENTS
bytebuffer
doctools
hdfs
http
s3
smart_open_lib
ssh
tests (package)
webhdfs
FUNCTIONS
open(uri, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None, ignore_ext=False, transport_params=None)
Open the URI object, returning a file-like object.
The URI is usually a string in a variety of formats:
1. a URI for the local filesystem: `./lines.txt`, `/home/joe/lines.txt.gz`, `file:///home/joe/lines.txt.bz2`
2. a URI for HDFS: `hdfs:///some/path/lines.txt`
3. a URI for Amazon's S3 (can also supply credentials inside the URI):
`s3://my_bucket/lines.txt`, `s3://my_aws_key_id:key_secret@my_bucket/lines.txt`
The URI may also be one of:
- an instance of the pathlib.Path class
- a stream (anything that implements io.IOBase-like functionality)
This function supports transparent compression and decompression using the
following codec:
- ``.gz``
- ``.bz2``
The function depends on the file extension to determine the appropriate codec.
Parameters
----------
uri: str or object
The object to open.
mode: str, optional
Mimicks built-in open parameter of the same name.
buffering: int, optional
Mimicks built-in open parameter of the same name.
encoding: str, optional
Mimicks built-in open parameter of the same name.
errors: str, optional
Mimicks built-in open parameter of the same name.
newline: str, optional
Mimicks built-in open parameter of the same name.
closefd: boolean, optional
Mimicks built-in open parameter of the same name. Ignored.
opener: object, optional
Mimicks built-in open parameter of the same name. Ignored.
ignore_ext: boolean, optional
Disable transparent compression/decompression based on the file extension.
transport_params: dict, optional
Additional parameters for the transport layer (see notes below).
Returns
-------
A file-like object.
Notes
-----
smart_open has several implementations for its transport layer (e.g. S3, HTTP).
Each transport layer has a different set of keyword arguments for overriding
default behavior. If you specify a keyword argument that is *not* supported
by the transport layer being used, smart_open will ignore that argument and
log a warning message.
S3 (for details, see :mod:`smart_open.s3` and :func:`smart_open.s3.open`):
buffer_size: int, optional
The buffer size to use when performing I/O.
min_part_size: int, optional
The minimum part size for multipart uploads. For writing only.
session: object, optional
The S3 session to use when working with boto3.
resource_kwargs: dict, optional
Keyword arguments to use when accessing the S3 resource for reading or writing.
multipart_upload_kwargs: dict, optional
Additional parameters to pass to boto3's initiate_multipart_upload function.
For writing only.
HTTP (for details, see :mod:`smart_open.http` and :func:`smart_open.http.open`):
kerberos: boolean, optional
If True, will attempt to use the local Kerberos credentials
user: str, optional
The username for authenticating over HTTP
password: str, optional
The password for authenticating over HTTP
WebHDFS (for details, see :mod:`smart_open.webhdfs` and :func:`smart_open.webhdfs.open`):
min_part_size: int, optional
For writing only.
SSH (for details, see :mod:`smart_open.ssh` and :func:`smart_open.ssh.open`):
mode: str, optional
The mode to use for opening the file.
host: str, optional
The hostname of the remote machine. May not be None.
user: str, optional
The username to use to login to the remote machine.
If None, defaults to the name of the current user.
port: int, optional
The port to connect to.
Examples
--------
>>> from smart_open import open
>>>
>>> # stream lines from an S3 object
>>> for line in open('s3://commoncrawl/robots.txt'):
... print(repr(line))
... break
'User-Agent: *\n'
>>> # stream from/to compressed files, with transparent (de)compression:
>>> for line in open('smart_open/tests/test_data/1984.txt.gz', encoding='utf-8'):
... print(repr(line))
'It was a bright cold day in April, and the clocks were striking thirteen.\n'
'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n'
'wind, slipped quickly through the glass doors of Victory Mansions, though not\n'
'quickly enough to prevent a swirl of gritty dust from entering along with him.\n'
>>> # can use context managers too:
>>> with open('smart_open/tests/test_data/1984.txt.gz') as fin:
... with open('smart_open/tests/test_data/1984.txt.bz2', 'w') as fout:
... for line in fin:
... fout.write(line)
>>> # can use any IOBase operations, like seek
>>> with open('s3://commoncrawl/robots.txt', 'rb') as fin:
... for line in fin:
... print(repr(line.decode('utf-8')))
... break
... offset = fin.seek(0) # seek to the beginning
... print(fin.read(4))
'User-Agent: *\n'
b'User'
>>> # stream from HTTP
>>> for line in open('http://example.com/index.html'):
... print(repr(line))
... break
'<!doctype html>\n'
Other examples of URLs that ``smart_open`` accepts::
s3://my_bucket/my_key
s3://my_key:my_secret@my_bucket/my_key
s3://my_key:my_secret@my_server:my_port@my_bucket/my_key
hdfs:///path/file
hdfs://path/file
webhdfs://host:port/path/file
./local/path/file
~/local/path/file
local/path/file
./local/path/file.gz
file:///home/user/file
file:///home/user/file.bz2
[ssh|scp|sftp]://username@host//path/file
See Also
--------
- `Standard library reference <https://docs.python.org/3.7/library/functions.html#open>`__
- `smart_open README.rst <https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst>`__
register_compressor(ext, callback)
Register a callback for transparently decompressing files with a specific extension.
Parameters
----------
ext: str
The extension.
callback: callable
The callback. It must accept two position arguments, file_obj and mode.
Examples
--------
Instruct smart_open to use the identity function whenever opening a file
with a .xz extension (see README.rst for the complete example showing I/O):
>>> def _handle_xz(file_obj, mode):
... import lzma
... return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ)
>>>
>>> register_compressor('.xz', _handle_xz)
s3_iter_bucket = iter_bucket(bucket_name, prefix='', accept_key=None, key_limit=None, workers=16, retries=3)
Iterate and download all S3 objects under `s3://bucket_name/prefix`.
Parameters
----------
bucket_name: str
The name of the bucket.
prefix: str, optional
Limits the iteration to keys starting wit the prefix.
accept_key: callable, optional
This is a function that accepts a key name (unicode string) and
returns True/False, signalling whether the given key should be downloaded.
The default behavior is to accept all keys.
key_limit: int, optional
If specified, the iterator will stop after yielding this many results.
workers: int, optional
The number of subprocesses to use.
retries: int, optional
The number of time to retry a failed download.
Yields
------
str
The full key name (does not include the bucket name).
bytes
The full contents of the key.
Notes
-----
The keys are processed in parallel, using `workers` processes (default: 16),
to speed up downloads greatly. If multiprocessing is not available, thus
_MULTIPROCESSING is False, this parameter will be ignored.
Examples
--------
>>> # get all JSON files under "mybucket/foo/"
>>> for key, content in iter_bucket(bucket_name, prefix='foo/', accept_key=lambda key: key.endswith('.json')):
... print key, len(content)
>>> # limit to 10k files, using 32 parallel workers (default is 16)
>>> for key, content in iter_bucket(bucket_name, key_limit=10000, workers=32):
... print key, len(content)
smart_open(uri, mode='rb', **kw)
Deprecated, use smart_open.open instead.
DATA
__all__ = ['open', 'smart_open', 's3_iter_bucket', 'register_compresso...
VERSION
1.8.3
FILE
/home/misha/git/smart_open/smart_open/__init__.py