Skip to content

Commit 5188c97

Browse files
author
Alexander Gorban
committed
A script to generate a text file with FSNS URLs.
1 parent 39c59d1 commit 5188c97

File tree

3 files changed

+1340
-4
lines changed

3 files changed

+1340
-4
lines changed

street/README.md

+9-4
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ Note that these datasets are very large. The approximate sizes are:
7979
* Validation: 64 files of 40MB each.
8080
* Test: 64 files of 50MB each.
8181
* Testdata: some smaller data files of a few MB for testing.
82+
* Total: ~158 Gb.
8283

8384
Here is a list of the download paths:
8485

@@ -99,9 +100,14 @@ https://download.tensorflow.org/data/fsns-20160927/validation/validation-00000-o
99100
https://download.tensorflow.org/data/fsns-20160927/validation/validation-00063-of-00064
100101
```
101102

102-
The above files need to be downloaded individually, as they are large and
103-
downloads are more likely to succeed with the individual files than with a
104-
single archive containing them all.
103+
All URLs are stored in the text file `python/fsns_urls.txt`, to download them in
104+
parallel:
105+
106+
```
107+
aria2c -c -j 20 -i fsns_urls.txt
108+
```
109+
If you ctrl+c and re-execute the command it will continue the aborted download.
110+
105111

106112
## Confidence Tests
107113

@@ -256,4 +262,3 @@ defines a Tensor Flow graph that can be used to process images of variable sizes
256262
to output a 1-dimensional sequence, like a transcription/OCR problem, or a
257263
0-dimensional label, as for image identification problems. For more information
258264
see [vgslspecs](g3doc/vgslspecs.md)
259-

street/python/fsns_urls.py

+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
# ==============================================================================
15+
16+
"""Creates a text file with URLs to download FSNS dataset using aria2c.
17+
18+
The FSNS dataset has 640 files and takes 158Gb of the disk space. So it is
19+
highly recommended to use some kind of a download manager to download it.
20+
21+
Aria2c is a powerful download manager which can download multiple files in
22+
parallel, re-try if encounter an error and continue previously unfinished
23+
downloads.
24+
"""
25+
26+
import os
27+
28+
_FSNS_BASE_URL = 'http://download.tensorflow.org/data/fsns-20160927/'
29+
_SHARDS = {'test': 64, 'train': 512, 'validation':64}
30+
_OUTPUT_FILE = "fsns_urls.txt"
31+
_OUTPUT_DIR = "data/fsns"
32+
33+
def fsns_paths():
34+
paths = ['charset_size=134.txt']
35+
for name, shards in _SHARDS.items():
36+
for i in range(shards):
37+
paths.append('%s/%s-%05d-of-%05d' % (name, name, i, shards))
38+
return paths
39+
40+
41+
if __name__ == "__main__":
42+
with open(_OUTPUT_FILE, "w") as f:
43+
for path in fsns_paths():
44+
url = _FSNS_BASE_URL + path
45+
dst_path = os.path.join(_OUTPUT_DIR, path)
46+
f.write("%s\n out=%s\n" % (url, dst_path))
47+
print("To download FSNS dataset execute:")
48+
print("aria2c -c -j 20 -i %s" % _OUTPUT_FILE)
49+
print("The downloaded FSNS dataset will be stored under %s" % _OUTPUT_DIR)

0 commit comments

Comments
 (0)