Embulk file input plugin read files stored on Microsoft Azure Blob Storage
embulk-input-azure_blog_storage v0.2.0+ requires Embulk v0.9.12+
- Plugin type: file input
- Resume supported: no
- Cleanup supported: yes
First, create Azure Storage Account.
- account_name: storage account name (string, required)
- account_key: primary access key (string, required)
- container: container name data stored (string, required)
- path_prefix: prefix of target keys (string, required) (string, required)
- incremental: enables incremental loading(boolean, optional. default: true). If incremental loading is enabled, config diff for the next execution will include
last_path
parameter so that next execution skips files before the path. Otherwise,last_path
will not be included. - path_match_pattern: regexp to match file paths. If a file path doesn't match with this pattern, the file will be skipped (regexp string, optional)
- total_file_count_limit: maximum number of files to read (integer, optional)
- proxy:
- type: (string, required, default:
null
)- http: use HTTP Proxy
- host: (string, required)
- port: (int, required, default:
8080
) - user: (string, optional)
- password: (string, optional)
- type: (string, required, default:
in:
type: azure_blob_storage
account_name: myaccount
account_key: myaccount_key
container: my-container
path_prefix: logs/csv-
Example for "sample_01.csv.gz" , generated by embulk example
in:
type: azure_blob_storage
account_name: myaccount
account_key: myaccount_key
container: my-container
path_prefix: logs/csv-
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out: {type: stdout}
To filter files using regexp:
in:
type: sftp
path_prefix: logs/csv-
...
path_match_pattern: \.csv$ # a file will be skipped if its path doesn't match with this pattern
## some examples of regexp:
#path_match_pattern: /archive/ # match files in .../archive/... directory
#path_match_pattern: /data1/|/data2/ # match files in .../data1/... or .../data2/... directory
#path_match_pattern: .csv$|.csv.gz$ # match files whose suffix is .csv or .csv.gz
With proxy
in:
type: azure_blob_storage
...
proxy:
type: http
host: proxy_host
port: 8080
user: proxy_user
password: proxy_secret_pass
$ ./gradlew gem # -t to watch change of files and rebuild continuously
$ ./gradlew test # -t to watch change of files and rebuild continuously
To run unit tests, we need to configure the following environment variables.
Additionally, following files will be needed to upload to existing GCS bucket.
When environment variables are not set, skip some test cases.
AZURE_ACCOUNT_NAME
AZURE_ACCOUNT_KEY
AZURE_CONTAINER
AZURE_CONTAINER_IMPORT_DIRECTORY (optional, if needed)
If you're using Mac OS X El Capitan and GUI Applications(IDE), like as follows.
$ vi ~/Library/LaunchAgents/environment.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>my.startup</string>
<key>ProgramArguments</key>
<array>
<string>sh</string>
<string>-c</string>
<string>
launchctl setenv AZURE_ACCOUNT_NAME my-account-name
launchctl setenv AZURE_ACCOUNT_KEY my-account-key
launchctl setenv AZURE_CONTAINER my-container
launchctl setenv AZURE_CONTAINER_IMPORT_DIRECTORY unittests
</string>
</array>
<key>RunAtLoad</key>
<true/>
</dict>
</plist>
$ launchctl load ~/Library/LaunchAgents/environment.plist
$ launchctl getenv AZURE_ACCOUNT_NAME //try to get value.
Then start your applications.