-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#5492] feat(hadoop-catalog): Support Azure blob storage for Gravitino server and GVFS Java client #5508
Conversation
|
||
public class ABSFileSystemProvider implements FileSystemProvider { | ||
|
||
private static final String ABS_PROVIDER_SCHEME = "wasbs"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between wasb and wasbs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path with prefix wasbs
will use SSL in the transport process and wasb
will not. It's advisable to use wasbs
in the production environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Azure provides several storages, like wasb, adls2, can you please investigate more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the early days, Azure only supported block storage and the corresponding protocol was wsab
or wsabs
. As time went by, azure added the support for ADLS, which supports directory management like HDFS to support big data eco-systems and the protocol is abfs
or abfss
(abfss
is the securable enhancement of abfs
).
Another point is that the version of Hadoop we use is 3.1
and only support wsab
, we need to upgrade to 3.3
to
support this feature. I have created an issue about version upgrades.
@jerryshao Please help to verify whether we need to update the Hadoop version to 3.3 or only use 3.3 for hadoop-azure
(I'm afraid there will be some problem if we use Hadoop common with version 3.1 and hadoop-azure 3.3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuqi1129 As far as I know, hadoop environment using 3.1 and hadoop-azure using 3.2.1 can run normally (abfs/abfss protocol was first supported in hadoop-azure 3.2.0). However, whether hadoop-azure 3.3 is feasible needs further testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I'm not completely confident that Hadoop Common 3.1 is compatible with Hadoop-azure 3.2.
Compared to using Hadoop 3.2 or above for both two, would you suggest trying Hadoop 3.1 and Hadoop-azure 3.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no particular preference for this.
In terms of compatibility, our current production environment is using Hadoop 3.1 and Hadoop-Azure 3.2.1, which have been running for more than a year, and no compatibility issues have been found.
If we consider upgrading the Hadoop version, can you consider upgrading the Hadoop version of the entire project to a relatively stable version during this upgrade process? Currently, Hive still uses Hadoop 2.7. If we upgrade to a higher version, it may cause greater differences in the future.
You'd better also update the doc in this PR, don't separate into several PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's blob storage not "block storage", plz update the title and description
|
||
public class AzureFileSystemProvider implements FileSystemProvider { | ||
|
||
@VisibleForTesting public static final String ABS_PROVIDER_SCHEME = "abfss"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg uses this protocol, however, wasbs
is also used by several softwares like Drimo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's support this first, then we can add more support later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if later on when we support wasb, can we still use this provider, or shall we create another provider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to introduce a name alias method to support multiple protocol in one provider.
Set schemeAlias() {
Sets.of("wasb", "wasbs", "abfs");
}
if the scheme of the path in it, it will also use this provider.
public static final String GRAVITINO_ABS_ACCOUNT_NAME = "abs-account-name"; | ||
|
||
// The account key of the Azure Blob Storage. | ||
public static final String GRAVITINO_ABS_ACCOUNT_KEY = "abs-account-key"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is similar to AKSK in Azure blob storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Azure block storage also supports other two authentication mechanisms, which are quite complex than the current one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make sure if this auth is a main stream auth for ABS, and also widely used by other systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, like AKSK, it is the most widely used by users and application, others like SAS Token and Azure Active Directory (AAD) are quite complicated and hard to configured.
import org.junit.jupiter.api.condition.EnabledIf; | ||
import org.junit.platform.commons.util.StringUtils; | ||
|
||
@EnabledIf("absEnabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also unify the name for abs here as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, can we move this test to the azure-bundle (also for other storages)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, can we move this test to the azure-bundle (also for other storages)?
Yes, I can be, I have tried it but it seems to introduce a lot of dependencies to the bundle, so I plan to use a separate issue #5565 to solve it wholely.
...org/apache/gravitino/filesystem/hadoop/integration/test/GravitinoVirtualFileSystemABSIT.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revisit your code and doc carefully to avoid any errors.
|
||
@VisibleForTesting public static final String ABS_PROVIDER_SCHEME = "abfss"; | ||
|
||
@VisibleForTesting public static final String ABS_PROVIDER_NAME = "abfs"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this better to change to "abs", abfs is a scheme/protocol name, the service name should be abs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You bring in several concepts, like abfs, abfss, abs, I think you should clearly separate them and define them correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this better to change to "abs", abfs is a scheme/protocol name, the service name should be abs, right?
I took it.
if (config.containsKey(ABSProperties.GRAVITINO_ABS_ACCOUNT_NAME) | ||
&& config.containsKey(ABSProperties.GRAVITINO_ABS_ACCOUNT_KEY)) { | ||
hadoopConfMap.put( | ||
String.format( | ||
"fs.azure.account.key.%s.dfs.core.windows.net", | ||
config.get(ABSProperties.GRAVITINO_ABS_ACCOUNT_NAME)), | ||
config.get(ABSProperties.GRAVITINO_ABS_ACCOUNT_KEY)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can abfs work without these two configurations? if not, I think we should throw an exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we let users pass other Azure configurations here, I think we should not throw exceptions here. For example, users can set SAS Token and Azure Active Directory (AAD) related configuration through the bypass mechanism thought I'm not 100% sure whether uses can successfully do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
What changes were proposed in this pull request?
Add support for Support Azure blob storage for Gravitino server and GVFS Java client
Why are the changes needed?
It's a big improvement for fileset usage.
Fix: #5492
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
ITs