-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limitation of the storage claim #22
Comments
Sorry, I don't understand the problem you are facing from your description. Audience claims Explicit AuthZ statements ( |
Hi @paulmillar !
|
Sorry, your example is not really concrete, it's actually rather abstract. For me, a concrete use-case is something like: Each LHCb user is assigned their own (personal) subtree within the LHCb namespace at each SE. This is somewhere a user may store their personal (non-group) files without risk of clashing with similarly named files from other LHCb users. A LHCb user with username "fbloggs" wishes to upload a file This personal namespace is a subdirectory within a portion of the LHCb namespace used to store all LHCb users' personal subtrees. For example, the personal subtree of the user "fbloggs" at storage element SE-1 is the directory In general, the directory for all LHCb users' personal storage (the For example, at CERN it is Discussion If I understood the problem correctly (and the above use-case accurately describes the desired behaviour) then we have a few ways of supporting this.
If LHCb (quite reasonably) would like uniform behaviour then it should lay out files in a uniform way. This would require LHCb to undergo a storage reorganisation campaign. IMHO, this is not unreasonable as other VOs have undertaken similar reorganisation campaigns.
It may be desirable for a token to be "tied down" so it can only work with a specific storage element. This is purely for security considerations: limiting the damage if the token is "stolen" (leaked). If this is done then the problem goes away: the token can contain the correct authorisation path for the targeted storage element.
The mapping between the "authorisation path" (i.e., the path At least in dCache, this mapping currently assume a simple prefix: all AuthZ paths are resolved against a common base path to determine the namespace path (e.g., So, one solution could be for storage elements to provide a more sophisticated AuthZ-to-namespace mapping. To be more concrete, one way to do this is to support a placeholder initial path-element in the AuthZ path to represent the LHCb user-space base path. For example, the token would have a scope claim that contains In this way, an AuthZ statement would work in a uniform way across storage elements, despite the transfer namespaces being non-uniform. Personally, I would prefer 1. or 2. over 3 to avoid unnecessary complexity in the storage (my own self-interest ;-). However, all three options would solve the LHCb user data access without requiring any changes to the JWT profile. |
Just to add some further thoughts: There may be other solutions to this problem that are supported by the JWT profile. The above three are just the ones I could think of :-) Of course, we can modify the JWT profile to support the LHCb use-case (in line with @chaen 's suggestion). However, I think we should first explore the above options and understand why are not possible. Finally, the first option ("1. LHCb adopts a common approach to storing user's data") might not be too bad, at least for "regular" (non-object-store) storage. Such a move (to adopt a WLCG-wide canonical convention) could involve renaming a directory and adding a symbolic link for backwards compatibility. This could be done (more or less) atomically. Existing references would continue to work (via the sym-link). |
Hi @paulmillar 1. LHCb adopts a common approach to storing user's data.that seems like a fair suggestion, and I am totally willing to do it (I sort of had it in mind already :-)). I am not entirely sure that it solves everything though. For example, we have storages with and without user space. Also you mention
RAL is already a special case, and that alone is a reason to consider a broader scope from the beginning. In the case of a uniform namespace, the question I raise about the path definition has an answer: what goes in 2. Give up on the requirement that a token should work on all SEs.That is a matter of trade of between security and operational aspects. We can do a very detailed risk assessment but a token allowing a user to write to his own user space only (and not to his colleague's) is already a huge improvement compared to proxies. And we consider this a good middle ground. So we are not willing to give up on that. 3. Storage software support more sophisticated AuthZ mapping.Your proposal of having a place holder in the path is interesting. It basically boils down to having the logical storage element definition on the storage itself, and not in a centralized dynamic json like I mentioned. In a way, this is closer from what cloud provider (and RAL!) do: you not only write to an endpoint, but you also specify a bucket. I think that would solve a lot of issues in how we express permissions. I see 2 drawbacks (if we may even call it that):
|
For Third-Party-Copy you can generate two independent tokens for source and destination storage (they could be even cached and reused by FTS). Originally our FTS design used just one token for whole FTS TPC transfer (current implementation), but I always thought it would be better to use two independent tokens, because on one side it is sufficient to use |
but the source will need both tokens: it needs to receive the read token but also be able to send the write token. Only the destination needs just the write token? |
Yes, both tokens are available for active TPC party, but only one token reach passive party. HTTP-TPC pull: active party is destination storage which GET data from source (read token) Compromised tokens gives write privileges to one storage, not both as in case of token with read+write scope and two audiences {
"wlcg.ver": "1.0",
"sub": "84622f21-a31a-46b9-81b4-85e7841e1695",
"aud": "https://active-tpc-party.example.com,https://passive-tpc-party.example.com",
"nbf": 1670576459,
"scope": "storage.modify:/transfer/destination/path storage.read:/transfer/source/path",
"iss": "https://atlas-auth.web.cern.ch/",
"exp": 1670580059,
"iat": 1670576459,
"jti": "129a07b0-d59c-4fb9-bd4d-e8d602d465a0",
"client_id": "e9b44f9d-c745-4860-bedc-dbb2c9a82fba"
} E.g. if one storage is compromised than HTTP-TPC pull mode still prevent data destruction on the rest of distributed storage infrastructure. |
Hi all, Though historically, the VO probably needed to know and include such prefixes more often than not, with tokens we have an opportunity to do away with such encumbrances. During the transition period, legacy methods (X509 + VOMS) should continue being able to use legacy (i.e. full) paths, while for tokens, each SE should be configured to add VO prefixes implicitly. W.r.t. targeting multiple SEs with a single token, it is true that any damage from an abused token can be limited to the path(s) explicitly listed. However, in the case at hand, a VO probably would not want user directories to be created on a tape SE! The VO data management client SW should not propose such SEs to random users, of course, but we would like it better if such attempted operations would be guaranteed to fail. Unfortunately this clashes with the wish for leading directories to be created automatically as needed. As a compromise, we could forbid such automatic creations in the root directory of the VO, i.e. at least the primary leading directory (e.g. "/users") should already exist... |
Do you expect that all storage endpoints should always provide path starting with VO prefix, e.g. Second issue with tokens that can create e.g. |
Hi Petr, On your second point, we would need to discuss with the IAM (and CILogon) devs to see if support for such restrictions can be implemented without too much hassle. And if so, decide if such usage is going to be a sustainable way for a typical VO to limit what can be done on particular sets of SEs. For example, the VO might need to list many SE names explicitly for various use cases: not nice... It would be easier if there were support for prohibitions as well. For example, |
Let's say there are three storage systems that provide storage capacity to ATLAS. They each provide this storage under different paths: In general, there is a VO-specific prefix (e.g., I'm assuming that if a storage system provides storage capacity for other VOs then they do so under a different path (for example, |
Hi Paul, |
While we may aspire to do better, the fact that VOs already handle per-SE prefixes means this is a solved problem. In other words, this isn't a show-stopper.
Yes. This is simply resolving the AuthZ path relative to per-VO prefix. In other words:
Currently (with VOMS and tokens), the request path reflect the "real" path within the storage system; for the
IIRC, the JWT profile doesn't mention any changes to the request path (the If I've understood your comment, you are suggesting that the request path should avoid the prefix. Following this idea, the above example curl command would be:
I think this only makes sense if it is done for all token-based requests (including request made with tokens that have no explicit AuthZ statements). Currently, services are free to ignore explicit AuthZ statements (using group-membership instead). Therefore, we would only get consistent behaviour if this mapping were done for all token-based requests. As it happens, dCache allows the admin to configure a per-user, chroot-like path. When this feature is enabled, all requests from that user are interpreted relative to some prefix (with no possibility to escape). If so configured, an ATLAS user could upload a file targeting the The dCache instance may be configured so that CMS users experience a different chroot path. Therefore, So, this is possible with dCache. However I don't know if this would be supported by all storage systems. Also, it would require some kind of campaign, where this change is enacted SE-per-SE, with corresponding changes in the VO catalogues: a non-trivial effort. Therefore, I think this is something would require more discussion. This is also rather tangential. It is (perhaps) a nice-to-have feature, but I don't see this as a show-stopper. We could do this after AuthZ tokens are in use.
Well, technically you're right: this is possible. We can choose any (self-consistent) mapping between AuthN namespace and the filesystem namespace. However, a couple of things you might like to consider.
Expanding on the first point, all VO-related information (group-membership, SE AuthZ, CE AuthZ, etc) would need the token to identify for which VO they apply. Currently the VO is clear: a token issued by the ATLAS OP asserts ATLAS-related information (the asserted groups are ATLAS groups, the asserted AuthZ statements apply to ATLAS resources, etc). If we have a multi-VO tokens then a token could include information (group-membership, AuthZ statements) about CMS and about ALICE and DUNE and ... To my mind, this breaks the JWT profile, as its currently written: the document provides no standard way for a token to describe for which VO a particular group or AuthZ statement applies. Interestingly, the group-membership syntax coming from AARC (AARC-G02, IIRC) includes a group namespace concept. This would allow a single token to assert both CMS- and ATLAS group-membership. If we really want to support multi-VO tokens then adopting the AARC group-membership representation would be one possible solution. That still leaves the AuthZ statements, which would need to identify for which VO's resources the AuthZ statement applies. We can do this in an ad-hoc fashion (e.g., by adding the VO in the AuthZ path), but a better approach might be to reconsider the syntax and provide a more general solution: one that could work for computing resource AuthZ statements, too. So, yes, we could add the VO name in the AuthZ path. However, we should think really really hard before doing so. |
Coming back on that after a long time, and discussions with multiple people (e.g Rucio/IAM dev) and other experiments. Whether we like it or not, namespaces will not be uniform across storages (again confiremd by discussions during the DC workshop):
So the assumption namespace is uniform does not hold, and we can't rely on it (nor DIRAC nor Rucio) After discussing with Rucio/IAM people, I would like to propose a much lighter approach to solve that. What DIRAC/Rucio store in the config is "DESY-SCRATCH"
{
"prefix" : "/pnfs/desy.de/atlas/scratch/"
}
"CERN-SCRATCH"
{
"prefix" : "/dpm/cern.ch/home/atlas/anythinggoes/"
} Currently, if I want to read an LFN called For DESY "storage.read:/scratch/mylfn.root" FOR CERN "storage.read:/anythinggoes/mylfn.root`" so basically you need to put as path Not only is At the storage level, if I understand properly, what is done is just to make sure that the issuer is allowed operation in From the semantic point of view, for DIRAC/Rucio, prefix or voroot are meaningless. What we want is to act on a given LFN stored in a SCRATCH space, be it specifically at DESY, or specifically at CERN, or in any of the two. You could express it like
and if you specifically want CERN or DESY, you can use the audience to restrict it. And this is the addition I would like to propose. It's not a change, it's an addition. Support these area alias like This would allow to very easily support non uniform namespaces, without breaking the existing functionality. From the storage point of view, the parsing would not be very different to what is already implemented. What is required though is the propagation of the information of what The best place would be a "CERN":
{
"SCRATCH" : "/dpm/cern.ch/home/atlas/anythinggoes/",
"MC": "/dpm/cern.ch/home/atlas/mcdata/",
"USER": "/dpm/cern.ch/home/atlas/user/",
},
"DESY":
{
"SCRATCH" : "/pnfs/desy.de/atlas/scratch/",
"MC": "/pnfs/desy.de/atlas/mcdata/"
} I believe that this is a minor addition, but which solves a lot of issues. cc: @giacomini, @bari12, @dchristidis |
I'm probably missing something important but why isn't anything under the VO namespace (so after let's say Trying to impose things like $HOME, $TMP, etc on top of a VO prefix seems like unnecessary complications that shouldn't be in the profile (if a VO wants to do that for its own namespace part that's not up to us of course). |
The token profile comes 15 years after experiments started writing data. As a matter of a fact, discrepancies between SEs exist, and they wont be fixed |
We have discussed the matter between IAM/StoRM developers and StoRM deployers and our preference would be that the scope path prefix-matches the path in the URL, without any assumptions about the VO-based mapping, i.e. if you want to GET On the other hand, I'm not sure I understand the proposal of using the aliases.
Depending on the answers, I may have also further doubts about the distribution mechanism of the aliases via the token issuer. |
Can you share your concern with having an extra well-known endpoint in the token issuer ? We could also have this well-known hosted somewhere else (DIRAC or Rucio), providing something in the token-issuer redirects to it. |
let me bring up that issue again, as the problem is still there :-) |
This github issue aim at illustrating the limitation we would face with the current profile when it comes to storages.
Extract from the current profile
These here are just snippet of the JWT profile in its current form, on which I base my comments below
Audience statements:
common-jwt-profile/profile.md
Lines 328 to 333 in 84b8e2a
common-jwt-profile/profile.md
Line 339 in 84b8e2a
Path statements:
common-jwt-profile/profile.md
Line 468 in 84b8e2a
Base path definition
common-jwt-profile/profile.md
Lines 776 to 791 in 84b8e2a
Use case
Each LHCb user is given a logical grid directory in which he is allowed to upload files. It is of the form
/lhcb/user/c/chaen
. This directory exists on multiple sites.The User StorageElements (SE) are defined as:
A user file uploaded as the LFN
/lhcb/user/c/chaen/toto.xml
on the various StorageElements would end up inI want to issue a token that will allow the bearer to write on ALL the
USER
SEsProblems
Problem with audience
The audience can in theory be a list. But if we have to list all the endpoints in the audience, the token will be huge, and we will very quickly reach the maximum header size.
Problem with the path definition
What is an absolute path ?
If the path is absolute from the storage point of view, the path should be the full physical path (i.e.
/eos/lhcb/prod/userarea/lhcb/user/c/chaen/
or/pnfs/gridka.de/lhcb/userarea/lhcb/user/c/chaen/
) but in that case, it is valid for one audience only.The only sensible thing to do is to use the LFN as path:
/lhcb/user/c/chaen/
But in that case, we end up with the problem of the base path definition mentioned just below.
Problems with base path definition
From a site point of view, each VO is given a base path in which the experiment can write and do anything.
For example
/eos/lhcb/
at CERN,/pnfs/gridka.de/lhcb
at GRIDKA. This root directory is reflected in the configuration at the site for token validationHowever, from the experiment point of view, this root path is often further divided.
This division can be by related to the activity, for example:
/pnfs/gridka.de/lhcb/userarea/
for user data/pnfs/gridka.de/lhcb/prodarea/
for centraly produced dataOr it can even be divided by the various instances of the middleware, like the production and the certification instance of DIRAC:
/eos/lhcb/prod/userarea/
for user's data on the DIRAC production instance/eos/lhcb/certification/userarea/
for the DIRAC test instance.The second case could be discarded by simply saying that each DIRAC instance should have a separate IAM instance (fair enough).
But the first case is legitimate and poses problem. If my token contains
path:/lhcb/user/c/chaen
and the storage knows only/pnfs/gridka.de/lhcb
as base path, the permission obviously does not point to the directory I would like it to.Proposal
Replace the
storage.*:/path
approach with something along these linesContrary to what is written in
common-jwt-profile/profile.md
Line 339 in 84b8e2a
user readable names are good because shorter (solve the header length issue). But indeed, you need a lookup table. You could imagine having this lookup table in a json file in the
.well-known
directory of the IAM instance in question (e.g./.well-known/storage_definition.json
)The storage_definition.json could contain entries like:
This allows even more generic token like "do anything in your user area and read all the production data"
And of course, storage can make sure that the path defined in the storage_definition.json file start with the base_path they have configured.
Note that the SE definitions do not often change, so the storage systems can do aggressive caching of the definitions to validate tokens faster.
Extra bonus with this format: a token allowing third party copy between two SE could be directly handed to FTS (so no token exchange needed from its side) and would look like
The text was updated successfully, but these errors were encountered: