Closed
Description
The problem is that default DD configuration for mysql, django (cache), defaultdb, requests, and potentially other libraries has each of these going to a shared service of the same name that is used across all of our IDA services.
Problems with this default approach:
- It is difficult to determine what span data belongs to which IDA/service.
- There is some metric data that may be impossible to separate between IDAs/services.
Acceptance Criteria
- Decide on and build a Proof of Concept for a service naming scheme that associates other spans (e.g. mysql, django (cache), defaultdb, mysql, requests, kafka, etc.) with the edxapp service.
- Ensure spans do not have multiple service tags (may require SRE support)
- Ensure we've addressed the complete list of dependent services.
- Ensure the primary operation of the edxapp services remain as they are (e.g.
django.requests
, etc.).
- Document (possibly in a wiki ADR) why we made the choices we made, and communicate to other service owners. Ideally they can participate in ADR review, so we get a consistent solution across all of our services.
- implement the decided upon service naming scheme for edxapp
- Approach: Inferred Services
- Feature switch: feat: Support for Datadog Inferred Services at the library level configuration#122
- stage, edge, prod
- Just make it default to true and remove the per-env settings
- devstack (
timmc/datadog-local-testing
is updated)
- Approach: Inferred Services
- Offer to other teams
- Convert ADR into DD docs, explaining how to apply the POC approach to other services
- Make available in Helm Charts
- Configure DD agent in k8s
- Add config option in django-ida helm chart, defaulting false for now: https://github.com/edx/helm-charts/pull/170
- Available now with django-ida helm chart 0.10.0
- Write up a recommendation for other teams
- Test in k8s
- Announce broadly as a recommendation to other teams
- File ticket for cleanup: Cleanup after global service names rollout #973
Notes:
- DD support ticket https://help.datadoghq.com/hc/en-us/requests/1781912 asks a variety of questions about this topic, specifically targeting django (cache) and requests services. Questions and answers should ultimately be copied out of this ticket once it is complete.
- DD support ticket https://help.datadoghq.com/hc/en-us/requests/1762193 asks similar questions about mysql, but has some additional information about metrics and multiple service tags (which should be avoided). Questions and answers should ultimately be copied out of this ticket once it is complete.
- DD support ticket https://help.datadoghq.com/hc/en-us/requests/1873842 summarizes some of the above as well as other options we've considered, and asks for guidance.
- There is an open (support) question about whether using
DD_SERVICE_MAPPING
is a simpler method of remapping, rather than using a variety of different settings.- However, remapping
django
tocache
might be risky, becausedjango
could be used for something else in the future, so in this particular case we might want to useDD_DJANGO_CACHE_SERVICE_NAME
instead.
- However, remapping
- For each service (e.g. mysql, django, etc.) we decide to remap, we need to choose between the IDA service
service:edx-edxapp-lms
(spans would still have a different operation_name), orservice:edx-edxapp-lms-cache
(a new DD service catalog service).- If everything went to the same service, we may need to adjust the primary operation_name for
edx-edxapp-lms
if it no longer defaults tooperation_name:django.requests
. See these DD docs for configuring the primary operation.
- If everything went to the same service, we may need to adjust the primary operation_name for
- If we go with separate DD service names:
- We would probably use the following naming convention:
service:edx-edxapp-lms
service:edx-edxapp-lms-cache
(wasdjango
)service:edx-edxapp-lms-defaultdb
- Etc.
- Unfortunately, I chose to go with
service:edx-edxapp-lms-workers
rather thanservice:edx-edxapp-workers-lms
.- We might need to change this, because presumably we'd want
service:edx-edxapp-workers-lms-cache
, etc., and if we used a search likeservice:edx-edxapp-lms*
to pick up all services related toservice:edx-edxapp-lms
, we would not want that to also pick up the worker service and all its sub-services. - Should we do this as a separate, pre-emptive ticket? We could use expand/contract to ensure monitors (listed in DD) and dashboards, etc. are updated before the name is changed. Communications will be required.
- We might need to change this, because presumably we'd want
- We would probably use the following naming convention:
- ADR should state that we are rejecting the status quo of a shared DD service across all of our IDAs.
- What is the full list of services to address? DD has a dependency graph somewhere, and some other possible services may includes
elasticsearch
,redis
,read_replicadb
. Please review in DD to get the full list of affect dependencies. Note that some general shared services (e.g. aws.s3) probably should remain shared, but this could be discussed.
ADRish page: https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1265598591/Datadog+Service+mapping