$ oc adm must-gather --image=ghcr.io/open-telemetry/opentelemetry-operator/must-gather -- \
/usr/bin/must-gather --operator-namespace <operator_namespace> (1)
The OpenTelemetry Collector offers multiple ways to measure its health as well as investigate data ingestion issues.
When submitting a support case, it is helpful to include diagnostic information about your cluster to Red Hat Support.
You can use the oc adm must-gather
tool to gather diagnostic data for resources of various types, such as OpenTelemetryCollector
, Instrumentation
, and the created resources like Deployment
, Pod
, or ConfigMap
.
The oc adm must-gather
tool creates a new pod that collects this data.
From the directory where you want to save the collected data, run the oc adm must-gather
command to collect the data:
$ oc adm must-gather --image=ghcr.io/open-telemetry/opentelemetry-operator/must-gather -- \
/usr/bin/must-gather --operator-namespace <operator_namespace> (1)
1 | The default namespace where the Operator is installed is openshift-opentelemetry-operator . |
Verify that the new directory is created and contains the collected data.
You can get the logs for the OpenTelemetry Collector as follows.
Set the relevant log level in the OpenTelemetryCollector
custom resource (CR):
config: |
service:
telemetry:
logs:
level: debug (1)
1 | Collector’s log level. Supported values include info , warn , error , or debug . Defaults to info . |
Use the oc logs
command or the web console to retrieve the logs.
The OpenTelemetry Collector exposes the metrics about the data volumes it has processed. The following metrics are for spans, although similar metrics are exposed for metrics and logs signals:
otelcol_receiver_accepted_spans
The number of spans successfully pushed into the pipeline.
otelcol_receiver_refused_spans
The number of spans that could not be pushed into the pipeline.
otelcol_exporter_sent_spans
The number of spans successfully sent to the destination.
otelcol_exporter_enqueue_failed_spans
The number of spans failed to be added to the sending queue.
The Operator creates a <cr_name>-collector-monitoring
telemetry service that you can use to scrape the metrics endpoint.
Enable the telemetry service by adding the following lines in the OpenTelemetryCollector
custom resource (CR):
# ...
config: |
service:
telemetry:
metrics:
address: ":8888" (1)
# ...
1 | The address at which the internal collector metrics are exposed. Defaults to :8888 . |
Retrieve the metrics by running the following command, which uses the port-forwarding Collector pod:
$ oc port-forward <collector_pod>
In the OpenTelemetryCollector
CR, set the enableMetrics
field to true
to scrape internal metrics:
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
spec:
# ...
mode: deployment
observability:
metrics:
enableMetrics: true
# ...
Depending on the deployment mode of the OpenTelemetry Collector, the internal metrics are scraped by using PodMonitors
or ServiceMonitors
.
Alternatively, if you do not set the |
On the Observe page in the web console, enable User Workload Monitoring to visualize the scraped metrics.
Not all processors expose the required metrics. |
In the web console, go to Observe → Dashboards and select the OpenTelemetry Collector dashboard from the drop-down list to view it.
You can filter the visualized data such as spans or metrics by the Collector instance, namespace, or OpenTelemetry components such as processors, receivers, or exporters. |
You can configure the Debug Exporter to export the collected data to the standard output.
Configure the OpenTelemetryCollector
custom resource as follows:
config: |
exporters:
debug:
verbosity: detailed
service:
pipelines:
traces:
exporters: [debug]
metrics:
exporters: [debug]
logs:
exporters: [debug]
Use the oc logs
command or the web console to export the logs to the standard output.
You can debug the traffic between your observability components by visualizing it with the Network Observability Operator.
You have installed the Network Observability Operator as explained in "Installing the Network Observability Operator".
In the OpenShift Container Platform web console, go to Observe → Network Traffic → Topology.
Select Namespace to filter the workloads by the namespace in which your OpenTelemetry Collector is deployed.
Use the network traffic visuals to troubleshoot possible issues. See "Observing the network traffic from the Topology view" for more details.
To troubleshoot the instrumentation, look for any of the following issues:
Issues with instrumentation injection into your workload
Issues with data generation by the instrumentation libraries
To troubleshoot instrumentation injection, you can perform the following activities:
Checking if the Instrumentation
object was created
Checking if the init-container started
Checking if the resources were deployed in the correct order
Searching for errors in the Operator logs
Double-checking the pod annotations
Run the following command to verify that the Instrumentation
object was successfully created:
$ oc get instrumentation -n <workload_project> (1)
1 | The namespace where the instrumentation was created. |
Run the following command to verify that the opentelemetry-auto-instrumentation
init-container successfully started, which is a prerequisite for instrumentation injection into workloads:
$ oc get events -n <workload_project> (1)
1 | The namespace where the instrumentation is injected for workloads. |
... Created container opentelemetry-auto-instrumentation
... Started container opentelemetry-auto-instrumentation
Verify that the resources were deployed in the correct order for the auto-instrumentation to work correctly. The correct order is to deploy the Instrumentation
custom resource (CR) before the application. For information about the Instrumentation
CR, see the section "Configuring the instrumentation".
When the pod starts, the Red Hat build of OpenTelemetry Operator checks the |
Fixing the order of deployment requires the following steps:
Update the instrumentation settings.
Delete the instrumentation object.
Redeploy the application.
Run the following command to inspect the Operator logs for instrumentation errors:
$ oc logs -l app.kubernetes.io/name=opentelemetry-operator --container manager -n openshift-opentelemetry-operator --follow
Troubleshoot pod annotations for the instrumentations for a specific programming language. See the required annotation fields and values in "Configuring the instrumentation".
Verify that the application pods that you are instrumenting are labeled with correct annotations and the appropriate auto-instrumentation settings have been applied.
instrumentation.opentelemetry.io/inject-python="true"
$ oc get pods -n <workload_project> -o jsonpath='{range .items[?(@.metadata.annotations["instrumentation.opentelemetry.io/inject-python"]=="true")]}{.metadata.name}{"\n"}{end}'
Verify that the annotation applied to the instrumentation object is correct for the programming language that you are instrumenting.
If there are multiple instrumentations in the same namespace, specify the name of the Instrumentation
object in their annotations.
instrumentation.opentelemetry.io/inject-nodejs: "<instrumentation_object>"
If the Instrumentation
object is in a different namespace, specify the namespace in the annotation.
instrumentation.opentelemetry.io/inject-nodejs: "<other_namespace>/<instrumentation_object>"
Verify that the OpenTelemetryCollector
custom resource specifies the auto-instrumentation annotations under spec.template.metadata.annotations
. If the auto-instrumentation annotations are in spec.metadata.annotations
instead, move them into spec.template.metadata.annotations
.
You can troubleshoot telemetry data generation by the instrumentation libraries by checking the endpoint, looking for errors in your application logs, and verifying that the Collector is receiving the telemetry data.
Verify that the instrumentation is transmitting data to the correct endpoint:
$ oc get instrumentation <instrumentation_name> -n <workload_project> -o jsonpath='{.spec.endpoint}'
The default endpoint http://localhost:4317
for the Instrumentation
object is only applicable to a Collector instance that is deployed as a sidecar in your application pod. If you are using an incorrect endpoint, correct it by editing the Instrumentation
object and redeploying your application.
Inspect your application logs for error messages that might indicate that the instrumentation is malfunctioning:
$ oc logs <application_pod> -n <workload_project>
If the application logs contain error messages that indicate that the instrumentation might be malfunctioning, install the OpenTelemetry SDK and libraries locally. Then run your application locally and troubleshoot for issues between the instrumentation libraries and your application without OpenShift Container Platform.
Use the Debug Exporter to verify that the telemetry data is reaching the destination OpenTelemetry Collector instance. For more information, see "Debug Exporter".