Software Testing & QA Services

Introduction to Open Source Observability

Tags: Technologies
Open Source Observability

 

There is often a need to monitor the operation of the services we deploy when performing software updates. Simpler applications can manage with basic log monitoring, perhaps using an ad-hoc tool or by creating a dashboard that calculates real-time metrics about what is happening.

 

For other services that handle more sophisticated workflows and have multiple black boxes throughout their execution, we turn to tools that allow us to measure exactly what is happening without impacting performance. An Observability framework or suite perfectly fulfills this need.

 

Datadog is an excellent solution that offers a wide range of capabilities to monitor multiple dimensions of our software: traces, logs, metrics, alerts, performance, profiling, among others. Sometimes, between the actual need and the available budget, a solution like Datadog may not be the best fit, especially if its full potential would not be utilized. This is where we can take advantage of open-source tools that provide many of the functionalities of popular observability solutions, delivering similar performance at a more cost-effective price.

 

Scope

The objective of this article is to present a setup with functional examples in JavaScript/TypeScript or Python at the application layer, along with the proposed technology stack, covering the following capabilities:

 

  • Logs and log search
  • Traces
  • Manual and automatic instrumentation
  • Metrics reporting

 

Open Source Suite

 

Grafana

Grafana is an open-source platform for data visualization and monitoring that allows users to analyze and view time-series data. It supports integrations through data sources, enabling teams to monitor systems, applications, servers, databases, and business metrics using interactive dashboards.

 

Tempo

Grafana Tempo is another open-source tool for distributed backend tracing. It allows you to send, store, and search traces, meaning the execution records within an instrumented application. Examples of traces include:

 

  • The execution of an API request
  • Calls to multiple backend services
  • Database queries

 

A trace allows you to visualize how long a service took, how many errors occurred — including any logs recorded in between — and which services were involved.

 

Loki

Grafana Loki is an open-source log aggregation system. It collects, stores, and enables efficient querying of application logs compared to traditional logging systems.

 

Prometheus

Prometheus is an open-source toolkit used for monitoring and alerting. It collects and stores out-of-the-box metrics — such as CPU usage, memory consumption, request count, error rate, and latency — as well as custom business metrics, such as frustration measurements, retry counts, or interaction time for a specific service.

 

Instrumenting Code

Before an application can report data about its execution, telemetry must be implemented. This functionality enables the measurement, collection, and communication of such data. Once implemented, our code becomes “instrumentable.” There are two types of instrumentation: manual and automatic. Automatic instrumentation relies on third-party libraries that actively report metrics about their core functionalities. In the case of an MVC framework, this could include the execution of each endpoint. Manual instrumentation is intentionally designed and developed, for example, to cover a batch file processing workflow. Below are examples of automatic instrumentation in Python and JavaScript:

 

# manage.py
import os
import sys
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.django import DjangoInstrumentor
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
def main():
   resource = Resource(attributes={SERVICE_NAME: "my-service"})
   otlp_exporter = OTLPSpanExporter(
       endpoint=os.environ.get("OPEN_TELEMETRY_HOST"),
   )
   provider = TracerProvider(resource=resource)
   processor = BatchSpanProcessor(otlp_exporter)
   provider.add_span_processor(processor)
   trace.set_tracer_provider(provider)
   DjangoInstrumentor().instrument()
   from django.core.management import execute_from_command_line
  
   execute_from_command_line(sys.argv)
if __name__ == "__main__":
   main()
# wsgi.py
from django.core.wsgi import get_wsgi_application
from opentelemetry.instrumentation.wsgi import OpenTelemetryMiddleware
application = get_wsgi_application()
application = OpenTelemetryMiddleware(application)
// instrumentation.js
import { trace } from '@opentelemetry/api';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { SimpleSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { resourceFromAttributes } from '@opentelemetry/resources';
export const setupTelemetry = () => {
 const exporter = new OTLPTraceExporter({});
 const provider = new NodeTracerProvider({
   resource: resourceFromAttributes({
     [ATTR_SERVICE_NAME]: ‘express’,
   }),
   spanProcessors: [new SimpleSpanProcessor(exporter)],
 });
 registerInstrumentations({
   tracerProvider: provider,
   instrumentations: [
     new HttpInstrumentation(),
     new ExpressInstrumentation(),
   ],
 });
 provider.register();
 return trace.getTracer(serviceName);
};
// app.js
import { setupTelemetry } from './instrumentation';
setupTelemetry();
import * as express from 'express';
const app = express();
...

 

How does it work?

After installing and configuring each service and instrumenting the application, the only remaining step is to add “datasources” in the Grafana dashboard. As the application runs, each process instance, log, and metric is exported or collected, processed, and then delivered to Grafana for observation and analysis.

 

Open Source Observability

 

Result

Once the requirements have been configured in the application, there is full traceability of all the processes we have instrumented. We can connect traces to logs and understand what is happening in each of the workflows being executed.

 

One of the most common use cases is developing a new feature, instrumenting it, and once it is deployed to production environments, being able within seconds — through Grafana “drilldown” dashboards or custom dashboards — to monitor how the new feature is performing.

 

Open Source Observability

A trace of a service with multiple components is shown.

 

Having transparency into the execution of complex applications is not only useful for an IT team, but also allows any interested area to see what is happening in real time. The data collected from different sources creates opportunities to iterate on software development with evidence and feedback about what to prioritize and what areas need improvement. Finally, software design that consistently takes instrumentation into account fosters a culture of awareness within the team — making as much information visible as possible by default, rather than as an afterthought.

 

open source observability

Logs filtered by user activity.