Telemetry Configuration

Credentials

Apply the setting to your fabric that will allow for telemetry to be pushed to the specified Grafana instance:

spec:
  config:
    observability:
      targets:
        loki:
          grafana_cloud:
            basicAuth:
              password: token_with_log_write_permission
              username: "1234567"
            labels:
              env: production
            url: https://logs-prod-021.grafana.net/loki/api/v1/push
        prometheus:
          grafana_cloud:
            basicAuth:
              password: token_with_metric_write_permission
              username: "1234567"
            labels:
              env: production
            url: https://prometheus-prod-36-prod-us-west-0.grafana.net/api/prom/push

Tokens

Grafana Cloud manages read and write permissions with policies. In order to send metrics to the datasources a policy for your realm needs to be created. When creating the policy ensure that it has at least logs:write and metrics:write permission selected. After the policy is created, add a token to that policy ensure that it is appropriately named and time limited. Once the token is created use it in the credentials YAML file. For additional details see the documentation

Configuration

The second yaml section controls what is pushed from the fabric to prometheus

spec:
  config:
    fabric:
      observability:
        agent: 
          logs: true
          metrics: true
          metricsInterval: 60
          metricsRelabel: 
          - action: keep
            regex: .*(_resource_|_interface_|_platform_|_bgp_|node_|_heartbeats_|_generation|_status).*
            sourceLabels:
            - __name__
        unix: 
          metrics: true
          metricsCollectors:
          - cpu
          - loadavg
          - meminfo
          - filesystem
          metricsInterval: 60
          metricsRelabel: 
          - action: keep
            regex: .*(_load).*
            sourceLabels:
            - __name__
          syslog: true

Alerting

The alert rule queries the increase of the fabric_agent_agent_heartbeats_total metric. In normal operation the switch agent sends two increments every minute. The prometheus increase function will extrapolate the value for the total time range which leads to a higher reported number than is actually observed, this is not a concern. Select a value for the Alert condition according to your operational needs. The example has a value of 3, which allows for some delays and drops before firing the alarm.

For convenience here is the JSON used to configure this alarm. Values that should be changed to match your environment contain the string "Hedgehog".

Grafana has a learning journey to assist users in creating and configuring alerts.