1 Thanos Sidecar 组件

1.1 prometheus-rbac

创建对应的 RBAC 权限声明：

# prometheus-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: kube-mon
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
      - services
      - endpoints
      - pods
      - nodes/proxy
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "extensions"
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - configmaps
      - nodes/metrics
    verbs:
      - get
  - nonResourceURLs:
      - /metrics
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
  - kind: ServiceAccount
    name: prometheus
    namespace: kube-mon

创建 kube-mon NS 并创建对应的 rbac

root@master:~/thanos# kubectl create ns kube-mon
root@master:~/thanos# kubectl apply -f prometheus-rbac.yaml

1.2 prometheus-config

然后需要部署 Prometheus 的配置文件，下面的资源对象是创建 Prometheus 配置文件的模板，该模板将由 Thanos sidecar 组件进行读取，最终会通过该模板生成实际的配置文件，在同一个 Pod 中的 Prometheus 容器将读取最终的配置文件，在配置文件中添加 external_labels 标签是非常重要的，以便让 Queirer 可以基于这些标签对数据进行去重处理：

# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: kube-mon
data:
  prometheus.yaml.tmpl: | # 注意这里的名称是 prometheus.yaml.tmpl
    global:
      scrape_interval: 15s
      scrape_timeout: 15s
      external_labels:
        cluster: ydzs-test
        replica: $(POD_NAME)  # 每个 Prometheus 有一个唯一的标签

    rule_files:  # 报警规则文件配置
    - /etc/prometheus/rules/*rules.yaml

    alerting:
      alert_relabel_configs:  # 我们希望告警从不同的副本中也是去重的
      - regex: replica
        action: labeldrop
      alertmanagers:
      - scheme: http
        path_prefix: /
        static_configs:
        - targets: ['alertmanager:9093']

    # 配置监控发现
    scrape_configs:
    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']

    - job_name: 'coredns'
      static_configs:
        - targets: ['10.96.0.10:9153']

root@master:~/thanos# kubectl apply -f prometheus-config.yaml

1.3 prometheus-rules

上面配置了报警规则文件，由于这里配置文件太大了，所以为了更加清晰，我们将报警规则文件拆分到另外的 ConfigMap 对象中来，后续如果有新的报警规则我们只需要往这两个 configmap 中添加即可下面我们配置了两个报警规则：

# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: kube-mon
data:
  alert-rules.yaml: |-
    groups:
      - name: K8sObjects_Alerts
        rules:
        - alert: Deployment_Replicas_0
          expr: |
            sum(kube_deployment_status_replicas) by (deployment, namespace) < 1
          for: 1m
          labels:
            severity: warning
          annotations:
            summary: Deployment {{$labels.deployment}} of {{$labels.namespace}} is currently having no pods running
            description: Has no pods running in Deployment {{$labels.deployment}} of {{$labels.namespace}}, you can describe to get events, or get replicas status.

root@master:~/thanos# kubectl apply -f prometheus-rules.yaml

1.4 thanos-sidecar

Thanos 通过 Sidecar 和现有的 Prometheus 进行集成，将 Prometheus 的数据备份到对象存储中，所以首先我们需要将 Prometheus 和 Sidecar 部署在同一个 Pod 中，另外 Prometheus 中一定要开启下面两个参数：

--web.enable-admin-api ：允许 Thanos 的 Sidecar 从 Prometheus 获取元数据。
--web.enable-lifecycle ：允许 Thanos 的 Sidecar 重新加载 Prometheus 的配置和规则文件，从而不再需要手动操作。

由于 Prometheus 默认每2h生成一个 TSDB 数据块，所以仍然并不意味着 Prometheus 可以是完全无状态的，因为如果它崩溃并重新启动，我们将丢失〜2 个小时的指标，因此强烈建议依然对 Prometheus 做数据持久化，所以我们这里使用了 StatefulSet 来管理这个应用，添加 volumeClaimTemplates 来声明了数据持久化的 PVC 模板：

# thanos-sidecar.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
  namespace: kube-mon
  labels:
    app: prometheus
spec:
  serviceName: prometheus
  replicas: 2
  selector:
    matchLabels:
      app: prometheus
      thanos-store-api: "true"        # 这个标签加上就是告诉当前这个 pod 实现了 store-api，为了后续告诉 query 通过这个 API 进行查询
  template:
    metadata:
      labels:
        app: prometheus
        thanos-store-api: "true"
    spec:
      serviceAccountName: prometheus    # 加上 SA 
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                topologyKey: kubernetes.io/hostname
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - prometheus
      volumes:
        - name: prometheus-config
          configMap:
            name: prometheus-config
        - name: prometheus-rules
          configMap:
            name: prometheus-rules
        - name: prometheus-config-shared
          emptyDir: {}
      # 对 Prometheus 数据做一下权限的 fix
      initContainers:
        - name: fix-permissions
          image: busybox:stable
          command: [chown, -R, "nobody:nobody", /prometheus]
          volumeMounts:
            - name: data
              mountPath: /prometheus
      containers:
        - name: prometheus
          image: prom/prometheus:v2.34.0
          imagePullPolicy: IfNotPresent
          args:
            - "--config.file=/etc/prometheus-shared/prometheus.yaml"
            - "--storage.tsdb.path=/prometheus"
            - "--storage.tsdb.retention.time=6h"   # 让数据保存 6 小时
            - "--storage.tsdb.no-lockfile"
            - "--storage.tsdb.min-block-duration=2h" # Thanos 处理数据压缩
            - "--storage.tsdb.max-block-duration=2h"
            - "--web.enable-admin-api" # 通过一些命令去管理数据
            - "--web.enable-lifecycle" # 支持热更新  localhost:9090/-/reload 加载
          ports:
            - name: http
              containerPort: 9090
          resources:
            requests:
              memory: 1Gi
              cpu: 500m
            limits:
              memory: 1Gi
              cpu: 500m
          volumeMounts:
            - name: prometheus-config-shared
              mountPath: /etc/prometheus-shared/
            - name: prometheus-rules
              mountPath: /etc/prometheus/rules
            - name: data
              mountPath: /prometheus
        - name: thanos
          image: thanosio/thanos:v0.25.1
          imagePullPolicy: IfNotPresent
          args:
            - sidecar
            - --log.level=debug
            - --tsdb.path=/prometheus
            - --prometheus.url=http://localhost:9090    # 通过 localhost 来访问 Prometheus，因为是 sidecar 所以是共享网络
            - --reloader.config-file=/etc/prometheus/prometheus.yaml.tmpl   # 指定 Prometheus 配置文件的模板
            - --reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yaml # 将 Prometheus 环境变了注入进来之后就放到了 sidecar 配置中，最终的配置文件需要给到上面的 prometheus-shared 使用，所以这里使用 emptyDir 起到共享的作用
            - --reloader.rule-dir=/etc/prometheus/rules/
          ports:
            - name: http-sidecar
              containerPort: 10902
            - name: grpc
              containerPort: 10901
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          resources:
            requests:
              memory: 1Gi
              cpu: 500m
            limits:
              memory: 1Gi
              cpu: 500m
          volumeMounts:
            - name: prometheus-config-shared
              mountPath: /etc/prometheus-shared/
            - name: prometheus-config
              mountPath: /etc/prometheus 
            - name: prometheus-rules
              mountPath: /etc/prometheus/rules
            - name: data
              mountPath: /prometheus
  volumeClaimTemplates: # 由于prometheus每2h生成一个TSDB数据块，所以还是需要保存本地的数据
    - metadata:
        name: data
        labels:
          app: prometheus
      spec:
        storageClassName: longhorn # 不要用nfs存储
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 2Gi

由于 Prometheus 和 Thanos 的 Sidecar 在同一个 Pod 中了，所以我们完全可以用 localhost 就可以访问到了，然后将数据目录做了声明挂载，所以同样可以在两个容器中共享数据目录了，一定要注意几个配置文件的挂载方式。此外在上面的配置文件中我们通过 POD_NAME 这个环境变量作为 external 标签附加到了 Prometheus 实例上，这里我们通过 Downward API 去设置该环境变量。

由于现在使用的是 StatefulSet 控制器，所以需要创建一个 Headless Service，而且后面的 Thanos Query 还将使用该无头服务来查询所有 Prometheus 实例中的数据，当然我们也可以为每一个 Prometheus 实例去创建一个 Service 对象便于调试，当然这个不是必须的：

# prometheus-headless.yaml
# 该服务为 querier 创建 srv 记录，以便查找 store-api 的信息
apiVersion: v1
kind: Service
metadata:
  name: thanos-store-gateway
  namespace: kube-mon
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - name: grpc
      port: 10901
      targetPort: grpc
  selector:
    # 匹配了 thanos-store-api: "true" 该标签的 pod 都会对接到这个 gateway 上
    # 通过这种方式就可以实现匹配了该标签的所有关于 thanos 组件的 pod
    thanos-store-api: "true"

创建无头服务

root@master:~/thanos# kubectl apply -f prometheus-headless.yaml

由于在上面的持久化 PV 中使用到了 Longhorn 所以这里我们还需要部署一个 Longhorn 来作为存储

1.4.1 部署 Longhorn

在现有 Kubernetes 集群上安装 Longhorn 仅需2步：为 Longhorn 安装 controller 以及扩展包，然后创建一个可用于 pod 的 StorageClass 。

第一步：

root@master:~/thanos# kubectl apply -f https://raw.githubusercontent.com/rancher/longhorn/master/deploy/longhorn.yaml

# 部署完之后的 POD
root@master:~# kubectl get pod -n longhorn-system 
NAME                                                READY   STATUS    RESTARTS      AGE
csi-attacher-868487bdf9-5jgtb                       1/1     Running   0             89m
csi-attacher-868487bdf9-cnn8s                       1/1     Running   0             89m
csi-attacher-868487bdf9-rhh9l                       1/1     Running   0             89m
csi-provisioner-579866cdd8-h5hjn                    1/1     Running   0             89m
csi-provisioner-579866cdd8-qnk4z                    1/1     Running   2 (63m ago)   89m
csi-provisioner-579866cdd8-td44j                    1/1     Running   3 (55m ago)   89m
csi-resizer-cdd748db8-s5b97                         1/1     Running   0             89m
csi-resizer-cdd748db8-tmwqp                         1/1     Running   0             89m
csi-resizer-cdd748db8-tq5b6                         1/1     Running   0             89m
csi-snapshotter-9b68bbfb8-8sgcd                     1/1     Running   0             89m
csi-snapshotter-9b68bbfb8-ch6ms                     1/1     Running   0             89m
csi-snapshotter-9b68bbfb8-vlm49                     1/1     Running   0             89m
engine-image-ei-b907910b-2hq5m                      1/1     Running   0             52m
engine-image-ei-b907910b-2lvsk                      1/1     Running   0             47m
engine-image-ei-b907910b-p8xk6                      1/1     Running   0             47m
instance-manager-3c7e07d1fdd859ff6e9f79e2d0e1a9ec   1/1     Running   0             85m
instance-manager-6f65b506f45b369d57a2cced29b4485a   1/1     Running   0             88m
instance-manager-d5c5cb3127f9497fa976f895c69b7fb9   1/1     Running   0             89m
longhorn-csi-plugin-87v5b                           3/3     Running   6 (49m ago)   89m
longhorn-csi-plugin-jqtzr                           3/3     Running   5 (63m ago)   89m
longhorn-csi-plugin-jx5hg                           3/3     Running   5 (47m ago)   89m
longhorn-driver-deployer-75776cf9b6-qr9n8           1/1     Running   0             93m
longhorn-manager-8nzrn                              1/1     Running   0             88m
longhorn-manager-bsn2z                              1/1     Running   0             86m
longhorn-manager-tgklw                              1/1     Running   1 (89m ago)   89m
longhorn-ui-d996774-kdtsf                           1/1     Running   0             93m
longhorn-ui-d996774-wp4ds                           1/1     Running   0             93m

第二步：

创建StorageClass需要使用另一个命令，然而作为附加步骤，你可以将新的class设置为默认，这样你无需每次都指定它：

root@master:~# kubectl apply -f https://raw.githubusercontent.com/rancher/longhorn/master/examples/storageclass.yaml

# default sc longhorn 已经创建，并且改 sc 就是下面 thanos-sidecar 要使用的
root@master:~# kubectl get storageclasses.storage.k8s.io 
NAME                 PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
longhorn (default)   driver.longhorn.io   Delete          Immediate           true                   91m
longhorn-test        driver.longhorn.io   Delete          Immediate           true                   3m6s

1.4.2 访问 Longhorn Dashboard

Longhorn有一个十分简洁的Dashboard，可以在上面看到已使用的空间、可用空间、volume列表等等信息。但首先，我们需要创建身份验证的详细信息：

root@master:~# apt install apache2-utils -y

# 创建密码：123456
root@master:~# htpasswd -c ./ing-auth admin
New password: 
Re-type new password: 

# 创建 secrete
root@master:~# kubectl create secret generic longhorn-auth \
>   --from-file ing-auth --namespace=longhorn-system

修改 SVC 实现暴露，这里使用的是 nodeport 工作中最好用 ingress

root@master:~# kubectl edit svc -n longhorn-system longhorn-frontend 
  type: NodePort    # 修改字段

访问地址：http://10.0.0.131:9491/#/dashboard

部署完了 longhorn 以后我们就需要部署上面的 sidecar

1.4.3 部署 thanos-sidecar

root@master:~/thanos# kubectl apply -f thanos-sidecar.yaml 

# 查看 Prometheus pod 已经运行
root@master:~/thanos# kubectl get pod -n kube-mon -w
NAME           READY   STATUS    RESTARTS        AGE
prometheus-0   2/2     Running   1 (4m59s ago)   7m11s
prometheus-1   2/2     Running   1 (112s ago)    4m19s

创建成功后可以看到 Prometheus 中包含两个容器，其中的 Sidecar 容器启动的时候有两个非常重要的参数 --reloader.config-file 与 --reloader.config-envsubst-file，第一个参数是指定 Prometheus 配置文件的模板文件，然后通过渲染配置模板文件，这里就是将 external_labels.replica: $(POD_NAME) 的标签值用环境变量 POD_NAME 进行替换，然后将渲染后的模板文件放到 config-envsubst-file 指定的路径，也就是 /etc/prometheus-shared/prometheus.yaml，所以应用主容器也通过 --config.file 来指定的该配置文件路径。我们也可以查看 Sidecar 容器的相关日志来验证：

# 查看 thanos sidecar logs
root@master:~/thanos# kubectl logs -f -n kube-mon prometheus-0 -c thanos
level=debug ts=2023-05-09T06:34:25.331739073Z caller=main.go:66 msg="maxprocs: Updating GOMAXPROCS=[1]: using minimum allowed GOMAXPROCS"
level=info ts=2023-05-09T06:34:25.33211784Z caller=sidecar.go:123 msg="no supported bucket was configured, uploads will be disabled"
level=info ts=2023-05-09T06:34:25.332178986Z caller=options.go:27 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2023-05-09T06:34:25.332503369Z caller=sidecar.go:357 msg="starting sidecar"
level=info ts=2023-05-09T06:34:25.332667115Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2023-05-09T06:34:25.332683183Z caller=http.go:73 service=http/server component=sidecar msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2023-05-09T06:34:25.332806868Z caller=tls_config.go:195 service=http/server component=sidecar msg="TLS is disabled." http2=false
level=debug ts=2023-05-09T06:34:25.332861342Z caller=promclient.go:623 msg="build version" url=http://localhost:9090/api/v1/status/buildinfo
level=info ts=2023-05-09T06:34:25.333879115Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2023-05-09T06:34:25.333937756Z caller=grpc.go:131 service=gRPC/server component=sidecar msg="listening for serving gRPC" address=0.0.0.0:10901
level=error ts=2023-05-09T06:34:25.342311986Z caller=runutil.go:101 component=reloader msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp 127.0.0.1:9090: connect: connection refused"
level=warn ts=2023-05-09T06:34:25.342461976Z caller=sidecar.go:172 msg="failed to fetch prometheus version. Is Prometheus running? Retrying" err="perform GET request against http://localhost:9090/api/v1/status/buildinfo: Get \"http://localhost:9090/api/v1/status/buildinfo\": dial tcp 127.0.0.1:9090: connect: connection refused"
level=debug ts=2023-05-09T06:34:27.332904766Z caller=promclient.go:623 msg="build version" url=http://localhost:9090/api/v1/status/buildinfo
level=info ts=2023-05-09T06:34:27.334517011Z caller=sidecar.go:179 msg="successfully loaded prometheus version"
level=info ts=2023-05-09T06:34:27.336881616Z caller=sidecar.go:201 msg="successfully loaded prometheus external labels" external_labels="{cluster=\"ydzs-test\", replica=\"prometheus-0\"}"
level=info ts=2023-05-09T06:34:30.337282418Z caller=reloader.go:373 component=reloader msg="Reload triggered" cfg_in=/etc/prometheus/prometheus.yaml.tmpl cfg_out=/etc/prometheus-shared/prometheus.yaml watched_dirs=/etc/prometheus/rules/
level=info ts=2023-05-09T06:34:30.337369219Z caller=reloader.go:235 component=reloader msg="started watching config file and directories for changes" cfg=/etc/prometheus/prometheus.yaml.tmpl out=/etc/prometheus-shared/prometheus.yaml dirs=/etc/prometheus/rules/

由于在 Sidecar 中我们并没有配置对象存储相关参数，所以出现了 no supported bucket was configured, uploads will be disabled 的警告信息，也就是现在并不会上传我们的指标数据，到这里我们就将 Thanos Sidecar 组件成功部署上了。

这就需要后续我们将其他组件部署完成之后才可以正常去工作

2 Thanos Query 组件

现在我们就创建成功了两个 Prometheus 实例，但是我们真正去使用的时候并不是像上面提到的在前面加一个负载均衡器去查询监控数据，而是使用 Thanos 的 Querier 组件来提供一个全局的统一查询入口。对于 Quierier 最重要的就是要配置上 Thanos 的 Sidecar 地址，我们这里完全可以直接使用刚才上面创建的 Headless Service 去自动发现：

由于 query 组件它是无状态的，所以我们可以部署多个实例来保证 query 的高可用

# thanos-querier.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-querier
  namespace: kube-mon
  labels:
    app: thanos-querier
spec:
  # 无状态，可以多副本实现高可用
  replicas: 2
  selector:
    matchLabels:
      app: thanos-querier
  template:
    metadata:
      labels:
        app: thanos-querier
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                topologyKey: kubernetes.io/hostname
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - thanos-querier
      containers:
        - name: thanos
          image: thanosio/thanos:v0.25.1
          args:
            - query
            - --log.level=debug
            # query.replica-label 该参数非常重要，查询标签也就是我们需要依据那个标签进行去重，因为在上面创建 config 的时候添加了 replica 的唯一标识标签，所以就基于上面的标签进行去重
            - --query.replica-label=replica
            # Discover local store APIs using DNS SRV. 配置实现 store API 的地址，基于 dns+svc 的方式，而改地址就是上面创建的 Prometheus-headless 的 name 字段 thanos-store-gateway
            - --store=dnssrv+thanos-store-gateway:10901
          ports:
            - name: http
              containerPort: 10902
            - name: grpc
              containerPort: 10901
          resources:
            requests:
              memory: 512Mi
              cpu: 500m
            limits:
              memory: 512Mi
              cpu: 500m
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: http
            initialDelaySeconds: 10
          readinessProbe:
            httpGet:
              path: /-/healthy
              port: http
            initialDelaySeconds: 15
---
# 通过 nodeport 暴露 query
apiVersion: v1
kind: Service
metadata:
  name: thanos-querier
  namespace: kube-mon
  labels:
    app: thanos-querier
spec:
  ports:
    - port: 9090
      targetPort: http
      name: http
  selector:
    app: thanos-querier
  type: NodePort

root@master:~/thanos# kubectl apply -f thanos-query.yaml 

# 查看 pod thanos-query 已经创建成功
root@master:~/thanos# kubectl get pod -n kube-mon 
NAME                              READY   STATUS    RESTARTS      AGE
prometheus-0                      2/2     Running   1 (83m ago)   85m
prometheus-1                      2/2     Running   1 (80m ago)   82m
thanos-querier-854598789d-ngmgl   1/1     Running   0             57s
thanos-querier-854598789d-tr64g   1/1     Running   0             57s

# 查看 svc 
root@master:~/thanos# kubectl get svc -n kube-mon 
NAME                   TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
thanos-querier         NodePort    10.96.75.61   <none>        9090:441/TCP   93s
thanos-store-gateway   ClusterIP   None          <none>        10901/TCP       67m

2.1 访问

可以看到 query 其实和 Prometheus 的前端十分相似

http://10.0.0.131:441

sidecar 已经发现，并且绑定了 Prometheus

并且还能通过 sidecar 查询到 Prometheus 中对应的监控配置指标

可以看到但我们将去重按钮取消掉的话 query 会将它发现的所有条目都展示出来，

勾上的话就会将合并数过滤在进行展示

如果将 deduplication 选中，结果会根据 replica 这个标签进行合并，如果两个副本都有对应的数据，Querier 会取 timestamp 更小的结果：

2.2 部署 Grafana 配置 Query 为数据源

1.解压

root@i-bfotjfux:~# tar xf grafana-enterprise-7.5.11.linux-amd64.tar.gz 
root@i-bfotjfux:~# mkdir /apps
root@i-bfotjfux:~# mv grafana-7.5.11/ /apps/

2.制作软连接

root@i-bfotjfux:~# ln -vs /apps/grafana-7.5.11/ /apps/grafana

3.编写 service 文件

# 实现基于 systemctl 启动
root@i-bfotjfux:~# vi  /etc/systemd/system/grafana.service
[Unit]
Description=Grafana
After=network.target

[Service]
# 启动文件路径
ExecStart=/apps/grafana/bin/grafana-server                                              
# 指定工作路径
WorkingDirectory=/apps/grafana/         
# 指定启动文件
grafana-server -config=/apps/grafana/conf/defaults.ini          

[Install]
WantedBy=multi-user.target

4.启动并且设置为开机自启动

root@i-bfotjfux:~# systemctl daemon-reload 
root@i-bfotjfux:~# systemctl enable --now grafana.service

5.端口已经监听

root@i-bfotjfux:~# ss -ntl| grep 3000
LISTEN   0         128                       *:3000                   *:*

浏览器访问：

1.添加数据源

2.填写对应链接

2.2.1 添加 coreDNS 模板

导入这里使用的模板是：11759

总览效果：

可以看到当前的数据全是来自于 thanos-query

query 这种架构默认会把两个小时的数据给他存放在 TSDB 的模块里面，然后 sidecar 会将我们的 TSDB 块上传到对象存储中，那也就是说本地的数据只会存放两个小时，那么当我们想查询历史记录就需要将数据存放到对象存储中使用，所以也就是我们的 stores API 需要对接对象存储上去。

3 Thanos Stores 组件

上面我们安装了 Thanos 的 Sidecar 和 Querier 组件，已经可以做到 Prometheus 的高可用，通过 Querier 提供一个统一的入口来查询监控数据，而且还可以对监控数据自动去重，但是还有一个非常重要的地方是还没有配置对象存储，如果想要查看历史监控数据就不行了，这个时候我们就需要去配置 Thanos Store 组件，将历史监控指标存储在对象存储中去。

因为 sidecar 当前如果没有配置 Stores 组件那么保留得数据时间是相当短暂

要在生产环境使用最好使用 Stable 状态的，比如 S3 或者兼容 S3 的服务，比如 Ceph、Minio 等等。

对于国内用户当然最方便的还是直接使用阿里云 OSS 或者腾讯云 COS 这样的服务，但是很多时候可能我们的服务并不是跑在公有云上面的，所以这里我们用 Minio 来部署一个兼容 S3 协议的对象存储服务。

3.1 安装 minio

MinIO 是一个基于 Apache License v2.0 开源协议的高性能分布式对象存储服务，为大规模私有云基础设施而设计。它兼容亚马逊 S3 云存储服务接口，非常适合于存储大容量非结构化的数据，例如图片、视频、日志文件、备份数据和容器/虚拟机镜像等，而一个对象文件可以是任意大小，从几 kb 到最大 5T 不等。

要安装 Minio 非常容易的，同样我们这里将 Minio 安装到 Kubernetes 集群中，可以直接参考官方文档使用 Kubernetes 部署 MinIO，在 Kubernetes 集群下面可以部署独立、分布式或共享几种模式，可以根据实际情况部署，我们这里只是单纯测试用最简单的独立模式即可。

直接使用如下所示的 Deployment 来管理 Minio 的服务：

# minio-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: minio
spec:
  selector:
    matchLabels:
      app: minio
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: minio
    spec:
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: minio-pvc
      containers:
        - name: minio
          volumeMounts:
            - name: data
              mountPath: "/data"
          image: minio/minio:latest
          args: ["server", "--console-address", ":9001", "/data"]
          env:
            - name: MINIO_ACCESS_KEY
              value: "minio"
            - name: MINIO_SECRET_KEY
              value: "minio123"
          ports:
            - containerPort: 9000
            - containerPort: 9001
          readinessProbe:
            httpGet:
              path: /minio/health/ready
              port: 9000
            initialDelaySeconds: 10
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /minio/health/live
              port: 9000
            initialDelaySeconds: 10
            periodSeconds: 10

由于新版本的镜像区分了 Console 和 API 两个服务的端口，所以在启动的时候我们需要通过 --console-address 参数来指定 Console 服务的端口，默认的 API 服务在 9000 端口上。

然后通过一个名为 minio-pvc 的 PVC 对象将数据持久化：

# minio-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: minio-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10G
  storageClassName: longhorn # 最好使用LocalPV

最后我们可以通过 NodePort 类型的 Service 服务将 Minio 暴露给外部用户使用：

# minio-svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: minio
spec:
  ports:
    - name: console
      port: 9001
      targetPort: 9001
      nodePort: 30091
    - name: api
      port: 9000
      targetPort: 9000
  selector:
    app: minio
  type: NodePort

创建上面的 yaml

root@master:~/thanos# kubectl apply -f minio-pvc.yaml 
root@master:~/thanos# kubectl apply -f minio-svc.yaml 
root@master:~/thanos# kubectl apply -f minio-deploy.yaml 

root@master:~/thanos# kubectl get pod 
NAME                    READY   STATUS    RESTARTS   AGE
minio-875749785-kvftc   1/1     Running   0          94s

# 访问 console 就是 30091 访问 API 就是 11133
root@master:~/thanos# kubectl get svc
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                         AGE
kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP                         42h
minio        NodePort    10.98.205.164   <none>        9001:30091/TCP,9000:11133/TCP   106s

3.1.1 创建 buckets

浏览器访问：

创建 buckets：

创建了一个 Thanos 的 Bucket

3.2 安装 Thanos Store

现在对象存储准备好了，我们就可以来部署 Store 组件了，该组件会从对象存储给 Querier 提供 metrics 数据。

根据上面创建的 Minio 创建一个如下所示的对象存储配置文件：

这个配置文件是给到我们的 store 组件，告诉 store 如何对接我们的对象存储

# thanos-storage-minio.yaml
type: s3
config:
  bucket: thanos
  # 访问 minio 地址，由于这里都是在 K8S 里面使用所以采用了 dns 的方式访问它的 API
  endpoint: minio.default.svc.cluster.local:9000
  access_key: minio
  secret_key: minio123
  insecure: true
  signature_version2: false

使用上面的配置文件来创建一个 Secret 对象：

root@master:~/thanos# kubectl create secret generic thanos-objectstorage --from-file=thanos.yaml=thanos-storage-minio.yaml -n kube-mon

# 创建成功
root@master:~/thanos# kubectl get secrets -n kube-mon 
NAME                     TYPE                                  DATA   AGE
thanos-objectstorage     Opaque                                1      16s

然后创建 Store 组件的资源清单文件，这里有一个需要注意的地方是需要添加一个 thanos-store-api: "true" 的标签，这样前面我们创建的 thanos-store-gateway 这个 Headless Service 就可以自动发现到这个服务，Querier 组件查询数据的时候除了可以通过 Sidecar 去获取数据也可以通过这个 Store 组件去对象存储里面获取数据了。将上面的 Secret 对象通过 Volume 形式挂载到容器中的 /etc/secret 目录下，通过 objstore.config-file 参数指定即可：

# thanos-store.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: thanos-store-gateway
  namespace: kube-mon
  labels:
    app: thanos-store-gateway
spec:
  # 副本数配置为 2 用于高可用
  replicas: 2
  selector:
    matchLabels:
      app: thanos-store-gateway
  # 匹配 headless Service
  serviceName: thanos-store-gateway
  template:
    metadata:
      labels:
        app: thanos-store-gateway
        # thanos-store-api: "true" 该标签就是想告诉系统当前的这个组件也实现了 store-api，然后 query 组件就能直接对接有该标签的组件，所以后续 query 就能够实现直接查询拥有该标签的数据比如 sidecar 和 store 组件
        thanos-store-api: "true"
    spec:
      affinity:
        # 由于是采用了高可用，所以这里我使用了 pod 的反亲和性将两个 pod 部署在不同的 node 上
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                topologyKey: kubernetes.io/hostname
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - thanos-store-gateway
      containers:
        - name: thanos
          image: thanosio/thanos:v0.25.1
          args:
            - "store"
            - "--log.level=debug"
            - "--data-dir=/data"
        # objstore.config-file 指定对象存储的配置文件，将刚才 minio 的对接访问通过 secrets 引用进来
            - "--objstore.config-file=/etc/secret/thanos.yaml"
            - "--index-cache-size=500MB"
            - "--chunk-pool-size=500MB"
          ports:
            - name: http
              containerPort: 10902
            - name: grpc
              containerPort: 10901
          livenessProbe:
            httpGet:
              port: 10902
              path: /-/healthy
          readinessProbe:
            httpGet:
              port: 10902
              path: /-/ready
          volumeMounts:
      # 将 thanos-objectstorage secrete 挂载到 /etc/secret 用于上面 objstore.config-file 指定
            - name: object-storage-config
              mountPath: /etc/secret
              readOnly: false
            - mountPath: /data
              name: data
      volumes:
        # 将 thanos-objectstorage secrete 挂载进容器中
        - name: object-storage-config
          secret:
            secretName: thanos-objectstorage
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: longhorn
        resources:
          requests:
            storage: 1Gi

Store Gateway 实际也可以做到一定程度的无状态，它会需要一点磁盘空间来对对象存储做索引以加速查询，但数据不那么重要，是可以删除的，删除后会自动去拉对象存储查数据重新建立索引，这里为了避免每次重启都重新建立索引，所以用 StatefulSet 部署 Store Gateway，挂载一个小容量的 PV。部署两个副本，可以实现 Store Gateway 的高可用。

创建 thanos-store

root@master:~/thanos# kubectl apply -f thanos-store.yaml 

# 查看拥有 thanos-store-api=true 标签的 pod，有这些标签的都能够被 query 查询
root@master:~/thanos# kubectl get pod -n  kube-mon -l thanos-store-api=true
NAME                     READY   STATUS    RESTARTS      AGE
prometheus-0             2/2     Running   1 (17h ago)   17h
prometheus-1             2/2     Running   1 (17h ago)   17h
thanos-store-gateway-0   1/1     Running   0             2m18s
thanos-store-gateway-1   1/1     Running   0             105s

3.2.1 访问 query 页面是否对接 store

http://10.0.0.131:441/stores

在下图可以看到 store 已经对接

到这里证明我们的 Store 组件也配置成功了。但是还有一个明显的问题是这里我们只是配置去对象存储中查询数据的，那什么地方往对象存储中写入数据呢？当然还是在 Sidecar 组件里面了。

所以同样我们需要把 objstore.config-file 参数和 Secret 对象也要配置到 Sidecar 组件中去：

# 将下面的配置重新添加到 sidecar 的 yaml 中
......
volumes:
  - name: object-storage-config
    secret:
      secretName: thanos-objectstorage
......
args:
  - sidecar
  - --log.level=debug
  - --tsdb.path=/prometheus
  - --prometheus.url=http://localhost:9090
  - --reloader.config-file=/etc/prometheus/prometheus.yaml.tmpl
  - --reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yaml
  - --reloader.rule-dir=/etc/prometheus/rules/
  # 将 objstore.config-file 参数添加到 sidecar 中，当 sidecar 读取到 Prometheus 本地的数据目录里面有了 tsdb 的数据块就会上传到对象存储中，也就是每两个小时生成一个
  - --objstore.config-file=/etc/secret/thanos.yaml
......
volumeMounts:
  - name: object-storage-config
    mountPath: /etc/secret
    readOnly: false
......

因为所有的数据其实都是通过 sidecar 来实现抓取，所以我们的 store 中的数据最终也是基于 sidecar 上传得到，然后在将数据上传至对象存储中

重建 sidecar yaml

root@master:~/thanos# kubectl apply -f thanos-sidecar.yaml 

root@master:~/thanos# kubectl get pod -n kube-mon 
NAME                              READY   STATUS    RESTARTS      AGE
prometheus-0                      2/2     Running   1 (21s ago)   32s
prometheus-1                      2/2     Running   1 (35s ago)   53s

3.2.2 查看 minio 是否有 store 数据

配置完成后重新更新 Sidecar 组件即可，配置生效过后正常的话就会有数据传入到 MinIO 里面去了（本地有超过两小时的 TSDB 块数据），我们可以去 MinIO 的页面上查看验证：

数据已上传至 thanos buckets 中

下载数据查看内容是否准确

下载 meta.json

验证 meta.json 文件中可以看到有 thanos 和 Prometheus 的标签信息

3.2.3 Thanos query 页面验证

登录到 query 组件的页面上点击 stores ，可以看到当前的 stores 中已经有数据标签了

2025年 5月
一	二	三	四	五	六	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31