Thanos 部署

这里我们需要部署 Prometheus、consul、minio、thanos、grafana

Prometheus ：集群监控管理
consul：Prometheus 的注册中心
minio：S3 协议存储
thanos：多 Prometheus 管理
grafana：可视化工具

Thanos比较难上手的原因是因为 Thanos 并不像 prometheus 一样，直接使用二进制文件一启动就可以玩起来了，他有非常多的组件，每一个组件的启动方式都是通过 thanos xxx--config xxx 这种形式来构成的
而真正的产品中，都已经把这些功能打包成了镜像，我们没办法一层一层的剥开外壳，看到实质。所以我们会从内而外的展示一下 thanos 的各个组件是怎么协同工作的。我们会大致经过三个阶段。

构建最简单的集群
优化查询
持久存储

期间的一些其他组件，比如注册中心(consul)的集群，持久存储(minio)的集群，数据库(pgsql)的集群

1 实验环境

在真实的生产环境中，一部分组件，比如对象存储和数据库，我们都可以直接使用公有云提供给我们的SaaS服务就好了。为了让大家理解的更加透彻，我们假设自己在私有云环境中，只有网络，服务器和存储三个部分。而这些我们都使用AWS的公有云环境进行模拟。

整个实验的最低要求是两个服务器，但是一些组件，比如服务注册中心onsu和对象存储minio的集群分别需要三个和四个节点，所以建议大家使用四个节点。如果实在没有，就在一台机器上启动多个实例也可以，但是生产系统上还是要分开的。

我这里为了模拟真实的生产环境，使用5台机器。一台作为跳转机和负均衡，其他4台机器作为应用节点。

节点	功能
node1	jumpserver/nginx
node2	Prometheus/grafana/alertmanager/thanos-sidecar/thanos-query/thanos-storage-gateway/thanos-frontend/thanos-rules/thanos-compact/minio/consul/postgresql
node2	Prometheus/grafana/alertmanager/thanos-sidecar/thanos-query/thanos-storage-gateway/thanos-frontend/thanos-rules/thanos-compact/minio/consul/postgresql
node3	Prometheus/grafana/alertmanager/thanos-sidecar/thanos-query/thanos-storage-gateway/thanos-frontend/thanos-rules/thanos-compact/minio/consul
node4	Prometheus/grafana/alertmanager/thanos-sidecar/thanos-query/thanos-storage-gateway/thanos-frontend/thanos-rules/thanos-compact/minio

在生产环境中大概架构如下图所示：

2 简易架构部署

经典组合，Prometheus+Grafana+AlerManager+Node_exporter，我们就不过多赘述、直接操作

2.1 Prometheus 部署

这里我部署两台 Prometheus 分别在 node1 和 node2

2.1.1 node1 部署

[10:47:03 root@node1 ~]#mkdir /apps
[10:49:20 root@node1 ~]#cd /apps/
[10:53:07 root@node1 ~]#wget https://github.com/prometheus/prometheus/releases/download/v2.31.1/prometheus-2.31.1.linux-amd64.tar.gz
[11:01:31 root@node1 apps]#tar xf prometheus-2.31.1.linux-amd64.tar.gz
[11:01:40 root@node1 apps]#ln -sv /apps/prometheus-2.31.1.linux-amd64 /apps/prometheus

# 编写 service 文件
[11:02:53 root@node1 apps]#vim /etc/systemd/system/prometheus.service 

[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/ 
After=network.target

[Service]
Restart=on-failure
WorkingDirectory=/apps/prometheus/
ExecStart=/apps/prometheus/prometheus   --config.file=/apps/prometheus/prometheus.yml
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

# 设置开机启动
[11:02:39 root@node1 apps]#systemctl daemon-reload 
[11:02:44 root@node1 apps]#systemctl enable --now prometheus.service 
Created symlink /etc/systemd/system/multi-user.target.wants/prometheus.service → /etc/systemd/system/prometheus.service.
[11:02:48 root@node1 apps]#systemctl status prometheus.service 
● prometheus.service - Prometheus Server
     Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2023-04-03 11:02:48 CST; 4s ago
       Docs: https://prometheus.io/docs/introduction/overview/
   Main PID: 3185 (prometheus)
      Tasks: 8 (limit: 2235)
     Memory: 17.7M
     CGroup: /system.slice/prometheus.service
             └─3185 /apps/prometheus/prometheus --config.file=/apps/prometheus/prometheus.yml

node2 部署参考上面 node1

2.1.2 修改 Prometheus 配置

node1 配置如下

[11:38:12 root@node1 prometheus]#vim prometheus.yml 

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
  # 添加下面两段配置， node 1 就是 replica: A ，该标签就是让 thanos 在数据上做区别，如果没有该标签那么在 grafana 上看一个数据
  external_labels:
    replica: A

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

加载配置文件

[11:42:56 root@node1 prometheus]#systemctl reload prometheus.service

node2 配置

[11:38:12 root@node2 prometheus]#vim prometheus.yml 

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
  # 添加下面两段配置， node 2 就是 replica: B，该标签就是让 thanos 在数据上做区别，如果没有该标签那么在 grafana 上看一个数据
  external_labels:
    replica: B

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

加载配置文件

[11:42:09 root@node2 prometheus]#systemctl reload prometheus.service

2.2 部署 node-exporter

node 节点上部署

[14:16:27 root@node1 ~]#cd /apps/
[14:16:34 root@node1 apps]#wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz
[14:17:48 root@node1 apps]#tar xf node_exporter-1.2.2.linux-amd64.tar.gz 
[14:17:51 root@node1 apps]#ln -sv /apps/node_exporter-1.2.2.linux-amd64 /apps/node_exporter

# 编写 service 文件
[14:17:51 root@node1 apps]# vim /etc/systemd/system/node-exporter.service 

[Unit]
Description=Prometheus Node Exporter 
After=network.target

[Service]
ExecStart=/apps/node_exporter/node_exporter

[Install]
WantedBy=multi-user.target

# 启动测试
[14:18:50 root@node1 apps]#systemctl daemon-reload 
[14:19:08 root@node1 apps]#systemctl enable --now node-exporter.service 
Created symlink /etc/systemd/system/multi-user.target.wants/node-exporter.service → /etc/systemd/system/node-exporter.service.
[14:19:12 root@node1 apps]#systemctl status node-exporter.service 
● node-exporter.service - Prometheus Node Exporter
     Loaded: loaded (/etc/systemd/system/node-exporter.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2023-04-03 14:19:12 CST; 3s ago
   Main PID: 1997 (node_exporter)
      Tasks: 4 (limit: 2235)
     Memory: 2.3M
     CGroup: /system.slice/node-exporter.service
             └─1997 /apps/node_exporter/node_exporter

node2 与 node1 步骤同理

2.2.1 修改 Prometheus 配置，实现对 node-exporter 的采集

node-1 修改

# 在 scrape_configs: 字段下添加下面几段配置
[14:25:28 root@node1 prometheus]#vim /apps/prometheus/prometheus.yml 
  - job_name: "node"
    static_configs:
      - targets: ["10.0.0.140:9100","10.0.0.141:9100"]

# 检查配置文件
[14:25:10 root@node1 prometheus]#./promtool check config prometheus.yml 
Checking prometheus.yml
  SUCCESS: 0 rule files found

# 重新加载 Prometheus
[14:25:17 root@node1 prometheus]#systemctl reload prometheus.service

node-2 修改与 node-1 同理

浏览器登录检查

node-1 :http://10.0.0.140:9090/targets

node-2 :http://10.0.0.141:9090/targets

这个时候我们实现了两台 Prometheus 采集的数据一样，但是当我们有台 Prometheus 宕掉了并且后续恢复，那么用户访问查询的时候可能有台数据中途就是不完整的，所以为了避免这种问题从而引出了 Thanos，即使宕掉后续恢复也能够实现数据同步

3 Thanos 基础组件

这里先通过 Thanos 来解决 Prometheus 查询断点的这么一个问题，Thanos 只需要两个组件就可以简单形成一个集群，Thanos-query 和 Thanos-sidecar ，sidecar 用来抽象数据层，query 来查询抽象出来的数据层，从而用来提供查询接口

通过 Github 下载：

https://github.com/thanos-io/thanos/releases

3.1 Thanos-sidecar 部署

顾名思义，他就是 sidecar ，和 Prometheus 的 sidecar 一样，假如部署在节点上那么就和 Prometheus 同一台节点，如果部署在 K8S 中那么就和 Prometheus 通一个 pod。

这里我们先通过 node 部署

下面操作依旧是在 node-1 node-2 两个节点上进行

[15:06:17 root@node1 apps]#wget https://github.com/thanos-io/thanos/releases/download/v0.26.0/thanos-0.26.0.linux-amd64.tar.gz

[15:06:32 root@node1 apps]#tar xf thanos-0.26.0.linux-amd64.tar.gz 
[15:06:53 root@node1 apps]#ln -sv /apps/thanos-0.26.0.linux-amd64 /apps/thanos

# 看到只有一个二进制程序
[15:07:38 root@node1 apps]#ls thanos
thanos

[15:08:29 root@node2 apps]#mv thanos/thanos /usr/local/sbin/

3.1.1 编写 service 文件

[15:21:15 root@node1 ~]#vim /etc/systemd/system/thanos-sidecar.service

[Unit]
Description=thanos-sidecar
Documentationmhttps=//thanos.io/
After=network.target

# --tsdb.path=/apps/prometheus 获取 Prometheus 的文件路径
# --prometheus.url=http://localhost:9090 访问 Prometheus 端口
[Service]
Type=simple
ExecStart=/usr/local/sbin/thanos sidecar
                  --tsdb.path=/apps/prometheus \
                  --prometheus.url=http://localhost:9090 \
                  --http-address=0.0.0.0:10901 \
                  --grpc-address=0.0.0.0:10902
ExecReload=/bin/kill -HUP
TimeoutstopSec=20s
Restart=always

[Install]
WantedBy=multi-user.target

# 加载并启动
[15:22:06 root@node1 ~]#systemctl daemon-reload 
[15:22:13 root@node1 ~]#systemctl enable --now thanos-sidecar.service 
[15:22:22 root@node1 ~]#systemctl status thanos-sidecar.service 
● thanos-sidecar.service - thanos-sidecar
     Loaded: loaded (/etc/systemd/system/thanos-sidecar.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2023-04-03 15:22:22 CST; 4s ago
   Main PID: 4290 (thanos)
      Tasks: 7 (limit: 2235)
     Memory: 11.5M
     CGroup: /system.slice/thanos-sidecar.service
             └─4290 /usr/local/sbin/thanos sidecar

Apr 03 15:22:22 node1 thanos[4290]: level=info ts=2023-04-03T07:22:22.281242832Z caller=sidecar.go:201 msg="successfully loaded prometheus external labels" external_labels="{replica=\"A\"}"
# 可以看到这里的 external_labels="{replica=\"A\"}" 就是我们在 Prometheus 配置中定义并且区分该，从而实现 Prometheus 集群

# 端口已监听
[15:22:49 root@node1 ~]#ss -ntl | grep 1090*
LISTEN  0        4096                   *:10901                *:*              
LISTEN  0        4096                   *:10902                *:*

node-1

{replica=\"A\"}

node-2

{replica=\"B\"}

3.2 Thanos-query 部署

query 是用来查询所有可能的数据接口，比如 sidecar 或者 storage-gateway，我们这里并没有把数据进行远程写，所以我们只需要查询 sidecar 即可

也就是说通过 query 来同时查询两台机器上的 Prometheus，从而实现将 Prometheus 抽象出来

下面操作依旧是在 node-1 node-2 两个节点上进行

1.编写 service 文件

[15:59:29 root@node1 ~]#vim /etc/systemd/system/thanos-query.service

[Unit]
Description=thanos-query
Documentation=https://thanos.io/
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/sbin/thanos query \
          --http-address=0.0.0.0:10903 \
          --grpc-address=0.0.0.0:10904 \
          --store=10.0.0.140:10902 \
          --store=10.0.0.140:10901 \
          --store=10.0.0.141:10902 \
          --store=10.0.0.141:10901 \
          --query.timeout=10m \
          --query.max-concurrent=200 \
          --query.max-concurrent-select=40 \
          --query.replica-label=replica
ExecReload=/bin/kill -HUP
TimeoutStopSec=20s
Restart=always
LimitNOFILE=20480000

[Install]
WantedBy=multi-user.target

# 参数解释
--http-address=0.0.0.0:10903 \  # 自己监听的端口
--grpc-address=0.0.0.0:10904 \  # 程序的调用接口
--store=10.0.0.140:10902 \      # 接入 10.0.0.140 Store
--store=10.0.0.140:10901 \      # #接入10.0.0.140 sidecar
--store=10.0.0.141:10902 \      # 接入 10.0.0.141 Store
--store=10.0.0.141:10901 \      # #接入10.0.0.141 sidecar
--query.timeout=10m \           # 超时时长

[16:02:12 root@node1 ~]#systemctl daemon-reload 
[16:02:18 root@node1 ~]#systemctl enable --now thanos-query.service 
[16:02:28 root@node1 ~]#systemctl status thanos-query.service

浏览器访问

http://10.0.0.140:10903/

可以看到和 Prometheus 很像，但是其实访问的是 thanos query 的地址

实现将两个 sidecar 添加并标记对应的 label

4 安装 nginx 实现高可用

下面操作依旧是在 node-1 node-2 两个节点上进行

[16:08:32 root@node1 ~]#apt install nginx

[16:10:40 root@node1 ~]#vim /etc/nginx/conf.d/thanos.conf

server {
        listen 80;
        server_name thanos.com;

location / {
        # 先将认证功能注释
        # auth_basic "Please enter your password!";
        # auth_basic_user_file /etc/nginx/htpasswd;
        # proxy_pass http://thanos;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        }
}

upstream thanos {
        server 10.0.0.140:10903;
        server 10.0.0.141:10903;
}

[16:15:05 root@node1 ~]#systemctl restart nginx.service

5 安装 Grafana

1.下载安装包

# 安装依赖
root@server:~# sudo apt-get install -y adduser libfontconfig1

# 下载 deb 安装包
root@server:~# wget https://dl.grafana.com/enterprise/release/grafana-enterprise_7.5.11_amd64.deb

# 安装
root@server:~# sudo dpkg -i grafana-enterprise_7.5.11_amd64.deb

2.配置文件

# 安装完之后配置文件默认位置
# 在配置文件中最重要的就是 database 项，如果需要上生产的话我们需要将数据库改为 mysql 或者 pgsql，这里我就没有修改

root@server:~# vim /etc/grafana/grafana.ini 
[database]
# You can configure the database connection by specifying type, host, name, user and password
# as separate properties or as on string using the url properties.

# Either "mysql", "postgres" or "sqlite3", it's your choice
;type = sqlite3                 # 默认使用的 sqlite3 数据库
;host = 127.0.0.1:3306          # 连接数据库地址
;name = grafana                 # 数据库名称
;user = root                    # 数据库登录用户
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
;password =                     # 用户密码

3.设置为开机启动同时启动服务

root@server:~# systemctl enable --now grafana-server.service 

# 默认监听 3000 端口
root@server:~# ss -ntl | grep 3000
LISTEN   0         128                       *:3000                   *:*

http://10.0.0.140:3000/login

默认账户名密码都是 admin

第一次登录会提示修改密码
新密码修改为 666666

5.1 添加 thanos 数据源

这里需要添加 thanos query 的 url，但是为了实现 nginx 的高可用这里通过的是直接在 grafana 节点的 hosts 文件里进行添加，当时这里只是为了模拟，如果在生成中还是需要在 DNS 或者 LB 中实现轮询负载

root@server:~#vim /etc/hosts
10.0.0.140 thanos.com
10.0.0.141 thanos.com

通过对 query 的 10903 端口添加

导入模板可以看到实现了数据的获取

2025年 5月
一	二	三	四	五	六	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Thanos 部署

1 实验环境

2 简易架构部署

2.1 Prometheus 部署

2.1.1 node1 部署

2.1.2 修改 Prometheus 配置

2.2 部署 node-exporter

2.2.1 修改 Prometheus 配置，实现对 node-exporter 的采集

3 Thanos 基础组件

3.1 Thanos-sidecar 部署

3.1.1 编写 service 文件

3.2 Thanos-query 部署

4 安装 nginx 实现高可用

5 安装 Grafana

5.1 添加 thanos 数据源

发送评论 编辑评论

发送评论编辑评论