Thanos 部署
这里我们需要部署 Prometheus、consul、minio、thanos、grafana
- Prometheus :集群监控管理
- consul:Prometheus 的注册中心
- minio:S3 协议存储
- thanos:多 Prometheus 管理
- grafana:可视化工具
Thanos比 较难上手的原因是因为 Thanos 并不像 prometheus 一样,直接使用二进制文件一启动就可以玩起来了,他有非常多的组件,每一个组件的启动方式都是通过 thanos xxx--config xxx
这种形式来构成的
而真正的产品中,都已经把这些功能打包成了镜像,我们没办法一层一层的剥开外壳,看到实质。所以我们会从内而外的展示一下 thanos 的各个组件是怎么协同工作的。我们会大致经过三个阶段。
- 构建最简单的集群
- 优化查询
- 持久存储
期间的一些其他组件,比如注册中心(consul)的集群,持久存储(minio)的集群,数据库(pgsql)的集群
1 实验环境
在真实的生产环境中,一部分组件,比如对象存储和数据库,我们都可以直接使用公有云提供给我们的SaaS服务就好了。为了让大家理解的更加透彻,我们假设自己在私有云环境中,只有网络,服务器和存储三个部分。而这些我们都使用AWS的公有云环境进行模拟。
整个实验的最低要求是两个服务器,但是一些组件,比如服务注册中心onsu和对象存储minio的集群分别需要三个和四个节点,所以建议大家使用四个节点。如果实在没有,就在一台机器上启动多个实例也可以,但是生产系统上还是要分开的。
我这里为了模拟真实的生产环境,使用5台机器。一台作为跳转机和负均衡,其他4台机器作为应用节点。
节点 | 功能 |
---|---|
node1 | jumpserver/nginx |
node2 | Prometheus/grafana/alertmanager/thanos-sidecar/thanos-query/thanos-storage-gateway/thanos-frontend/thanos-rules/thanos-compact/minio/consul/postgresql |
node2 | Prometheus/grafana/alertmanager/thanos-sidecar/thanos-query/thanos-storage-gateway/thanos-frontend/thanos-rules/thanos-compact/minio/consul/postgresql |
node3 | Prometheus/grafana/alertmanager/thanos-sidecar/thanos-query/thanos-storage-gateway/thanos-frontend/thanos-rules/thanos-compact/minio/consul |
node4 | Prometheus/grafana/alertmanager/thanos-sidecar/thanos-query/thanos-storage-gateway/thanos-frontend/thanos-rules/thanos-compact/minio |
在生产环境中大概架构如下图所示:
2 简易架构部署
经典组合,Prometheus+Grafana+AlerManager+Node_exporter,我们就不过多赘述、直接操作
2.1 Prometheus 部署
这里我部署两台 Prometheus 分别在 node1 和 node2
2.1.1 node1 部署
[10:47:03 root@node1 ~]#mkdir /apps
[10:49:20 root@node1 ~]#cd /apps/
[10:53:07 root@node1 ~]#wget https://github.com/prometheus/prometheus/releases/download/v2.31.1/prometheus-2.31.1.linux-amd64.tar.gz
[11:01:31 root@node1 apps]#tar xf prometheus-2.31.1.linux-amd64.tar.gz
[11:01:40 root@node1 apps]#ln -sv /apps/prometheus-2.31.1.linux-amd64 /apps/prometheus
# 编写 service 文件
[11:02:53 root@node1 apps]#vim /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target
[Service]
Restart=on-failure
WorkingDirectory=/apps/prometheus/
ExecStart=/apps/prometheus/prometheus --config.file=/apps/prometheus/prometheus.yml
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target
# 设置开机启动
[11:02:39 root@node1 apps]#systemctl daemon-reload
[11:02:44 root@node1 apps]#systemctl enable --now prometheus.service
Created symlink /etc/systemd/system/multi-user.target.wants/prometheus.service → /etc/systemd/system/prometheus.service.
[11:02:48 root@node1 apps]#systemctl status prometheus.service
● prometheus.service - Prometheus Server
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2023-04-03 11:02:48 CST; 4s ago
Docs: https://prometheus.io/docs/introduction/overview/
Main PID: 3185 (prometheus)
Tasks: 8 (limit: 2235)
Memory: 17.7M
CGroup: /system.slice/prometheus.service
└─3185 /apps/prometheus/prometheus --config.file=/apps/prometheus/prometheus.yml
node2 部署参考上面 node1
2.1.2 修改 Prometheus 配置
node1 配置如下
[11:38:12 root@node1 prometheus]#vim prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# 添加下面两段配置, node 1 就是 replica: A ,该标签就是让 thanos 在数据上做区别,如果没有该标签那么在 grafana 上看一个数据
external_labels:
replica: A
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
加载配置文件
[11:42:56 root@node1 prometheus]#systemctl reload prometheus.service
node2 配置
[11:38:12 root@node2 prometheus]#vim prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# 添加下面两段配置, node 2 就是 replica: B,该标签就是让 thanos 在数据上做区别,如果没有该标签那么在 grafana 上看一个数据
external_labels:
replica: B
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
加载配置文件
[11:42:09 root@node2 prometheus]#systemctl reload prometheus.service
2.2 部署 node-exporter
node 节点上部署
[14:16:27 root@node1 ~]#cd /apps/
[14:16:34 root@node1 apps]#wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz
[14:17:48 root@node1 apps]#tar xf node_exporter-1.2.2.linux-amd64.tar.gz
[14:17:51 root@node1 apps]#ln -sv /apps/node_exporter-1.2.2.linux-amd64 /apps/node_exporter
# 编写 service 文件
[14:17:51 root@node1 apps]# vim /etc/systemd/system/node-exporter.service
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
ExecStart=/apps/node_exporter/node_exporter
[Install]
WantedBy=multi-user.target
# 启动测试
[14:18:50 root@node1 apps]#systemctl daemon-reload
[14:19:08 root@node1 apps]#systemctl enable --now node-exporter.service
Created symlink /etc/systemd/system/multi-user.target.wants/node-exporter.service → /etc/systemd/system/node-exporter.service.
[14:19:12 root@node1 apps]#systemctl status node-exporter.service
● node-exporter.service - Prometheus Node Exporter
Loaded: loaded (/etc/systemd/system/node-exporter.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2023-04-03 14:19:12 CST; 3s ago
Main PID: 1997 (node_exporter)
Tasks: 4 (limit: 2235)
Memory: 2.3M
CGroup: /system.slice/node-exporter.service
└─1997 /apps/node_exporter/node_exporter
node2 与 node1 步骤同理
2.2.1 修改 Prometheus 配置,实现对 node-exporter 的采集
node-1 修改
# 在 scrape_configs: 字段下添加下面几段配置
[14:25:28 root@node1 prometheus]#vim /apps/prometheus/prometheus.yml
- job_name: "node"
static_configs:
- targets: ["10.0.0.140:9100","10.0.0.141:9100"]
# 检查配置文件
[14:25:10 root@node1 prometheus]#./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 0 rule files found
# 重新加载 Prometheus
[14:25:17 root@node1 prometheus]#systemctl reload prometheus.service
node-2 修改与 node-1 同理
浏览器登录检查
node-1 :http://10.0.0.140:9090/targets
node-2 :http://10.0.0.141:9090/targets
这个时候我们实现了两台 Prometheus 采集的数据一样,但是当我们有台 Prometheus 宕掉了并且后续恢复,那么用户访问查询的时候可能有台数据中途就是不完整的,所以为了避免这种问题从而引出了 Thanos,即使宕掉后续恢复也能够实现数据同步
3 Thanos 基础组件
这里先通过 Thanos 来解决 Prometheus 查询断点的这么一个问题,Thanos 只需要两个组件就可以简单形成一个集群,Thanos-query 和 Thanos-sidecar ,sidecar 用来抽象数据层,query 来查询抽象出来的数据层,从而用来提供查询接口
通过 Github 下载:
https://github.com/thanos-io/thanos/releases
3.1 Thanos-sidecar 部署
顾名思义,他就是 sidecar ,和 Prometheus 的 sidecar 一样,假如部署在节点上那么就和 Prometheus 同一台节点,如果部署在 K8S 中那么就和 Prometheus 通一个 pod。
这里我们先通过 node 部署
下面操作依旧是在 node-1 node-2 两个节点上进行
[15:06:17 root@node1 apps]#wget https://github.com/thanos-io/thanos/releases/download/v0.26.0/thanos-0.26.0.linux-amd64.tar.gz
[15:06:32 root@node1 apps]#tar xf thanos-0.26.0.linux-amd64.tar.gz
[15:06:53 root@node1 apps]#ln -sv /apps/thanos-0.26.0.linux-amd64 /apps/thanos
# 看到只有一个二进制程序
[15:07:38 root@node1 apps]#ls thanos
thanos
[15:08:29 root@node2 apps]#mv thanos/thanos /usr/local/sbin/
3.1.1 编写 service 文件
[15:21:15 root@node1 ~]#vim /etc/systemd/system/thanos-sidecar.service
[Unit]
Description=thanos-sidecar
Documentationmhttps=//thanos.io/
After=network.target
# --tsdb.path=/apps/prometheus 获取 Prometheus 的文件路径
# --prometheus.url=http://localhost:9090 访问 Prometheus 端口
[Service]
Type=simple
ExecStart=/usr/local/sbin/thanos sidecar
--tsdb.path=/apps/prometheus \
--prometheus.url=http://localhost:9090 \
--http-address=0.0.0.0:10901 \
--grpc-address=0.0.0.0:10902
ExecReload=/bin/kill -HUP
TimeoutstopSec=20s
Restart=always
[Install]
WantedBy=multi-user.target
# 加载并启动
[15:22:06 root@node1 ~]#systemctl daemon-reload
[15:22:13 root@node1 ~]#systemctl enable --now thanos-sidecar.service
[15:22:22 root@node1 ~]#systemctl status thanos-sidecar.service
● thanos-sidecar.service - thanos-sidecar
Loaded: loaded (/etc/systemd/system/thanos-sidecar.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2023-04-03 15:22:22 CST; 4s ago
Main PID: 4290 (thanos)
Tasks: 7 (limit: 2235)
Memory: 11.5M
CGroup: /system.slice/thanos-sidecar.service
└─4290 /usr/local/sbin/thanos sidecar
Apr 03 15:22:22 node1 thanos[4290]: level=info ts=2023-04-03T07:22:22.281242832Z caller=sidecar.go:201 msg="successfully loaded prometheus external labels" external_labels="{replica=\"A\"}"
# 可以看到这里的 external_labels="{replica=\"A\"}" 就是我们在 Prometheus 配置中定义并且区分该,从而实现 Prometheus 集群
# 端口已监听
[15:22:49 root@node1 ~]#ss -ntl | grep 1090*
LISTEN 0 4096 *:10901 *:*
LISTEN 0 4096 *:10902 *:*
node-1
{replica=\"A\"}
node-2
{replica=\"B\"}
3.2 Thanos-query 部署
query 是用来查询所有可能的数据接口,比如 sidecar 或者 storage-gateway,我们这里并没有把数据进行远程写,所以我们只需要查询 sidecar 即可
也就是说通过 query 来同时查询两台机器上的 Prometheus,从而实现将 Prometheus 抽象出来
下面操作依旧是在 node-1 node-2 两个节点上进行
1.编写 service 文件
[15:59:29 root@node1 ~]#vim /etc/systemd/system/thanos-query.service
[Unit]
Description=thanos-query
Documentation=https://thanos.io/
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/sbin/thanos query \
--http-address=0.0.0.0:10903 \
--grpc-address=0.0.0.0:10904 \
--store=10.0.0.140:10902 \
--store=10.0.0.140:10901 \
--store=10.0.0.141:10902 \
--store=10.0.0.141:10901 \
--query.timeout=10m \
--query.max-concurrent=200 \
--query.max-concurrent-select=40 \
--query.replica-label=replica
ExecReload=/bin/kill -HUP
TimeoutStopSec=20s
Restart=always
LimitNOFILE=20480000
[Install]
WantedBy=multi-user.target
# 参数解释
--http-address=0.0.0.0:10903 \ # 自己监听的端口
--grpc-address=0.0.0.0:10904 \ # 程序的调用接口
--store=10.0.0.140:10902 \ # 接入 10.0.0.140 Store
--store=10.0.0.140:10901 \ # #接入10.0.0.140 sidecar
--store=10.0.0.141:10902 \ # 接入 10.0.0.141 Store
--store=10.0.0.141:10901 \ # #接入10.0.0.141 sidecar
--query.timeout=10m \ # 超时时长
[16:02:12 root@node1 ~]#systemctl daemon-reload
[16:02:18 root@node1 ~]#systemctl enable --now thanos-query.service
[16:02:28 root@node1 ~]#systemctl status thanos-query.service
浏览器访问
可以看到和 Prometheus 很像,但是其实访问的是 thanos query 的地址
实现将两个 sidecar 添加并标记对应的 label
4 安装 nginx 实现高可用
下面操作依旧是在 node-1 node-2 两个节点上进行
[16:08:32 root@node1 ~]#apt install nginx
[16:10:40 root@node1 ~]#vim /etc/nginx/conf.d/thanos.conf
server {
listen 80;
server_name thanos.com;
location / {
# 先将认证功能注释
# auth_basic "Please enter your password!";
# auth_basic_user_file /etc/nginx/htpasswd;
# proxy_pass http://thanos;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
upstream thanos {
server 10.0.0.140:10903;
server 10.0.0.141:10903;
}
[16:15:05 root@node1 ~]#systemctl restart nginx.service
5 安装 Grafana
1.下载安装包
# 安装依赖
root@server:~# sudo apt-get install -y adduser libfontconfig1
# 下载 deb 安装包
root@server:~# wget https://dl.grafana.com/enterprise/release/grafana-enterprise_7.5.11_amd64.deb
# 安装
root@server:~# sudo dpkg -i grafana-enterprise_7.5.11_amd64.deb
2.配置文件
# 安装完之后配置文件默认位置
# 在配置文件中最重要的就是 database 项,如果需要上生产的话我们需要将数据库改为 mysql 或者 pgsql,这里我就没有修改
root@server:~# vim /etc/grafana/grafana.ini
[database]
# You can configure the database connection by specifying type, host, name, user and password
# as separate properties or as on string using the url properties.
# Either "mysql", "postgres" or "sqlite3", it's your choice
;type = sqlite3 # 默认使用的 sqlite3 数据库
;host = 127.0.0.1:3306 # 连接数据库地址
;name = grafana # 数据库名称
;user = root # 数据库登录用户
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
;password = # 用户密码
3.设置为开机启动同时启动服务
root@server:~# systemctl enable --now grafana-server.service
# 默认监听 3000 端口
root@server:~# ss -ntl | grep 3000
LISTEN 0 128 *:3000 *:*
默认账户名密码都是 admin
第一次登录会提示修改密码
新密码修改为 666666
5.1 添加 thanos 数据源
这里需要添加 thanos query 的 url,但是为了实现 nginx 的高可用这里通过的是直接在 grafana 节点的 hosts 文件里进行添加,当时这里只是为了模拟,如果在生成中还是需要在 DNS 或者 LB 中实现轮询负载
root@server:~#vim /etc/hosts
10.0.0.140 thanos.com
10.0.0.141 thanos.com
通过对 query 的 10903 端口添加
导入模板可以看到实现了数据的获取