1359 lines
34 KiB
Markdown
1359 lines
34 KiB
Markdown
|
|
# 运维文档
|
|||
|
|
|
|||
|
|
## 版本历史
|
|||
|
|
| 版本 | 日期 | 作者 | 变更说明 |
|
|||
|
|
|------|------|------|----------|
|
|||
|
|
| 1.0 | 2024-01-20 | 运维团队 | 初始版本 |
|
|||
|
|
|
|||
|
|
## 1. 运维概述
|
|||
|
|
|
|||
|
|
### 1.1 运维目标
|
|||
|
|
确保畜牧养殖管理平台7×24小时稳定运行,提供高可用、高性能、安全可靠的服务。
|
|||
|
|
|
|||
|
|
### 1.2 运维职责
|
|||
|
|
- **系统监控**:实时监控系统运行状态
|
|||
|
|
- **故障处理**:快速响应和处理系统故障
|
|||
|
|
- **性能优化**:持续优化系统性能
|
|||
|
|
- **安全管理**:维护系统安全防护
|
|||
|
|
- **备份恢复**:确保数据安全和可恢复性
|
|||
|
|
- **容量规划**:预测和规划系统容量需求
|
|||
|
|
|
|||
|
|
### 1.3 服务等级协议(SLA)
|
|||
|
|
|
|||
|
|
| 指标 | 目标值 | 说明 |
|
|||
|
|
|------|--------|------|
|
|||
|
|
| 系统可用性 | 99.9% | 年度停机时间不超过8.76小时 |
|
|||
|
|
| 响应时间 | < 500ms | API平均响应时间 |
|
|||
|
|
| 故障恢复时间 | < 30分钟 | 从故障发生到服务恢复 |
|
|||
|
|
| 数据备份 | 每日备份 | 保留30天备份数据 |
|
|||
|
|
| 安全事件响应 | < 15分钟 | 安全事件响应时间 |
|
|||
|
|
|
|||
|
|
## 2. 系统架构监控
|
|||
|
|
|
|||
|
|
### 2.1 监控架构图
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
graph TB
|
|||
|
|
subgraph "监控数据收集"
|
|||
|
|
A[Node Exporter] --> P[Prometheus]
|
|||
|
|
B[MySQL Exporter] --> P
|
|||
|
|
C[Redis Exporter] --> P
|
|||
|
|
D[Nginx Exporter] --> P
|
|||
|
|
E[Application Metrics] --> P
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph "告警系统"
|
|||
|
|
P --> AM[AlertManager]
|
|||
|
|
AM --> DT[钉钉通知]
|
|||
|
|
AM --> WX[企业微信]
|
|||
|
|
AM --> SMS[短信告警]
|
|||
|
|
AM --> EMAIL[邮件告警]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph "可视化展示"
|
|||
|
|
P --> G[Grafana]
|
|||
|
|
G --> DB[Dashboard]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph "日志系统"
|
|||
|
|
F[Filebeat] --> L[Logstash]
|
|||
|
|
L --> ES[Elasticsearch]
|
|||
|
|
ES --> K[Kibana]
|
|||
|
|
end
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2.2 监控指标体系
|
|||
|
|
|
|||
|
|
#### 2.2.1 基础设施监控
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# prometheus/rules/infrastructure.yml
|
|||
|
|
groups:
|
|||
|
|
- name: infrastructure
|
|||
|
|
rules:
|
|||
|
|
# CPU使用率告警
|
|||
|
|
- alert: HighCPUUsage
|
|||
|
|
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "CPU使用率过高"
|
|||
|
|
description: "实例 {{ $labels.instance }} CPU使用率为 {{ $value }}%"
|
|||
|
|
|
|||
|
|
# 内存使用率告警
|
|||
|
|
- alert: HighMemoryUsage
|
|||
|
|
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "内存使用率过高"
|
|||
|
|
description: "实例 {{ $labels.instance }} 内存使用率为 {{ $value }}%"
|
|||
|
|
|
|||
|
|
# 磁盘使用率告警
|
|||
|
|
- alert: HighDiskUsage
|
|||
|
|
expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: critical
|
|||
|
|
annotations:
|
|||
|
|
summary: "磁盘使用率过高"
|
|||
|
|
description: "实例 {{ $labels.instance }} 磁盘使用率为 {{ $value }}%"
|
|||
|
|
|
|||
|
|
# 磁盘IO告警
|
|||
|
|
- alert: HighDiskIO
|
|||
|
|
expr: irate(node_disk_io_time_seconds_total[5m]) * 100 > 80
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "磁盘IO使用率过高"
|
|||
|
|
description: "实例 {{ $labels.instance }} 磁盘IO使用率为 {{ $value }}%"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 2.2.2 应用服务监控
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# prometheus/rules/application.yml
|
|||
|
|
groups:
|
|||
|
|
- name: application
|
|||
|
|
rules:
|
|||
|
|
# API响应时间告警
|
|||
|
|
- alert: HighAPIResponseTime
|
|||
|
|
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "API响应时间过长"
|
|||
|
|
description: "API 95%分位响应时间为 {{ $value }}秒"
|
|||
|
|
|
|||
|
|
# API错误率告警
|
|||
|
|
- alert: HighAPIErrorRate
|
|||
|
|
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: critical
|
|||
|
|
annotations:
|
|||
|
|
summary: "API错误率过高"
|
|||
|
|
description: "API错误率为 {{ $value | humanizePercentage }}"
|
|||
|
|
|
|||
|
|
# 服务实例下线告警
|
|||
|
|
- alert: ServiceInstanceDown
|
|||
|
|
expr: up == 0
|
|||
|
|
for: 1m
|
|||
|
|
labels:
|
|||
|
|
severity: critical
|
|||
|
|
annotations:
|
|||
|
|
summary: "服务实例下线"
|
|||
|
|
description: "实例 {{ $labels.instance }} 已下线"
|
|||
|
|
|
|||
|
|
# 数据库连接数告警
|
|||
|
|
- alert: HighDatabaseConnections
|
|||
|
|
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "数据库连接数过高"
|
|||
|
|
description: "数据库连接数使用率为 {{ $value | humanizePercentage }}"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 2.2.3 业务指标监控
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# prometheus/rules/business.yml
|
|||
|
|
groups:
|
|||
|
|
- name: business
|
|||
|
|
rules:
|
|||
|
|
# 用户注册异常告警
|
|||
|
|
- alert: LowUserRegistration
|
|||
|
|
expr: rate(user_registrations_total[1h]) < 0.1
|
|||
|
|
for: 30m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "用户注册量异常"
|
|||
|
|
description: "过去1小时用户注册量为 {{ $value }}"
|
|||
|
|
|
|||
|
|
# 交易失败率告警
|
|||
|
|
- alert: HighTransactionFailureRate
|
|||
|
|
expr: rate(transactions_total{status="failed"}[5m]) / rate(transactions_total[5m]) > 0.1
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: critical
|
|||
|
|
annotations:
|
|||
|
|
summary: "交易失败率过高"
|
|||
|
|
description: "交易失败率为 {{ $value | humanizePercentage }}"
|
|||
|
|
|
|||
|
|
# 支付异常告警
|
|||
|
|
- alert: PaymentAbnormal
|
|||
|
|
expr: rate(payments_total{status="failed"}[5m]) > 0.05
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: critical
|
|||
|
|
annotations:
|
|||
|
|
summary: "支付异常"
|
|||
|
|
description: "支付失败率为 {{ $value }}"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2.3 Grafana仪表板配置
|
|||
|
|
|
|||
|
|
#### 2.3.1 系统概览仪表板
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"dashboard": {
|
|||
|
|
"title": "系统概览",
|
|||
|
|
"panels": [
|
|||
|
|
{
|
|||
|
|
"title": "系统负载",
|
|||
|
|
"type": "stat",
|
|||
|
|
"targets": [
|
|||
|
|
{
|
|||
|
|
"expr": "avg(node_load1)",
|
|||
|
|
"legendFormat": "1分钟负载"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"title": "CPU使用率",
|
|||
|
|
"type": "graph",
|
|||
|
|
"targets": [
|
|||
|
|
{
|
|||
|
|
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
|
|||
|
|
"legendFormat": "{{ instance }}"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"title": "内存使用率",
|
|||
|
|
"type": "graph",
|
|||
|
|
"targets": [
|
|||
|
|
{
|
|||
|
|
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
|
|||
|
|
"legendFormat": "{{ instance }}"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"title": "网络流量",
|
|||
|
|
"type": "graph",
|
|||
|
|
"targets": [
|
|||
|
|
{
|
|||
|
|
"expr": "irate(node_network_receive_bytes_total[5m])",
|
|||
|
|
"legendFormat": "接收 - {{ instance }}"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"expr": "irate(node_network_transmit_bytes_total[5m])",
|
|||
|
|
"legendFormat": "发送 - {{ instance }}"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 2.3.2 应用性能仪表板
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"dashboard": {
|
|||
|
|
"title": "应用性能",
|
|||
|
|
"panels": [
|
|||
|
|
{
|
|||
|
|
"title": "API请求量",
|
|||
|
|
"type": "graph",
|
|||
|
|
"targets": [
|
|||
|
|
{
|
|||
|
|
"expr": "rate(http_requests_total[5m])",
|
|||
|
|
"legendFormat": "{{ method }} {{ path }}"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"title": "API响应时间",
|
|||
|
|
"type": "graph",
|
|||
|
|
"targets": [
|
|||
|
|
{
|
|||
|
|
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
|
|||
|
|
"legendFormat": "50%分位"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
|
|||
|
|
"legendFormat": "95%分位"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
|
|||
|
|
"legendFormat": "99%分位"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"title": "错误率",
|
|||
|
|
"type": "graph",
|
|||
|
|
"targets": [
|
|||
|
|
{
|
|||
|
|
"expr": "rate(http_requests_total{status=~\"4..\"}[5m])",
|
|||
|
|
"legendFormat": "4xx错误"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
|
|||
|
|
"legendFormat": "5xx错误"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 3. 日常运维操作
|
|||
|
|
|
|||
|
|
### 3.1 日常检查清单
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# daily-check.sh - 日常检查脚本
|
|||
|
|
|
|||
|
|
LOG_FILE="/var/log/daily-check.log"
|
|||
|
|
DATE=$(date '+%Y-%m-%d %H:%M:%S')
|
|||
|
|
|
|||
|
|
echo "=== 日常检查开始 $DATE ===" | tee -a $LOG_FILE
|
|||
|
|
|
|||
|
|
# 1. 检查系统资源
|
|||
|
|
echo "1. 系统资源检查" | tee -a $LOG_FILE
|
|||
|
|
echo "CPU负载: $(uptime | awk -F'load average:' '{print $2}')" | tee -a $LOG_FILE
|
|||
|
|
echo "内存使用: $(free -h | grep Mem | awk '{print $3"/"$2}')" | tee -a $LOG_FILE
|
|||
|
|
echo "磁盘使用: $(df -h / | tail -1 | awk '{print $5}')" | tee -a $LOG_FILE
|
|||
|
|
|
|||
|
|
# 2. 检查服务状态
|
|||
|
|
echo "2. 服务状态检查" | tee -a $LOG_FILE
|
|||
|
|
services=("mysql-master" "redis-master" "mongodb" "backend-api-1" "backend-api-2" "nginx")
|
|||
|
|
for service in "${services[@]}"; do
|
|||
|
|
if docker ps --format "{{.Names}}" | grep -q "^${service}$"; then
|
|||
|
|
echo "✅ $service 运行正常" | tee -a $LOG_FILE
|
|||
|
|
else
|
|||
|
|
echo "❌ $service 服务异常" | tee -a $LOG_FILE
|
|||
|
|
fi
|
|||
|
|
done
|
|||
|
|
|
|||
|
|
# 3. 检查网络连接
|
|||
|
|
echo "3. 网络连接检查" | tee -a $LOG_FILE
|
|||
|
|
echo "HTTP连接数: $(netstat -an | grep :80 | grep ESTABLISHED | wc -l)" | tee -a $LOG_FILE
|
|||
|
|
echo "HTTPS连接数: $(netstat -an | grep :443 | grep ESTABLISHED | wc -l)" | tee -a $LOG_FILE
|
|||
|
|
|
|||
|
|
# 4. 检查数据库状态
|
|||
|
|
echo "4. 数据库状态检查" | tee -a $LOG_FILE
|
|||
|
|
mysql_connections=$(docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW STATUS LIKE 'Threads_connected';" | tail -1 | awk '{print $2}')
|
|||
|
|
echo "MySQL连接数: $mysql_connections" | tee -a $LOG_FILE
|
|||
|
|
|
|||
|
|
redis_connections=$(docker exec redis-master redis-cli info clients | grep connected_clients | cut -d: -f2)
|
|||
|
|
echo "Redis连接数: $redis_connections" | tee -a $LOG_FILE
|
|||
|
|
|
|||
|
|
# 5. 检查日志错误
|
|||
|
|
echo "5. 日志错误检查" | tee -a $LOG_FILE
|
|||
|
|
error_count=$(docker logs backend-api-1 --since="24h" 2>&1 | grep -i error | wc -l)
|
|||
|
|
echo "后端错误日志数量: $error_count" | tee -a $LOG_FILE
|
|||
|
|
|
|||
|
|
# 6. 检查备份状态
|
|||
|
|
echo "6. 备份状态检查" | tee -a $LOG_FILE
|
|||
|
|
backup_today=$(ls /backup/ | grep $(date +%Y%m%d) | wc -l)
|
|||
|
|
echo "今日备份文件数量: $backup_today" | tee -a $LOG_FILE
|
|||
|
|
|
|||
|
|
echo "=== 日常检查完成 ===" | tee -a $LOG_FILE
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.2 性能优化操作
|
|||
|
|
|
|||
|
|
#### 3.2.1 数据库性能优化
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
-- MySQL性能优化查询
|
|||
|
|
-- 1. 查看慢查询
|
|||
|
|
SELECT * FROM mysql.slow_log WHERE start_time > DATE_SUB(NOW(), INTERVAL 1 DAY);
|
|||
|
|
|
|||
|
|
-- 2. 查看表锁等待
|
|||
|
|
SHOW PROCESSLIST;
|
|||
|
|
|
|||
|
|
-- 3. 查看索引使用情况
|
|||
|
|
SELECT
|
|||
|
|
table_schema,
|
|||
|
|
table_name,
|
|||
|
|
index_name,
|
|||
|
|
cardinality,
|
|||
|
|
sub_part,
|
|||
|
|
packed,
|
|||
|
|
nullable,
|
|||
|
|
index_type
|
|||
|
|
FROM information_schema.statistics
|
|||
|
|
WHERE table_schema = 'xlxumu_db';
|
|||
|
|
|
|||
|
|
-- 4. 查看表大小
|
|||
|
|
SELECT
|
|||
|
|
table_name,
|
|||
|
|
ROUND(((data_length + index_length) / 1024 / 1024), 2) AS 'Size (MB)'
|
|||
|
|
FROM information_schema.tables
|
|||
|
|
WHERE table_schema = 'xlxumu_db'
|
|||
|
|
ORDER BY (data_length + index_length) DESC;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# mysql-optimize.sh - MySQL优化脚本
|
|||
|
|
|
|||
|
|
# 1. 分析表
|
|||
|
|
echo "开始分析表..."
|
|||
|
|
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e "
|
|||
|
|
ANALYZE TABLE users, farms, animals, transactions;
|
|||
|
|
"
|
|||
|
|
|
|||
|
|
# 2. 优化表
|
|||
|
|
echo "开始优化表..."
|
|||
|
|
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e "
|
|||
|
|
OPTIMIZE TABLE users, farms, animals, transactions;
|
|||
|
|
"
|
|||
|
|
|
|||
|
|
# 3. 检查表
|
|||
|
|
echo "检查表完整性..."
|
|||
|
|
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e "
|
|||
|
|
CHECK TABLE users, farms, animals, transactions;
|
|||
|
|
"
|
|||
|
|
|
|||
|
|
echo "MySQL优化完成"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 3.2.2 Redis性能优化
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# redis-optimize.sh - Redis优化脚本
|
|||
|
|
|
|||
|
|
# 1. 检查Redis内存使用
|
|||
|
|
echo "Redis内存使用情况:"
|
|||
|
|
docker exec redis-master redis-cli info memory
|
|||
|
|
|
|||
|
|
# 2. 检查慢查询
|
|||
|
|
echo "Redis慢查询:"
|
|||
|
|
docker exec redis-master redis-cli slowlog get 10
|
|||
|
|
|
|||
|
|
# 3. 清理过期键
|
|||
|
|
echo "清理过期键..."
|
|||
|
|
docker exec redis-master redis-cli --scan --pattern "*" | xargs -I {} docker exec redis-master redis-cli ttl {}
|
|||
|
|
|
|||
|
|
# 4. 检查大键
|
|||
|
|
echo "检查大键..."
|
|||
|
|
docker exec redis-master redis-cli --bigkeys
|
|||
|
|
|
|||
|
|
echo "Redis优化完成"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.3 日志管理
|
|||
|
|
|
|||
|
|
#### 3.3.1 日志轮转配置
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# /etc/logrotate.d/xlxumu
|
|||
|
|
/var/log/xlxumu/*.log {
|
|||
|
|
daily
|
|||
|
|
missingok
|
|||
|
|
rotate 30
|
|||
|
|
compress
|
|||
|
|
delaycompress
|
|||
|
|
notifempty
|
|||
|
|
create 644 root root
|
|||
|
|
postrotate
|
|||
|
|
docker kill -s USR1 $(docker ps -q --filter name=backend-api)
|
|||
|
|
endscript
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
/var/log/nginx/*.log {
|
|||
|
|
daily
|
|||
|
|
missingok
|
|||
|
|
rotate 30
|
|||
|
|
compress
|
|||
|
|
delaycompress
|
|||
|
|
notifempty
|
|||
|
|
create 644 nginx nginx
|
|||
|
|
postrotate
|
|||
|
|
docker exec nginx-lb nginx -s reopen
|
|||
|
|
endscript
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 3.3.2 日志分析脚本
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# log-analysis.sh - 日志分析脚本
|
|||
|
|
|
|||
|
|
LOG_DIR="/var/log/xlxumu"
|
|||
|
|
REPORT_FILE="/tmp/log-report-$(date +%Y%m%d).txt"
|
|||
|
|
|
|||
|
|
echo "=== 日志分析报告 $(date) ===" > $REPORT_FILE
|
|||
|
|
|
|||
|
|
# 1. 错误日志统计
|
|||
|
|
echo "1. 错误日志统计" >> $REPORT_FILE
|
|||
|
|
grep -i error $LOG_DIR/*.log | wc -l >> $REPORT_FILE
|
|||
|
|
|
|||
|
|
# 2. 访问量统计
|
|||
|
|
echo "2. 今日访问量统计" >> $REPORT_FILE
|
|||
|
|
grep "$(date +%d/%b/%Y)" /var/log/nginx/access.log | wc -l >> $REPORT_FILE
|
|||
|
|
|
|||
|
|
# 3. 状态码统计
|
|||
|
|
echo "3. HTTP状态码统计" >> $REPORT_FILE
|
|||
|
|
awk '{print $9}' /var/log/nginx/access.log | grep "$(date +%d/%b/%Y)" | sort | uniq -c | sort -nr >> $REPORT_FILE
|
|||
|
|
|
|||
|
|
# 4. 慢请求统计
|
|||
|
|
echo "4. 慢请求统计(>1s)" >> $REPORT_FILE
|
|||
|
|
awk '$NF > 1.0 {print $0}' /var/log/nginx/access.log | grep "$(date +%d/%b/%Y)" | wc -l >> $REPORT_FILE
|
|||
|
|
|
|||
|
|
# 5. 热门API统计
|
|||
|
|
echo "5. 热门API统计" >> $REPORT_FILE
|
|||
|
|
awk '{print $7}' /var/log/nginx/access.log | grep "$(date +%d/%b/%Y)" | grep "/api/" | sort | uniq -c | sort -nr | head -10 >> $REPORT_FILE
|
|||
|
|
|
|||
|
|
echo "日志分析完成,报告保存至: $REPORT_FILE"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 4. 备份与恢复
|
|||
|
|
|
|||
|
|
### 4.1 自动备份策略
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# backup-system.sh - 系统备份脚本
|
|||
|
|
|
|||
|
|
BACKUP_DIR="/backup"
|
|||
|
|
DATE=$(date +%Y%m%d_%H%M%S)
|
|||
|
|
BACKUP_PATH="$BACKUP_DIR/xlxumu_$DATE"
|
|||
|
|
RETENTION_DAYS=30
|
|||
|
|
|
|||
|
|
# 创建备份目录
|
|||
|
|
mkdir -p $BACKUP_PATH
|
|||
|
|
|
|||
|
|
echo "开始系统备份: $DATE"
|
|||
|
|
|
|||
|
|
# 1. 备份MySQL数据库
|
|||
|
|
echo "备份MySQL数据库..."
|
|||
|
|
docker exec mysql-master mysqldump -u root -p${MYSQL_ROOT_PASSWORD} \
|
|||
|
|
--single-transaction \
|
|||
|
|
--routines \
|
|||
|
|
--triggers \
|
|||
|
|
--all-databases > $BACKUP_PATH/mysql_backup.sql
|
|||
|
|
|
|||
|
|
if [ $? -eq 0 ]; then
|
|||
|
|
echo "✅ MySQL备份成功"
|
|||
|
|
else
|
|||
|
|
echo "❌ MySQL备份失败"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 2. 备份Redis数据
|
|||
|
|
echo "备份Redis数据..."
|
|||
|
|
docker exec redis-master redis-cli --rdb $BACKUP_PATH/redis_backup.rdb
|
|||
|
|
docker cp redis-master:/data/dump.rdb $BACKUP_PATH/redis_backup.rdb
|
|||
|
|
|
|||
|
|
if [ $? -eq 0 ]; then
|
|||
|
|
echo "✅ Redis备份成功"
|
|||
|
|
else
|
|||
|
|
echo "❌ Redis备份失败"
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 3. 备份MongoDB数据
|
|||
|
|
echo "备份MongoDB数据..."
|
|||
|
|
docker exec mongodb mongodump --out $BACKUP_PATH/mongodb_backup
|
|||
|
|
|
|||
|
|
if [ $? -eq 0 ]; then
|
|||
|
|
echo "✅ MongoDB备份成功"
|
|||
|
|
else
|
|||
|
|
echo "❌ MongoDB备份失败"
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 4. 备份应用配置
|
|||
|
|
echo "备份应用配置..."
|
|||
|
|
cp -r ./config $BACKUP_PATH/
|
|||
|
|
cp -r ./nginx $BACKUP_PATH/
|
|||
|
|
cp .env.production $BACKUP_PATH/
|
|||
|
|
|
|||
|
|
# 5. 备份上传文件
|
|||
|
|
echo "备份上传文件..."
|
|||
|
|
if [ -d "./uploads" ]; then
|
|||
|
|
tar -czf $BACKUP_PATH/uploads.tar.gz ./uploads
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 6. 压缩备份文件
|
|||
|
|
echo "压缩备份文件..."
|
|||
|
|
cd $BACKUP_DIR
|
|||
|
|
tar -czf xlxumu_$DATE.tar.gz xlxumu_$DATE/
|
|||
|
|
rm -rf xlxumu_$DATE/
|
|||
|
|
|
|||
|
|
# 7. 清理过期备份
|
|||
|
|
echo "清理过期备份..."
|
|||
|
|
find $BACKUP_DIR -name "xlxumu_*.tar.gz" -mtime +$RETENTION_DAYS -delete
|
|||
|
|
|
|||
|
|
# 8. 上传到云存储(可选)
|
|||
|
|
echo "上传备份到云存储..."
|
|||
|
|
# aws s3 cp xlxumu_$DATE.tar.gz s3://your-backup-bucket/
|
|||
|
|
|
|||
|
|
echo "系统备份完成: xlxumu_$DATE.tar.gz"
|
|||
|
|
|
|||
|
|
# 9. 发送备份通知
|
|||
|
|
curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN" \
|
|||
|
|
-H 'Content-Type: application/json' \
|
|||
|
|
-d "{\"msgtype\": \"text\",\"text\": {\"content\": \"系统备份完成: xlxumu_$DATE.tar.gz\"}}"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4.2 数据恢复流程
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# restore-system.sh - 系统恢复脚本
|
|||
|
|
|
|||
|
|
BACKUP_FILE=$1
|
|||
|
|
BACKUP_DIR="/backup"
|
|||
|
|
|
|||
|
|
if [ -z "$BACKUP_FILE" ]; then
|
|||
|
|
echo "使用方法: $0 <backup_file>"
|
|||
|
|
echo "可用备份文件:"
|
|||
|
|
ls -la $BACKUP_DIR/xlxumu_*.tar.gz
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
echo "开始系统恢复: $BACKUP_FILE"
|
|||
|
|
|
|||
|
|
# 1. 解压备份文件
|
|||
|
|
echo "解压备份文件..."
|
|||
|
|
cd $BACKUP_DIR
|
|||
|
|
tar -xzf $BACKUP_FILE
|
|||
|
|
|
|||
|
|
BACKUP_NAME=$(basename $BACKUP_FILE .tar.gz)
|
|||
|
|
RESTORE_PATH="$BACKUP_DIR/$BACKUP_NAME"
|
|||
|
|
|
|||
|
|
# 2. 停止服务
|
|||
|
|
echo "停止服务..."
|
|||
|
|
docker-compose down
|
|||
|
|
|
|||
|
|
# 3. 恢复MySQL数据库
|
|||
|
|
echo "恢复MySQL数据库..."
|
|||
|
|
docker-compose -f docker-compose.mysql.yml up -d mysql-master
|
|||
|
|
sleep 30
|
|||
|
|
|
|||
|
|
docker exec -i mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} < $RESTORE_PATH/mysql_backup.sql
|
|||
|
|
|
|||
|
|
if [ $? -eq 0 ]; then
|
|||
|
|
echo "✅ MySQL恢复成功"
|
|||
|
|
else
|
|||
|
|
echo "❌ MySQL恢复失败"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 4. 恢复Redis数据
|
|||
|
|
echo "恢复Redis数据..."
|
|||
|
|
docker cp $RESTORE_PATH/redis_backup.rdb redis-master:/data/dump.rdb
|
|||
|
|
docker restart redis-master
|
|||
|
|
|
|||
|
|
# 5. 恢复MongoDB数据
|
|||
|
|
echo "恢复MongoDB数据..."
|
|||
|
|
docker exec mongodb mongorestore $RESTORE_PATH/mongodb_backup
|
|||
|
|
|
|||
|
|
# 6. 恢复应用配置
|
|||
|
|
echo "恢复应用配置..."
|
|||
|
|
cp -r $RESTORE_PATH/config ./
|
|||
|
|
cp -r $RESTORE_PATH/nginx ./
|
|||
|
|
cp $RESTORE_PATH/.env.production ./
|
|||
|
|
|
|||
|
|
# 7. 恢复上传文件
|
|||
|
|
echo "恢复上传文件..."
|
|||
|
|
if [ -f "$RESTORE_PATH/uploads.tar.gz" ]; then
|
|||
|
|
tar -xzf $RESTORE_PATH/uploads.tar.gz
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 8. 重启服务
|
|||
|
|
echo "重启服务..."
|
|||
|
|
docker-compose up -d
|
|||
|
|
|
|||
|
|
# 9. 健康检查
|
|||
|
|
echo "执行健康检查..."
|
|||
|
|
sleep 60
|
|||
|
|
./scripts/health-check.sh
|
|||
|
|
|
|||
|
|
echo "系统恢复完成"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 5. 故障处理
|
|||
|
|
|
|||
|
|
### 5.1 故障响应流程
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
graph TD
|
|||
|
|
A[故障发生] --> B[监控系统告警]
|
|||
|
|
B --> C[运维人员接收告警]
|
|||
|
|
C --> D[初步故障定位]
|
|||
|
|
D --> E{故障等级判断}
|
|||
|
|
|
|||
|
|
E -->|P0严重| F[立即响应<br/>15分钟内]
|
|||
|
|
E -->|P1重要| G[快速响应<br/>30分钟内]
|
|||
|
|
E -->|P2一般| H[正常响应<br/>2小时内]
|
|||
|
|
E -->|P3轻微| I[计划响应<br/>24小时内]
|
|||
|
|
|
|||
|
|
F --> J[故障处理]
|
|||
|
|
G --> J
|
|||
|
|
H --> J
|
|||
|
|
I --> J
|
|||
|
|
|
|||
|
|
J --> K[服务恢复]
|
|||
|
|
K --> L[根因分析]
|
|||
|
|
L --> M[改进措施]
|
|||
|
|
M --> N[文档更新]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5.2 常见故障处理手册
|
|||
|
|
|
|||
|
|
#### 5.2.1 服务无响应
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# fix-service-unresponsive.sh
|
|||
|
|
|
|||
|
|
SERVICE_NAME=$1
|
|||
|
|
|
|||
|
|
if [ -z "$SERVICE_NAME" ]; then
|
|||
|
|
echo "使用方法: $0 <service_name>"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
echo "处理服务无响应: $SERVICE_NAME"
|
|||
|
|
|
|||
|
|
# 1. 检查容器状态
|
|||
|
|
echo "1. 检查容器状态"
|
|||
|
|
docker ps -a | grep $SERVICE_NAME
|
|||
|
|
|
|||
|
|
# 2. 检查容器日志
|
|||
|
|
echo "2. 检查容器日志"
|
|||
|
|
docker logs --tail 100 $SERVICE_NAME
|
|||
|
|
|
|||
|
|
# 3. 检查资源使用
|
|||
|
|
echo "3. 检查资源使用"
|
|||
|
|
docker stats --no-stream $SERVICE_NAME
|
|||
|
|
|
|||
|
|
# 4. 尝试重启服务
|
|||
|
|
echo "4. 尝试重启服务"
|
|||
|
|
docker restart $SERVICE_NAME
|
|||
|
|
|
|||
|
|
# 5. 等待服务启动
|
|||
|
|
echo "5. 等待服务启动"
|
|||
|
|
sleep 30
|
|||
|
|
|
|||
|
|
# 6. 健康检查
|
|||
|
|
echo "6. 执行健康检查"
|
|||
|
|
case $SERVICE_NAME in
|
|||
|
|
"backend-api-1")
|
|||
|
|
curl -f http://localhost:3001/health
|
|||
|
|
;;
|
|||
|
|
"backend-api-2")
|
|||
|
|
curl -f http://localhost:3002/health
|
|||
|
|
;;
|
|||
|
|
"nginx")
|
|||
|
|
curl -f http://localhost:80/health
|
|||
|
|
;;
|
|||
|
|
esac
|
|||
|
|
|
|||
|
|
if [ $? -eq 0 ]; then
|
|||
|
|
echo "✅ 服务恢复正常"
|
|||
|
|
else
|
|||
|
|
echo "❌ 服务仍然异常,需要进一步处理"
|
|||
|
|
fi
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 5.2.2 数据库连接异常
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# fix-database-connection.sh
|
|||
|
|
|
|||
|
|
echo "处理数据库连接异常"
|
|||
|
|
|
|||
|
|
# 1. 检查MySQL容器状态
|
|||
|
|
echo "1. 检查MySQL容器状态"
|
|||
|
|
docker ps | grep mysql-master
|
|||
|
|
|
|||
|
|
# 2. 检查MySQL进程
|
|||
|
|
echo "2. 检查MySQL进程"
|
|||
|
|
docker exec mysql-master ps aux | grep mysql
|
|||
|
|
|
|||
|
|
# 3. 检查MySQL连接数
|
|||
|
|
echo "3. 检查MySQL连接数"
|
|||
|
|
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW STATUS LIKE 'Threads_connected';"
|
|||
|
|
|
|||
|
|
# 4. 检查MySQL慢查询
|
|||
|
|
echo "4. 检查MySQL慢查询"
|
|||
|
|
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW PROCESSLIST;"
|
|||
|
|
|
|||
|
|
# 5. 检查MySQL错误日志
|
|||
|
|
echo "5. 检查MySQL错误日志"
|
|||
|
|
docker logs --tail 50 mysql-master | grep -i error
|
|||
|
|
|
|||
|
|
# 6. 重启MySQL服务(如果必要)
|
|||
|
|
read -p "是否需要重启MySQL服务?(y/n): " restart_mysql
|
|||
|
|
if [ "$restart_mysql" = "y" ]; then
|
|||
|
|
echo "重启MySQL服务..."
|
|||
|
|
docker restart mysql-master
|
|||
|
|
sleep 30
|
|||
|
|
|
|||
|
|
# 检查服务状态
|
|||
|
|
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SELECT 1;"
|
|||
|
|
if [ $? -eq 0 ]; then
|
|||
|
|
echo "✅ MySQL服务恢复正常"
|
|||
|
|
else
|
|||
|
|
echo "❌ MySQL服务仍然异常"
|
|||
|
|
fi
|
|||
|
|
fi
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 5.2.3 磁盘空间不足
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# fix-disk-space.sh
|
|||
|
|
|
|||
|
|
echo "处理磁盘空间不足"
|
|||
|
|
|
|||
|
|
# 1. 检查磁盘使用情况
|
|||
|
|
echo "1. 磁盘使用情况"
|
|||
|
|
df -h
|
|||
|
|
|
|||
|
|
# 2. 查找大文件
|
|||
|
|
echo "2. 查找大文件(>100MB)"
|
|||
|
|
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | head -20
|
|||
|
|
|
|||
|
|
# 3. 清理Docker资源
|
|||
|
|
echo "3. 清理Docker资源"
|
|||
|
|
docker system prune -f
|
|||
|
|
docker volume prune -f
|
|||
|
|
docker image prune -a -f
|
|||
|
|
|
|||
|
|
# 4. 清理日志文件
|
|||
|
|
echo "4. 清理日志文件"
|
|||
|
|
find /var/log -name "*.log" -type f -mtime +7 -exec truncate -s 0 {} \;
|
|||
|
|
|
|||
|
|
# 5. 清理临时文件
|
|||
|
|
echo "5. 清理临时文件"
|
|||
|
|
rm -rf /tmp/*
|
|||
|
|
rm -rf /var/tmp/*
|
|||
|
|
|
|||
|
|
# 6. 清理旧备份文件
|
|||
|
|
echo "6. 清理旧备份文件"
|
|||
|
|
find /backup -name "*.tar.gz" -mtime +30 -delete
|
|||
|
|
|
|||
|
|
# 7. 再次检查磁盘空间
|
|||
|
|
echo "7. 清理后磁盘使用情况"
|
|||
|
|
df -h
|
|||
|
|
|
|||
|
|
echo "磁盘空间清理完成"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5.3 故障预防措施
|
|||
|
|
|
|||
|
|
#### 5.3.1 预防性维护脚本
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# preventive-maintenance.sh
|
|||
|
|
|
|||
|
|
echo "开始预防性维护"
|
|||
|
|
|
|||
|
|
# 1. 系统更新
|
|||
|
|
echo "1. 系统更新检查"
|
|||
|
|
yum check-update
|
|||
|
|
|
|||
|
|
# 2. 清理系统缓存
|
|||
|
|
echo "2. 清理系统缓存"
|
|||
|
|
echo 3 > /proc/sys/vm/drop_caches
|
|||
|
|
|
|||
|
|
# 3. 检查系统服务
|
|||
|
|
echo "3. 检查系统服务"
|
|||
|
|
systemctl status docker
|
|||
|
|
systemctl status firewalld
|
|||
|
|
|
|||
|
|
# 4. 检查网络连接
|
|||
|
|
echo "4. 检查网络连接"
|
|||
|
|
netstat -tuln | grep -E "(80|443|3000|3306|6379|27017)"
|
|||
|
|
|
|||
|
|
# 5. 检查SSL证书有效期
|
|||
|
|
echo "5. 检查SSL证书有效期"
|
|||
|
|
openssl x509 -in /etc/letsencrypt/live/www.xlxumu.com/cert.pem -noout -dates
|
|||
|
|
|
|||
|
|
# 6. 数据库维护
|
|||
|
|
echo "6. 数据库维护"
|
|||
|
|
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "OPTIMIZE TABLE xlxumu_db.users, xlxumu_db.farms, xlxumu_db.animals;"
|
|||
|
|
|
|||
|
|
# 7. 性能基准测试
|
|||
|
|
echo "7. 性能基准测试"
|
|||
|
|
curl -w "@curl-format.txt" -o /dev/null -s http://localhost/api/health
|
|||
|
|
|
|||
|
|
echo "预防性维护完成"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 6. 安全运维
|
|||
|
|
|
|||
|
|
### 6.1 安全检查清单
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# security-check.sh
|
|||
|
|
|
|||
|
|
echo "=== 安全检查开始 ==="
|
|||
|
|
|
|||
|
|
# 1. 检查系统用户
|
|||
|
|
echo "1. 检查系统用户"
|
|||
|
|
awk -F: '$3 >= 1000 {print $1}' /etc/passwd
|
|||
|
|
|
|||
|
|
# 2. 检查SSH配置
|
|||
|
|
echo "2. 检查SSH配置"
|
|||
|
|
grep -E "(PermitRootLogin|PasswordAuthentication|Port)" /etc/ssh/sshd_config
|
|||
|
|
|
|||
|
|
# 3. 检查防火墙状态
|
|||
|
|
echo "3. 检查防火墙状态"
|
|||
|
|
firewall-cmd --list-all
|
|||
|
|
|
|||
|
|
# 4. 检查开放端口
|
|||
|
|
echo "4. 检查开放端口"
|
|||
|
|
netstat -tuln
|
|||
|
|
|
|||
|
|
# 5. 检查失败登录尝试
|
|||
|
|
echo "5. 检查失败登录尝试"
|
|||
|
|
grep "Failed password" /var/log/secure | tail -10
|
|||
|
|
|
|||
|
|
# 6. 检查文件权限
|
|||
|
|
echo "6. 检查关键文件权限"
|
|||
|
|
ls -la /etc/passwd /etc/shadow /etc/ssh/sshd_config
|
|||
|
|
|
|||
|
|
# 7. 检查Docker安全
|
|||
|
|
echo "7. 检查Docker安全"
|
|||
|
|
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
|
|||
|
|
-v /usr/local/bin/docker:/usr/local/bin/docker \
|
|||
|
|
docker/docker-bench-security
|
|||
|
|
|
|||
|
|
# 8. 检查SSL证书
|
|||
|
|
echo "8. 检查SSL证书"
|
|||
|
|
echo | openssl s_client -servername www.xlxumu.com -connect www.xlxumu.com:443 2>/dev/null | openssl x509 -noout -dates
|
|||
|
|
|
|||
|
|
echo "=== 安全检查完成 ==="
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 6.2 安全加固措施
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# security-hardening.sh
|
|||
|
|
|
|||
|
|
echo "开始安全加固"
|
|||
|
|
|
|||
|
|
# 1. 禁用不必要的服务
|
|||
|
|
echo "1. 禁用不必要的服务"
|
|||
|
|
systemctl disable telnet
|
|||
|
|
systemctl disable rsh
|
|||
|
|
systemctl disable rlogin
|
|||
|
|
|
|||
|
|
# 2. 配置SSH安全
|
|||
|
|
echo "2. 配置SSH安全"
|
|||
|
|
sed -i 's/#PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
|
|||
|
|
sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
|
|||
|
|
sed -i 's/#Port 22/Port 2222/' /etc/ssh/sshd_config
|
|||
|
|
systemctl restart sshd
|
|||
|
|
|
|||
|
|
# 3. 配置防火墙规则
|
|||
|
|
echo "3. 配置防火墙规则"
|
|||
|
|
firewall-cmd --permanent --remove-service=ssh
|
|||
|
|
firewall-cmd --permanent --add-port=2222/tcp
|
|||
|
|
firewall-cmd --reload
|
|||
|
|
|
|||
|
|
# 4. 设置文件权限
|
|||
|
|
echo "4. 设置文件权限"
|
|||
|
|
chmod 600 /etc/ssh/sshd_config
|
|||
|
|
chmod 644 /etc/passwd
|
|||
|
|
chmod 000 /etc/shadow
|
|||
|
|
|
|||
|
|
# 5. 配置日志审计
|
|||
|
|
echo "5. 配置日志审计"
|
|||
|
|
echo "auth.* /var/log/auth.log" >> /etc/rsyslog.conf
|
|||
|
|
systemctl restart rsyslog
|
|||
|
|
|
|||
|
|
# 6. 安装入侵检测
|
|||
|
|
echo "6. 安装入侵检测"
|
|||
|
|
yum install -y fail2ban
|
|||
|
|
systemctl enable fail2ban
|
|||
|
|
systemctl start fail2ban
|
|||
|
|
|
|||
|
|
echo "安全加固完成"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 7. 容量规划
|
|||
|
|
|
|||
|
|
### 7.1 容量监控指标
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# capacity-monitoring.sh
|
|||
|
|
|
|||
|
|
REPORT_FILE="/tmp/capacity-report-$(date +%Y%m%d).txt"
|
|||
|
|
|
|||
|
|
echo "=== 容量监控报告 $(date) ===" > $REPORT_FILE
|
|||
|
|
|
|||
|
|
# 1. 服务器资源使用趋势
|
|||
|
|
echo "1. 服务器资源使用趋势" >> $REPORT_FILE
|
|||
|
|
echo "CPU使用率: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')" >> $REPORT_FILE
|
|||
|
|
echo "内存使用率: $(free | grep Mem | awk '{printf("%.2f%%"), $3/$2 * 100.0}')" >> $REPORT_FILE
|
|||
|
|
echo "磁盘使用率: $(df -h / | tail -1 | awk '{print $5}')" >> $REPORT_FILE
|
|||
|
|
|
|||
|
|
# 2. 数据库容量分析
|
|||
|
|
echo "2. 数据库容量分析" >> $REPORT_FILE
|
|||
|
|
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "
|
|||
|
|
SELECT
|
|||
|
|
table_schema AS '数据库',
|
|||
|
|
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS '大小(MB)'
|
|||
|
|
FROM information_schema.tables
|
|||
|
|
WHERE table_schema = 'xlxumu_db'
|
|||
|
|
GROUP BY table_schema;
|
|||
|
|
" >> $REPORT_FILE
|
|||
|
|
|
|||
|
|
# 3. 用户增长趋势
|
|||
|
|
echo "3. 用户增长趋势" >> $REPORT_FILE
|
|||
|
|
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e "
|
|||
|
|
SELECT
|
|||
|
|
DATE(created_at) as date,
|
|||
|
|
COUNT(*) as new_users
|
|||
|
|
FROM users
|
|||
|
|
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 30 DAY)
|
|||
|
|
GROUP BY DATE(created_at)
|
|||
|
|
ORDER BY date DESC
|
|||
|
|
LIMIT 10;
|
|||
|
|
" >> $REPORT_FILE
|
|||
|
|
|
|||
|
|
# 4. 存储空间预测
|
|||
|
|
echo "4. 存储空间预测" >> $REPORT_FILE
|
|||
|
|
current_usage=$(df / | tail -1 | awk '{print $3}')
|
|||
|
|
growth_rate=5 # 假设每月增长5%
|
|||
|
|
echo "当前使用: ${current_usage}KB" >> $REPORT_FILE
|
|||
|
|
echo "预计3个月后: $((current_usage * (100 + growth_rate * 3) / 100))KB" >> $REPORT_FILE
|
|||
|
|
|
|||
|
|
echo "容量监控报告生成完成: $REPORT_FILE"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 7.2 扩容建议
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# scaling-recommendations.sh
|
|||
|
|
|
|||
|
|
# 获取当前资源使用情况
|
|||
|
|
cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//')
|
|||
|
|
mem_usage=$(free | grep Mem | awk '{printf("%.0f"), $3/$2 * 100.0}')
|
|||
|
|
disk_usage=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
|
|||
|
|
|
|||
|
|
echo "=== 扩容建议 ==="
|
|||
|
|
|
|||
|
|
# CPU扩容建议
|
|||
|
|
if [ "$cpu_usage" -gt 70 ]; then
|
|||
|
|
echo "🔴 CPU使用率过高($cpu_usage%),建议:"
|
|||
|
|
echo " - 增加CPU核心数"
|
|||
|
|
echo " - 优化应用程序性能"
|
|||
|
|
echo " - 考虑水平扩展"
|
|||
|
|
elif [ "$cpu_usage" -gt 50 ]; then
|
|||
|
|
echo "🟡 CPU使用率较高($cpu_usage%),建议监控"
|
|||
|
|
else
|
|||
|
|
echo "🟢 CPU使用率正常($cpu_usage%)"
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 内存扩容建议
|
|||
|
|
if [ "$mem_usage" -gt 80 ]; then
|
|||
|
|
echo "🔴 内存使用率过高($mem_usage%),建议:"
|
|||
|
|
echo " - 增加内存容量"
|
|||
|
|
echo " - 优化内存使用"
|
|||
|
|
echo " - 检查内存泄漏"
|
|||
|
|
elif [ "$mem_usage" -gt 60 ]; then
|
|||
|
|
echo "🟡 内存使用率较高($mem_usage%),建议监控"
|
|||
|
|
else
|
|||
|
|
echo "🟢 内存使用率正常($mem_usage%)"
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 磁盘扩容建议
|
|||
|
|
if [ "$disk_usage" -gt 85 ]; then
|
|||
|
|
echo "🔴 磁盘使用率过高($disk_usage%),建议:"
|
|||
|
|
echo " - 立即清理磁盘空间"
|
|||
|
|
echo " - 扩展磁盘容量"
|
|||
|
|
echo " - 迁移数据到其他存储"
|
|||
|
|
elif [ "$disk_usage" -gt 70 ]; then
|
|||
|
|
echo "🟡 磁盘使用率较高($disk_usage%),建议监控"
|
|||
|
|
else
|
|||
|
|
echo "🟢 磁盘使用率正常($disk_usage%)"
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 数据库扩容建议
|
|||
|
|
db_connections=$(docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW STATUS LIKE 'Threads_connected';" | tail -1 | awk '{print $2}')
|
|||
|
|
max_connections=$(docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW VARIABLES LIKE 'max_connections';" | tail -1 | awk '{print $2}')
|
|||
|
|
connection_usage=$((db_connections * 100 / max_connections))
|
|||
|
|
|
|||
|
|
if [ "$connection_usage" -gt 80 ]; then
|
|||
|
|
echo "🔴 数据库连接使用率过高($connection_usage%),建议:"
|
|||
|
|
echo " - 增加最大连接数"
|
|||
|
|
echo " - 优化连接池配置"
|
|||
|
|
echo " - 考虑读写分离"
|
|||
|
|
fi
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 8. 应急预案
|
|||
|
|
|
|||
|
|
### 8.1 应急响应流程
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# emergency-response.sh
|
|||
|
|
|
|||
|
|
INCIDENT_TYPE=$1
|
|||
|
|
SEVERITY=$2
|
|||
|
|
|
|||
|
|
case $INCIDENT_TYPE in
|
|||
|
|
"service_down")
|
|||
|
|
echo "服务下线应急处理"
|
|||
|
|
# 1. 立即切换到备用服务
|
|||
|
|
# 2. 通知相关人员
|
|||
|
|
# 3. 开始故障排查
|
|||
|
|
;;
|
|||
|
|
"data_corruption")
|
|||
|
|
echo "数据损坏应急处理"
|
|||
|
|
# 1. 立即停止写入操作
|
|||
|
|
# 2. 启动数据恢复流程
|
|||
|
|
# 3. 通知业务方
|
|||
|
|
;;
|
|||
|
|
"security_breach")
|
|||
|
|
echo "安全事件应急处理"
|
|||
|
|
# 1. 隔离受影响系统
|
|||
|
|
# 2. 收集证据
|
|||
|
|
# 3. 通知安全团队
|
|||
|
|
;;
|
|||
|
|
esac
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 8.2 灾难恢复计划
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# disaster-recovery.sh
|
|||
|
|
|
|||
|
|
echo "=== 灾难恢复计划 ==="
|
|||
|
|
|
|||
|
|
# 1. 评估损失程度
|
|||
|
|
echo "1. 评估系统损失程度"
|
|||
|
|
|
|||
|
|
# 2. 启动备用系统
|
|||
|
|
echo "2. 启动备用系统"
|
|||
|
|
# 切换到备用数据中心
|
|||
|
|
|
|||
|
|
# 3. 数据恢复
|
|||
|
|
echo "3. 开始数据恢复"
|
|||
|
|
# 从最近备份恢复数据
|
|||
|
|
|
|||
|
|
# 4. 服务验证
|
|||
|
|
echo "4. 验证服务功能"
|
|||
|
|
# 执行完整的功能测试
|
|||
|
|
|
|||
|
|
# 5. 切换流量
|
|||
|
|
echo "5. 切换用户流量"
|
|||
|
|
# 更新DNS指向新系统
|
|||
|
|
|
|||
|
|
echo "灾难恢复完成"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 9. 运维工具
|
|||
|
|
|
|||
|
|
### 9.1 运维脚本集合
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# ops-toolkit.sh - 运维工具箱
|
|||
|
|
|
|||
|
|
show_menu() {
|
|||
|
|
echo "=== 运维工具箱 ==="
|
|||
|
|
echo "1. 系统状态检查"
|
|||
|
|
echo "2. 服务重启"
|
|||
|
|
echo "3. 日志查看"
|
|||
|
|
echo "4. 性能监控"
|
|||
|
|
echo "5. 备份操作"
|
|||
|
|
echo "6. 故障诊断"
|
|||
|
|
echo "7. 安全检查"
|
|||
|
|
echo "8. 容量分析"
|
|||
|
|
echo "0. 退出"
|
|||
|
|
echo "================="
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
while true; do
|
|||
|
|
show_menu
|
|||
|
|
read -p "请选择操作: " choice
|
|||
|
|
|
|||
|
|
case $choice in
|
|||
|
|
1)
|
|||
|
|
./scripts/system-check.sh
|
|||
|
|
;;
|
|||
|
|
2)
|
|||
|
|
read -p "请输入服务名: " service
|
|||
|
|
docker restart $service
|
|||
|
|
;;
|
|||
|
|
3)
|
|||
|
|
read -p "请输入容器名: " container
|
|||
|
|
docker logs --tail 100 -f $container
|
|||
|
|
;;
|
|||
|
|
4)
|
|||
|
|
./scripts/performance-monitor.sh
|
|||
|
|
;;
|
|||
|
|
5)
|
|||
|
|
./scripts/backup-system.sh
|
|||
|
|
;;
|
|||
|
|
6)
|
|||
|
|
./scripts/troubleshoot.sh
|
|||
|
|
;;
|
|||
|
|
7)
|
|||
|
|
./scripts/security-check.sh
|
|||
|
|
;;
|
|||
|
|
8)
|
|||
|
|
./scripts/capacity-monitoring.sh
|
|||
|
|
;;
|
|||
|
|
0)
|
|||
|
|
echo "退出运维工具箱"
|
|||
|
|
break
|
|||
|
|
;;
|
|||
|
|
*)
|
|||
|
|
echo "无效选择,请重新输入"
|
|||
|
|
;;
|
|||
|
|
esac
|
|||
|
|
|
|||
|
|
echo "按回车键继续..."
|
|||
|
|
read
|
|||
|
|
done
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 9.2 自动化运维脚本
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# auto-ops.sh - 自动化运维
|
|||
|
|
|
|||
|
|
# 定时任务配置
|
|||
|
|
setup_cron_jobs() {
|
|||
|
|
echo "配置定时任务"
|
|||
|
|
|
|||
|
|
# 每日备份
|
|||
|
|
echo "0 2 * * * /opt/xlxumu/scripts/backup-system.sh" >> /var/spool/cron/root
|
|||
|
|
|
|||
|
|
# 每小时系统检查
|
|||
|
|
echo "0 * * * * /opt/xlxumu/scripts/system-check.sh" >> /var/spool/cron/root
|
|||
|
|
|
|||
|
|
# 每日日志清理
|
|||
|
|
echo "0 3 * * * /opt/xlxumu/scripts/log-cleanup.sh" >> /var/spool/cron/root
|
|||
|
|
|
|||
|
|
# 每周性能报告
|
|||
|
|
echo "0 9 * * 1 /opt/xlxumu/scripts/performance-report.sh" >> /var/spool/cron/root
|
|||
|
|
|
|||
|
|
systemctl restart crond
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 自动故障恢复
|
|||
|
|
auto_recovery() {
|
|||
|
|
# 检查服务状态并自动重启
|
|||
|
|
services=("mysql-master" "redis-master" "backend-api-1" "backend-api-2")
|
|||
|
|
|
|||
|
|
for service in "${services[@]}"; do
|
|||
|
|
if ! docker ps | grep -q $service; then
|
|||
|
|
echo "检测到 $service 服务异常,尝试自动恢复"
|
|||
|
|
docker restart $service
|
|||
|
|
sleep 30
|
|||
|
|
|
|||
|
|
# 验证恢复结果
|
|||
|
|
if docker ps | grep -q $service; then
|
|||
|
|
echo "$service 服务恢复成功"
|
|||
|
|
# 发送恢复通知
|
|||
|
|
send_notification "服务自动恢复" "$service 服务已自动恢复正常"
|
|||
|
|
else
|
|||
|
|
echo "$service 服务恢复失败,需要人工介入"
|
|||
|
|
# 发送告警通知
|
|||
|
|
send_alert "服务恢复失败" "$service 服务自动恢复失败,需要人工处理"
|
|||
|
|
fi
|
|||
|
|
fi
|
|||
|
|
done
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 发送通知
|
|||
|
|
send_notification() {
|
|||
|
|
local title=$1
|
|||
|
|
local message=$2
|
|||
|
|
|
|||
|
|
# 钉钉通知
|
|||
|
|
curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN" \
|
|||
|
|
-H 'Content-Type: application/json' \
|
|||
|
|
-d "{\"msgtype\": \"text\",\"text\": {\"content\": \"$title: $message\"}}"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 发送告警
|
|||
|
|
send_alert() {
|
|||
|
|
local title=$1
|
|||
|
|
local message=$2
|
|||
|
|
|
|||
|
|
# 发送邮件告警
|
|||
|
|
echo "$message" | mail -s "$title" ops@xlxumu.com
|
|||
|
|
|
|||
|
|
# 发送短信告警(集成短信服务)
|
|||
|
|
# curl -X POST "SMS_API_URL" -d "phone=13800000000&message=$message"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 主函数
|
|||
|
|
main() {
|
|||
|
|
case $1 in
|
|||
|
|
"setup")
|
|||
|
|
setup_cron_jobs
|
|||
|
|
;;
|
|||
|
|
"recovery")
|
|||
|
|
auto_recovery
|
|||
|
|
;;
|
|||
|
|
*)
|
|||
|
|
echo "使用方法: $0 {setup|recovery}"
|
|||
|
|
;;
|
|||
|
|
esac
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
main "$@"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 10. 总结
|
|||
|
|
|
|||
|
|
### 10.1 运维最佳实践
|
|||
|
|
|
|||
|
|
1. **预防为主**:通过监控和预防性维护减少故障发生
|
|||
|
|
2. **自动化优先**:尽可能自动化日常运维操作
|
|||
|
|
3. **文档完善**:维护详细的运维文档和操作手册
|
|||
|
|
4. **持续改进**:根据运维经验不断优化流程和工具
|
|||
|
|
5. **团队协作**:建立有效的运维团队协作机制
|
|||
|
|
|
|||
|
|
### 10.2 关键指标监控
|
|||
|
|
|
|||
|
|
- **可用性**: 99.9%+
|
|||
|
|
- **响应时间**: < 500ms
|
|||
|
|
- **错误率**: < 0.1%
|
|||
|
|
- **恢复时间**: < 30分钟
|
|||
|
|
- **备份成功率**: 100%
|
|||
|
|
|
|||
|
|
### 10.3 持续优化方向
|
|||
|
|
|
|||
|
|
1. **监控体系完善**:增加更多业务指标监控
|
|||
|
|
2. **自动化程度提升**:扩大自动化运维覆盖范围
|
|||
|
|
3. **故障预测能力**:基于AI的故障预测和预防
|
|||
|
|
4. **运维效率提升**:优化运维工具和流程
|
|||
|
|
5. **安全防护加强**:持续加强安全防护措施
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**文档版本**: v1.0.0
|
|||
|
|
**最后更新**: 2024年12月
|
|||
|
|
**维护团队**: 运维团队
|
|||
|
|
**联系方式**: ops@xlxumu.com
|