线上服务内存暴涨，竟是Goroutine泄漏：3个排查命令救了我

凌晨2点17分，手机响了。

我迷迷糊糊摸过来一看，不是闹钟，是P0告警。服务内存占用从正常的300MB飙到了8GB，而且还在涨。

第一反应：kill -9 重启。

第二反应：不对，上周刚重启过，又来了。

第三次，我决定找出到底是谁在吃我的内存。

先确认：真的是内存泄漏吗？

先看监控。这种垂直上升的曲线，很像是内存泄漏：

![配图：Grafana内存飙升截图，时间轴显示从2:00开始直线上升]

但这里有个误区。Go 的 GC 其实很勤快，真正的内存泄漏在 Go 里并不常见。更常见的，是Goroutine泄漏——你开了协程，但协程卡住了，退不出来，越积越多。

每个 Goroutine 初始栈 2KB，但可能增长到 MB 级。10万个卡住的 Goroutine，内存直接爆炸。

判断方法很简单：

import (
    "expvar"
    "net/http"
    _ "net/http/pprof"
)

func init() {
    http.Handle("/debug/vars", expvar.Handler())
    go http.ListenAndServe(":6060", nil)
}

// 在你的业务代码里加个接口，或者直接curl pprof
func getGoroutineCount() int {
    return runtime.NumGoroutine()
}

然后执行：

$ curl http://localhost:6060/debug/pprof/goroutine?debug=1 | head -20

goroutine profile: total 104857
104856 @ 0x43e3c6 0x4068b0 0x4065a5 0x78f2a3 ...
#	0x78f2a2	main.processTask+0x42	/app/main.go:127

看到那个 104857 了吗？

正常服务也就几十个、几百个 Goroutine。超过1万就该警惕了，我这直接10万+。

确认嫌疑：不是内存泄漏，是 Goroutine 泄泄漏。

排查第一步：runtime.NumGoroutine() 快速确认

如果你连 pprof 都懒得配，一行代码就能验尸：

go func() {
    for {
        log.Printf("当前Goroutine数量: %d", runtime.NumGoroutine())
        time.Sleep(10 * time.Second)
    }
}()

把这段代码塞进你的服务，看日志：

2025/04/07 02:18:12 当前Goroutine数量: 127
2025/04/07 02:18:22 当前Goroutine数量: 1024
2025/04/07 02:18:32 当前Goroutine数量: 8192
...

肉眼可见的飙升，基本实锤。

排查第二步：pprof 定位泄漏点

知道泄漏了，关键是找出哪里泄漏的。

Go 自带了神器，不用白不用：

# 查看所有Goroutine的堆栈
curl http://localhost:6060/debug/pprof/goroutine?debug=1 > goroutine.txt

# 或者用2，信息更全
curl http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutine_full.txt

打开 goroutine.txt，你会看到类似这样的输出：

goroutine profile: total 104857
104856 @ 0x43e3c6 0x46f5f2 0x78f2a3 0x4709a1
#	0x78f2a2	main.(*Worker).Process+0x242	/app/worker.go:89
#	0x78f3b1	main.(*Worker).Start.func1+0x51	/app/worker.go:56

1 @ 0x43e3c6 0x46f5f2 0x791102 0x4709a1
#	0x791101	main.main+0x301	/app/main.go:34

重点看那个巨大的数字：104856 @ ...

说明有 104856 个 Goroutine 都卡在了 worker.go:89 这一行。

打开代码看：

// worker.go
func (w *Worker) Process(task Task) {
    result := w.client.Send(task)  // 第89行
    w.handleResult(result)
}

再跟进去，Send 方法里：

func (c *Client) Send(t Task) Result {
    // 这里有个坑：没有设置超时！
    resp, err := c.httpClient.Post(c.url, "application/json", body)
    if err != nil {
        return Result{Err: err}
    }
    // 更坑的是：没有读取 Body，也没有关闭！
    return Result{Data: resp}
}

两个致命问题：

HTTP 请求没有超时 → 下游服务挂了，请求永远挂在那
Response.Body 没有关闭 → 连接池泄漏，Goroutine 阻塞在 I/O 等待

每个请求开一个 Goroutine，请求卡住，Goroutine 就堆积。

排查第三步：可视化火焰图（可选但直观）

如果你喜欢用看的，而不是看文本：

# 本地启动可视化分析
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/goroutine

# 或者先下载再分析
curl http://localhost:6060/debug/pprof/goroutine > goroutine.prof
go tool pprof -http=:8080 goroutine.prof

然后打开 http://localhost:8080：

![配图：pprof火焰图，显示一大片阻塞在 net/http.(*persistConn).roundTrip 的 Goroutine]

火焰图上一眼就能看到：哪块最宽，哪里就是罪魁祸首。

修复：给请求加上缰绳

找到问题了，修复其实就几行代码：

// 修复前（泄漏）
func (c *Client) Send(t Task) Result {
    resp, err := c.httpClient.Post(c.url, "application/json", body)
    return Result{Data: resp}  // 没关Body，没超时
}

// 修复后（安全）
func (c *Client) Send(ctx context.Context, t Task) (Result, error) {
    req, err := http.NewRequestWithContext(ctx, "POST", c.url, body)
    if err != nil {
        return Result{}, err
    }
    req.Header.Set("Content-Type", "application/json")
    
    resp, err := c.httpClient.Do(req)
    if err != nil {
        return Result{}, err  // 必须先处理错误！
    }
    // ⚠️ 关键点：defer 必须放在 err 判断之后！
    // 如果 err != nil，resp 可能是 nil，此时 defer nil.Close() 会直接 Panic
    defer resp.Body.Close()
    
    // 即使不需要Body，也要读完（让连接复用）
    // 进阶：如果 resp.Body 很大，可用 io.LimitReader 限制读取长度
    io.Copy(io.Discard, resp.Body)
    
    return Result{Status: resp.StatusCode}, nil
}

调用方也要配合：

// 修复前
go w.Process(task)  // 裸go，没有任何控制

// 修复后
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

if err := w.Process(ctx, task); err != nil {
    log.Printf("任务处理超时: %v", err)
}

核心原则：

所有外部请求必须带超时（context.WithTimeout）
HTTP Response 必须关闭（defer resp.Body.Close()）
别裸用 go 关键字，要有控制（Worker Pool 或至少带 Recover）

预防清单：上线前过一遍

事故复盘完了，但最好的排查是不让它发生。收藏这个清单，上线前自查：

检查项	说明	严重程度
☐ 裸 `go` 关键字	业务代码里直接 `go func()`？必须加 recover	🔴 P0
☐ HTTP 无超时	所有 http.Client 必须配置 Timeout	🔴 P0
☐ Response 未关闭	defer resp.Body.Close() 了吗	🔴 P0
☐ Channel 无缓冲	无缓冲 chan 必须有接收方，否则阻塞	🟡 P1
☐ time.After 在循环	循环里用 time.After 会内存泄漏	🟡 P1
☐ 第三方库回调	库的回调是否会在内部起 Goroutine	🟢 P2

监控告警配置：

# Prometheus 告警规则
groups:
  - name: goroutine-alerts
    rules:
      - alert: HighGoroutineCount
        expr: go_goroutines{job="my-service"} > 5000
        for: 5m
        annotations:
          summary: "Goroutine数量异常: {{ $value }}"

关于 time.After 的坑：

// 泄漏代码：每次循环都新建一个 Timer
for {
    select {
    case <-time.After(5 * time.Second):  // 每次都会分配新的 Timer
        // do something
    }
}

// 正确做法：复用 Timer
for {
    timer := time.NewTimer(5 * time.Second)
    select {
    case <-timer.C:
        // do something
    }
    timer.Stop()  // 或者用 Reset 复用
}

time.After 底层会创建一个 time.Timer，在定时器到期之前，这个对象和它关联的内存不会被 GC 回收。如果在高频循环里使用，会瞬间堆积成千上万个 Timer。

写在最后

那天晚上，我用这3个命令，20分钟定位到了问题，10分钟修复，5分钟验证。

3个命令再总结一遍：

runtime.NumGoroutine() — 快速确认嫌疑
curl /debug/pprof/goroutine?debug=1 — 定位泄漏代码行
go tool pprof -http=:8080 — 可视化确认

Goroutine 是 Go 的杀手锏，但也是最容易被忽视的资源。它不像内存那样会被 GC 自动回收，一旦泄漏，只有重启能救你。

现在，去你的项目里跑一遍：

curl http://localhost:6060/debug/pprof/goroutine?debug=1 | head -5

看看你的服务现在有多少 Goroutine 在跑。如果有异常，趁现在还来得及。

你遇到过 Goroutine 泄漏吗？是怎么排查的？欢迎在评论区聊聊，我去给你点赞。

配图建议：封面用"Goroutine: 我免费了"的Meme图，或者一张"10万+ Goroutine"的监控截图做视觉冲击力。 ne: 我免费了"的Meme图，或者一张"10万+ Goroutine"的监控截图做视觉冲击力。*

如有疑问关注公众号给我留言

整理了5个Goroutine泄漏的Code Review检查点 make和new到底用哪个？我用Benchmark测了100万次

Go 编程语言之旅

io

math

reflect

path

container

list

crypto

md5

text

template

testing

runtime

hash

crc32

archive