A Guide to Using Built-in Profiling Tools in Go

Profiling helps you identify where your program spends time and memory. Go provides built-in tools to collect and analyze this data. This guide covers profiling, benchmarks, and tracing, including examples and instructions for interpreting results.

Profiling Overview

Profiling is a way to measure program performance automatically. It identifies code sections responsible for high CPU usage, memory allocations, or blocking operations.

Go provides four main diagnostic approaches:

  1. Profiling – Measures CPU, memory, and blocking costs

  2. Tracing – Tracks latency and concurrency across requests

  3. Debugging – Pauses execution to inspect state and flow // not important here

  4. Runtime statistics – Provides a high-level overview of app health // not important here

I’d like to focus more on profiling and especially, pprof tool.

Collecting Profile Data

Profiling Tests

Use go test with profiling flags:

  • CPU profile. Records which functions are active during CPU cycles.

    go test -cpuprofile=cpu.out
  • Memory (Heap) profile. This tracks the stack trace every time a heap allocation is made.

    go test -memprofile=mem.out
  • Blocking profile. To track goroutines blocked by locks, channels, or system calls.

    go test -blockprofile=block.out

Important. Avoid enabling more than one type of profile simultaneously, as the profiling mechanism itself can distort the result.

Benchmarking

Benchmarks are only reliable if written properly. Example:

func BenchmarkProcessData(b *testing.B) {
    data := loadTestData()
    b.ResetTimer()
    for i := range b.N {
        processData(data)
    }
}
  • Always set up test data outside the timed loop

  • Reset the timer after setup to avoid measuring preparation time

  • Use -benchmem to collect memory allocation metrics:

go test -bench=. -benchmem

Output includes:

  • ns/op – Time per operation

  • B/op – Bytes allocated per operation

  • allocs/op – Number of heap allocations

Compare before and after optimizations to verify improvements.

Profiling Live Applications (using pprof)

For web servers or long-running programs, you need to import "net/http/pprof". This enables profiling at runtime which is especially useful for web applications. It installs diagnostic handlers under the /debug/pprof endpoint.

import _ "net/http/pprof"
import "net/http"

func main() {
    go http.ListenAndServe(":6060", nil)
    runApplication()
}

Access profiles in a browser or with go tool pprof:

You can also write your own custom profilers https://go.dev/wiki/CustomPprofProfiles

Analyzing Profiles with pprof

To collect the profiling log.

go tool pprof cpu.out

You can also use the following flags:

  • topN - show top N samples by function

  • -cum flag - sort by cumulative time

  • list FunctionName - shows source code with samples per line.

  • disasm - shows disassembly.

  • web/gv - writes profile graph for browser/Ghostview.

Example top2 output:

(pprof) top
Showing nodes accounting for 90% of 2s total
      flat  flat%   sum%        cum   cum%
     0.8s  40%   40%      1.2s   60%  main.processData
     0.4s  20%   60%      0.4s   20%  main.calculate

How to read:

  • flat: Time spent in the function itself

  • cum: Cumulative time spent in the function and all functions it calls

Focus on functions with high cumulative time to target optimizations.

Interpreting Heap Profiles

A heap profile shows which parts of the program allocate the most memory. Memory profiling records the stack trace whenever a heap allocation happens. The profiling library samples calls to the internal memory allocation routines, usually recording about one event per 512KB of allocated memory (this can be adjusted).

It doesn’t track stack allocations because they are considered free. The Go compiler uses an algorithm called “escape analysis” to decide if a value should be created on the stack or the heap. Only constructions on the heap are classified as allocations. This is important because the main goal of optimizing memory usage is to reduce the load on the garbage collector (GC). Reducing allocations shortens the duration of collections and prevents the GC from causing high latency in the running application.

Once a profile log is created (e.g. mem.out), use the go tool pprof to read it.

  • If you run go tool pprof with the --inuse_objects flag, the tool will report allocation counts instead of sizes.
$ go tool pprof mem.out
(pprof) top

Example output:

Showing nodes accounting for 90% of 5MB total
      flat  flat%   sum%        cum   cum%
     2MB   40%   40%       2MB   40%  main.loadData
     1MB   20%   60%       1MB   20%  main.buildObjects

How to read:

  • Focus on functions with high memory allocations

  • Frequent allocations in these functions can increase GC pressure

  • Consider reusing objects, pooling, or reducing allocations

Visualization:

(pprof) web

Shows a call graph with memory usage per function. Example (top of it) is from here.

  • Using —inuse_objects flag output example (from here):

  •   $ go tool pprof --inuse_objects havlak3 havlak3.mprof
      Adjusting heap profiles for 1-in-524288 sampling rate
      Welcome to pprof!  For help, type 'help'.
      (pprof) list FindLoops
      Total: 1763108 objects
      ROUTINE ====================== main.FindLoops in /home/rsc/g/benchgraffiti/havlak/havlak3.go
      720903 720903 Total objects (flat / cumulative)
      ...
           .      .  277:     for i := 0; i < size; i++ {
      311296 311296  278:             nodes[i] = new(UnionFindNode)
           .      .  279:     }
           .      .  280:
           .      .  281:     // Step a:
           .      .  282:     //   - initialize all nodes as unvisited.
           .      .  283:     //   - depth-first traversal and numbering.
           .      .  284:     //   - unreached BB's are marked as dead.
           .      .  285:     //
           .      .  286:     for i, bb := range cfgraph.Blocks {
           .      .  287:             number[bb.Name] = unvisited
      409600 409600  288:             nonBackPreds[i] = make(map[int]bool)
           .      .  289:     }
      ...
      (pprof)

Tracing for Latency and Concurrency

Tracing captures latency across functions and goroutines:

go test -trace trace.out
go tool trace trace.out

Traces help identify:

  • Functions causing delays

  • Goroutines waiting on locks or channels

  • Latency bottlenecks across processes

Takeaways

  1. Profile Before Optimizing: The most critical step is to identify bottlenecks using tools like go tool pprof. This helps to focus on the right areas.

  2. Prioritize Simple Data Structures: The CPU profile often reveals performance degradation due to inefficient use of complex data types, such as Go’s map. The takeaway is that “There’s no reason to use a map when an array or slice will do” for indexed access or simple sets. Switching from maps to slices significantly improves runtime (e.g., cutting time by nearly a factor of two).

  3. Minimize Allocation to Reduce GC Pressure: If the CPU profile shows high time spent in runtime.mallocgc, the program is memory-bound. The memory profile helps pinpoint code sections responsible for allocating the most memory. The general principle is that the fastest program is often the one that makes the fewest memory allocations. Reducing allocations minimizes garbage collector (GC) work.

  4. Implement Memory Reuse for Inner Loops: Even necessary bookkeeping structures can generate significant allocations if created repeatedly in inner loops. Consider object pooling or reusing buffers to minimize GC pressure.

  5. Achieving Competitive Performance: The overall conclusion of the optimization study is that when Go programmers use profiling tools to meticulously manage the garbage generated by inner loops, the resulting Go program can be competitive with equivalent C++ code.

Main Sources with more info

Comments

© 2025 Threads of Thought. Built with Astro.