Autopool: Speeding Up gRPC With Finalizers

Introduction

It might be the fever talking, but I have found a new use for the much maligned runtime.Finalizer.

Between urgent care runs through the holidays for my family members and trying to adapt songs about Christmas into songs about coughing (aka "Hard Candy Christmas" becomes "Hard Coughing Christmas"), this little problem popped into my head:

How do you use a *sync.Pool to recover gRPC protocol buffers?

Go program speed seem linked to some combination of the number of allocations and the size of allocations. The allocations cause the GC to work and most papers link this to be the biggest speed differences between non-GC languages and Go.

We can see that Java and C# programs can often get close to Go speed in common applications like web services, just not memory usage.

Many Go programmers want to get close to C/C++/Rust speed with Go, they spend a lot of time trying to control allocations, doing weird things with MAXPROCS, changing when GC's happen, or allocating large virtual memory chunks to trick the GC.

gRPC Problem

gRPC is my chosen platform for RPCs, though this problem would affect most Golang RPC services or when object control leaves code you control and doesn't return.

Repeated large allocations are bad for speed according to every source because of the time it takes to allocate objects and the time the garbage collector has to spend tracking them.

In the old days, experts would tell you to use circular buffers based on channels to reuse your buffers. This allowed reuse on demand of expensive objects. This lowered your large allocations but the circular buffer wouldn't automatically adjust its size and might hold more memory than you needed or not enough and constantly need to create new objects. It needed some automation.

The Go authors added a standard one called sync.Pool. This is a fast free list that you can store heap objects in for reuse.

But here's the rub:

gRPC and third party libs control when an object will go out of scope. In gRPC, this can make expensive slices un-poolable. When you create and return an output object, you cannot pool a contained []byte slice because the gRPC service takes control and you no longer have lifetime control.

Deeper Look

This isn't actually just a gRPC problem, but any time you have to pass an object to third-party package, you loose reuse by pooling.

Here is a simple proto definition for a gRPC service with a method called Record().

service Recorder {
   rpc Record(Input) returns (Output) {}
}

The code below implements the interface.

func (g *grpcService) Record(ctx context.Context, in *pb.Input) (*pb.Output, error) {
	out := &pb.Output{}
	...
	return out, nil
}

The first problem is that "in" is a complete loss. We can't reuse it because we cannot tell gRPC where to get its next input object.

The output object we create within the Record() function would be reusable, but it is returned to the gRPC service object that then has control of its lifetime.

gRPC is unfortunately caught in a bad position here. They cannot pool objects because they cannot control input/output object lifetime.

Stubby, the Google internal version of gRPC, in the early days, had an interface that was similar to:

func (g *grpcService) Record(in *pb.Input, out *pb.Output) error {
	...
}

In this model they could have pooled, but the user would have to make copies of input/output objects if they were to live passed the Record() call. This is probably why they moved to a more standard function model because most SWEs would forget that detail and have data race issues. This is conjecture on my part.

Autopool - a use for runtime.SetFinalizer()

If you've never used runtime.SetFinalizer(), good for you. People like to think of them as destructors, but in a GC language that makes no guarantees on object lifetime, this just leads to problems.

A finalizer simply is a function that is called when the object pointed to is going to be garbage collected.

David Crawshaw had a good article about finalizers being less than useful, so I will list his articles here and let you investigate why they are bad from an expert:

What if we could use a finalizer to reclaim an object we have lost track of into a sync.Pool for reuse?

Enter autopool. Let's put it into the service.


type grpcService struct {
	...
	pool *autopool.Pool
	rescID int
	...
}

func newGRPC() *grpcService {
	...
	serv := &grpcService{}
	
	// Create our pool object.
	p := autopool.New()
	
	// Add a pool that will serve this object type and get the ID of the 
	// internal sync.Pool to pull from.
	serv.rescID = autopool.Add(reflect.TypeOf(&pb.Resource{}))
	serv.pool = p

	return serv
}

...

// Record implements the gRPC service Record() call defined in the protocol buffer.
func (g *grpcService) Record(ctx context.Context, in *pb.Input) (*pb.Output, error) {
	// Create our standard Output struct.
	var out = &pb.Output{...}
	
	// The output.Resource object has a []byte, which we want to be able to reuse.
	// So we yank it from our pool and reset the []byte to 0 length. You may have
	// to reset other fields.
	out.Resc = g.pool.Get(g.rescID).(*pb.Resource)
	out.Resc.Payload = out.Resc.Payload[0:0]
	
	// Somewhere here we'd want to modify the payload.
	...

	return out, nil
}

What you see happening here is that when we create our output object and pull a sub object that contains a []byte from our Pool.

So why are we able to reclaim our Protocol Buffers here where we could not before? And where is that happening?

autopool wraps standard sync.Pool(s) for object types you define. When you pull one of these objects, it works exactly like a sync.Pool except we add a finalizer to the object and inserts it back into pool after garbage collection tries to free it.

But you cannot guarantee when the pool will be added to?

That is correct, especially if you are trying to hack your GC with a lot of the tricks I see around the web. The GC runs at certain memory pressures, so autopool finalizers wont' be run necessarily when the object goes out of scope, or ever.

But on any service that is getting a constant stream of requests, this should happen often enough to cause this to fill our pool.

The cost of adding the finalizer is fairly low.

Is this worth doing?

From what I can tell, if your service is getting enough requests to keep a sync.Pool from freeing the memory and you have payloads at around 100KiB or higher, you start to see non-trivial gains.

Why not do this with all messages?

I gave that a try to see if this gave any benefit. I could not detect benefits based on just number of allocations, the size mattered.

Why not finalize just the slices then, wouldn't that be safer?

You can only finalize an object created by new() or taking the address of a composite literal. Reference types don't count.

Since my initial problem was about gRPC and protocol buffers (and proto3 specifically), I could not wrap my buffer. Even if I did, that would not guarantee that all references to the underlying array would be clear when the wrapper went away.

Keith Randall on go-nuts had a cool way of finalizing a slice's array, you can read about it here (thanks Keith).

However, that method did not allow me to capture the slice itself and was banking on a loophole that he was kind enough to point out is not spec compliant.

Garbage Collection is tricky beast, are you sure this won't cause problems?

Short answer: No

Longer answer:

Using this is like the unsafe package, you better be sure on what you are doing, and even then you might get bitten in the future.

When you manually use a sync.Pool.Put(), you are ensuring that the entire object is free for reuse otherwise you get some nasty bugs. When an object is finalized, you have no idea if a reference to an underlying slice is held somewhere.

So this technique is not completely safe, you have to KNOW that any slices or maps will not have any references held in the third-party code (like gRPC). When using this package you need to either version or static the code in your mod file as to avoid nasty surprises by changes in the upstream code.

gRPC seems quite happy at the moment with taking my output object and keeping no references to any of my fields once it serializes the output.

long john silver
But imagine a pirate voice here: "Thar be bugs out thar!"

Let's See Some Numbers

I thought you'd never ask:

We have two types of Benchmarks testing a gRPC service:

Using the autopool
Not using the autopool

My benchmark environment:

Mac Pro Laptop circa 2015
Go 1.13

Note a few things:

I don't have a benchmarking machine
I'm not on Linux, which I am sure the go compiler makes more optimized binaries for
I could have done something wrong in my benchmarks. This is likely
I could be drawing the wrong conclusions

Let's talk about what the server does:

Receives a message
Creates an output message. That output message has a []byte field.
The []byte field is filled to some buffer size at 64 byte chunks at a time.
Sends the output message back, which drops it

GRPC Service Benchmark

Without Pool Summary:

Clients	Buffer Size	Requests	ns/op	B/op	allocs/op	Real	User	Sys
100	1K	100K	1238696350	1703020680	16959429	0m1.965s	0m8.653s	0m1.095s
100	10K	100K	3634045380	11715026968	17832296	0m4.472s	0m17.188s	0m7.466s
100	50K	10K	1311663465	4924447952	1932345	0m2.072s	0m4.658s	0m2.731s
100	50K	100K	15146798078	49332431456	19474287	0m16.331s	0m40.992s	0m27.367s
100	100K	10K	2602280177	10144042400	2150441	0m3.793s	0m7.413s	0m5.290s
100	100K	100K	33020455793	101572491024	21440598	0m34.581s	1m14.539s	0m57.384s
100	3M	10K	185087993259	329795532600	9040208	3m7.449s	4m16.773s	7m6.174s

With Pool Summary:

Clients	Buffer Size	Requests	ns/op	B/op	allocs/op	Real	User	Sys
100	1K	100K	1228820620	1514841392	16515117	0m1.942s	0m8.842s	0m1.170s
100	10K	100K	3373151710	7457301736	16614532	0m4.187s	0m15.015s	0m6.795s
100	50K	10K	1208806099	3166273448	1764917	0m1.992s	0m3.987s	0m2.550s
100	50K	100K	12690099859	31439344120	17791186	0m13.965s	0m31.267s	0m23.260s
100	100K	10K	2118462469	5874870160	1890988	0m3.351s	0m5.465s	0m4.063s
100	100K	100K	29157830020	59277607656	19369437	0m30.743s	0m57.534s	0m52.227s
100	3M	10K	131743623587	178073871536	8470306	2m14.139s	2m53.578s	4m32.568s

Conclusions

1K Slices

Has Pool	Buffer Size	Requests	ns/op	B/op	allocs/op	Real	User	Sys
No	1K	100K	1238696350	1703020680	16959429	0m1.965s	0m8.653s	0m1.095s
Yes	1K	100K	1228820620	1514841392	16515117	0m1.942s	0m8.842s	0m1.170s

9.87573ms decrease in op time, 433KiB allocation savings, 444,312 reduction in allocs.

Virtually no real time saved and a slight increase in kernel time. I'd say that there isn't enough benefit here to warrant usage.

10K Slices

| Has Pool | Buffer Size | Requests | ns/op | B/op | allocs/op | Real | User | Sys |
| No | 10K | 100K | 3634045380 | 11715026968 | 17832296 | 0m4.472s | 0m17.188s | 0m7.466s |
| Yes | 10K | 100K | 3373151710 | 7457301736 | 16614532 | 0m4.187s | 0m15.015s | 0m6.795s |

261ms decrease in op time, 4.0 GiB in allocation savings, 1,217,764 reduction in allocs.

Still almost no real world savings in time, slight reduction in user space and kernel space time. Wouldn't get excited about using it here.

50K Slices

| Has Pool | Buffer Size | Requests | ns/op | B/op | allocs/op | Real | User | Sys |
| No | 50K | 10K | 1311663465 | 4924447952 | 1932345 | 0m2.072s | 0m4.658s | 0m2.731s |
| Yes | 50K | 10K | 1208806099 | 3166273448 | 1764917 | 0m1.992s | 0m3.987s | 0m2.550s |

103ms decrease in op time, 1.6 GiB in allocation savings, 167,428 reduction in allocs.

Again, nothing to write home about here. But if you look at the runs for 100K, we start to see several seconds in time reduction both for real time and CPU times.

100K Slices

| Has Pool | Buffer Size | Requests | ns/op | B/op | allocs/op | Real | User | Sys |
| No | 100K | 10K | 2602280177 | 10144042400 | 2150441 | 0m3.793s | 0m7.413s | 0m5.290s |
| Yes | 100K | 10K | 2118462469 | 5874870160 | 1890988 | 0m3.351s | 0m5.465s | 0m4.063s |

484ms decrease in op time, 4 GiB in allocation savings, 259,453 reduction in allocs.

Here is where I think things get interesting. Real time doesn't really change that much, but we can see that we are spending less CPU time here. We gain a second on both User/Sys.

So if you are averaging over 100K in byte slices, this might be where this might help.

3MiB Slices

| Has Pool | Buffer Size | Requests | ns/op | B/op | allocs/op | Real | User | Sys |
| No | 3M | 10K | 185087993259 | 329795532600 | 9040208 | 3m7.449s | 4m16.773s | 7m6.174s |
| Yes | 3M | 10K | 131743623587 | 178073871536 | 8470306 | 2m14.139s | 2m53.578s | 4m32.568s |

53s decrease in op time, 141 GiB in allocation savings, 569,902 reduction in allocs.

At the far end here we can see some significant savings. We saved over a minute on our runtime and several minutes of reduction in CPU time.

So in the MiB region of slice size, being able to recover these slices can provide some significant savings.

Final Conclusions

I think I've found a good use for finalizers that could really help speed up software where control of objects is lost to third party packages.

You might be asking, why I think I've found a good use.

It is possible a mistake was made or an assumption went into this that is incorrect.
The conclusions may be wrong or attributed to another factor.
This is not peer reviewed.
This was not tested on the most popular platform, Linux. There could be optimizations there that make this mute.

I would also note that I'm not recommending this use. There may be hidden gotchas I haven't thought of and it is easy to have packages outside your control change how they treat your slices. In most use cases, you give up control when you pass an object.

The code is published. You are welcome to duplicate my findings or show where it is incorrect.

Until then, we won't know if this was just the fever making me delusional or I have found something interesting.

Until then, cheers and happy holidays.

Note on some gotchas

Protocol Buffers have a Reset() function. In proto3, this simply points the pointer at a new version of the struct. That destroys the slices. This means that you cannot use Reset() and this technique.
If you are thinking of linking to this code, realize it is in a development branch, it is subject to change.