Author here, would be happy to field any questions or feedback! Thanks for the post, this is pretty cool!I feel like I've seen Cupti have fairly high overhead depending on the cuda version, but I'm not very confident -- did you happen to benchmark different workloads with cupti o