PyTorch 源码阅读笔记

PyTorch 源码阅读笔记（7）：TorchDynamo

关于 TorchDynamo torchdynamo 为 PyTorch 2.0 的新功能，可以在不修改代码的情况下，对大部分模型提速，基本的使用方式如下 import torch def fn(x, y): a = torch.cos(x).cuda() b = torch.sin(y).cuda() return a + b new_fn = torch.compile(fn, backend="inductor") input_tensor = torch.randn(10000).to(device="cuda:0") a = new_fn(input_tensor, input_tensor) TorchDynamo 原理官方给出的 TorchDynamo 原理图如下涉及到关于 Python 编译运行的内容参考 [python 编译运行过程](Python 代码编译运行过程（1）：编译过程 | K’s blog (luokai.tech) 字节码优化 torchdynamo 通过捕捉 python 的 frame object 进行字节码优化，运行如下代码 from torch._dynamo import optimize import torch._dynamo.config import logging torch._dynamo.config.log_level = logging.INFO torch._dynamo.config.output_code = True @optimize() def toy_example(a, b): a *= 10 b = b + 1 return b for _ in range(100): toy_example(torch....

PyTorch 源码阅读笔记（6）：PyTorch 2.0 编译与安装

1、关于 PyTorch 2.0 PyTorch 主分支已经是2.0版本，新增了大量特性，参考PyTorch 2.0 2、PyTorch 2.0 编译环境 2.0 不再支持 CUDA 11.6，我之前的编译环境一直是 wsl2 + ubuntu 20.04 + CUDA 11.6 + gcc，这次把环境换到了 wsl2 + debian 11 + CUDA 11.7 + oneapiMKL 2023.0.0 + gcc，同时还试了一下 windows 11 + CUDA 11.7 + visual studio 2022 套件。 3、Python 编译安装 2.0 可以直接用如下命令安装 pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117 自己编译安装的话参考官方命令 python setup.py develop python setup.py install 上面命令安装的 PyTorch 无法运行 TorchDynamo，参照官方说法“To install GPU TorchDynamo dependencies, run make triton in the PyTorch repo root directory....

PyTorch 源码阅读笔记（5）：TorchScript

TorchScript 的使用 python api: class MyCell(torch.nn.Module): def __init__(self): super(MyCell, self).__init__() self.linear = torch.nn.Linear(4, 4) def forward(self, x, h): new_h = torch.tanh(self.linear(x) + h) return new_h, new_h scripted_module = torch.jit.script(MyCell().eval()) C++ api: #include <torch/script.h> // One-stop header. #include <iostream> #include <memory> int main(int argc, const char* argv[]) { if (argc != 2) { std::cerr << "usage: example-app <path-to-exported-script-module>\n"; return -1; } torch::jit::script::Module module; try { // Deserialize the ScriptModule from a file using torch::jit::load()....

PyTorch 源码阅读笔记（4）：自动微分张量库

张量库张量接口定义可以在 aten/src/ATen/core/TensorBody.h 看到，Tensor 类含有大量自动生成的代码，可以进行算子调用。 Tensor 类继承自 TensorBase 类，张量相关的大量函数调用自父类 TensorBase ，TensorBase 类有一个关键的成员变量： protected: c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl> impl_; TensorImpl 类为张量的底层表示，包含了实际的数据指针和用以描述张量的元数据，它继承自 c10::intrusive_ptr_target，intrusive_ptr_target 是 c10 模块的侵入式指针模块。 PyTorch 实现了一个侵入式指针来替代 C++ 的 shared_ptr，shared_ptr 使用时需要创建单独的对象进行引用计数，而侵入式指针在使用的类中进行引用计数，所以侵入式指针具有更好的性能。使用侵入式指针的类都需要实现引用计数的函数，在这里则是都需要继承 c10::intrusive_ptr_target 类，intrusive_ptr_target 有如下两个成员变量，refcount_ 记录引用计数，weakcount_ 记录弱引用计数，弱引用计数可以处理循环引用的问题： mutable std::atomic<size_t> refcount_; mutable std::atomic<size_t> weakcount_; TensorImpl 有一个 Storage 类的成员变量，Storage 有如下成员变量： protected: c10::intrusive_ptr<StorageImpl> storage_impl_; StorageImpl 继承了 c10::intrusive_ptr_target, 是实质上的底层数据类，保存了原始数据指针，对于 Storage 类的设计官方备注是继承自原始的 Torch7 项目，倾向于去掉此模块的设计，但是比较麻烦没人有空做。 Variable 与 Tensor 在较新版本的 PyTorch 中，Variable 与 Tensor 进行了合并，有如下的命名空间定义，不过没有完全去掉 Variable 相关的 api： using torch::autograd::Variable = at::Tensor 自动微分反向传播 api backward 函数的调用会进行反向传播：...

PyTorch 源码阅读笔记（3）：算子调用

算子注册参考原生算子注册算子调用过程找到 OperatorHandle // cmake-build-debug-wsl-gcc/aten/src/ATen/core/TensorBody.h inline at::Tensor Tensor::add(const at::Tensor & other, const at::Scalar & alpha) const { return at::_ops::add_Tensor::call(const_cast<Tensor&>(*this), other, alpha); } // cmake-build-debug-wsl-gcc/aten/src/ATen/ops/add_ops.h struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = schema*; // See Note [static constexpr char* members for windows NVCC] STATIC_CONSTEXPR_STR_INL_EXCEPT_WIN_CUDA(name, "aten::add") STATIC_CONSTEXPR_STR_INL_EXCEPT_WIN_CUDA(overload_name, "Tensor") STATIC_CONSTEXPR_STR_INL_EXCEPT_WIN_CUDA(schema_str, "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor") static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); } // cmake-build-debug-wsl-gcc/aten/src/ATen/Operators_2....

PyTorch 源码阅读笔记（2）：原生算子注册

算子定义按照官方描述，所有的原生算子（函数）都定义在aten/src/ATen/native/native_functions.yaml文件里面，以一个add算子为例：如下原生算子： - func: add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor device_check: NoCheck # TensorIterator structured_delegate: add.out variants: function, method dispatch: SparseCPU, SparseCUDA: add_sparse SparseCsrCPU, SparseCsrCUDA: add_sparse_csr MkldnnCPU: mkldnn_add ZeroTensor: add_zerotensor NestedTensorCPU, NestedTensorCUDA: NestedTensor_add_Tensor tags: [canonical, pointwise] 算子信息注册算子通过如下宏进行 schema 注册： // 文件自动生成在 cmake-build-debug-wsl-gcc/aten/src/ATen/RegisterSchema.cpp // TORCH_LIBRARY(aten, m) 展开如下 static void TORCH_LIBRARY_init_aten(torch::Library&); static const torch::detail::TorchLibraryInit TORCH_LIBRARY_static_init_aten( torch::Library::DEF, &TORCH_LIBRARY_init_aten, "aten", c10::nullopt, "_file_name_", 6); void TORCH_LIBRARY_init_aten(torch::Library& m) { m.def("add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor", {at::Tag::core, at::Tag::pointwise}); } 注册发生在 m....

PyTorch 源码阅读笔记（0）：代码结构与编译

PyTorch代码目录结构参考PyTorch官方描述，大致代码结构如下所述： c10 - Core library —— 核心库，包含最基本的功能，aten/src/ATen/core中的代码在逐渐往此处迁移 aten - PyTorch C++ 张量库，不包括自动梯度支持 aten/src - aten/src/ATen aten/src/ATen/core - 核心函数库，逐步往c10迁移中。 aten/src/ATen/native - 原生算子库，大部分CPU算子在此一级目录下，除了一些有特殊编译需求的算子在cpu目录下 aten/src/ATen/native/cpu - 某些需要类似AVX等特殊指令集编译的cpu算子实现 aten/src/ATen/native/cuda - CUDA 算子 aten/src/ATen/native/sparse - CPU 和 CUDA 的稀疏矩阵算子 aten/src/ATen/native/mkl，aten/src/ATen/native/mkldnn，…… - 如文件夹描述，对应的算子 torch - 实际的PyTorch库，除了torch/csrc之外的都是Python模块 torch/csrc - Python 和 C++ 混编库 torch/csrc/jit - TorchScript JIT frontend torch/csrc/autograd - 自动微分实现 torch/csrc/api - The PyTorch C++ frontend. torch/csrc/distributed - tools - 代码生成模块，PyTorch很多代码是在编译时自动生成的 test - Python前端单元测试模块，C++前端的单元测试在其他文件夹 caffe2 - Caffe2 库合并入PyTorch，具体合并了哪些官方说的太抽象，以后看到了再更新 PyTorch c++ 编译 PyTorch 官方有单独打包的 C++ 库 libtorch...

PyTorch 源码阅读笔记（1）：dispatcher

什么是dispatcher 关于 PyTorch 的 dispatcher，PyTorch 的核心作者之一 Edward Z Yang 有过介绍：Let’s talk about the PyTorch dispatcher PyTorch 作为多平台的神经网络框架，需要实现这样一种功能：每个通用的算子都要实现一些相同的 api，比如前传和反传，这些相同的api在不同的硬件设备会有不同的代码实现，CPU下可能要用到MKL，GPU下是CUDA，各个厂商的NPU加速卡也可能有不同的底层代码。PyTorch 需要根据不同的硬件设备和使用场景，调用对应的函数实现，dispatcher 能够实现这个功能。对于每个operator，dispatcher都会维护一个函数指针表，为每个dispatch key提供对应的实现。 Dispatcher class TORCH_API Dispatcher final { // 嵌套结构体 struct OperatorDef final { explicit OperatorDef(OperatorName&& op_name) : op(std::move(op_name)) {} impl::OperatorEntry op; size_t def_count = 0; size_t def_and_impl_count = 0; }; // 成员函数 C10_ALWAYS_INLINE static Dispatcher& singleton() { // ... static Dispatcher& s = realSingleton(); /* 全局单例 C10_EXPORT Dispatcher& Dispatcher::realSingleton() { static Dispatcher _singleton; return _singleton; } */ return s; } // 成员变量 LeftRight<ska::flat_hash_map<OperatorName, OperatorHandle>> operatorLookupTable_; std::list<OperatorDef> operators_; } operatorLookupTable_ 是一个算子表 LeftRight 实现参考：Brief Announcement: Left-Right - A Concurrency Control Technique with Wait-Free Population Oblivious Reads，大概逻辑是给任意的数据结构生成两份实例左和右，同时存在读写的时候，读左边的写右边的，写入完成后读取换到右边，当左边的所有读结束后，右边的写入再同步到左边，这种并发控制方式实现了零等待的读操作。...