opendilab
diff --git a/‎.gitignore
+5-4 b/‎.gitignore
+5-4
diff --git a/‎requirements.txt
+2-1 b/‎requirements.txt
+2-1
diff --git a/‎source/00_intro/index.rst
+37 b/‎source/00_intro/index.rst
+37
diff --git a/‎source/00_intro/index_zh.rst
+28 b/‎source/00_intro/index_zh.rst
+28
diff --git a/‎source/01_quickstart/first_rl_program.rst
+109 b/‎source/01_quickstart/first_rl_program.rst
+109
diff --git a/‎source/01_quickstart/first_rl_program_zh.rst
+100 b/‎source/01_quickstart/first_rl_program_zh.rst
+100
diff --git a/‎source/quick_start/images/cartpole_cmp.gif ‎source/01_quickstart/images/cartpole_cmp.gif b/‎source/quick_start/images/cartpole_cmp.gif ‎source/01_quickstart/images/cartpole_cmp.gif
diff --git a/‎source/01_quickstart/images/train_dqn.gif
2.52 MB b/‎source/01_quickstart/images/train_dqn.gif
2.52 MB
diff --git a/‎source/01_quickstart/index.rst
+8 b/‎source/01_quickstart/index.rst
+8
diff --git a/‎source/01_quickstart/index_zh.rst
+8 b/‎source/01_quickstart/index_zh.rst
+8
@@ -1,10 +1,11 @@
-*.eps
-*.jpg
-*.svg
+*.puml.eps
+*.puml.jpg
+*.puml.svg
 .DS_Store
 build/
 source/_build
 _build/
 .vscode/
 venv/
-.idea/
+.idea/
+src/pytorch-sphinx-theme/
@@ -1,8 +1,9 @@
 Pillow==8.2.0
 sphinx>=2.2.1,<=4.2
-sphinx_rtd_theme~=0.4.3
+sphinx_rtd_theme
 enum_tools
 sphinx-toolbox
 plantumlcli>=0.0.2
 sphinx-autobuild
 git+http://github.com/opendilab/DI-engine@main
+-e git+https://github.com/opendilab/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
@@ -0,0 +1,37 @@
+Introduction
+===============================
+
+What is DI-engine?
+-------------------------------
+
+DI-engine is a decision intelligence platform built by a group of enthusiastic researchers and engineers, \
+that will provide you with the most professional and convenient assistance for your reinforcement learning algorithm research \
+and development work, mainly including:
+
+1. Comprehensive algorithm support, such as DQN, PPO, SAC, and many related algorithms for research subfields - \
+   QMIX for multi-intelligent reinforcement learning, GAIL for inverse reinforcement learning, RND for exploration problems, etc.
+
+2. User-friendly interface, we abstract most common objects in reinforcement learning tasks, such as environments, policies, \
+   and encapsulate complex reinforcement learning processes into middleware, allowing you to build your own learning process as you wish.
+
+3. Flexible scalability, using the integrated messaging components and event programming interfaces within the framework, \
+   you can flexibly scale your basic research work to industrial-grade large-scale training clusters, \
+   such as StarCraft Intelligence `DI-star <https://github.com/opendilab/DI-star>`_.
+
+.. image::
+   ../images/system_layer.png
+
+Key Concepts
+-------------------------------
+
+If you are not familiar with reinforcement learning, you can go to our `reinforcement learning tutorial <../10_concepts/index_zh.html>`_ \
+for a glimpse into the wonderful world of reinforcement learning.
+
+If you have already been exposed to reinforcement learning, you will already be familiar with the basic interaction objects of reinforcement learning: \
+**environments** and **agents (or the policies that make them up)**.
+
+Instead of creating more concepts, the DI-engine abstracts the complex interaction logic between the two into declarative middleware, \
+such as **collect**, **train**, **evaluate**, and **save_ckpt**. You can adapt each part of the process in the most natural way.
+
+Using the DI-engine will be very easy, in the `quickstart <... /01_quickstart/index_zh.html>`_, \
+we will show you how to quickly build a classic reinforcement learning process using DI-engine with a simple example.
@@ -0,0 +1,28 @@
+DI-engine 简介
+===============================
+
+了解 DI-engine
+-------------------------------
+
+DI-engine 是由一群充满活力的研究员和工程师打造的决策智能平台，它将为您的强化学习算法研究和开发工作提供最专业最便捷的帮助，主要包括：
+
+1. 完整的算法支持，例如 DQN，PPO，SAC 以及许多研究子领域的相关算法——多智能体强化学习中的 QMIX，逆强化学习中的 GAIL，探索问题中的 RND 等等。
+
+2. 友好的用户接口，我们抽象了强化学习任务中的大部分常见对象，例如环境，策略，并将复杂的强化学习流程封装成丰富的中间件，让您随心所欲的构建自己的学习流程。
+
+3. 弹性的拓展能力，利用框架内集成的消息组件和事件编程接口，您可以灵活的将基础研究工作拓展到工业级大规模训练集群中，例如星际争霸智能体 `DI-star <https://github.com/opendilab/DI-star>`_。
+
+.. image::
+   ../images/system_layer.png
+
+核心概念
+-------------------------------
+
+假如您尚未了解强化学习，可以转至我们的 `强化学习教程 <../10_concepts/index_zh.html>`_ 一窥强化学习的奇妙世界。
+
+假如您已经接触过强化学习，想必已经非常了解强化学习的基本交互对象： **环境** 和 **智能体（或者构成智能体的策略）**。
+
+DI-engine 没有创造更多的概念，而是将这两者之间复杂的交互逻辑抽象成了声明式的中间件，例如 **采集数据（collect）**，**训练模型（train）**，**评估模型（evaluate）**，**保存模型（save_ckpt）**，
+您可以依据最自然的方式调整流程中的各个环节。
+
+使用 DI-engine 将会非常简单，在 `快速开始 <../01_quickstart/index_zh.html>`_ 部分，我们将通过一个简单的例子向您介绍，如何使用 DI-engine 快速搭建一个经典的强化学习流程。
@@ -0,0 +1,109 @@
+First Reinforcement Learning Program
+======================================
+
+.. toctree::
+   :maxdepth: 2
+
+CartPole is the ideal learning environment for an introduction to reinforcement learning, \
+and using the DQN algorithm allows CartPole to converge (maintain equilibrium) in a very short time. \
+We will introduce the use of DI-engine based on CartPole + DQN.
+
+.. image::
+    images/cartpole_cmp.gif
+    :width: 1000
+    :align: center
+
+Using the Configuration File
+------------------------------
+
+The DI-engine uses a global configuration file to control all variables of the environment and strategy, \
+each of which has a corresponding default configuration that can be found in \
+`cartpole_dqn_config <https://github.com/opendilab/DI-engine/blob/main/dizoo/classic_control/cartpole/config/cartpole_dqn_config.py>`_, \
+in the tutorial we use the default configuration directly:
+
+.. code-block:: python
+
+    from dizoo.classic_control.cartpole.config.cartpole_dqn_config import main_config, create_config
+    from ding.config import compile_config
+
+    cfg = compile_config(main_config, create_cfg=create_config, auto=True)
+
+Initialize the Environments
+------------------------------
+
+In reinforcement learning, there may be a difference in the strategy for collecting environment data \
+between the training process and the evaluation process, for example, the training process tends to train \
+one epoch for n steps of collection, while the evaluation process requires completing the whole game to get a score. \
+We recommend that the collection and evaluation environments be initialized separately as follows.
+
+.. code-block:: python
+
+    from ding.envs import DingEnvWrapper, BaseEnvManagerV2
+
+    collector_env = BaseEnvManagerV2(
+        env_fn=[lambda: DingEnvWrapper(gym.make("CartPole-v0")) for _ in range(cfg.env.collector_env_num)],
+        cfg=cfg.env.manager
+    )
+    evaluator_env = BaseEnvManagerV2(
+        env_fn=[lambda: DingEnvWrapper(gym.make("CartPole-v0")) for _ in range(cfg.env.evaluator_env_num)],
+        cfg=cfg.env.manager
+    )
+
+.. note::
+
+    DingEnvWrapper is a unified wrapper of DI-engine for different environment libraries. \
+    BaseEnvManagerV2 is a unified external interface for managing multiple environments. \
+    so you can use BaseEnvManagerV2 to collect multiple environments in parallel.
+
+Select Policy
+--------------
+
+DI-engine covers most of the reinforcement learning policies, using them only requires selecting the right policy and model.
+Since DQN is off-policy, we also need to instantiate a buffer module.
+
+.. code-block:: python
+
+    from ding.model import DQN
+    from ding.policy import DQNPolicy
+    from ding.data import DequeBuffer
+
+    model = DQN(**cfg.policy.model)
+    buffer_ = DequeBuffer(size=cfg.policy.other.replay_buffer.replay_buffer_size)
+    policy = DQNPolicy(cfg.policy, model=model)
+
+Build the Pipeline
+---------------------
+
+With the various middleware provided by DI-engine, we can easily build the entire pipeline:
+
+.. code-block:: python
+
+    from ding.framework import task
+    from ding.framework.context import OnlineRLContext
+    from ding.framework.middleware import OffPolicyLearner, StepCollector, interaction_evaluator, data_pusher, eps_greedy_handler, CkptSaver
+
+    with task.start(async_mode=False, ctx=OnlineRLContext()):
+        # Evaluating, we place it on the first place to get the score of the random model as a benchmark value
+        task.use(interaction_evaluator(cfg, policy.eval_mode, evaluator_env))
+        task.use(eps_greedy_handler(cfg))  # Decay probability of explore-exploit
+        task.use(StepCollector(cfg, policy.collect_mode, collector_env))  # Collect environmental data
+        task.use(data_pusher(cfg, buffer_))  # Push data to buffer
+        task.use(OffPolicyLearner(cfg, policy.learn_mode, buffer_))  # Train the model
+        task.use(CkptSaver(cfg, policy, train_freq=100))  # Save the model
+        # In the evaluation process, if the model is found to have exceeded the convergence value, it will end early here
+        task.run()
+
+Run the Code
+--------------
+
+The full example can be found in `DQN example <https://github.com/opendilab/DI-engine/blob/main/ding/example/dqn.py>`_ and can be run via ``python dqn.py``.
+
+.. image::
+    images/train_dqn.gif
+    :width: 1000
+    :align: center
+
+Now you have completed your first reinforcement learning task with DI-engine, you can try out more algorithms \
+in the `Examples directory <https://github.com/opendilab/DI-engine/blob/main/ding/example>`_, or continue reading \
+the documentation to get a deeper understanding of DI-engine's `Algorithm <../02_algo/index.html>`_, `System Design <../03_system/index.html>`_ \
+and `Best Practices <../04_best_practice/index.html>`_.
@@ -0,0 +1,100 @@
+第一个强化学习程序
+============================
+
+.. toctree::
+   :maxdepth: 2
+
+CartPole 是强化学习入门的理想学习环境，使用 DQN 算法可以在很短的时间内让 CartPole 收敛（保持平衡）。
+我们将基于 CartPole + DQN 介绍一下 DI-engine 的用法。
+
+.. image::
+    images/cartpole_cmp.gif
+    :width: 1000
+    :align: center
+
+使用配置文件
+--------------
+
+DI-engine 使用一个全局的配置文件来控制环境和策略的所有变量，每个环境和策略都有对应的默认配置，可以在
+`cartpole_dqn_config <https://github.com/opendilab/DI-engine/blob/main/dizoo/classic_control/cartpole/config/cartpole_dqn_config.py>`_
+看到完整的配置，在教程里我们直接使用默认配置：
+
+.. code-block:: python
+
+    from dizoo.classic_control.cartpole.config.cartpole_dqn_config import main_config, create_config
+    from ding.config import compile_config
+
+    cfg = compile_config(main_config, create_cfg=create_config, auto=True)
+
+初始化采集环境和评估环境
+------------------------
+
+在强化学习中，训练阶段和评估阶段采集环境数据的策略可能有区别，例如训练阶段往往是采集 n 个步骤就训练一次，
+而评估阶段则需要完成整局游戏才能得到评分。我们推荐将采集和评估环境分开初始化：
+
+.. code-block:: python
+
+    from ding.envs import DingEnvWrapper, BaseEnvManagerV2
+
+    collector_env = BaseEnvManagerV2(
+        env_fn=[lambda: DingEnvWrapper(gym.make("CartPole-v0")) for _ in range(cfg.env.collector_env_num)],
+        cfg=cfg.env.manager
+    )
+    evaluator_env = BaseEnvManagerV2(
+        env_fn=[lambda: DingEnvWrapper(gym.make("CartPole-v0")) for _ in range(cfg.env.evaluator_env_num)],
+        cfg=cfg.env.manager
+    )
+
+.. note::
+
+    DingEnvWrapper 是 DI-engine 对不同环境库的统一封装。BaseEnvManagerV2 管理多个环境的统一对外接口，
+    利用 BaseEnvManagerV2 可以同时对多个环境进行并行采集。
+
+选择策略
+--------------
+
+DI-engine 覆盖了大部分强化学习策略，使用它们只需要选择正确的策略和模型即可。
+由于 DQN 是一个 off-policy 策略，所以我们还需要实例化一个 buffer 模块。
+
+.. code-block:: python
+
+    from ding.model import DQN
+    from ding.policy import DQNPolicy
+    from ding.data import DequeBuffer
+
+    model = DQN(**cfg.policy.model)
+    buffer_ = DequeBuffer(size=cfg.policy.other.replay_buffer.replay_buffer_size)
+    policy = DQNPolicy(cfg.policy, model=model)
+
+构建训练管线
+--------------
+
+利用 DI-engine 提供的各类中间件，我们可以很容易的构建整个训练管线：
+
+.. code-block:: python
+
+    from ding.framework import task
+    from ding.framework.context import OnlineRLContext
+    from ding.framework.middleware import OffPolicyLearner, StepCollector, interaction_evaluator, data_pusher, eps_greedy_handler, CkptSaver
+
+    with task.start(async_mode=False, ctx=OnlineRLContext()):
+        task.use(interaction_evaluator(cfg, policy.eval_mode, evaluator_env))  # 评估流程，放在第一个是为了获得随机模型的评分作为基准值
+        task.use(eps_greedy_handler(cfg))  # 衰减探索-利用的概率
+        task.use(StepCollector(cfg, policy.collect_mode, collector_env))  # 采集环境数据
+        task.use(data_pusher(cfg, buffer_))  # 将数据保存到 buffer
+        task.use(OffPolicyLearner(cfg, policy.learn_mode, buffer_))  # 训练模型
+        task.use(CkptSaver(cfg, policy, train_freq=100))  # 保存模型
+        task.run()  # 在评估流程中，如果发现模型表现已经超过了收敛值，这里将提前结束
+
+运行代码
+--------------
+
+代码完整的示例代码可以在 `DQN example <https://github.com/opendilab/DI-engine/blob/main/ding/example/dqn.py>`_ 中找到，通过 ``python dqn.py`` 即可运行代码
+
+.. image::
+    images/train_dqn.gif
+    :width: 1000
+    :align: center
+
+至此您已经完成了 DI-engine 的第一个强化学习任务，您可以在 `示例目录 <https://github.com/opendilab/DI-engine/blob/main/ding/example>`_ 中尝试更多的算法，
+或继续阅读文档来深入了解 DI-engine 的 `算法 <../02_algo/index_zh.html>`_， `系统设计 <../03_system/index_zh.html>`_ 和 `最佳实践 <../04_best_practice/index_zh.html>`_。
@@ -0,0 +1,8 @@
+Quickstart
+============================
+
+.. toctree::
+   :maxdepth: 2
+
+   installation
+   first_rl_program
@@ -0,0 +1,8 @@
+快速开始
+============================
+
+.. toctree::
+   :maxdepth: 2
+
+   installation_zh
+   first_rl_program_zh