update docs

Former-commit-id: b4f9391f66 [formerly b5a8cc6a61] [formerly f9d9933ef9 [formerly 1a4fc54c24]] [formerly 19b7d536d1 [formerly 988ec5c83a] [formerly 7f480b6783 [formerly fc11d81413]]] [formerly f9c3e7e042 [formerly d2a2ee485d] [formerly c1a3d1dbbe [formerly 8f0498f352]] [formerly f12d6e39ca [formerly 8ba54f5192] [formerly ad523f7ac7 [formerly 43a0aedc83]]]] [formerly 4b851d9b6a [formerly 964b171b97] [formerly 91ff1377ad [formerly b8e31652e5]] [formerly 13472a5434 [formerly 0d13aa5de2] [formerly a2a009a597 [formerly aa96b65c41]]] [formerly 6efe91fa97 [formerly 0e9b285cdc] [formerly f9099f57d6 [formerly 48c1ed465c]] [formerly a4c38752b2 [formerly 3f5372c0d7] [formerly f1fc509223 [formerly 2c75d3ed66]]]]] [formerly f73e0e79b5 [formerly bb1592d62a] [formerly 9b775b7ce6 [formerly 1a07723663]] [formerly 45b8b7a2e7 [formerly f505046a1e] [formerly 2d91d39c00 [formerly 41f2783fd8]]] [formerly 98c6897952 [formerly bf39d987b6] [formerly b95d5e1d2f [formerly e99547b3c3]] [formerly 36ee095cf5 [formerly 5f039a7a6a] [formerly ad2ed53891 [formerly ea94d6e921]]]] [formerly 023d2ae825 [formerly e2099c84ff] [formerly 385bbb9f02 [formerly 658d7bc094]] [formerly aeb9ae3f64 [formerly 41461da4fb] [formerly d2aba9f4bb [formerly 214e71901b]]] [formerly 85bc2482bd [formerly bcdcb18d5a] [formerly 063bf00d72 [formerly 736726d1b6]] [formerly 3119de7da3 [formerly b04f8107fa] [formerly 7d23dc080a [formerly 8d22f0930b]]]]]] Former-commit-id: e488b8c4a0 [formerly c0839cf283] [formerly 35cda6c222 [formerly 21ecbeff1f]] [formerly 0c39c474d0 [formerly 753b82ea17] [formerly e1f3995b2f [formerly 905df014d2]]] [formerly ce2ecf9247 [formerly 5d1296adca] [formerly c5748d206d [formerly 60b93c2611]] [formerly 67f9fb672c [formerly 261d75c23a] [formerly 656522a99c [formerly 70dd19d77d]]]] [formerly b1f6f39978 [formerly abb2faee1f] [formerly 7b8ebcc612 [formerly 5e106bbe7d]] [formerly a887c6ac67 [formerly 17ec569ae4] [formerly 9d4ffdc32d [formerly 93ea1a952d]]] [formerly b58006417c [formerly c68fe2a604] [formerly 03f60582a7 [formerly ce205f21e2]] [formerly 12b8da1498 [formerly 52ab6ac550] [formerly 7d23dc080a]]]] Former-commit-id: 9bbe314a25 [formerly debf62c9fb] [formerly bee4ae87c8 [formerly 363e8c4cbb]] [formerly f43cd35c22 [formerly 0453186dc0] [formerly db37b88820 [formerly d2122719fc]]] [formerly a5298f4ead [formerly 956a50169c] [formerly a5ff9c9f21 [formerly 975b32abbb]] [formerly bd6266cd7f [formerly 3e712311e7] [formerly 1a60e83a77 [formerly 3aeaeec271]]]] Former-commit-id: 0bd595f9f2 [formerly 5328055b46] [formerly a760df362f [formerly 14fa9bb70c]] [formerly 2268844207 [formerly f92402f2b7] [formerly 38144d591e [formerly 66df7c3ad8]]] Former-commit-id: b8a5d9e66a [formerly 1a5590cef5] [formerly 4122d313d7 [formerly ff0ae449df]] Former-commit-id: 2f64bae9a0 [formerly 252d5866f8] Former-commit-id: 67c8884899
4 years ago · 64a7c8c4b1
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -28,7 +28,7 @@ def setup(app):

 # -- Project information -----------------------------------------------------

 project = 'Time Series Outlier Detection System'
 project = 'TODS'
 copyright = '2020, DataLab@Texas A&M University'
 author = 'DataLab@Texas A&M University'

@@ -56,21 +56,30 @@ extensions = [
 templates_path = ['_templates']
 source_suffix = '.rst'

 # The master toctree document.
 master_doc = 'doctree'

 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
 # This pattern also affects html_static_path and html_extra_path.
 exclude_patterns = []

 # The name of the Pygments (syntax highlighting) style to use.
 pygments_style = None


 # -- Options for HTML output -------------------------------------------------

 # The theme to use for HTML and HTML Help pages.  See the documentation for
 # a list of builtin themes.
 #
 html_theme = 'alabaster'
 html_theme = 'sphinx_rtd_theme'

 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".
 html_static_path = ['_static']
 html_sidebars = {
   '**': ['fulltoc.html', 'sourcelink.html', 'searchbox.html']
 }

--- a/docs/source/doctree.rst
+++ b/docs/source/doctree.rst
@@ -0,0 +1,31 @@
 .. rlcard documentation master file, created by
   sphinx-quickstart on Thu Sep  5 18:45:31 2019.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

 .. toctree::
   :glob:
   :caption: Documentation:

   overview
   getting_started


 .. toctree::
   :glob:
   :caption: API Documents:

   tods.data_processing
   tods.timeseries_processing
   tods.feature_analysis
   tods.detection_algorithm
   tods.reinforcement



 Indices and tables
 ==================

 * :ref:`genindex`
 * :ref:`modindex`
 * :ref:`search`
--- a/docs/source/getting_started.rst
+++ b/docs/source/getting_started.rst
@@ -0,0 +1,595 @@
 Getting Started
 ===============

 In this document, we provide some toy examples for getting started. All
 the examples in this document and even more examples are available in
 `examples/ <https://github.com/datamllab/rlcard/tree/master/examples>`__.

 Playing with Random Agents
 --------------------------

 We have set up a random agent that can play randomly on each
 environment. An example of applying a random agent on Blackjack is as
 follow:

 .. code:: python

   import rlcard
   from rlcard.agents import RandomAgent
   from rlcard.utils import set_global_seed

   # Make environment
   env = rlcard.make('blackjack', config={'seed': 0})
   episode_num = 2

   # Set a global seed
   set_global_seed(0)

   # Set up agents
   agent_0 = RandomAgent(action_num=env.action_num)
   env.set_agents([agent_0])

   for episode in range(episode_num):

       # Generate data from the environment
       trajectories, _ = env.run(is_training=False)

       # Print out the trajectories
       print('\nEpisode {}'.format(episode))
       for ts in trajectories[0]:
           print('State: {}, Action: {}, Reward: {}, Next State: {}, Done: {}'.format(ts[0], ts[1], ts[2], ts[3], ts[4]))

 The expected output should look like something as follows:

 ::

   Episode 0
   State: {'obs': array([20,  3]), 'legal_actions': [0, 1]}, Action: 0, Reward: 0, Next State: {'obs': array([15,  3]), 'legal_actions': [0, 1]}, Done: False
   State: {'obs': array([15,  3]), 'legal_actions': [0, 1]}, Action: 1, Reward: -1, Next State: {'obs': array([15, 20]), 'legal_actions': [0, 1]}, Done: True

   Episode 1
   State: {'obs': array([15,  5]), 'legal_actions': [0, 1]}, Action: 1, Reward: 1, Next State: {'obs': array([15, 23]), 'legal_actions': [0, 1]}, Done: True

 Note that the states and actions are wrapped by ``env`` in Blackjack. In
 this example, the ``[20, 3]`` suggests the current player obtains score
 20 while the card that faces up in the dealer’s hand has score 3. Action
 0 means “hit” while action 1 means “stand”. Reward 1 suggests the player
 wins while reward -1 suggests the dealer wins. Reward 0 suggests a tie.
 The above data can be directly fed into a RL algorithm for training.

 Deep-Q Learning on Blackjack
 ----------------------------

 The second example is to use Deep-Q learning to train an agent on
 Blackjack. We aim to use this example to show how reinforcement learning
 algorithms can be developed and applied in our toolkit. We design a
 ``run`` function which plays one complete game and provides the data for
 training RL agents. The example is shown below:

 .. code:: python

   import tensorflow as tf
   import os

   import rlcard
   from rlcard.agents import DQNAgent
   from rlcard.utils import set_global_seed, tournament
   from rlcard.utils import Logger

   # Make environment
   env = rlcard.make('blackjack', config={'seed': 0})
   eval_env = rlcard.make('blackjack', config={'seed': 0})

   # Set the iterations numbers and how frequently we evaluate/save plot
   evaluate_every = 100
   evaluate_num = 10000
   episode_num = 100000

   # The intial memory size
   memory_init_size = 100

   # Train the agent every X steps
   train_every = 1

   # The paths for saving the logs and learning curves
   log_dir = './experiments/blackjack_dqn_result/'

   # Set a global seed
   set_global_seed(0)

   with tf.Session() as sess:

       # Initialize a global step
       global_step = tf.Variable(0, name='global_step', trainable=False)

       # Set up the agents
       agent = DQNAgent(sess,
                        scope='dqn',
                        action_num=env.action_num,
                        replay_memory_init_size=memory_init_size,
                        train_every=train_every,
                        state_shape=env.state_shape,
                        mlp_layers=[10,10])
       env.set_agents([agent])
       eval_env.set_agents([agent])

       # Initialize global variables
       sess.run(tf.global_variables_initializer())

       # Init a Logger to plot the learning curve
       logger = Logger(log_dir)

       for episode in range(episode_num):

           # Generate data from the environment
           trajectories, _ = env.run(is_training=True)

           # Feed transitions into agent memory, and train the agent
           for ts in trajectories[0]:
               agent.feed(ts)

           # Evaluate the performance. Play with random agents.
           if episode % evaluate_every == 0:
               logger.log_performance(env.timestep, tournament(eval_env, evaluate_num)[0])

       # Close files in the logger
       logger.close_files()

       # Plot the learning curve
       logger.plot('DQN')
       
       # Save model
       save_dir = 'models/blackjack_dqn'
       if not os.path.exists(save_dir):
           os.makedirs(save_dir)
       saver = tf.train.Saver()
       saver.save(sess, os.path.join(save_dir, 'model'))

 The expected output is something like below:

 ::

   ----------------------------------------
     timestep     |  1
     reward       |  -0.7342
   ----------------------------------------
   INFO - Agent dqn, step 100, rl-loss: 1.0042707920074463
   INFO - Copied model parameters to target network.
   INFO - Agent dqn, step 136, rl-loss: 0.7888197302818298
   ----------------------------------------
     timestep     |  136
     reward       |  -0.1406
   ----------------------------------------
   INFO - Agent dqn, step 278, rl-loss: 0.6946825981140137
   ----------------------------------------
     timestep     |  278
     reward       |  -0.1523
   ----------------------------------------
   INFO - Agent dqn, step 412, rl-loss: 0.62268990278244025
   ----------------------------------------
     timestep     |  412
     reward       |  -0.088
   ----------------------------------------
   INFO - Agent dqn, step 544, rl-loss: 0.69050502777099616
   ----------------------------------------
     timestep     |  544
     reward       |  -0.08
   ----------------------------------------
   INFO - Agent dqn, step 681, rl-loss: 0.61789089441299444
   ----------------------------------------
     timestep     |  681
     reward       |  -0.0793
   ----------------------------------------

 In Blackjack, the player will get a payoff at the end of the game: 1 if
 the player wins, -1 if the player loses, and 0 if it is a tie. The
 performance is measured by the average payoff the player obtains by
 playing 10000 episodes. The above example shows that the agent achieves
 better and better performance during training. The logs and learning
 curves are saved in ``./experiments/blackjack_dqn_result/``.

 Running Multiple Processes
 --------------------------

 The environments can be run with multiple processes to accelerate the
 training. Below is an example to train DQN on Blackjack with multiple
 processes.

 .. code:: python

   ''' An example of learning a Deep-Q Agent on Blackjack with multiple processes
   Note that we must use if __name__ == '__main__' for multiprocessing
   '''

   import tensorflow as tf
   import os

   import rlcard
   from rlcard.agents import DQNAgent
   from rlcard.utils import set_global_seed, tournament
   from rlcard.utils import Logger

   def main():
       # Make environment
       env = rlcard.make('blackjack', config={'seed': 0, 'env_num': 4})
       eval_env = rlcard.make('blackjack', config={'seed': 0, 'env_num': 4})

       # Set the iterations numbers and how frequently we evaluate performance
       evaluate_every = 100
       evaluate_num = 10000
       iteration_num = 100000

       # The intial memory size
       memory_init_size = 100

       # Train the agent every X steps
       train_every = 1

       # The paths for saving the logs and learning curves
       log_dir = './experiments/blackjack_dqn_result/'

       # Set a global seed
       set_global_seed(0)

       with tf.Session() as sess:

           # Initialize a global step
           global_step = tf.Variable(0, name='global_step', trainable=False)

           # Set up the agents
           agent = DQNAgent(sess,
                            scope='dqn',
                            action_num=env.action_num,
                            replay_memory_init_size=memory_init_size,
                            train_every=train_every,
                            state_shape=env.state_shape,
                            mlp_layers=[10,10])
           env.set_agents([agent])
           eval_env.set_agents([agent])

           # Initialize global variables
           sess.run(tf.global_variables_initializer())

           # Initialize a Logger to plot the learning curve
           logger = Logger(log_dir)

           for iteration in range(iteration_num):

               # Generate data from the environment
               trajectories, _ = env.run(is_training=True)

               # Feed transitions into agent memory, and train the agent
               for ts in trajectories[0]:
                   agent.feed(ts)

               # Evaluate the performance. Play with random agents.
               if iteration % evaluate_every == 0:
                   logger.log_performance(env.timestep, tournament(eval_env, evaluate_num)[0])

           # Close files in the logger
           logger.close_files()

           # Plot the learning curve
           logger.plot('DQN')
           
           # Save model
           save_dir = 'models/blackjack_dqn'
           if not os.path.exists(save_dir):
               os.makedirs(save_dir)
           saver = tf.train.Saver()
           saver.save(sess, os.path.join(save_dir, 'model'))

   if __name__ == '__main__':
       main()

 Example output is as follow:

 ::

   ----------------------------------------
     timestep     |  17
     reward       |  -0.7378
   ----------------------------------------

   INFO - Copied model parameters to target network.
   INFO - Agent dqn, step 1100, rl-loss: 0.40940183401107797
   INFO - Copied model parameters to target network.
   INFO - Agent dqn, step 2100, rl-loss: 0.44971221685409546
   INFO - Copied model parameters to target network.
   INFO - Agent dqn, step 2225, rl-loss: 0.65466868877410897
   ----------------------------------------
     timestep     |  2225
     reward       |  -0.0658
   ----------------------------------------
   INFO - Agent dqn, step 3100, rl-loss: 0.48663979768753053
   INFO - Copied model parameters to target network.
   INFO - Agent dqn, step 4100, rl-loss: 0.71293979883193974
   INFO - Copied model parameters to target network.
   INFO - Agent dqn, step 4440, rl-loss: 0.55871248245239263
   ----------------------------------------
     timestep     |  4440
     reward       |  -0.0736
   ----------------------------------------

 Training CFR on Leduc Hold’em
 -----------------------------

 To show how we can use ``step`` and ``step_back`` to traverse the game
 tree, we provide an example of solving Leduc Hold’em with CFR:

 .. code:: python

   import numpy as np

   import rlcard
   from rlcard.agents import CFRAgent
   from rlcard import models
   from rlcard.utils import set_global_seed, tournament
   from rlcard.utils import Logger

   # Make environment and enable human mode
   env = rlcard.make('leduc-holdem', config={'seed': 0, 'allow_step_back':True})
   eval_env = rlcard.make('leduc-holdem', config={'seed': 0})

   # Set the iterations numbers and how frequently we evaluate/save plot
   evaluate_every = 100
   save_plot_every = 1000
   evaluate_num = 10000
   episode_num = 10000

   # The paths for saving the logs and learning curves
   log_dir = './experiments/leduc_holdem_cfr_result/'

   # Set a global seed
   set_global_seed(0)

   # Initilize CFR Agent
   agent = CFRAgent(env)
   agent.load()  # If we have saved model, we first load the model

   # Evaluate CFR against pre-trained NFSP
   eval_env.set_agents([agent, models.load('leduc-holdem-nfsp').agents[0]])

   # Init a Logger to plot the learning curve
   logger = Logger(log_dir)

   for episode in range(episode_num):
       agent.train()
       print('\rIteration {}'.format(episode), end='')
       # Evaluate the performance. Play with NFSP agents.
       if episode % evaluate_every == 0:
           agent.save() # Save model
           logger.log_performance(env.timestep, tournament(eval_env, evaluate_num)[0])

   # Close files in the logger
   logger.close_files()

   # Plot the learning curve
   logger.plot('CFR')

 In the above example, the performance is measured by playing against a
 pre-trained NFSP model. The expected output is as below:

 ::

   Iteration 0
   ----------------------------------------
     timestep     |  192
     reward       |  -1.3662
   ----------------------------------------
   Iteration 100
   ----------------------------------------
     timestep     |  19392
     reward       |  0.9462
   ----------------------------------------
   Iteration 200
   ----------------------------------------
     timestep     |  38592
     reward       |  0.8591
   ----------------------------------------
   Iteration 300
   ----------------------------------------
     timestep     |  57792
     reward       |  0.7861
   ----------------------------------------
   Iteration 400
   ----------------------------------------
     timestep     |  76992
     reward       |  0.7752
   ----------------------------------------
   Iteration 500
   ----------------------------------------
     timestep     |  96192
     reward       |  0.7215
   ----------------------------------------

 We observe that CFR achieves better performance as NFSP. However, CFR
 requires traversal of the game tree, which is infeasible in large
 environments.

 Having Fun with Pretrained Leduc Model
 --------------------------------------

 We have designed simple human interfaces to play against the pretrained
 model. Leduc Hold’em is a simplified version of Texas Hold’em. Rules can
 be found `here <games.md#leduc-holdem>`__. Example of playing against
 Leduc Hold’em CFR model is as below:

 .. code:: python

   import rlcard
   from rlcard import models
   from rlcard.agents import LeducholdemHumanAgent as HumanAgent
   from rlcard.utils import print_card

   # Make environment
   # Set 'record_action' to True because we need it to print results
   env = rlcard.make('leduc-holdem', config={'record_action': True})
   human_agent = HumanAgent(env.action_num)
   cfr_agent = models.load('leduc-holdem-cfr').agents[0]
   env.set_agents([human_agent, cfr_agent])

   print(">> Leduc Hold'em pre-trained model")

   while (True):
       print(">> Start a new game")

       trajectories, payoffs = env.run(is_training=False)
       # If the human does not take the final action, we need to
       # print other players action
       final_state = trajectories[0][-1][-2]
       action_record = final_state['action_record']
       state = final_state['raw_obs']
       _action_list = []
       for i in range(1, len(action_record)+1):
           if action_record[-i][0] == state['current_player']:
               break
           _action_list.insert(0, action_record[-i])
       for pair in _action_list:
           print('>> Player', pair[0], 'chooses', pair[1])

       # Let's take a look at what the agent card is
       print('===============     CFR Agent    ===============')
       print_card(env.get_perfect_information()['hand_cards'][1])

       print('===============     Result     ===============')
       if payoffs[0] > 0:
           print('You win {} chips!'.format(payoffs[0]))
       elif payoffs[0] == 0:
           print('It is a tie.')
       else:
           print('You lose {} chips!'.format(-payoffs[0]))
       print('')

       input("Press any key to continue...")

 Example output is as follow:

 ::

   >> Leduc Hold'em pre-trained model

   >> Start a new game!
   >> Agent 1 chooses raise

   =============== Community Card ===============
   ┌─────────┐
   │░░░░░░░░░│
   │░░░░░░░░░│
   │░░░░░░░░░│
   │░░░░░░░░░│
   │░░░░░░░░░│
   │░░░░░░░░░│
   │░░░░░░░░░│
   └─────────┘
   ===============   Your Hand    ===============
   ┌─────────┐
   │J        │
   │         │
   │         │
   │    ♥    │
   │         │
   │         │
   │        J│
   └─────────┘
   ===============     Chips      ===============
   Yours:   +
   Agent 1: +++
   =========== Actions You Can Choose ===========
   0: call, 1: raise, 2: fold

   >> You choose action (integer):

 We also provide a running demo of a rule-based agent for UNO. Try it by
 running ``examples/uno_human.py``.

 Leduc Hold’em as Single-Agent Environment
 -----------------------------------------

 We have wrraped the environment as single agent environment by assuming
 that other players play with pre-trained models. The interfaces are
 exactly the same to OpenAI Gym. Thus, any single-agent algorithm can be
 connected to the environment. An example of Leduc Hold’em is as below:

 .. code:: python

   import tensorflow as tf
   import os
   import numpy as np

   import rlcard
   from rlcard.agents import DQNAgent
   from rlcard.agents import RandomAgent
   from rlcard.utils import set_global_seed, tournament
   from rlcard.utils import Logger

   # Make environment
   env = rlcard.make('leduc-holdem', config={'seed': 0, 'single_agent_mode':True})
   eval_env = rlcard.make('leduc-holdem', config={'seed': 0, 'single_agent_mode':True})

   # Set the iterations numbers and how frequently we evaluate/save plot
   evaluate_every = 1000
   evaluate_num = 10000
   timesteps = 100000

   # The intial memory size
   memory_init_size = 1000

   # Train the agent every X steps
   train_every = 1

   # The paths for saving the logs and learning curves
   log_dir = './experiments/leduc_holdem_single_dqn_result/'

   # Set a global seed
   set_global_seed(0)

   with tf.Session() as sess:

       # Initialize a global step
       global_step = tf.Variable(0, name='global_step', trainable=False)

       # Set up the agents
       agent = DQNAgent(sess,
                        scope='dqn',
                        action_num=env.action_num,
                        replay_memory_init_size=memory_init_size,
                        train_every=train_every,
                        state_shape=env.state_shape,
                        mlp_layers=[128,128])
       # Initialize global variables
       sess.run(tf.global_variables_initializer())

       # Init a Logger to plot the learning curve
       logger = Logger(log_dir)

       state = env.reset()

       for timestep in range(timesteps):
           action = agent.step(state)
           next_state, reward, done = env.step(action)
           ts = (state, action, reward, next_state, done)
           agent.feed(ts)

           if timestep % evaluate_every == 0:
               rewards = []
               state = eval_env.reset()
               for _ in range(evaluate_num):
                   action, _ = agent.eval_step(state)
                   _, reward, done = env.step(action)
                   if done:
                       rewards.append(reward)
               logger.log_performance(env.timestep, np.mean(rewards))

       # Close files in the logger
       logger.close_files()

       # Plot the learning curve
       logger.plot('DQN')
       
       # Save model
       save_dir = 'models/leduc_holdem_single_dqn'
       if not os.path.exists(save_dir):
           os.makedirs(save_dir)
       saver = tf.train.Saver()
       saver.save(sess, os.path.join(save_dir, 'model'))
--- a/docs/source/img/framework.pdf
+++ b/docs/source/img/framework.pdf
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -3,17 +3,25 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

 Welcome to Time Series Outlier Detection System's documentation!
 Welcome to TOD's documentation!
 ================================================================

 .. toctree::
   :maxdepth: 2
   :maxdepth: 4
   :caption: Contents:



 Indices and tables
 API Documents
 ==================
 .. toctree::
    :maxdepth: 4
    :caption: API Documents:
    tods.data_processing
    tods.timeseries_processing
    tods.feature_analysis
    tods.detection_algorithm
    tods.reinforcement

 * :ref:`genindex`
 * :ref:`modindex`
--- a/docs/source/modules.rst
+++ b/docs/source/modules.rst
@@ -2,6 +2,6 @@ tods
 ====

 .. toctree::
   :maxdepth: 3
   :maxdepth: 4

   tods
--- a/docs/source/overview.rst
+++ b/docs/source/overview.rst
@@ -0,0 +1,101 @@
 Overview
 ========

 Design Principles
 ~~~~~~~~~~~~~~~~~

 The toolkit wraps each game by ``Env`` class with easy-to-use
 interfaces. The goal of this toolkit is to enable the users to focus on
 algorithm development without caring about the environment. The
 following design principles are applied when developing the toolkit: 
 * **Reproducible.** Results on the environments can be reproduced. The same result should be obtained with the same random seed in different runs. 
 * **Accessible.** The experiences are collected and well organized after each game with easy-to-use interfaces. Uses can conveniently configure state representation, action encoding, reward design, or even the game rules. 
 * **Scalable.** New card environments can be added conveniently into the toolkit with the above design principles. We also try to minimize the dependencies in the toolkit so that the codes can be easily maintained.

 TODS High-level Design
 ~~~~~~~~~~~~~~~~~~~~~~~~

 This document introduces the high-level design for the environments, the
 games, and the agents (algorithms).

 .. image:: img/framework.pdf
   :width: 800



 Data-Processing
 ---------------

 We wrap each game with an ``Env`` class. The responsibility of ``Env``
 is to help you generate trajectories of the games. For developing
 Reinforcement Learning (RL) algorithms, we recommend to use the
 following interfaces:

 -  ``set_agents``: This function tells the ``Env`` what agents will be
   used to perform actions in the game. Different games may have a
   different number of agents. The input of the function is a list of
   ``Agent`` class. For example,
   ``env.set_agent([RandomAgent(), RandomAgent()])`` indicates that two
   random agents will be used to generate the trajectories.
 -  ``run``: After setting the agents, this interface will run a complete
   trajectory of the game, calculate the reward for each transition, and
   reorganize the data so that it can be directly fed into a RL
   algorithm.

 For advanced access to the environment, such as traversal of the game
 tree, we provide the following interfaces:

 -  ``step``: Given the current state, the environment takes one step
   forward, and returns the next state and the next player.
 -  ``step_back``: Takes one step backward. The environment will restore
   to the last state. The ``step_back`` is defaultly turned off since it
   requires expensively recoeding previous states. To turn it on, set
   ``allow_step_back = True`` when ``make`` environments.
 -  ``get_payoffs``: At the end of the game, this function can be called
   to obtain the payoffs for each player.

 We also support single-agent mode and human mode. Examples can be found
 in ``examples/``.

 -  Single agent mode: single-agent environments are developped by
   simulating other players with pre-trained models or rule-based
   models. You can enable single-agent mode by
   ``rlcard.make(ENV_ID, config={'single_agent_mode':True})``. Then the
   ``step`` function will return ``(next_state, reward, done)`` just as
   common single-agent environments. ``env.reset()`` will reset the game
   and return the first state.

 Games
 -----

 Card games usually have similar structures. We abstract some concepts in
 card games and follow the same design pattern. In this way,
 users/developers can easily dig into the code and change the rules for
 research purpose. Specifically, the following classes are used in all
 the games:

 -  ``Game``: A game is defined as a complete sequence starting from one
   of the non-terminal states to a terminal state.
 -  ``Round``: A round is a part of the sequence of a game. Most card
   games can be naturally divided into multiple rounds.
 -  ``Dealer``: A dealer is responsible for shuffling and allocating a
   deck of cards.
 -  ``Judger``: A judger is responsible for making major decisions at the
   end of a round or a game.
 -  ``Player``: A player is a role who plays cards following a strategy.

 To summarize, in one ``Game``, a ``Dealer`` deals the cards for each
 ``Player``. In each ``Round`` of the game, a ``Judger`` will make major
 decisions about the next round and the payoffs in the end of the game.

 Agents
 ------

 We provide examples of several representative algorithms and wrap them
 as ``Agent`` to show how a learning algorithm can be connected to the
 toolkit. The first example is DQN which is a representative of the
 Reinforcement Learning (RL) algorithms category. The second example is
 NFSP which is a representative of the Reinforcement Learning (RL) with
 self-play. We also provide CFR and DeepCFR which belong to Conterfactual
 Regret Minimization (CFR) category. Other algorithms from these three
 categories can be connected in similar ways.
--- a/docs/source/tods.rst
+++ b/docs/source/tods.rst
@@ -5,7 +5,7 @@ Subpackages
 -----------

 .. toctree::
   :maxdepth: 4
   :maxdepth: 2

   tods.data_processing
   tods.detection_algorithm
--- a/docs/source/tods.searcher.search.rst
+++ b/docs/source/tods.searcher.search.rst
@@ -9,7 +9,7 @@ tods.searcher.search.brute\_force\_search module

 .. automodule:: tods.searcher.search.brute_force_search
   :members:
   :undoc-members:
   :noindex:
   :show-inheritance:

 Module contents
@@ -17,5 +17,5 @@ Module contents

 .. automodule:: tods.searcher.search
   :members:
   :undoc-members:
   :noindex:
   :show-inheritance: