Browse Source

update docs

Former-commit-id: b4f9391f66 [formerly b5a8cc6a61] [formerly f9d9933ef9 [formerly 1a4fc54c24]] [formerly 19b7d536d1 [formerly 988ec5c83a] [formerly 7f480b6783 [formerly fc11d81413]]] [formerly f9c3e7e042 [formerly d2a2ee485d] [formerly c1a3d1dbbe [formerly 8f0498f352]] [formerly f12d6e39ca [formerly 8ba54f5192] [formerly ad523f7ac7 [formerly 43a0aedc83]]]] [formerly 4b851d9b6a [formerly 964b171b97] [formerly 91ff1377ad [formerly b8e31652e5]] [formerly 13472a5434 [formerly 0d13aa5de2] [formerly a2a009a597 [formerly aa96b65c41]]] [formerly 6efe91fa97 [formerly 0e9b285cdc] [formerly f9099f57d6 [formerly 48c1ed465c]] [formerly a4c38752b2 [formerly 3f5372c0d7] [formerly f1fc509223 [formerly 2c75d3ed66]]]]] [formerly f73e0e79b5 [formerly bb1592d62a] [formerly 9b775b7ce6 [formerly 1a07723663]] [formerly 45b8b7a2e7 [formerly f505046a1e] [formerly 2d91d39c00 [formerly 41f2783fd8]]] [formerly 98c6897952 [formerly bf39d987b6] [formerly b95d5e1d2f [formerly e99547b3c3]] [formerly 36ee095cf5 [formerly 5f039a7a6a] [formerly ad2ed53891 [formerly ea94d6e921]]]] [formerly 023d2ae825 [formerly e2099c84ff] [formerly 385bbb9f02 [formerly 658d7bc094]] [formerly aeb9ae3f64 [formerly 41461da4fb] [formerly d2aba9f4bb [formerly 214e71901b]]] [formerly 85bc2482bd [formerly bcdcb18d5a] [formerly 063bf00d72 [formerly 736726d1b6]] [formerly 3119de7da3 [formerly b04f8107fa] [formerly 7d23dc080a [formerly 8d22f0930b]]]]]]
Former-commit-id: e488b8c4a0 [formerly c0839cf283] [formerly 35cda6c222 [formerly 21ecbeff1f]] [formerly 0c39c474d0 [formerly 753b82ea17] [formerly e1f3995b2f [formerly 905df014d2]]] [formerly ce2ecf9247 [formerly 5d1296adca] [formerly c5748d206d [formerly 60b93c2611]] [formerly 67f9fb672c [formerly 261d75c23a] [formerly 656522a99c [formerly 70dd19d77d]]]] [formerly b1f6f39978 [formerly abb2faee1f] [formerly 7b8ebcc612 [formerly 5e106bbe7d]] [formerly a887c6ac67 [formerly 17ec569ae4] [formerly 9d4ffdc32d [formerly 93ea1a952d]]] [formerly b58006417c [formerly c68fe2a604] [formerly 03f60582a7 [formerly ce205f21e2]] [formerly 12b8da1498 [formerly 52ab6ac550] [formerly 7d23dc080a]]]]
Former-commit-id: 9bbe314a25 [formerly debf62c9fb] [formerly bee4ae87c8 [formerly 363e8c4cbb]] [formerly f43cd35c22 [formerly 0453186dc0] [formerly db37b88820 [formerly d2122719fc]]] [formerly a5298f4ead [formerly 956a50169c] [formerly a5ff9c9f21 [formerly 975b32abbb]] [formerly bd6266cd7f [formerly 3e712311e7] [formerly 1a60e83a77 [formerly 3aeaeec271]]]]
Former-commit-id: 0bd595f9f2 [formerly 5328055b46] [formerly a760df362f [formerly 14fa9bb70c]] [formerly 2268844207 [formerly f92402f2b7] [formerly 38144d591e [formerly 66df7c3ad8]]]
Former-commit-id: b8a5d9e66a [formerly 1a5590cef5] [formerly 4122d313d7 [formerly ff0ae449df]]
Former-commit-id: 2f64bae9a0 [formerly 252d5866f8]
Former-commit-id: 67c8884899
master
lhenry15 4 years ago
parent
commit
64a7c8c4b1
9 changed files with 753 additions and 9 deletions
  1. +11
    -2
      docs/source/conf.py
  2. +31
    -0
      docs/source/doctree.rst
  3. +595
    -0
      docs/source/getting_started.rst
  4. BIN
      docs/source/img/framework.pdf
  5. +11
    -3
      docs/source/index.rst
  6. +1
    -1
      docs/source/modules.rst
  7. +101
    -0
      docs/source/overview.rst
  8. +1
    -1
      docs/source/tods.rst
  9. +2
    -2
      docs/source/tods.searcher.search.rst

+ 11
- 2
docs/source/conf.py View File

@@ -28,7 +28,7 @@ def setup(app):

# -- Project information -----------------------------------------------------

project = 'Time Series Outlier Detection System'
project = 'TODS'
copyright = '2020, DataLab@Texas A&M University'
author = 'DataLab@Texas A&M University'

@@ -56,21 +56,30 @@ extensions = [
templates_path = ['_templates']
source_suffix = '.rst'

# The master toctree document.
master_doc = 'doctree'

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = None


# -- Options for HTML output -------------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'
html_theme = 'sphinx_rtd_theme'

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_sidebars = {
'**': ['fulltoc.html', 'sourcelink.html', 'searchbox.html']
}


+ 31
- 0
docs/source/doctree.rst View File

@@ -0,0 +1,31 @@
.. rlcard documentation master file, created by
sphinx-quickstart on Thu Sep 5 18:45:31 2019.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

.. toctree::
:glob:
:caption: Documentation:

overview
getting_started


.. toctree::
:glob:
:caption: API Documents:

tods.data_processing
tods.timeseries_processing
tods.feature_analysis
tods.detection_algorithm
tods.reinforcement



Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

+ 595
- 0
docs/source/getting_started.rst View File

@@ -0,0 +1,595 @@
Getting Started
===============

In this document, we provide some toy examples for getting started. All
the examples in this document and even more examples are available in
`examples/ <https://github.com/datamllab/rlcard/tree/master/examples>`__.

Playing with Random Agents
--------------------------

We have set up a random agent that can play randomly on each
environment. An example of applying a random agent on Blackjack is as
follow:

.. code:: python

import rlcard
from rlcard.agents import RandomAgent
from rlcard.utils import set_global_seed

# Make environment
env = rlcard.make('blackjack', config={'seed': 0})
episode_num = 2

# Set a global seed
set_global_seed(0)

# Set up agents
agent_0 = RandomAgent(action_num=env.action_num)
env.set_agents([agent_0])

for episode in range(episode_num):

# Generate data from the environment
trajectories, _ = env.run(is_training=False)

# Print out the trajectories
print('\nEpisode {}'.format(episode))
for ts in trajectories[0]:
print('State: {}, Action: {}, Reward: {}, Next State: {}, Done: {}'.format(ts[0], ts[1], ts[2], ts[3], ts[4]))

The expected output should look like something as follows:

::

Episode 0
State: {'obs': array([20, 3]), 'legal_actions': [0, 1]}, Action: 0, Reward: 0, Next State: {'obs': array([15, 3]), 'legal_actions': [0, 1]}, Done: False
State: {'obs': array([15, 3]), 'legal_actions': [0, 1]}, Action: 1, Reward: -1, Next State: {'obs': array([15, 20]), 'legal_actions': [0, 1]}, Done: True

Episode 1
State: {'obs': array([15, 5]), 'legal_actions': [0, 1]}, Action: 1, Reward: 1, Next State: {'obs': array([15, 23]), 'legal_actions': [0, 1]}, Done: True

Note that the states and actions are wrapped by ``env`` in Blackjack. In
this example, the ``[20, 3]`` suggests the current player obtains score
20 while the card that faces up in the dealer’s hand has score 3. Action
0 means “hit” while action 1 means “stand”. Reward 1 suggests the player
wins while reward -1 suggests the dealer wins. Reward 0 suggests a tie.
The above data can be directly fed into a RL algorithm for training.

Deep-Q Learning on Blackjack
----------------------------

The second example is to use Deep-Q learning to train an agent on
Blackjack. We aim to use this example to show how reinforcement learning
algorithms can be developed and applied in our toolkit. We design a
``run`` function which plays one complete game and provides the data for
training RL agents. The example is shown below:

.. code:: python

import tensorflow as tf
import os

import rlcard
from rlcard.agents import DQNAgent
from rlcard.utils import set_global_seed, tournament
from rlcard.utils import Logger

# Make environment
env = rlcard.make('blackjack', config={'seed': 0})
eval_env = rlcard.make('blackjack', config={'seed': 0})

# Set the iterations numbers and how frequently we evaluate/save plot
evaluate_every = 100
evaluate_num = 10000
episode_num = 100000

# The intial memory size
memory_init_size = 100

# Train the agent every X steps
train_every = 1

# The paths for saving the logs and learning curves
log_dir = './experiments/blackjack_dqn_result/'

# Set a global seed
set_global_seed(0)

with tf.Session() as sess:

# Initialize a global step
global_step = tf.Variable(0, name='global_step', trainable=False)

# Set up the agents
agent = DQNAgent(sess,
scope='dqn',
action_num=env.action_num,
replay_memory_init_size=memory_init_size,
train_every=train_every,
state_shape=env.state_shape,
mlp_layers=[10,10])
env.set_agents([agent])
eval_env.set_agents([agent])

# Initialize global variables
sess.run(tf.global_variables_initializer())

# Init a Logger to plot the learning curve
logger = Logger(log_dir)

for episode in range(episode_num):

# Generate data from the environment
trajectories, _ = env.run(is_training=True)

# Feed transitions into agent memory, and train the agent
for ts in trajectories[0]:
agent.feed(ts)

# Evaluate the performance. Play with random agents.
if episode % evaluate_every == 0:
logger.log_performance(env.timestep, tournament(eval_env, evaluate_num)[0])

# Close files in the logger
logger.close_files()

# Plot the learning curve
logger.plot('DQN')
# Save model
save_dir = 'models/blackjack_dqn'
if not os.path.exists(save_dir):
os.makedirs(save_dir)
saver = tf.train.Saver()
saver.save(sess, os.path.join(save_dir, 'model'))

The expected output is something like below:

::

----------------------------------------
timestep | 1
reward | -0.7342
----------------------------------------
INFO - Agent dqn, step 100, rl-loss: 1.0042707920074463
INFO - Copied model parameters to target network.
INFO - Agent dqn, step 136, rl-loss: 0.7888197302818298
----------------------------------------
timestep | 136
reward | -0.1406
----------------------------------------
INFO - Agent dqn, step 278, rl-loss: 0.6946825981140137
----------------------------------------
timestep | 278
reward | -0.1523
----------------------------------------
INFO - Agent dqn, step 412, rl-loss: 0.62268990278244025
----------------------------------------
timestep | 412
reward | -0.088
----------------------------------------
INFO - Agent dqn, step 544, rl-loss: 0.69050502777099616
----------------------------------------
timestep | 544
reward | -0.08
----------------------------------------
INFO - Agent dqn, step 681, rl-loss: 0.61789089441299444
----------------------------------------
timestep | 681
reward | -0.0793
----------------------------------------

In Blackjack, the player will get a payoff at the end of the game: 1 if
the player wins, -1 if the player loses, and 0 if it is a tie. The
performance is measured by the average payoff the player obtains by
playing 10000 episodes. The above example shows that the agent achieves
better and better performance during training. The logs and learning
curves are saved in ``./experiments/blackjack_dqn_result/``.

Running Multiple Processes
--------------------------

The environments can be run with multiple processes to accelerate the
training. Below is an example to train DQN on Blackjack with multiple
processes.

.. code:: python

''' An example of learning a Deep-Q Agent on Blackjack with multiple processes
Note that we must use if __name__ == '__main__' for multiprocessing
'''

import tensorflow as tf
import os

import rlcard
from rlcard.agents import DQNAgent
from rlcard.utils import set_global_seed, tournament
from rlcard.utils import Logger

def main():
# Make environment
env = rlcard.make('blackjack', config={'seed': 0, 'env_num': 4})
eval_env = rlcard.make('blackjack', config={'seed': 0, 'env_num': 4})

# Set the iterations numbers and how frequently we evaluate performance
evaluate_every = 100
evaluate_num = 10000
iteration_num = 100000

# The intial memory size
memory_init_size = 100

# Train the agent every X steps
train_every = 1

# The paths for saving the logs and learning curves
log_dir = './experiments/blackjack_dqn_result/'

# Set a global seed
set_global_seed(0)

with tf.Session() as sess:

# Initialize a global step
global_step = tf.Variable(0, name='global_step', trainable=False)

# Set up the agents
agent = DQNAgent(sess,
scope='dqn',
action_num=env.action_num,
replay_memory_init_size=memory_init_size,
train_every=train_every,
state_shape=env.state_shape,
mlp_layers=[10,10])
env.set_agents([agent])
eval_env.set_agents([agent])

# Initialize global variables
sess.run(tf.global_variables_initializer())

# Initialize a Logger to plot the learning curve
logger = Logger(log_dir)

for iteration in range(iteration_num):

# Generate data from the environment
trajectories, _ = env.run(is_training=True)

# Feed transitions into agent memory, and train the agent
for ts in trajectories[0]:
agent.feed(ts)

# Evaluate the performance. Play with random agents.
if iteration % evaluate_every == 0:
logger.log_performance(env.timestep, tournament(eval_env, evaluate_num)[0])

# Close files in the logger
logger.close_files()

# Plot the learning curve
logger.plot('DQN')
# Save model
save_dir = 'models/blackjack_dqn'
if not os.path.exists(save_dir):
os.makedirs(save_dir)
saver = tf.train.Saver()
saver.save(sess, os.path.join(save_dir, 'model'))

if __name__ == '__main__':
main()

Example output is as follow:

::

----------------------------------------
timestep | 17
reward | -0.7378
----------------------------------------

INFO - Copied model parameters to target network.
INFO - Agent dqn, step 1100, rl-loss: 0.40940183401107797
INFO - Copied model parameters to target network.
INFO - Agent dqn, step 2100, rl-loss: 0.44971221685409546
INFO - Copied model parameters to target network.
INFO - Agent dqn, step 2225, rl-loss: 0.65466868877410897
----------------------------------------
timestep | 2225
reward | -0.0658
----------------------------------------
INFO - Agent dqn, step 3100, rl-loss: 0.48663979768753053
INFO - Copied model parameters to target network.
INFO - Agent dqn, step 4100, rl-loss: 0.71293979883193974
INFO - Copied model parameters to target network.
INFO - Agent dqn, step 4440, rl-loss: 0.55871248245239263
----------------------------------------
timestep | 4440
reward | -0.0736
----------------------------------------

Training CFR on Leduc Hold’em
-----------------------------

To show how we can use ``step`` and ``step_back`` to traverse the game
tree, we provide an example of solving Leduc Hold’em with CFR:

.. code:: python

import numpy as np

import rlcard
from rlcard.agents import CFRAgent
from rlcard import models
from rlcard.utils import set_global_seed, tournament
from rlcard.utils import Logger

# Make environment and enable human mode
env = rlcard.make('leduc-holdem', config={'seed': 0, 'allow_step_back':True})
eval_env = rlcard.make('leduc-holdem', config={'seed': 0})

# Set the iterations numbers and how frequently we evaluate/save plot
evaluate_every = 100
save_plot_every = 1000
evaluate_num = 10000
episode_num = 10000

# The paths for saving the logs and learning curves
log_dir = './experiments/leduc_holdem_cfr_result/'

# Set a global seed
set_global_seed(0)

# Initilize CFR Agent
agent = CFRAgent(env)
agent.load() # If we have saved model, we first load the model

# Evaluate CFR against pre-trained NFSP
eval_env.set_agents([agent, models.load('leduc-holdem-nfsp').agents[0]])

# Init a Logger to plot the learning curve
logger = Logger(log_dir)

for episode in range(episode_num):
agent.train()
print('\rIteration {}'.format(episode), end='')
# Evaluate the performance. Play with NFSP agents.
if episode % evaluate_every == 0:
agent.save() # Save model
logger.log_performance(env.timestep, tournament(eval_env, evaluate_num)[0])

# Close files in the logger
logger.close_files()

# Plot the learning curve
logger.plot('CFR')

In the above example, the performance is measured by playing against a
pre-trained NFSP model. The expected output is as below:

::

Iteration 0
----------------------------------------
timestep | 192
reward | -1.3662
----------------------------------------
Iteration 100
----------------------------------------
timestep | 19392
reward | 0.9462
----------------------------------------
Iteration 200
----------------------------------------
timestep | 38592
reward | 0.8591
----------------------------------------
Iteration 300
----------------------------------------
timestep | 57792
reward | 0.7861
----------------------------------------
Iteration 400
----------------------------------------
timestep | 76992
reward | 0.7752
----------------------------------------
Iteration 500
----------------------------------------
timestep | 96192
reward | 0.7215
----------------------------------------

We observe that CFR achieves better performance as NFSP. However, CFR
requires traversal of the game tree, which is infeasible in large
environments.

Having Fun with Pretrained Leduc Model
--------------------------------------

We have designed simple human interfaces to play against the pretrained
model. Leduc Hold’em is a simplified version of Texas Hold’em. Rules can
be found `here <games.md#leduc-holdem>`__. Example of playing against
Leduc Hold’em CFR model is as below:

.. code:: python

import rlcard
from rlcard import models
from rlcard.agents import LeducholdemHumanAgent as HumanAgent
from rlcard.utils import print_card

# Make environment
# Set 'record_action' to True because we need it to print results
env = rlcard.make('leduc-holdem', config={'record_action': True})
human_agent = HumanAgent(env.action_num)
cfr_agent = models.load('leduc-holdem-cfr').agents[0]
env.set_agents([human_agent, cfr_agent])

print(">> Leduc Hold'em pre-trained model")

while (True):
print(">> Start a new game")

trajectories, payoffs = env.run(is_training=False)
# If the human does not take the final action, we need to
# print other players action
final_state = trajectories[0][-1][-2]
action_record = final_state['action_record']
state = final_state['raw_obs']
_action_list = []
for i in range(1, len(action_record)+1):
if action_record[-i][0] == state['current_player']:
break
_action_list.insert(0, action_record[-i])
for pair in _action_list:
print('>> Player', pair[0], 'chooses', pair[1])

# Let's take a look at what the agent card is
print('=============== CFR Agent ===============')
print_card(env.get_perfect_information()['hand_cards'][1])

print('=============== Result ===============')
if payoffs[0] > 0:
print('You win {} chips!'.format(payoffs[0]))
elif payoffs[0] == 0:
print('It is a tie.')
else:
print('You lose {} chips!'.format(-payoffs[0]))
print('')

input("Press any key to continue...")

Example output is as follow:

::

>> Leduc Hold'em pre-trained model

>> Start a new game!
>> Agent 1 chooses raise

=============== Community Card ===============
┌─────────┐
│░░░░░░░░░│
│░░░░░░░░░│
│░░░░░░░░░│
│░░░░░░░░░│
│░░░░░░░░░│
│░░░░░░░░░│
│░░░░░░░░░│
└─────────┘
=============== Your Hand ===============
┌─────────┐
│J │
│ │
│ │
│ ♥ │
│ │
│ │
│ J│
└─────────┘
=============== Chips ===============
Yours: +
Agent 1: +++
=========== Actions You Can Choose ===========
0: call, 1: raise, 2: fold

>> You choose action (integer):

We also provide a running demo of a rule-based agent for UNO. Try it by
running ``examples/uno_human.py``.

Leduc Hold’em as Single-Agent Environment
-----------------------------------------

We have wrraped the environment as single agent environment by assuming
that other players play with pre-trained models. The interfaces are
exactly the same to OpenAI Gym. Thus, any single-agent algorithm can be
connected to the environment. An example of Leduc Hold’em is as below:

.. code:: python

import tensorflow as tf
import os
import numpy as np

import rlcard
from rlcard.agents import DQNAgent
from rlcard.agents import RandomAgent
from rlcard.utils import set_global_seed, tournament
from rlcard.utils import Logger

# Make environment
env = rlcard.make('leduc-holdem', config={'seed': 0, 'single_agent_mode':True})
eval_env = rlcard.make('leduc-holdem', config={'seed': 0, 'single_agent_mode':True})

# Set the iterations numbers and how frequently we evaluate/save plot
evaluate_every = 1000
evaluate_num = 10000
timesteps = 100000

# The intial memory size
memory_init_size = 1000

# Train the agent every X steps
train_every = 1

# The paths for saving the logs and learning curves
log_dir = './experiments/leduc_holdem_single_dqn_result/'

# Set a global seed
set_global_seed(0)

with tf.Session() as sess:

# Initialize a global step
global_step = tf.Variable(0, name='global_step', trainable=False)

# Set up the agents
agent = DQNAgent(sess,
scope='dqn',
action_num=env.action_num,
replay_memory_init_size=memory_init_size,
train_every=train_every,
state_shape=env.state_shape,
mlp_layers=[128,128])
# Initialize global variables
sess.run(tf.global_variables_initializer())

# Init a Logger to plot the learning curve
logger = Logger(log_dir)

state = env.reset()

for timestep in range(timesteps):
action = agent.step(state)
next_state, reward, done = env.step(action)
ts = (state, action, reward, next_state, done)
agent.feed(ts)

if timestep % evaluate_every == 0:
rewards = []
state = eval_env.reset()
for _ in range(evaluate_num):
action, _ = agent.eval_step(state)
_, reward, done = env.step(action)
if done:
rewards.append(reward)
logger.log_performance(env.timestep, np.mean(rewards))

# Close files in the logger
logger.close_files()

# Plot the learning curve
logger.plot('DQN')
# Save model
save_dir = 'models/leduc_holdem_single_dqn'
if not os.path.exists(save_dir):
os.makedirs(save_dir)
saver = tf.train.Saver()
saver.save(sess, os.path.join(save_dir, 'model'))

BIN
docs/source/img/framework.pdf View File


+ 11
- 3
docs/source/index.rst View File

@@ -3,17 +3,25 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

Welcome to Time Series Outlier Detection System's documentation!
Welcome to TOD's documentation!
================================================================

.. toctree::
:maxdepth: 2
:maxdepth: 4
:caption: Contents:



Indices and tables
API Documents
==================
.. toctree::
:maxdepth: 4
:caption: API Documents:
tods.data_processing
tods.timeseries_processing
tods.feature_analysis
tods.detection_algorithm
tods.reinforcement

* :ref:`genindex`
* :ref:`modindex`


+ 1
- 1
docs/source/modules.rst View File

@@ -2,6 +2,6 @@ tods
====

.. toctree::
:maxdepth: 3
:maxdepth: 4

tods

+ 101
- 0
docs/source/overview.rst View File

@@ -0,0 +1,101 @@
Overview
========

Design Principles
~~~~~~~~~~~~~~~~~

The toolkit wraps each game by ``Env`` class with easy-to-use
interfaces. The goal of this toolkit is to enable the users to focus on
algorithm development without caring about the environment. The
following design principles are applied when developing the toolkit:
* **Reproducible.** Results on the environments can be reproduced. The same result should be obtained with the same random seed in different runs.
* **Accessible.** The experiences are collected and well organized after each game with easy-to-use interfaces. Uses can conveniently configure state representation, action encoding, reward design, or even the game rules.
* **Scalable.** New card environments can be added conveniently into the toolkit with the above design principles. We also try to minimize the dependencies in the toolkit so that the codes can be easily maintained.

TODS High-level Design
~~~~~~~~~~~~~~~~~~~~~~~~

This document introduces the high-level design for the environments, the
games, and the agents (algorithms).

.. image:: img/framework.pdf
:width: 800



Data-Processing
---------------

We wrap each game with an ``Env`` class. The responsibility of ``Env``
is to help you generate trajectories of the games. For developing
Reinforcement Learning (RL) algorithms, we recommend to use the
following interfaces:

- ``set_agents``: This function tells the ``Env`` what agents will be
used to perform actions in the game. Different games may have a
different number of agents. The input of the function is a list of
``Agent`` class. For example,
``env.set_agent([RandomAgent(), RandomAgent()])`` indicates that two
random agents will be used to generate the trajectories.
- ``run``: After setting the agents, this interface will run a complete
trajectory of the game, calculate the reward for each transition, and
reorganize the data so that it can be directly fed into a RL
algorithm.

For advanced access to the environment, such as traversal of the game
tree, we provide the following interfaces:

- ``step``: Given the current state, the environment takes one step
forward, and returns the next state and the next player.
- ``step_back``: Takes one step backward. The environment will restore
to the last state. The ``step_back`` is defaultly turned off since it
requires expensively recoeding previous states. To turn it on, set
``allow_step_back = True`` when ``make`` environments.
- ``get_payoffs``: At the end of the game, this function can be called
to obtain the payoffs for each player.

We also support single-agent mode and human mode. Examples can be found
in ``examples/``.

- Single agent mode: single-agent environments are developped by
simulating other players with pre-trained models or rule-based
models. You can enable single-agent mode by
``rlcard.make(ENV_ID, config={'single_agent_mode':True})``. Then the
``step`` function will return ``(next_state, reward, done)`` just as
common single-agent environments. ``env.reset()`` will reset the game
and return the first state.

Games
-----

Card games usually have similar structures. We abstract some concepts in
card games and follow the same design pattern. In this way,
users/developers can easily dig into the code and change the rules for
research purpose. Specifically, the following classes are used in all
the games:

- ``Game``: A game is defined as a complete sequence starting from one
of the non-terminal states to a terminal state.
- ``Round``: A round is a part of the sequence of a game. Most card
games can be naturally divided into multiple rounds.
- ``Dealer``: A dealer is responsible for shuffling and allocating a
deck of cards.
- ``Judger``: A judger is responsible for making major decisions at the
end of a round or a game.
- ``Player``: A player is a role who plays cards following a strategy.

To summarize, in one ``Game``, a ``Dealer`` deals the cards for each
``Player``. In each ``Round`` of the game, a ``Judger`` will make major
decisions about the next round and the payoffs in the end of the game.

Agents
------

We provide examples of several representative algorithms and wrap them
as ``Agent`` to show how a learning algorithm can be connected to the
toolkit. The first example is DQN which is a representative of the
Reinforcement Learning (RL) algorithms category. The second example is
NFSP which is a representative of the Reinforcement Learning (RL) with
self-play. We also provide CFR and DeepCFR which belong to Conterfactual
Regret Minimization (CFR) category. Other algorithms from these three
categories can be connected in similar ways.

+ 1
- 1
docs/source/tods.rst View File

@@ -5,7 +5,7 @@ Subpackages
-----------

.. toctree::
:maxdepth: 4
:maxdepth: 2

tods.data_processing
tods.detection_algorithm


+ 2
- 2
docs/source/tods.searcher.search.rst View File

@@ -9,7 +9,7 @@ tods.searcher.search.brute\_force\_search module

.. automodule:: tods.searcher.search.brute_force_search
:members:
:undoc-members:
:noindex:
:show-inheritance:

Module contents
@@ -17,5 +17,5 @@ Module contents

.. automodule:: tods.searcher.search
:members:
:undoc-members:
:noindex:
:show-inheritance:

Loading…
Cancel
Save