deepword.agents package

Submodules

deepword.agents.base_agent module

class deepword.agents.base_agent.BaseAgent(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)

Bases: deepword.log.Logging

Base agent class that using
  1. action collector

  2. trajectory collector

  3. floor plan collector

  4. tree memory storage and sampling

__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None

Initialize a base agent

Parameters
  • hp – hyper-parameters, refer to deepword.hparams

  • model_dir – path to model dir

act(obs: List[str], scores: List[int], dones: List[bool], infos: Dict[str, List[Any]]) → Optional[List[str]]

Acts upon the current list of observations. One text command must be returned for each observation.

Parameters
  • obs – observed texts for each game

  • scores – score obtained so far for each game

  • dones – whether a game is finished

  • infos – extra information requested from TextWorld

Returns

if all dones, return None, else return actions

Notes

Commands returned for games marked as done have no effect. The states for finished games are simply copy over until all games are done.

eval(load_best=True) → None

call eval() before performing evaluation

Parameters

load_best – load from the best weights, otherwise from last weights

property negative_scores

Total negative scores

property positive_scores

Total positive scores earned

reset(restore_from: Optional[str] = None) → None

reset is only used for evaluation during training do not use it at anywhere else.

Parameters

restore_from – where to restore the model, None goes to default

save_snapshot() → None
classmethod select_additional_infos() → textworld.core.EnvInfos

additional information needed when playing the game requested infos here are required to run the Agent

train() → None

call train() before performing training

deepword.agents.competition_agent module

class deepword.agents.competition_agent.CompetitionAgent(hp, model_dir)

Bases: deepword.agents.base_agent.BaseAgent

The agent built for participant the TextWorld competition. Include action filtering and rule based policies.

deepword.agents.cores module

class deepword.agents.cores.BaseCore(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)

Bases: deepword.log.Logging, abc.ABC

Core: used for agents to compute policy. Core objects are isolated with games and gaming platforms. They work with agents, receiving trajectories, actions, and then compute a policy for agents.

How to get trajectories, actions, and how to choose actions given policies are decided by agents.

__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None

Initialize A Core for an agent.

Parameters
  • hp – hyper-parameters, see deepword.hparams

  • model_dir – path to save or load model

create_or_reload_target_model(restore_from: Optional[str] = None) → None

Create (if not exist) or reload weights for the target model

Parameters

restore_from – the path to restore weights

init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None

Initialize models of the core.

Parameters
  • is_training – training or evaluation

  • load_best – load from best weights, otherwise last weights

  • restore_from – path to restore

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray

Infer from policy.

Parameters
  • trajectory – a list of ActionMaster

  • state – the current game state of observation + inventory

  • action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.

  • action_len – 1D array, length for each action.

  • action_mask – 1D array, indices of admissible actions from all actions of the game.

Returns

Q-values for actions in the action_matrix

save_model(t: Optional[int] = None) → None

Save current model with training steps

Parameters

t – training steps, None falls back to default global steps

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray

Train the core with one batch of data.

Parameters
  • pre_trajectories – previous trajectories

  • post_trajectories – post trajectories

  • pre_states – previous states

  • post_states – post states

  • action_matrix – all actions for each of previous trajectories

  • action_len – length of actions

  • pre_action_mask – action masks for each of previous trajectories

  • post_action_mask – action masks for each of post trajectories

  • dones – game terminated or not for post trajectories

  • rewards – rewards received for reaching post trajectories

  • action_idx – actions used for reaching post trajectories

  • b_weight – 1D array, weight for each data point

  • step – current training step

  • others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value

for each data point

class deepword.agents.cores.DQNCore(hp, model_dir)

Bases: deepword.agents.cores.TFCore

DQNAgent that treats actions as types

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray

get either an random action index with action string or the best predicted action index with action string.

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray

Train the core with one batch of data.

Parameters
  • pre_trajectories – previous trajectories

  • post_trajectories – post trajectories

  • pre_states – previous states

  • post_states – post states

  • action_matrix – all actions for each of previous trajectories

  • action_len – length of actions

  • pre_action_mask – action masks for each of previous trajectories

  • post_action_mask – action masks for each of post trajectories

  • dones – game terminated or not for post trajectories

  • rewards – rewards received for reaching post trajectories

  • action_idx – actions used for reaching post trajectories

  • b_weight – 1D array, weight for each data point

  • step – current training step

  • others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value

for each data point

class deepword.agents.cores.DRRNCore(hp, model_dir)

Bases: deepword.agents.cores.TFCore

DRRN agent that treats actions as meaningful sentences

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray

get either an random action index with action string or the best predicted action index with action string.

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray

Train the core with one batch of data.

Parameters
  • pre_trajectories – previous trajectories

  • post_trajectories – post trajectories

  • pre_states – previous states

  • post_states – post states

  • action_matrix – all actions for each of previous trajectories

  • action_len – length of actions

  • pre_action_mask – action masks for each of previous trajectories

  • post_action_mask – action masks for each of post trajectories

  • dones – game terminated or not for post trajectories

  • rewards – rewards received for reaching post trajectories

  • action_idx – actions used for reaching post trajectories

  • b_weight – 1D array, weight for each data point

  • step – current training step

  • others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value

for each data point

class deepword.agents.cores.DSQNCore(hp, model_dir)

Bases: deepword.agents.cores.DRRNCore

eval_snn(snn_data: Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray], batch_size: int = 32) → float
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray

Train the core with one batch of data.

Parameters
  • pre_trajectories – previous trajectories

  • post_trajectories – post trajectories

  • pre_states – previous states

  • post_states – post states

  • action_matrix – all actions for each of previous trajectories

  • action_len – length of actions

  • pre_action_mask – action masks for each of previous trajectories

  • post_action_mask – action masks for each of post trajectories

  • dones – game terminated or not for post trajectories

  • rewards – rewards received for reaching post trajectories

  • action_idx – actions used for reaching post trajectories

  • b_weight – 1D array, weight for each data point

  • step – current training step

  • others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value

for each data point

class deepword.agents.cores.DSQNZorkCore(hp, model_dir)

Bases: deepword.agents.cores.DQNCore

eval_snn(snn_data: Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray], batch_size: int = 32) → float
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray

Train the core with one batch of data.

Parameters
  • pre_trajectories – previous trajectories

  • post_trajectories – post trajectories

  • pre_states – previous states

  • post_states – post states

  • action_matrix – all actions for each of previous trajectories

  • action_len – length of actions

  • pre_action_mask – action masks for each of previous trajectories

  • post_action_mask – action masks for each of post trajectories

  • dones – game terminated or not for post trajectories

  • rewards – rewards received for reaching post trajectories

  • action_idx – actions used for reaching post trajectories

  • b_weight – 1D array, weight for each data point

  • step – current training step

  • others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value

for each data point

class deepword.agents.cores.GenDQNCore(hp, model_dir)

Bases: deepword.agents.cores.TFCore

decode_action(trajectory: List[deepword.agents.utils.ActionMaster])deepword.agents.utils.GenSummary
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray

Infer from policy.

Parameters
  • trajectory – a list of ActionMaster

  • state – the current game state of observation + inventory

  • action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.

  • action_len – 1D array, length for each action.

  • action_mask – 1D array, indices of admissible actions from all actions of the game.

Returns

Q-values for actions in the action_matrix

summary(token_idx: numpy.ndarray, col_eos_idx: numpy.ndarray, p_gen: numpy.ndarray, sum_logits: numpy.ndarray) → List[deepword.agents.utils.GenSummary]
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray

Train the core with one batch of data.

Parameters
  • pre_trajectories – previous trajectories

  • post_trajectories – post trajectories

  • pre_states – previous states

  • post_states – post states

  • action_matrix – all actions for each of previous trajectories

  • action_len – length of actions

  • pre_action_mask – action masks for each of previous trajectories

  • post_action_mask – action masks for each of post trajectories

  • dones – game terminated or not for post trajectories

  • rewards – rewards received for reaching post trajectories

  • action_idx – actions used for reaching post trajectories

  • b_weight – 1D array, weight for each data point

  • step – current training step

  • others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value

for each data point

class deepword.agents.cores.NLUCore(hp, model_dir)

Bases: deepword.agents.cores.TFCore

The agent that explores commonsense ability of BERT models. This agent combines each trajectory with all its actions together, separated with [SEP] in the middle. Then feeds the sentence into BERT to get a score from the [CLS] token. refer to https://arxiv.org/pdf/1810.04805.pdf for fine-tuning and evaluation

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray

Infer from policy.

Parameters
  • trajectory – a list of ActionMaster

  • state – the current game state of observation + inventory

  • action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.

  • action_len – 1D array, length for each action.

  • action_mask – 1D array, indices of admissible actions from all actions of the game.

Returns

Q-values for actions in the action_matrix

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray

Train the core with one batch of data.

Parameters
  • pre_trajectories – previous trajectories

  • post_trajectories – post trajectories

  • pre_states – previous states

  • post_states – post states

  • action_matrix – all actions for each of previous trajectories

  • action_len – length of actions

  • pre_action_mask – action masks for each of previous trajectories

  • post_action_mask – action masks for each of post trajectories

  • dones – game terminated or not for post trajectories

  • rewards – rewards received for reaching post trajectories

  • action_idx – actions used for reaching post trajectories

  • b_weight – 1D array, weight for each data point

  • step – current training step

  • others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value

for each data point

class deepword.agents.cores.PGNCore(hp, model_dir)

Bases: deepword.agents.cores.TFCore

Generate admissible actions for games, given only trajectory

decode(trajectory: List[deepword.agents.utils.ActionMaster], beam_size: int, temperature: float, use_greedy: bool) → List[deepword.agents.utils.GenSummary]
generate_admissible_actions(trajectory: List[deepword.agents.utils.ActionMaster]) → List[str]
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray

Infer from policy.

Parameters
  • trajectory – a list of ActionMaster

  • state – the current game state of observation + inventory

  • action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.

  • action_len – 1D array, length for each action.

  • action_mask – 1D array, indices of admissible actions from all actions of the game.

Returns

Q-values for actions in the action_matrix

summary(action_idx: numpy.ndarray, col_eos_idx: numpy.ndarray, decoded_logits: numpy.ndarray, p_gen: numpy.ndarray, beam_size: int) → List[deepword.agents.utils.GenSummary]

Return [ids, tokens, generation probabilities of each token, q_action] sorted by q_action (from larger to smaller) q_action: the average of decoded logits of selected tokens

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray

Train the core with one batch of data.

Parameters
  • pre_trajectories – previous trajectories

  • post_trajectories – post trajectories

  • pre_states – previous states

  • post_states – post states

  • action_matrix – all actions for each of previous trajectories

  • action_len – length of actions

  • pre_action_mask – action masks for each of previous trajectories

  • post_action_mask – action masks for each of post trajectories

  • dones – game terminated or not for post trajectories

  • rewards – rewards received for reaching post trajectories

  • action_idx – actions used for reaching post trajectories

  • b_weight – 1D array, weight for each data point

  • step – current training step

  • others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value

for each data point

class deepword.agents.cores.TFCore(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)

Bases: deepword.agents.cores.BaseCore, abc.ABC

Agent core implemented through Tensorflow.

__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None
Parameters
  • hp – hyper-parameters

  • model_dir – path to model dir

batch_trajectory2input(trajectories: List[List[deepword.agents.utils.ActionMaster]]) → Tuple[List[List[int]], List[int]]

generate batch of src, src_len, trimmed by hp.num_tokens

see deepword.agents.cores.TFCore.trajectory2input()

Parameters

trajectories – a batch of trajectories

Returns

batch of src batch of src_len

create_or_reload_target_model(restore_from: Optional[str] = None) → None

Create the target model if not exists, then load model from the most recent saved weights.

Parameters

restore_from – path to load target model, None falls back to default.

init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None

Initialize the core.

  1. create the model

  2. load the model if there are saved models

  3. create target model for training

Parameters
  • is_training – True for training, False for evaluation

  • load_best – load best model, otherwise load last weights

  • restore_from – specify the load path, load_best will be disabled

safe_loading(model: deepword.models.models.DQNModel, sess: tensorflow.python.client.session.Session, saver: tensorflow.python.training.saver.Saver, restore_from: str) → int

Load weights from restore_from to model. If weights in loaded model are incompatible with current model, try to load those weights that have the same name.

This method is useful when saved model lacks of training part, e.g. Adam optimizer.

Parameters
  • model – A tensorflow model

  • sess – A tensorflow session

  • saver – A tensorflow saver

  • restore_from – the path to restore the model

Returns

training steps

save_best_model() → None

Save current model to the best weights dir

save_model(t: Optional[int] = None) → None

Save model to model_dir with the number of training steps.

Parameters

t – number of training steps, None falls back to global step

set_d4eval(device: str) → None

Set the device for evaluation, e.g. “/device:CPU:0”, “/device:GPU:1” Otherwise, a default device allocation will be used.

Parameters

device – device name

trajectory2input(trajectory: List[deepword.agents.utils.ActionMaster]) → Tuple[List[int], int]

generate src, src_len from trajectory, trimmed by hp.num_tokens

Parameters

trajectory – List of ActionMaster

Returns

source indices src_len: length of the src

Return type

src

class deepword.agents.cores.TabularCore(hp, model_dir)

Bases: deepword.agents.cores.BaseCore

Tabular-wise DQN agent that uses matrix to store q-vectors and uses hashed values of observation + inventory as game states

create_or_reload_target_model(restore_from: Optional[str] = None) → None

Create (if not exist) or reload weights for the target model

Parameters

restore_from – the path to restore weights

get_state_hash(state: deepword.agents.utils.ObsInventory) → str
init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None

Initialize models of the core.

Parameters
  • is_training – training or evaluation

  • load_best – load from best weights, otherwise last weights

  • restore_from – path to restore

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray

Infer from policy.

Parameters
  • trajectory – a list of ActionMaster

  • state – the current game state of observation + inventory

  • action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.

  • action_len – 1D array, length for each action.

  • action_mask – 1D array, indices of admissible actions from all actions of the game.

Returns

Q-values for actions in the action_matrix

save_model(t: Optional[int] = None) → None

Save current model with training steps

Parameters

t – training steps, None falls back to default global steps

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray

Train the core with one batch of data.

Parameters
  • pre_trajectories – previous trajectories

  • post_trajectories – post trajectories

  • pre_states – previous states

  • post_states – post states

  • action_matrix – all actions for each of previous trajectories

  • action_len – length of actions

  • pre_action_mask – action masks for each of previous trajectories

  • post_action_mask – action masks for each of post trajectories

  • dones – game terminated or not for post trajectories

  • rewards – rewards received for reaching post trajectories

  • action_idx – actions used for reaching post trajectories

  • b_weight – 1D array, weight for each data point

  • step – current training step

  • others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value

for each data point

deepword.agents.dsqn_agent module

class deepword.agents.dsqn_agent.DSQNAgent(hp, model_dir)

Bases: deepword.agents.base_agent.BaseAgent

BaseAgent with hs2tj: hash states point to trajectories for SNN training

get_snn_pairs(batch_size: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]

Sample SNN pairs for SNN part training

Parameters

batch_size – how many data points to generate. Notice that batch_size * 2 data points will be generated, one half for trajectory pairs with the same states; the other half for trajectory pairs with different states.

Returns

trajectories src_len: length of them src2: the paired trajectories src2_len: length of them labels: 0 for same states; 1 for different states

Return type

src

save_train_pairs(t: int, src: numpy.ndarray, src_len: numpy.ndarray, src2: numpy.ndarray, src2_len: numpy.ndarray, labels: numpy.ndarray) → None

Save SNN pairs for verification.

Parameters
  • t – current training steps

  • src – trajectories

  • src_len – length of trajectories

  • src2 – paired trajectories

  • src2_len – length of paired trajectories

  • labels0 or 1 for same or different states

class deepword.agents.dsqn_agent.DSQNCompetitionAgent(hp, model_dir)

Bases: deepword.agents.dsqn_agent.DSQNAgent, deepword.agents.competition_agent.CompetitionAgent

class deepword.agents.dsqn_agent.DSQNZorkAgent(hp, model_dir)

Bases: deepword.agents.dsqn_agent.DSQNAgent, deepword.agents.zork_agent.ZorkAgent

deepword.agents.gen_agent module

class deepword.agents.gen_agent.GenDQNAgent(hp, model_dir)

Bases: deepword.agents.base_agent.BaseAgent

GenDQNAgent works with deepword.agents.cores.GenDQNCore.

deepword.agents.gen_drrn_agent module

class deepword.agents.gen_drrn_agent.GenCompetitionDRRNAgent(hp, model_dir)

Bases: deepword.agents.competition_agent.CompetitionAgent

class deepword.agents.gen_drrn_agent.GenDRRNAgent(hp, model_dir)

Bases: deepword.agents.base_agent.BaseAgent

We generate admissible actions at every step, and then use DRRN to choose the best action to play.

This agent can be compared with previous template-gen agent.

deepword.agents.utils module

class deepword.agents.utils.ActType(rnd, rule, rnd_walk, policy_drrn, policy_gen, jitter, policy_tbl)

Bases: deepword.agents.utils.ActType

class deepword.agents.utils.ActionDesc(action_type, action_idx, token_idx, action_len, action, q_actions)

Bases: deepword.agents.utils.ActionDesc

class deepword.agents.utils.ActionMaster(action_ids: List[int], master_ids: List[int], action: str, master: str)

Bases: object

property action
property action_ids
property ids
property lens
property master
property master_ids
class deepword.agents.utils.CommonActs(examine_cookbook, prepare_meal, eat_meal, look, inventory, gn, gs, ge, gw)

Bases: deepword.agents.utils.CommonActs

class deepword.agents.utils.EnvInfosKey(recipe, desc, inventory, max_score, won, lost, actions, templates, verbs, entities)

Bases: deepword.agents.utils.KeyInfo

class deepword.agents.utils.GenSummary(ids, tokens, gens, q_action, len)

Bases: deepword.agents.utils.GenSummary

class deepword.agents.utils.LinearDecayedEPS(decay_step, init_eps=1, final_eps=0)

Bases: deepword.agents.utils.ScheduledEPS

eps(t)
class deepword.agents.utils.Memolet(tid, sid, gid, aid, token_id, a_len, a_type, reward, is_terminal, action_mask, sys_action_mask, next_action_mask, next_sys_action_mask, q_actions)

Bases: deepword.agents.utils.Memolet

end_of_episode: game stops by 1) winning, 2) losing, or 3) exceeding maximum number of steps. is_terminal: is current step reaches the terminal game state by winning or losing. is_terminal = True means for the current step, q value equals to the instant reward.

TODO: Notice that end_of_episode doesn’t imply is_terminal. Only winning

or losing means is_terminal = True.

class deepword.agents.utils.ObsInventory(obs, inventory, sid, hs)

Bases: deepword.agents.utils.ObsInventory

class deepword.agents.utils.ScannerDecayEPS(decay_step, decay_range, next_init_eps_rate=0.8, init_eps=1, final_eps=0)

Bases: deepword.agents.utils.ScheduledEPS

eps(t)
class deepword.agents.utils.ScheduledEPS(name: Optional[str] = None)

Bases: deepword.log.Logging

eps(t)
deepword.agents.utils.batch_drrn_action_input(action_matrices: List[numpy.ndarray], action_lens: List[numpy.ndarray], action_masks: List[numpy.ndarray]) → Tuple[numpy.ndarray, numpy.ndarray, List[int], List[Dict[int, int]]]

Select actions from action_masks in a batch

see deepword.agents.utils.drrn_action_input()

deepword.agents.utils.bert_commonsense_input(action_matrix: numpy.ndarray, action_len: numpy.ndarray, trajectory: List[int], trajectory_len: int, sep_val_id: int, cls_val_id: int, num_tokens: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

Given one trajectory and its admissible actions, create a training set of input for Bert.

Notice: the trajectory_len and action_len need to be confirmed that to have special tokens e.g. [CLS], [SEP] positions to be reserved.

E.g. input: [1, 2, 3], and action_matrix [[1, 3], [2, PAD], [4, PAD]] suppose we need length to be 10. output:

[[CLS, 1, 2, 3, SEP, 1, 3, SEP, PAD, PAD, PAD],

[CLS, 1, 2, 3, SEP, 2, SEP, PAD, PAD, PAD, PAD], [CLS, 1, 2, 3, SEP, 4, SEP, PAD, PAD, PAD, PAD]]

segment of trajectory and actions: [[0, 0, 0, 0, 0, 1, 1, 1],

[0, 0, 0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0]]

input size: [8, 7, 7]

Returns

trajectory + action; segmentation ids; sizes

deepword.agents.utils.categorical_without_replacement(logits, k=1)

Courtesy of https://github.com/tensorflow/tensorflow/issues/ 9260#issuecomment-437875125 also cite here: @misc{vieira2014gumbel,

title = {Gumbel-max trick and weighted reservoir sampling}, author = {Tim Vieira}, url = {http://timvieira.github.io/blog/post/2014/08/01/ gumbel-max-trick-and-weighted-reservoir-sampling/}, year = {2014}

} Notice that the logits represent unnormalized log probabilities, in the citation above, there is no need to normalized them first to add the Gumbel random variant, which surprises me! since I thought it should be logits - tf.reduce_logsumexp(logits) + z

deepword.agents.utils.drrn_action_input(action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → Tuple[numpy.ndarray, numpy.ndarray, int, Dict[int, int]]

Select actions from action_mask.

Parameters
  • action_matrix – action matrix for a game

  • action_len – lengths for actions in the action_matrix

  • action_mask – list of indices of selected actions

Returns

selected action matrix, selected action len, number of actions selected,

and the mapping from real ID to mask ID.

real ID: the action index in the original action_matrix mask ID: the action index in the action_mask

Examples

>>> a_mat = np.asarray([
>>>     [1, 2, 3, 4, 0],
>>>     [2, 2, 1, 3, 1],
>>>     [3, 1, 0, 0, 0],
>>>     [6, 9, 9, 1, 0]])
>>> a_len = np.asarray([4, 5, 2, 4])
>>> a_mask = np.asarray([1, 3])
>>> drrn_action_input(a_mat, a_len, a_mask)
[[2, 2, 1, 3, 1], [6, 9, 9, 1, 0]]
[5, 4]
{1: 0, 3: 1}
deepword.agents.utils.get_action_idx_pair(action_matrix: numpy.ndarray, action_len: numpy.ndarray, sos_id: int, eos_id: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

Create action index pair for seq2seq training. Given action index, e.g. [1, 2, 3, 4, pad, pad, pad, pad], with 0 as sos_id, and -1 as eos_id, we create training pair: [0, 1, 2, 3, 4, pad, pad, pad] as the input sentence, and [1, 2, 3, 4, -1, pad, pad, pad] as the output sentence.

Notice that we remove the final pad to keep the action length unchanged. Notice 2. pad should be indexed as 0.

Parameters
  • action_matrix – np array of action index of N * K, there are N, and each of them has a length of K (with paddings).

  • action_len – length of each action (remove paddings).

  • sos_id

  • eos_id

Returns

action index as input, action index as output, new action len

deepword.agents.utils.get_best_1d_q(q_actions: numpy.ndarray) → Tuple[int, float]

Find the best Q-value given a 1D Q-vector

Parameters

q_actions – a vector of Q-values

Returns

best action index, Q-value

Examples

>>> q_vec = np.asarray([0.1, 0.2, 0.3, 0.4])
>>> get_best_1d_q(q_vec)
3, 0.4
deepword.agents.utils.get_best_batch_ids(q_actions: numpy.ndarray, actions_repeats: List[int]) → List[int]
Get a batch of best action index of q-values for each group defined by

actions_repeats

Parameters
  • q_actions – a 1D Q-vector

  • actions_repeats – groups of number of actions, indicating how many elements are in the same group.

Returns

best action index for each group

Examples

>>> q_vec = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> repeats = [3, 4, 3]
>>> #Q-vector splits into three groups containing 3, 4, 3 Q-values
>>> # shaded_qs = [[1, 2, 3], [4, 5, 6, 7], [8, 9, 10]]
>>> get_best_batch_ids(np.asarray(q_vec), repeats)
[3, 7, 10]
deepword.agents.utils.get_hash_state(obs: str, inv: str) → str

Generate hash state from observation and inventory :param obs: observation of current step :param inv: inventory of current step

Returns

hash state of current step

deepword.agents.utils.get_path_tags(path: str, prefix: str) → List[int]

Get tag from a path of saved objects. E.g. actions-100.npz 100 will be extracted Make sure the item to be extracted is saved with suffix of npz.

Parameters
  • path – path to find files with prefix

  • prefix – prefix

Returns

list of all tags

Examples

>>> # suppose there are these files:
>>> # actions-99.npz, actions-100.npz, actions-200.npz
>>> get_path_tags("/path/to/data", "actions")
[99, 100, 200]
deepword.agents.utils.get_snn_keys(hash_states2tjs: Dict[str, Dict[int, List[int]]], tjs: deepword.trajectory.Trajectory, size: int) → Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]

Get SNN training pairs from trajectories.

Parameters
  • hash_states2tjs – the mapping from hash state to trajectory

  • tjs – the trajectories

  • size – batch size

Returns

target_set, same_set and diff_set each set contains keys of (tid, sid) to locate trajectory

deepword.agents.utils.id_real2batch(real_id: List[int], id_real2mask: List[Dict[int, int]], actions_repeats: List[int]) → List[int]

Transform real IDs to IDs in a batch

An explanation of three ID system for actions, depending on which location does the action be in.

In the action matrix of the game: real ID. E.g. a game with three actions [“go east”, “go west”, “eat meal”], then the real IDs are [0, 1, 2]

In the action mask for each step of game-playing. E.g. when play at a step with admissible action as [“go east”, “eat meal”], then the mask IDs are [0, 1], mapping to the real IDs are [0, 2].

In a batch for training. E.g. in a batch of 2 entries, each entry is from a different game, say, game-1 and game-2.

Game-1, at the step of playing, has two actions, say [0, 2];

Game-2, at the step of playing, has three actions, say, [0, 4, 10].

Supposing the agent choose action-0 from game-1 for entry-1, and action-4 from game-2 for entry-2. Now the real IDs are [0, 4]. However, the mask IDs are [0, 1].

Why action-4 becomes action-1? Because for that step of game-2, there are only three action [0, 4, 10], and the action-4 is placed at position 1.

Converting mask IDs to batch IDs, we get [0, 3].

Why action-1 becomes action-3? Because if we place actions (mask IDs) for entry-1 and entry-2 together, it becomes [[0, 1], [0, 1, 2]]. The action list is then flatten into [0, 1, 0, 1, 2], then re-indexed as [0, 1, 2, 3, 4]. So action-1 maps to action-3 for entry-2.

Parameters
  • real_id – action ids for each game in the original action_matrix of that game

  • id_real2mask – list of mappings from real IDs to mask IDs

  • actions_repeats – action sizes in each group

Returns

a list of batch IDs

Examples

>>> rids = [0, 4]
>>> id_maps = [{0: 0, 2: 1}, {0: 1, 4: 1, 10: 2}]
>>> repeats = [2, 3]
>>> id_real2batch(rids, id_maps, repeats)
[0, 3]
deepword.agents.utils.remove_zork_version_info(text)
deepword.agents.utils.sample_batch_ids(q_actions: numpy.ndarray, actions_repeats: List[int], k: int) → List[int]

get a batch of sampled action index of q-values actions_repeats indicates how many elements are in the same group. e.g. q_actions = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] actions_repeats = [3, 4, 3] then q_actions can be split into three groups: [1, 2, 3], [4, 5, 6, 7], [8, 9, 10];

we sample from the indexes, we get the best idx in each group as the first one in that group, then sample another k - 1 elements for each group. If the number of elements in that group smaller than k - 1, we choose sample with replacement.

deepword.agents.zork_agent module

class deepword.agents.zork_agent.ZorkAgent(hp, model_dir)

Bases: deepword.agents.base_agent.BaseAgent

The agent to run Zork.

TextWorld will not provide admissible actions like cooking games, so a

loaded action file is required.

Module contents