deepword.agents package¶

Submodules¶

deepword.agents.base_agent module¶

class deepword.agents.base_agent.BaseAgent(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)¶

Bases: deepword.log.Logging

Base agent class that using

action collector
trajectory collector
floor plan collector
tree memory storage and sampling

__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None¶

Initialize a base agent

Parameters

hp – hyper-parameters, refer to deepword.hparams
model_dir – path to model dir

act(obs: List[str], scores: List[int], dones: List[bool], infos: Dict[str, List[Any]]) → Optional[List[str]]¶

Acts upon the current list of observations. One text command must be returned for each observation.

Parameters

obs – observed texts for each game
scores – score obtained so far for each game
dones – whether a game is finished
infos – extra information requested from TextWorld

Returns

if all dones, return None, else return actions

Notes

Commands returned for games marked as done have no effect. The states for finished games are simply copy over until all games are done.

eval(load_best=True) → None¶

call eval() before performing evaluation

Parameters: load_best – load from the best weights, otherwise from last weights

property negative_scores¶: Total negative scores

property positive_scores¶: Total positive scores earned

reset(restore_from: Optional[str] = None) → None¶

reset is only used for evaluation during training do not use it at anywhere else.

Parameters: restore_from – where to restore the model, None goes to default

save_snapshot() → None¶

classmethod select_additional_infos() → textworld.core.EnvInfos¶: additional information needed when playing the game requested infos here are required to run the Agent

train() → None¶: call train() before performing training

deepword.agents.competition_agent module¶

class deepword.agents.competition_agent.CompetitionAgent(hp, model_dir)¶

Bases: deepword.agents.base_agent.BaseAgent

The agent built for participant the TextWorld competition. Include action filtering and rule based policies.

deepword.agents.cores module¶

class deepword.agents.cores.BaseCore(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)¶

Bases: deepword.log.Logging, abc.ABC

Core: used for agents to compute policy. Core objects are isolated with games and gaming platforms. They work with agents, receiving trajectories, actions, and then compute a policy for agents.

How to get trajectories, actions, and how to choose actions given policies are decided by agents.

__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None¶

Initialize A Core for an agent.

Parameters

hp – hyper-parameters, see deepword.hparams
model_dir – path to save or load model

create_or_reload_target_model(restore_from: Optional[str] = None) → None¶

Create (if not exist) or reload weights for the target model

Parameters: restore_from – the path to restore weights

init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None¶

Initialize models of the core.

Parameters

is_training – training or evaluation
load_best – load from best weights, otherwise last weights
restore_from – path to restore

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶

Infer from policy.

Parameters

trajectory – a list of ActionMaster
state – the current game state of observation + inventory
action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.
action_len – 1D array, length for each action.
action_mask – 1D array, indices of admissible actions from all actions of the game.

Returns

Q-values for actions in the action_matrix

save_model(t: Optional[int] = None) → None¶

Save current model with training steps

Parameters: t – training steps, None falls back to default global steps

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶

Train the core with one batch of data.

Parameters

pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value: for each data point

class deepword.agents.cores.DQNCore(hp, model_dir)¶

Bases: deepword.agents.cores.TFCore

DQNAgent that treats actions as types

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶: get either an random action index with action string or the best predicted action index with action string.

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶

Train the core with one batch of data.

Parameters

pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value: for each data point

class deepword.agents.cores.DRRNCore(hp, model_dir)¶

Bases: deepword.agents.cores.TFCore

DRRN agent that treats actions as meaningful sentences

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶: get either an random action index with action string or the best predicted action index with action string.

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶

Train the core with one batch of data.

Parameters

pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value: for each data point

class deepword.agents.cores.DSQNCore(hp, model_dir)¶

Bases: deepword.agents.cores.DRRNCore

eval_snn(snn_data: Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray], batch_size: int = 32) → float¶

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶

Train the core with one batch of data.

Parameters

pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value: for each data point

class deepword.agents.cores.DSQNZorkCore(hp, model_dir)¶

Bases: deepword.agents.cores.DQNCore

eval_snn(snn_data: Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray], batch_size: int = 32) → float¶

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶

Train the core with one batch of data.

Parameters

pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value: for each data point

class deepword.agents.cores.GenDQNCore(hp, model_dir)¶

Bases: deepword.agents.cores.TFCore

decode_action(trajectory: List[deepword.agents.utils.ActionMaster]) → deepword.agents.utils.GenSummary ¶

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶

Infer from policy.

Parameters

trajectory – a list of ActionMaster
state – the current game state of observation + inventory
action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.
action_len – 1D array, length for each action.
action_mask – 1D array, indices of admissible actions from all actions of the game.

Returns

Q-values for actions in the action_matrix

summary(token_idx: numpy.ndarray, col_eos_idx: numpy.ndarray, p_gen: numpy.ndarray, sum_logits: numpy.ndarray) → List[deepword.agents.utils.GenSummary]¶

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶

Train the core with one batch of data.

Parameters

pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value: for each data point

class deepword.agents.cores.NLUCore(hp, model_dir)¶

Bases: deepword.agents.cores.TFCore

The agent that explores commonsense ability of BERT models. This agent combines each trajectory with all its actions together, separated with [SEP] in the middle. Then feeds the sentence into BERT to get a score from the [CLS] token. refer to https://arxiv.org/pdf/1810.04805.pdf for fine-tuning and evaluation

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶

Infer from policy.

Parameters

trajectory – a list of ActionMaster
state – the current game state of observation + inventory
action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.
action_len – 1D array, length for each action.
action_mask – 1D array, indices of admissible actions from all actions of the game.

Returns

Q-values for actions in the action_matrix

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶

Train the core with one batch of data.

Parameters

pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value: for each data point

class deepword.agents.cores.PGNCore(hp, model_dir)¶

Bases: deepword.agents.cores.TFCore

Generate admissible actions for games, given only trajectory

decode(trajectory: List[deepword.agents.utils.ActionMaster], beam_size: int, temperature: float, use_greedy: bool) → List[deepword.agents.utils.GenSummary]¶

generate_admissible_actions(trajectory: List[deepword.agents.utils.ActionMaster]) → List[str]¶

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶

Infer from policy.

Parameters

trajectory – a list of ActionMaster
state – the current game state of observation + inventory
action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.
action_len – 1D array, length for each action.
action_mask – 1D array, indices of admissible actions from all actions of the game.

Returns

Q-values for actions in the action_matrix

summary(action_idx: numpy.ndarray, col_eos_idx: numpy.ndarray, decoded_logits: numpy.ndarray, p_gen: numpy.ndarray, beam_size: int) → List[deepword.agents.utils.GenSummary]¶: Return [ids, tokens, generation probabilities of each token, q_action] sorted by q_action (from larger to smaller) q_action: the average of decoded logits of selected tokens

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶

Train the core with one batch of data.

Parameters

pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value: for each data point

class deepword.agents.cores.TFCore(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)¶

Bases: deepword.agents.cores.BaseCore, abc.ABC

Agent core implemented through Tensorflow.

__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None¶

Parameters

hp – hyper-parameters
model_dir – path to model dir

batch_trajectory2input(trajectories: List[List[deepword.agents.utils.ActionMaster]]) → Tuple[List[List[int]], List[int]]¶

generate batch of src, src_len, trimmed by hp.num_tokens

see deepword.agents.cores.TFCore.trajectory2input()

Parameters: trajectories – a batch of trajectories
Returns: batch of src batch of src_len

create_or_reload_target_model(restore_from: Optional[str] = None) → None¶

Create the target model if not exists, then load model from the most recent saved weights.

Parameters: restore_from – path to load target model, None falls back to default.

init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None¶

Initialize the core.

create the model
load the model if there are saved models
create target model for training

Parameters

is_training – True for training, False for evaluation
load_best – load best model, otherwise load last weights
restore_from – specify the load path, load_best will be disabled

safe_loading(model: deepword.models.models.DQNModel, sess: tensorflow.python.client.session.Session, saver: tensorflow.python.training.saver.Saver, restore_from: str) → int¶

Load weights from restore_from to model. If weights in loaded model are incompatible with current model, try to load those weights that have the same name.

This method is useful when saved model lacks of training part, e.g. Adam optimizer.

Parameters

model – A tensorflow model
sess – A tensorflow session
saver – A tensorflow saver
restore_from – the path to restore the model

Returns

training steps

save_best_model() → None¶: Save current model to the best weights dir

save_model(t: Optional[int] = None) → None¶

Save model to model_dir with the number of training steps.

Parameters: t – number of training steps, None falls back to global step

set_d4eval(device: str) → None¶

Set the device for evaluation, e.g. “/device:CPU:0”, “/device:GPU:1” Otherwise, a default device allocation will be used.

Parameters: device – device name

trajectory2input(trajectory: List[deepword.agents.utils.ActionMaster]) → Tuple[List[int], int]¶

generate src, src_len from trajectory, trimmed by hp.num_tokens

Parameters: trajectory – List of ActionMaster
Returns: source indices src_len: length of the src
Return type: src

class deepword.agents.cores.TabularCore(hp, model_dir)¶

Bases: deepword.agents.cores.BaseCore

Tabular-wise DQN agent that uses matrix to store q-vectors and uses hashed values of observation + inventory as game states

create_or_reload_target_model(restore_from: Optional[str] = None) → None¶

Create (if not exist) or reload weights for the target model

Parameters: restore_from – the path to restore weights

get_state_hash(state: deepword.agents.utils.ObsInventory) → str¶

init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None¶

Initialize models of the core.

Parameters

is_training – training or evaluation
load_best – load from best weights, otherwise last weights
restore_from – path to restore

policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶

Infer from policy.

Parameters

trajectory – a list of ActionMaster
state – the current game state of observation + inventory
action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.
action_len – 1D array, length for each action.
action_mask – 1D array, indices of admissible actions from all actions of the game.

Returns

Q-values for actions in the action_matrix

save_model(t: Optional[int] = None) → None¶

Save current model with training steps

Parameters: t – training steps, None falls back to default global steps

train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶

Train the core with one batch of data.

Parameters

pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose

Returns: Absolute loss between expected Q-value and predicted Q-value: for each data point

deepword.agents.dsqn_agent module¶

class deepword.agents.dsqn_agent.DSQNAgent(hp, model_dir)¶

Bases: deepword.agents.base_agent.BaseAgent

BaseAgent with hs2tj: hash states point to trajectories for SNN training

get_snn_pairs(batch_size: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]¶

Sample SNN pairs for SNN part training

Parameters: batch_size – how many data points to generate. Notice that batch_size * 2 data points will be generated, one half for trajectory pairs with the same states; the other half for trajectory pairs with different states.
Returns: trajectories src_len: length of them src2: the paired trajectories src2_len: length of them labels: 0 for same states; 1 for different states
Return type: src

save_train_pairs(t: int, src: numpy.ndarray, src_len: numpy.ndarray, src2: numpy.ndarray, src2_len: numpy.ndarray, labels: numpy.ndarray) → None¶

Save SNN pairs for verification.

Parameters

t – current training steps
src – trajectories
src_len – length of trajectories
src2 – paired trajectories
src2_len – length of paired trajectories
labels – 0 or 1 for same or different states

class deepword.agents.dsqn_agent.DSQNCompetitionAgent(hp, model_dir)¶: Bases: deepword.agents.dsqn_agent.DSQNAgent, deepword.agents.competition_agent.CompetitionAgent

class deepword.agents.dsqn_agent.DSQNZorkAgent(hp, model_dir)¶: Bases: deepword.agents.dsqn_agent.DSQNAgent, deepword.agents.zork_agent.ZorkAgent

deepword.agents.gen_agent module¶

class deepword.agents.gen_agent.GenDQNAgent(hp, model_dir)¶

Bases: deepword.agents.base_agent.BaseAgent

GenDQNAgent works with deepword.agents.cores.GenDQNCore.

deepword.agents.gen_drrn_agent module¶

class deepword.agents.gen_drrn_agent.GenCompetitionDRRNAgent(hp, model_dir)¶: Bases: deepword.agents.competition_agent.CompetitionAgent

class deepword.agents.gen_drrn_agent.GenDRRNAgent(hp, model_dir)¶

Bases: deepword.agents.base_agent.BaseAgent

We generate admissible actions at every step, and then use DRRN to choose the best action to play.

This agent can be compared with previous template-gen agent.

deepword.agents.utils module¶

class deepword.agents.utils.ActType(rnd, rule, rnd_walk, policy_drrn, policy_gen, jitter, policy_tbl)¶: Bases: deepword.agents.utils.ActType

class deepword.agents.utils.ActionDesc(action_type, action_idx, token_idx, action_len, action, q_actions)¶: Bases: deepword.agents.utils.ActionDesc

class deepword.agents.utils.ActionMaster(action_ids: List[int], master_ids: List[int], action: str, master: str)¶

Bases: object

property action¶

property action_ids¶

property ids¶

property lens¶

property master¶

property master_ids¶

class deepword.agents.utils.CommonActs(examine_cookbook, prepare_meal, eat_meal, look, inventory, gn, gs, ge, gw)¶: Bases: deepword.agents.utils.CommonActs

class deepword.agents.utils.EnvInfosKey(recipe, desc, inventory, max_score, won, lost, actions, templates, verbs, entities)¶: Bases: deepword.agents.utils.KeyInfo

class deepword.agents.utils.GenSummary(ids, tokens, gens, q_action, len)¶: Bases: deepword.agents.utils.GenSummary

class deepword.agents.utils.LinearDecayedEPS(decay_step, init_eps=1, final_eps=0)¶

Bases: deepword.agents.utils.ScheduledEPS

eps(t)¶

class deepword.agents.utils.Memolet(tid, sid, gid, aid, token_id, a_len, a_type, reward, is_terminal, action_mask, sys_action_mask, next_action_mask, next_sys_action_mask, q_actions)¶

Bases: deepword.agents.utils.Memolet

end_of_episode: game stops by 1) winning, 2) losing, or 3) exceeding maximum number of steps. is_terminal: is current step reaches the terminal game state by winning or losing. is_terminal = True means for the current step, q value equals to the instant reward.

TODO: Notice that end_of_episode doesn’t imply is_terminal. Only winning: or losing means is_terminal = True.

class deepword.agents.utils.ObsInventory(obs, inventory, sid, hs)¶: Bases: deepword.agents.utils.ObsInventory

class deepword.agents.utils.ScannerDecayEPS(decay_step, decay_range, next_init_eps_rate=0.8, init_eps=1, final_eps=0)¶

Bases: deepword.agents.utils.ScheduledEPS

eps(t)¶

class deepword.agents.utils.ScheduledEPS(name: Optional[str] = None)¶

Bases: deepword.log.Logging

eps(t)¶

deepword.agents.utils.batch_drrn_action_input(action_matrices: List[numpy.ndarray], action_lens: List[numpy.ndarray], action_masks: List[numpy.ndarray]) → Tuple[numpy.ndarray, numpy.ndarray, List[int], List[Dict[int, int]]]¶

Select actions from action_masks in a batch

see deepword.agents.utils.drrn_action_input()

deepword.agents.utils.bert_commonsense_input(action_matrix: numpy.ndarray, action_len: numpy.ndarray, trajectory: List[int], trajectory_len: int, sep_val_id: int, cls_val_id: int, num_tokens: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]¶

Given one trajectory and its admissible actions, create a training set of input for Bert.

Notice: the trajectory_len and action_len need to be confirmed that to have special tokens e.g. [CLS], [SEP] positions to be reserved.

E.g. input: [1, 2, 3], and action_matrix [[1, 3], [2, PAD], [4, PAD]] suppose we need length to be 10. output:

[[CLS, 1, 2, 3, SEP, 1, 3, SEP, PAD, PAD, PAD],
[CLS, 1, 2, 3, SEP, 2, SEP, PAD, PAD, PAD, PAD], [CLS, 1, 2, 3, SEP, 4, SEP, PAD, PAD, PAD, PAD]]

segment of trajectory and actions: [[0, 0, 0, 0, 0, 1, 1, 1],

[0, 0, 0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0]]

input size: [8, 7, 7]

Returns: trajectory + action; segmentation ids; sizes

deepword.agents.utils.categorical_without_replacement(logits, k=1)¶

Courtesy of https://github.com/tensorflow/tensorflow/issues/ 9260#issuecomment-437875125 also cite here: @misc{vieira2014gumbel,

title = {Gumbel-max trick and weighted reservoir sampling}, author = {Tim Vieira}, url = {http://timvieira.github.io/blog/post/2014/08/01/ gumbel-max-trick-and-weighted-reservoir-sampling/}, year = {2014}

} Notice that the logits represent unnormalized log probabilities, in the citation above, there is no need to normalized them first to add the Gumbel random variant, which surprises me! since I thought it should be logits - tf.reduce_logsumexp(logits) + z

deepword.agents.utils.drrn_action_input(action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → Tuple[numpy.ndarray, numpy.ndarray, int, Dict[int, int]]¶

Select actions from action_mask.

Parameters

action_matrix – action matrix for a game
action_len – lengths for actions in the action_matrix
action_mask – list of indices of selected actions

Returns

selected action matrix, selected action len, number of actions selected,: and the mapping from real ID to mask ID.

real ID: the action index in the original action_matrix mask ID: the action index in the action_mask

Examples

>>> a_mat = np.asarray([
>>>     [1, 2, 3, 4, 0],
>>>     [2, 2, 1, 3, 1],
>>>     [3, 1, 0, 0, 0],
>>>     [6, 9, 9, 1, 0]])
>>> a_len = np.asarray([4, 5, 2, 4])
>>> a_mask = np.asarray([1, 3])
>>> drrn_action_input(a_mat, a_len, a_mask)
[[2, 2, 1, 3, 1], [6, 9, 9, 1, 0]]
[5, 4]
{1: 0, 3: 1}

deepword.agents.utils.get_action_idx_pair(action_matrix: numpy.ndarray, action_len: numpy.ndarray, sos_id: int, eos_id: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]¶

Create action index pair for seq2seq training. Given action index, e.g. [1, 2, 3, 4, pad, pad, pad, pad], with 0 as sos_id, and -1 as eos_id, we create training pair: [0, 1, 2, 3, 4, pad, pad, pad] as the input sentence, and [1, 2, 3, 4, -1, pad, pad, pad] as the output sentence.

Notice that we remove the final pad to keep the action length unchanged. Notice 2. pad should be indexed as 0.

Parameters

action_matrix – np array of action index of N * K, there are N, and each of them has a length of K (with paddings).
action_len – length of each action (remove paddings).
sos_id –
eos_id –

Returns

action index as input, action index as output, new action len

deepword.agents.utils.get_best_1d_q(q_actions: numpy.ndarray) → Tuple[int, float]¶

Find the best Q-value given a 1D Q-vector

Parameters: q_actions – a vector of Q-values
Returns: best action index, Q-value

Examples

>>> q_vec = np.asarray([0.1, 0.2, 0.3, 0.4])
>>> get_best_1d_q(q_vec)
3, 0.4

deepword.agents.utils.get_best_batch_ids(q_actions: numpy.ndarray, actions_repeats: List[int]) → List[int]¶

Get a batch of best action index of q-values for each group defined by: actions_repeats

Parameters

q_actions – a 1D Q-vector
actions_repeats – groups of number of actions, indicating how many elements are in the same group.

Returns

best action index for each group

Examples

>>> q_vec = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> repeats = [3, 4, 3]
>>> #Q-vector splits into three groups containing 3, 4, 3 Q-values
>>> # shaded_qs = [[1, 2, 3], [4, 5, 6, 7], [8, 9, 10]]
>>> get_best_batch_ids(np.asarray(q_vec), repeats)
[3, 7, 10]

deepword.agents.utils.get_hash_state(obs: str, inv: str) → str¶

Generate hash state from observation and inventory :param obs: observation of current step :param inv: inventory of current step

Returns: hash state of current step

deepword.agents.utils.get_path_tags(path: str, prefix: str) → List[int]¶

Get tag from a path of saved objects. E.g. actions-100.npz 100 will be extracted Make sure the item to be extracted is saved with suffix of npz.

Parameters

path – path to find files with prefix
prefix – prefix

Returns

list of all tags

Examples

>>> # suppose there are these files:
>>> # actions-99.npz, actions-100.npz, actions-200.npz
>>> get_path_tags("/path/to/data", "actions")
[99, 100, 200]

deepword.agents.utils.get_snn_keys(hash_states2tjs: Dict[str, Dict[int, List[int]]], tjs: deepword.trajectory.Trajectory, size: int) → Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]¶

Get SNN training pairs from trajectories.

Parameters

hash_states2tjs – the mapping from hash state to trajectory
tjs – the trajectories
size – batch size

Returns

target_set, same_set and diff_set each set contains keys of (tid, sid) to locate trajectory

deepword.agents.utils.id_real2batch(real_id: List[int], id_real2mask: List[Dict[int, int]], actions_repeats: List[int]) → List[int]¶

Transform real IDs to IDs in a batch

An explanation of three ID system for actions, depending on which location does the action be in.

In the action matrix of the game: real ID. E.g. a game with three actions [“go east”, “go west”, “eat meal”], then the real IDs are [0, 1, 2]

In the action mask for each step of game-playing. E.g. when play at a step with admissible action as [“go east”, “eat meal”], then the mask IDs are [0, 1], mapping to the real IDs are [0, 2].

In a batch for training. E.g. in a batch of 2 entries, each entry is from a different game, say, game-1 and game-2.

Game-1, at the step of playing, has two actions, say [0, 2];

Game-2, at the step of playing, has three actions, say, [0, 4, 10].

Supposing the agent choose action-0 from game-1 for entry-1, and action-4 from game-2 for entry-2. Now the real IDs are [0, 4]. However, the mask IDs are [0, 1].

Why action-4 becomes action-1? Because for that step of game-2, there are only three action [0, 4, 10], and the action-4 is placed at position 1.

Converting mask IDs to batch IDs, we get [0, 3].

Why action-1 becomes action-3? Because if we place actions (mask IDs) for entry-1 and entry-2 together, it becomes [[0, 1], [0, 1, 2]]. The action list is then flatten into [0, 1, 0, 1, 2], then re-indexed as [0, 1, 2, 3, 4]. So action-1 maps to action-3 for entry-2.

Parameters

real_id – action ids for each game in the original action_matrix of that game
id_real2mask – list of mappings from real IDs to mask IDs
actions_repeats – action sizes in each group

Returns

a list of batch IDs

Examples

>>> rids = [0, 4]
>>> id_maps = [{0: 0, 2: 1}, {0: 1, 4: 1, 10: 2}]
>>> repeats = [2, 3]
>>> id_real2batch(rids, id_maps, repeats)
[0, 3]

deepword.agents.utils.remove_zork_version_info(text)¶

deepword.agents.utils.sample_batch_ids(q_actions: numpy.ndarray, actions_repeats: List[int], k: int) → List[int]¶

get a batch of sampled action index of q-values actions_repeats indicates how many elements are in the same group. e.g. q_actions = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] actions_repeats = [3, 4, 3] then q_actions can be split into three groups: [1, 2, 3], [4, 5, 6, 7], [8, 9, 10];

we sample from the indexes, we get the best idx in each group as the first one in that group, then sample another k - 1 elements for each group. If the number of elements in that group smaller than k - 1, we choose sample with replacement.

deepword.agents.zork_agent module¶

class deepword.agents.zork_agent.ZorkAgent(hp, model_dir)¶

Bases: deepword.agents.base_agent.BaseAgent

The agent to run Zork.

TextWorld will not provide admissible actions like cooking games, so a: loaded action file is required.

deepword.agents package¶

Submodules¶

deepword.agents.base_agent module¶

deepword.agents.competition_agent module¶

deepword.agents.cores module¶

deepword.agents.dsqn_agent module¶

deepword.agents.gen_agent module¶

deepword.agents.gen_drrn_agent module¶

deepword.agents.utils module¶

deepword.agents.zork_agent module¶

Module contents¶

Table of Contents

This Page