deepword.agents package¶
Submodules¶
deepword.agents.base_agent module¶
- 
class deepword.agents.base_agent.BaseAgent(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)¶
- Bases: - deepword.log.Logging- Base agent class that using
- action collector 
- trajectory collector 
- floor plan collector 
- tree memory storage and sampling 
 
 - 
__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None¶
- Initialize a base agent - Parameters
- hp – hyper-parameters, refer to - deepword.hparams
- model_dir – path to model dir 
 
 
 - 
act(obs: List[str], scores: List[int], dones: List[bool], infos: Dict[str, List[Any]]) → Optional[List[str]]¶
- Acts upon the current list of observations. One text command must be returned for each observation. - Parameters
- obs – observed texts for each game 
- scores – score obtained so far for each game 
- dones – whether a game is finished 
- infos – extra information requested from TextWorld 
 
- Returns
- if all dones, return None, else return actions 
 - Notes - Commands returned for games marked as done have no effect. The states for finished games are simply copy over until all games are done. 
 - 
eval(load_best=True) → None¶
- call eval() before performing evaluation - Parameters
- load_best – load from the best weights, otherwise from last weights 
 
 - 
property negative_scores¶
- Total negative scores 
 - 
property positive_scores¶
- Total positive scores earned 
 - 
reset(restore_from: Optional[str] = None) → None¶
- reset is only used for evaluation during training do not use it at anywhere else. - Parameters
- restore_from – where to restore the model, None goes to default 
 
 - 
save_snapshot() → None¶
 - 
classmethod select_additional_infos() → textworld.core.EnvInfos¶
- additional information needed when playing the game requested infos here are required to run the Agent 
 - 
train() → None¶
- call train() before performing training 
 
deepword.agents.competition_agent module¶
- 
class deepword.agents.competition_agent.CompetitionAgent(hp, model_dir)¶
- Bases: - deepword.agents.base_agent.BaseAgent- The agent built for participant the TextWorld competition. Include action filtering and rule based policies. 
deepword.agents.cores module¶
- 
class deepword.agents.cores.BaseCore(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)¶
- Bases: - deepword.log.Logging,- abc.ABC- Core: used for agents to compute policy. Core objects are isolated with games and gaming platforms. They work with agents, receiving trajectories, actions, and then compute a policy for agents. - How to get trajectories, actions, and how to choose actions given policies are decided by agents. - 
__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None¶
- Initialize A Core for an agent. - Parameters
- hp – hyper-parameters, see - deepword.hparams
- model_dir – path to save or load model 
 
 
 - 
create_or_reload_target_model(restore_from: Optional[str] = None) → None¶
- Create (if not exist) or reload weights for the target model - Parameters
- restore_from – the path to restore weights 
 
 - 
init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None¶
- Initialize models of the core. - Parameters
- is_training – training or evaluation 
- load_best – load from best weights, otherwise last weights 
- restore_from – path to restore 
 
 
 - 
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶
- Infer from policy. - Parameters
- trajectory – a list of ActionMaster 
- state – the current game state of observation + inventory 
- action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action. 
- action_len – 1D array, length for each action. 
- action_mask – 1D array, indices of admissible actions from all actions of the game. 
 
- Returns
- Q-values for actions in the action_matrix 
 
 - 
save_model(t: Optional[int] = None) → None¶
- Save current model with training steps - Parameters
- t – training steps, None falls back to default global steps 
 
 - 
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶
- Train the core with one batch of data. - Parameters
- pre_trajectories – previous trajectories 
- post_trajectories – post trajectories 
- pre_states – previous states 
- post_states – post states 
- action_matrix – all actions for each of previous trajectories 
- action_len – length of actions 
- pre_action_mask – action masks for each of previous trajectories 
- post_action_mask – action masks for each of post trajectories 
- dones – game terminated or not for post trajectories 
- rewards – rewards received for reaching post trajectories 
- action_idx – actions used for reaching post trajectories 
- b_weight – 1D array, weight for each data point 
- step – current training step 
- others – other information passed for training purpose 
 
 - Returns: Absolute loss between expected Q-value and predicted Q-value
- for each data point 
 
 
- 
- 
class deepword.agents.cores.DQNCore(hp, model_dir)¶
- Bases: - deepword.agents.cores.TFCore- DQNAgent that treats actions as types - 
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶
- get either an random action index with action string or the best predicted action index with action string. 
 - 
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶
- Train the core with one batch of data. - Parameters
- pre_trajectories – previous trajectories 
- post_trajectories – post trajectories 
- pre_states – previous states 
- post_states – post states 
- action_matrix – all actions for each of previous trajectories 
- action_len – length of actions 
- pre_action_mask – action masks for each of previous trajectories 
- post_action_mask – action masks for each of post trajectories 
- dones – game terminated or not for post trajectories 
- rewards – rewards received for reaching post trajectories 
- action_idx – actions used for reaching post trajectories 
- b_weight – 1D array, weight for each data point 
- step – current training step 
- others – other information passed for training purpose 
 
 - Returns: Absolute loss between expected Q-value and predicted Q-value
- for each data point 
 
 
- 
- 
class deepword.agents.cores.DRRNCore(hp, model_dir)¶
- Bases: - deepword.agents.cores.TFCore- DRRN agent that treats actions as meaningful sentences - 
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶
- get either an random action index with action string or the best predicted action index with action string. 
 - 
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶
- Train the core with one batch of data. - Parameters
- pre_trajectories – previous trajectories 
- post_trajectories – post trajectories 
- pre_states – previous states 
- post_states – post states 
- action_matrix – all actions for each of previous trajectories 
- action_len – length of actions 
- pre_action_mask – action masks for each of previous trajectories 
- post_action_mask – action masks for each of post trajectories 
- dones – game terminated or not for post trajectories 
- rewards – rewards received for reaching post trajectories 
- action_idx – actions used for reaching post trajectories 
- b_weight – 1D array, weight for each data point 
- step – current training step 
- others – other information passed for training purpose 
 
 - Returns: Absolute loss between expected Q-value and predicted Q-value
- for each data point 
 
 
- 
- 
class deepword.agents.cores.DSQNCore(hp, model_dir)¶
- Bases: - deepword.agents.cores.DRRNCore- 
eval_snn(snn_data: Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray], batch_size: int = 32) → float¶
 - 
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶
- Train the core with one batch of data. - Parameters
- pre_trajectories – previous trajectories 
- post_trajectories – post trajectories 
- pre_states – previous states 
- post_states – post states 
- action_matrix – all actions for each of previous trajectories 
- action_len – length of actions 
- pre_action_mask – action masks for each of previous trajectories 
- post_action_mask – action masks for each of post trajectories 
- dones – game terminated or not for post trajectories 
- rewards – rewards received for reaching post trajectories 
- action_idx – actions used for reaching post trajectories 
- b_weight – 1D array, weight for each data point 
- step – current training step 
- others – other information passed for training purpose 
 
 - Returns: Absolute loss between expected Q-value and predicted Q-value
- for each data point 
 
 
- 
- 
class deepword.agents.cores.DSQNZorkCore(hp, model_dir)¶
- Bases: - deepword.agents.cores.DQNCore- 
eval_snn(snn_data: Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray], batch_size: int = 32) → float¶
 - 
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶
- Train the core with one batch of data. - Parameters
- pre_trajectories – previous trajectories 
- post_trajectories – post trajectories 
- pre_states – previous states 
- post_states – post states 
- action_matrix – all actions for each of previous trajectories 
- action_len – length of actions 
- pre_action_mask – action masks for each of previous trajectories 
- post_action_mask – action masks for each of post trajectories 
- dones – game terminated or not for post trajectories 
- rewards – rewards received for reaching post trajectories 
- action_idx – actions used for reaching post trajectories 
- b_weight – 1D array, weight for each data point 
- step – current training step 
- others – other information passed for training purpose 
 
 - Returns: Absolute loss between expected Q-value and predicted Q-value
- for each data point 
 
 
- 
- 
class deepword.agents.cores.GenDQNCore(hp, model_dir)¶
- Bases: - deepword.agents.cores.TFCore- 
decode_action(trajectory: List[deepword.agents.utils.ActionMaster]) → deepword.agents.utils.GenSummary¶
 - 
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶
- Infer from policy. - Parameters
- trajectory – a list of ActionMaster 
- state – the current game state of observation + inventory 
- action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action. 
- action_len – 1D array, length for each action. 
- action_mask – 1D array, indices of admissible actions from all actions of the game. 
 
- Returns
- Q-values for actions in the action_matrix 
 
 - 
summary(token_idx: numpy.ndarray, col_eos_idx: numpy.ndarray, p_gen: numpy.ndarray, sum_logits: numpy.ndarray) → List[deepword.agents.utils.GenSummary]¶
 - 
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶
- Train the core with one batch of data. - Parameters
- pre_trajectories – previous trajectories 
- post_trajectories – post trajectories 
- pre_states – previous states 
- post_states – post states 
- action_matrix – all actions for each of previous trajectories 
- action_len – length of actions 
- pre_action_mask – action masks for each of previous trajectories 
- post_action_mask – action masks for each of post trajectories 
- dones – game terminated or not for post trajectories 
- rewards – rewards received for reaching post trajectories 
- action_idx – actions used for reaching post trajectories 
- b_weight – 1D array, weight for each data point 
- step – current training step 
- others – other information passed for training purpose 
 
 - Returns: Absolute loss between expected Q-value and predicted Q-value
- for each data point 
 
 
- 
- 
class deepword.agents.cores.NLUCore(hp, model_dir)¶
- Bases: - deepword.agents.cores.TFCore- The agent that explores commonsense ability of BERT models. This agent combines each trajectory with all its actions together, separated with [SEP] in the middle. Then feeds the sentence into BERT to get a score from the [CLS] token. refer to https://arxiv.org/pdf/1810.04805.pdf for fine-tuning and evaluation - 
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶
- Infer from policy. - Parameters
- trajectory – a list of ActionMaster 
- state – the current game state of observation + inventory 
- action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action. 
- action_len – 1D array, length for each action. 
- action_mask – 1D array, indices of admissible actions from all actions of the game. 
 
- Returns
- Q-values for actions in the action_matrix 
 
 - 
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶
- Train the core with one batch of data. - Parameters
- pre_trajectories – previous trajectories 
- post_trajectories – post trajectories 
- pre_states – previous states 
- post_states – post states 
- action_matrix – all actions for each of previous trajectories 
- action_len – length of actions 
- pre_action_mask – action masks for each of previous trajectories 
- post_action_mask – action masks for each of post trajectories 
- dones – game terminated or not for post trajectories 
- rewards – rewards received for reaching post trajectories 
- action_idx – actions used for reaching post trajectories 
- b_weight – 1D array, weight for each data point 
- step – current training step 
- others – other information passed for training purpose 
 
 - Returns: Absolute loss between expected Q-value and predicted Q-value
- for each data point 
 
 
- 
- 
class deepword.agents.cores.PGNCore(hp, model_dir)¶
- Bases: - deepword.agents.cores.TFCore- Generate admissible actions for games, given only trajectory - 
decode(trajectory: List[deepword.agents.utils.ActionMaster], beam_size: int, temperature: float, use_greedy: bool) → List[deepword.agents.utils.GenSummary]¶
 - 
generate_admissible_actions(trajectory: List[deepword.agents.utils.ActionMaster]) → List[str]¶
 - 
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶
- Infer from policy. - Parameters
- trajectory – a list of ActionMaster 
- state – the current game state of observation + inventory 
- action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action. 
- action_len – 1D array, length for each action. 
- action_mask – 1D array, indices of admissible actions from all actions of the game. 
 
- Returns
- Q-values for actions in the action_matrix 
 
 - 
summary(action_idx: numpy.ndarray, col_eos_idx: numpy.ndarray, decoded_logits: numpy.ndarray, p_gen: numpy.ndarray, beam_size: int) → List[deepword.agents.utils.GenSummary]¶
- Return [ids, tokens, generation probabilities of each token, q_action] sorted by q_action (from larger to smaller) q_action: the average of decoded logits of selected tokens 
 - 
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶
- Train the core with one batch of data. - Parameters
- pre_trajectories – previous trajectories 
- post_trajectories – post trajectories 
- pre_states – previous states 
- post_states – post states 
- action_matrix – all actions for each of previous trajectories 
- action_len – length of actions 
- pre_action_mask – action masks for each of previous trajectories 
- post_action_mask – action masks for each of post trajectories 
- dones – game terminated or not for post trajectories 
- rewards – rewards received for reaching post trajectories 
- action_idx – actions used for reaching post trajectories 
- b_weight – 1D array, weight for each data point 
- step – current training step 
- others – other information passed for training purpose 
 
 - Returns: Absolute loss between expected Q-value and predicted Q-value
- for each data point 
 
 
- 
- 
class deepword.agents.cores.TFCore(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)¶
- Bases: - deepword.agents.cores.BaseCore,- abc.ABC- Agent core implemented through Tensorflow. - 
__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None¶
- Parameters
- hp – hyper-parameters 
- model_dir – path to model dir 
 
 
 - 
batch_trajectory2input(trajectories: List[List[deepword.agents.utils.ActionMaster]]) → Tuple[List[List[int]], List[int]]¶
- generate batch of src, src_len, trimmed by hp.num_tokens - see - deepword.agents.cores.TFCore.trajectory2input()- Parameters
- trajectories – a batch of trajectories 
- Returns
- batch of src batch of src_len 
 
 - 
create_or_reload_target_model(restore_from: Optional[str] = None) → None¶
- Create the target model if not exists, then load model from the most recent saved weights. - Parameters
- restore_from – path to load target model, None falls back to default. 
 
 - 
init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None¶
- Initialize the core. - create the model 
- load the model if there are saved models 
- create target model for training 
 - Parameters
- is_training – True for training, False for evaluation 
- load_best – load best model, otherwise load last weights 
- restore_from – specify the load path, load_best will be disabled 
 
 
 - 
safe_loading(model: deepword.models.models.DQNModel, sess: tensorflow.python.client.session.Session, saver: tensorflow.python.training.saver.Saver, restore_from: str) → int¶
- Load weights from restore_from to model. If weights in loaded model are incompatible with current model, try to load those weights that have the same name. - This method is useful when saved model lacks of training part, e.g. Adam optimizer. - Parameters
- model – A tensorflow model 
- sess – A tensorflow session 
- saver – A tensorflow saver 
- restore_from – the path to restore the model 
 
- Returns
- training steps 
 
 - 
save_best_model() → None¶
- Save current model to the best weights dir 
 - 
save_model(t: Optional[int] = None) → None¶
- Save model to model_dir with the number of training steps. - Parameters
- t – number of training steps, None falls back to global step 
 
 - 
set_d4eval(device: str) → None¶
- Set the device for evaluation, e.g. “/device:CPU:0”, “/device:GPU:1” Otherwise, a default device allocation will be used. - Parameters
- device – device name 
 
 - 
trajectory2input(trajectory: List[deepword.agents.utils.ActionMaster]) → Tuple[List[int], int]¶
- generate src, src_len from trajectory, trimmed by hp.num_tokens - Parameters
- trajectory – List of ActionMaster 
- Returns
- source indices src_len: length of the src 
- Return type
- src 
 
 
- 
- 
class deepword.agents.cores.TabularCore(hp, model_dir)¶
- Bases: - deepword.agents.cores.BaseCore- Tabular-wise DQN agent that uses matrix to store q-vectors and uses hashed values of observation + inventory as game states - 
create_or_reload_target_model(restore_from: Optional[str] = None) → None¶
- Create (if not exist) or reload weights for the target model - Parameters
- restore_from – the path to restore weights 
 
 - 
get_state_hash(state: deepword.agents.utils.ObsInventory) → str¶
 - 
init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None¶
- Initialize models of the core. - Parameters
- is_training – training or evaluation 
- load_best – load from best weights, otherwise last weights 
- restore_from – path to restore 
 
 
 - 
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶
- Infer from policy. - Parameters
- trajectory – a list of ActionMaster 
- state – the current game state of observation + inventory 
- action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action. 
- action_len – 1D array, length for each action. 
- action_mask – 1D array, indices of admissible actions from all actions of the game. 
 
- Returns
- Q-values for actions in the action_matrix 
 
 - 
save_model(t: Optional[int] = None) → None¶
- Save current model with training steps - Parameters
- t – training steps, None falls back to default global steps 
 
 - 
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶
- Train the core with one batch of data. - Parameters
- pre_trajectories – previous trajectories 
- post_trajectories – post trajectories 
- pre_states – previous states 
- post_states – post states 
- action_matrix – all actions for each of previous trajectories 
- action_len – length of actions 
- pre_action_mask – action masks for each of previous trajectories 
- post_action_mask – action masks for each of post trajectories 
- dones – game terminated or not for post trajectories 
- rewards – rewards received for reaching post trajectories 
- action_idx – actions used for reaching post trajectories 
- b_weight – 1D array, weight for each data point 
- step – current training step 
- others – other information passed for training purpose 
 
 - Returns: Absolute loss between expected Q-value and predicted Q-value
- for each data point 
 
 
- 
deepword.agents.dsqn_agent module¶
- 
class deepword.agents.dsqn_agent.DSQNAgent(hp, model_dir)¶
- Bases: - deepword.agents.base_agent.BaseAgent- BaseAgent with hs2tj: hash states point to trajectories for SNN training - 
get_snn_pairs(batch_size: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]¶
- Sample SNN pairs for SNN part training - Parameters
- batch_size – how many data points to generate. Notice that batch_size * 2 data points will be generated, one half for trajectory pairs with the same states; the other half for trajectory pairs with different states. 
- Returns
- trajectories src_len: length of them src2: the paired trajectories src2_len: length of them labels: 0 for same states; 1 for different states 
- Return type
- src 
 
 - 
save_train_pairs(t: int, src: numpy.ndarray, src_len: numpy.ndarray, src2: numpy.ndarray, src2_len: numpy.ndarray, labels: numpy.ndarray) → None¶
- Save SNN pairs for verification. - Parameters
- t – current training steps 
- src – trajectories 
- src_len – length of trajectories 
- src2 – paired trajectories 
- src2_len – length of paired trajectories 
- labels – 0 or 1 for same or different states 
 
 
 
- 
- 
class deepword.agents.dsqn_agent.DSQNCompetitionAgent(hp, model_dir)¶
- Bases: - deepword.agents.dsqn_agent.DSQNAgent,- deepword.agents.competition_agent.CompetitionAgent
- 
class deepword.agents.dsqn_agent.DSQNZorkAgent(hp, model_dir)¶
- Bases: - deepword.agents.dsqn_agent.DSQNAgent,- deepword.agents.zork_agent.ZorkAgent
deepword.agents.gen_agent module¶
- 
class deepword.agents.gen_agent.GenDQNAgent(hp, model_dir)¶
- Bases: - deepword.agents.base_agent.BaseAgent- GenDQNAgent works with - deepword.agents.cores.GenDQNCore.
deepword.agents.gen_drrn_agent module¶
- 
class deepword.agents.gen_drrn_agent.GenCompetitionDRRNAgent(hp, model_dir)¶
- 
class deepword.agents.gen_drrn_agent.GenDRRNAgent(hp, model_dir)¶
- Bases: - deepword.agents.base_agent.BaseAgent- We generate admissible actions at every step, and then use DRRN to choose the best action to play. - This agent can be compared with previous template-gen agent. 
deepword.agents.utils module¶
- 
class deepword.agents.utils.ActType(rnd, rule, rnd_walk, policy_drrn, policy_gen, jitter, policy_tbl)¶
- 
class deepword.agents.utils.ActionDesc(action_type, action_idx, token_idx, action_len, action, q_actions)¶
- 
class deepword.agents.utils.ActionMaster(action_ids: List[int], master_ids: List[int], action: str, master: str)¶
- Bases: - object- 
property action¶
 - 
property action_ids¶
 - 
property ids¶
 - 
property lens¶
 - 
property master¶
 - 
property master_ids¶
 
- 
property 
- 
class deepword.agents.utils.CommonActs(examine_cookbook, prepare_meal, eat_meal, look, inventory, gn, gs, ge, gw)¶
- 
class deepword.agents.utils.EnvInfosKey(recipe, desc, inventory, max_score, won, lost, actions, templates, verbs, entities)¶
- Bases: - deepword.agents.utils.KeyInfo
- 
class deepword.agents.utils.GenSummary(ids, tokens, gens, q_action, len)¶
- 
class deepword.agents.utils.LinearDecayedEPS(decay_step, init_eps=1, final_eps=0)¶
- Bases: - deepword.agents.utils.ScheduledEPS- 
eps(t)¶
 
- 
- 
class deepword.agents.utils.Memolet(tid, sid, gid, aid, token_id, a_len, a_type, reward, is_terminal, action_mask, sys_action_mask, next_action_mask, next_sys_action_mask, q_actions)¶
- Bases: - deepword.agents.utils.Memolet- end_of_episode: game stops by 1) winning, 2) losing, or 3) exceeding maximum number of steps. is_terminal: is current step reaches the terminal game state by winning or losing. is_terminal = True means for the current step, q value equals to the instant reward. - TODO: Notice that end_of_episode doesn’t imply is_terminal. Only winning
- or losing means is_terminal = True. 
 
- 
class deepword.agents.utils.ObsInventory(obs, inventory, sid, hs)¶
- 
class deepword.agents.utils.ScannerDecayEPS(decay_step, decay_range, next_init_eps_rate=0.8, init_eps=1, final_eps=0)¶
- Bases: - deepword.agents.utils.ScheduledEPS- 
eps(t)¶
 
- 
- 
class deepword.agents.utils.ScheduledEPS(name: Optional[str] = None)¶
- Bases: - deepword.log.Logging- 
eps(t)¶
 
- 
- 
deepword.agents.utils.batch_drrn_action_input(action_matrices: List[numpy.ndarray], action_lens: List[numpy.ndarray], action_masks: List[numpy.ndarray]) → Tuple[numpy.ndarray, numpy.ndarray, List[int], List[Dict[int, int]]]¶
- Select actions from action_masks in a batch 
- 
deepword.agents.utils.bert_commonsense_input(action_matrix: numpy.ndarray, action_len: numpy.ndarray, trajectory: List[int], trajectory_len: int, sep_val_id: int, cls_val_id: int, num_tokens: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]¶
- Given one trajectory and its admissible actions, create a training set of input for Bert. - Notice: the trajectory_len and action_len need to be confirmed that to have special tokens e.g. [CLS], [SEP] positions to be reserved. - E.g. input: [1, 2, 3], and action_matrix [[1, 3], [2, PAD], [4, PAD]] suppose we need length to be 10. output: - [[CLS, 1, 2, 3, SEP, 1, 3, SEP, PAD, PAD, PAD],
- [CLS, 1, 2, 3, SEP, 2, SEP, PAD, PAD, PAD, PAD], [CLS, 1, 2, 3, SEP, 4, SEP, PAD, PAD, PAD, PAD]] 
 - segment of trajectory and actions: [[0, 0, 0, 0, 0, 1, 1, 1], - [0, 0, 0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0]] - input size: [8, 7, 7] - Returns
- trajectory + action; segmentation ids; sizes 
 
- 
deepword.agents.utils.categorical_without_replacement(logits, k=1)¶
- Courtesy of https://github.com/tensorflow/tensorflow/issues/ 9260#issuecomment-437875125 also cite here: @misc{vieira2014gumbel, - title = {Gumbel-max trick and weighted reservoir sampling}, author = {Tim Vieira}, url = {http://timvieira.github.io/blog/post/2014/08/01/ gumbel-max-trick-and-weighted-reservoir-sampling/}, year = {2014} - } Notice that the logits represent unnormalized log probabilities, in the citation above, there is no need to normalized them first to add the Gumbel random variant, which surprises me! since I thought it should be logits - tf.reduce_logsumexp(logits) + z 
- 
deepword.agents.utils.drrn_action_input(action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → Tuple[numpy.ndarray, numpy.ndarray, int, Dict[int, int]]¶
- Select actions from action_mask. - Parameters
- action_matrix – action matrix for a game 
- action_len – lengths for actions in the action_matrix 
- action_mask – list of indices of selected actions 
 
- Returns
- selected action matrix, selected action len, number of actions selected,
- and the mapping from real ID to mask ID. 
 - real ID: the action index in the original action_matrix mask ID: the action index in the action_mask 
 - Examples - >>> a_mat = np.asarray([ >>> [1, 2, 3, 4, 0], >>> [2, 2, 1, 3, 1], >>> [3, 1, 0, 0, 0], >>> [6, 9, 9, 1, 0]]) >>> a_len = np.asarray([4, 5, 2, 4]) >>> a_mask = np.asarray([1, 3]) >>> drrn_action_input(a_mat, a_len, a_mask) [[2, 2, 1, 3, 1], [6, 9, 9, 1, 0]] [5, 4] {1: 0, 3: 1} 
- 
deepword.agents.utils.get_action_idx_pair(action_matrix: numpy.ndarray, action_len: numpy.ndarray, sos_id: int, eos_id: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]¶
- Create action index pair for seq2seq training. Given action index, e.g. [1, 2, 3, 4, pad, pad, pad, pad], with 0 as sos_id, and -1 as eos_id, we create training pair: [0, 1, 2, 3, 4, pad, pad, pad] as the input sentence, and [1, 2, 3, 4, -1, pad, pad, pad] as the output sentence. - Notice that we remove the final pad to keep the action length unchanged. Notice 2. pad should be indexed as 0. - Parameters
- action_matrix – np array of action index of N * K, there are N, and each of them has a length of K (with paddings). 
- action_len – length of each action (remove paddings). 
- sos_id – 
- eos_id – 
 
- Returns
- action index as input, action index as output, new action len 
 
- 
deepword.agents.utils.get_best_1d_q(q_actions: numpy.ndarray) → Tuple[int, float]¶
- Find the best Q-value given a 1D Q-vector - Parameters
- q_actions – a vector of Q-values 
- Returns
- best action index, Q-value 
 - Examples - >>> q_vec = np.asarray([0.1, 0.2, 0.3, 0.4]) >>> get_best_1d_q(q_vec) 3, 0.4 
- 
deepword.agents.utils.get_best_batch_ids(q_actions: numpy.ndarray, actions_repeats: List[int]) → List[int]¶
- Get a batch of best action index of q-values for each group defined by
- actions_repeats 
 - Parameters
- q_actions – a 1D Q-vector 
- actions_repeats – groups of number of actions, indicating how many elements are in the same group. 
 
- Returns
- best action index for each group 
 - Examples - >>> q_vec = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> repeats = [3, 4, 3] >>> #Q-vector splits into three groups containing 3, 4, 3 Q-values >>> # shaded_qs = [[1, 2, 3], [4, 5, 6, 7], [8, 9, 10]] >>> get_best_batch_ids(np.asarray(q_vec), repeats) [3, 7, 10] 
- 
deepword.agents.utils.get_hash_state(obs: str, inv: str) → str¶
- Generate hash state from observation and inventory :param obs: observation of current step :param inv: inventory of current step - Returns
- hash state of current step 
 
- Get tag from a path of saved objects. E.g. actions-100.npz 100 will be extracted Make sure the item to be extracted is saved with suffix of npz. - Parameters
- path – path to find files with prefix 
- prefix – prefix 
 
- Returns
- list of all tags 
 - Examples - >>> # suppose there are these files: >>> # actions-99.npz, actions-100.npz, actions-200.npz >>> get_path_tags("/path/to/data", "actions") [99, 100, 200] 
- 
deepword.agents.utils.get_snn_keys(hash_states2tjs: Dict[str, Dict[int, List[int]]], tjs: deepword.trajectory.Trajectory, size: int) → Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]¶
- Get SNN training pairs from trajectories. - Parameters
- hash_states2tjs – the mapping from hash state to trajectory 
- tjs – the trajectories 
- size – batch size 
 
- Returns
- target_set, same_set and diff_set each set contains keys of (tid, sid) to locate trajectory 
 
- 
deepword.agents.utils.id_real2batch(real_id: List[int], id_real2mask: List[Dict[int, int]], actions_repeats: List[int]) → List[int]¶
- Transform real IDs to IDs in a batch - An explanation of three ID system for actions, depending on which location does the action be in. - In the action matrix of the game: real ID. E.g. a game with three actions [“go east”, “go west”, “eat meal”], then the real IDs are [0, 1, 2] - In the action mask for each step of game-playing. E.g. when play at a step with admissible action as [“go east”, “eat meal”], then the mask IDs are [0, 1], mapping to the real IDs are [0, 2]. - In a batch for training. E.g. in a batch of 2 entries, each entry is from a different game, say, game-1 and game-2. - Game-1, at the step of playing, has two actions, say [0, 2]; - Game-2, at the step of playing, has three actions, say, [0, 4, 10]. - Supposing the agent choose action-0 from game-1 for entry-1, and action-4 from game-2 for entry-2. Now the real IDs are [0, 4]. However, the mask IDs are [0, 1]. - Why action-4 becomes action-1? Because for that step of game-2, there are only three action [0, 4, 10], and the action-4 is placed at position 1. - Converting mask IDs to batch IDs, we get [0, 3]. - Why action-1 becomes action-3? Because if we place actions (mask IDs) for entry-1 and entry-2 together, it becomes [[0, 1], [0, 1, 2]]. The action list is then flatten into [0, 1, 0, 1, 2], then re-indexed as [0, 1, 2, 3, 4]. So action-1 maps to action-3 for entry-2. - Parameters
- real_id – action ids for each game in the original action_matrix of that game 
- id_real2mask – list of mappings from real IDs to mask IDs 
- actions_repeats – action sizes in each group 
 
- Returns
- a list of batch IDs 
 - Examples - >>> rids = [0, 4] >>> id_maps = [{0: 0, 2: 1}, {0: 1, 4: 1, 10: 2}] >>> repeats = [2, 3] >>> id_real2batch(rids, id_maps, repeats) [0, 3] 
- 
deepword.agents.utils.remove_zork_version_info(text)¶
- 
deepword.agents.utils.sample_batch_ids(q_actions: numpy.ndarray, actions_repeats: List[int], k: int) → List[int]¶
- get a batch of sampled action index of q-values actions_repeats indicates how many elements are in the same group. e.g. q_actions = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] actions_repeats = [3, 4, 3] then q_actions can be split into three groups: [1, 2, 3], [4, 5, 6, 7], [8, 9, 10]; - we sample from the indexes, we get the best idx in each group as the first one in that group, then sample another k - 1 elements for each group. If the number of elements in that group smaller than k - 1, we choose sample with replacement. 
deepword.agents.zork_agent module¶
- 
class deepword.agents.zork_agent.ZorkAgent(hp, model_dir)¶
- Bases: - deepword.agents.base_agent.BaseAgent- The agent to run Zork. - TextWorld will not provide admissible actions like cooking games, so a
- loaded action file is required. 
 
