deepword.agents package¶
Submodules¶
deepword.agents.base_agent module¶
-
class
deepword.agents.base_agent.BaseAgent(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)¶ Bases:
deepword.log.Logging- Base agent class that using
action collector
trajectory collector
floor plan collector
tree memory storage and sampling
-
__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None¶ Initialize a base agent
- Parameters
hp – hyper-parameters, refer to
deepword.hparamsmodel_dir – path to model dir
-
act(obs: List[str], scores: List[int], dones: List[bool], infos: Dict[str, List[Any]]) → Optional[List[str]]¶ Acts upon the current list of observations. One text command must be returned for each observation.
- Parameters
obs – observed texts for each game
scores – score obtained so far for each game
dones – whether a game is finished
infos – extra information requested from TextWorld
- Returns
if all dones, return None, else return actions
Notes
Commands returned for games marked as done have no effect. The states for finished games are simply copy over until all games are done.
-
eval(load_best=True) → None¶ call eval() before performing evaluation
- Parameters
load_best – load from the best weights, otherwise from last weights
-
property
negative_scores¶ Total negative scores
-
property
positive_scores¶ Total positive scores earned
-
reset(restore_from: Optional[str] = None) → None¶ reset is only used for evaluation during training do not use it at anywhere else.
- Parameters
restore_from – where to restore the model, None goes to default
-
save_snapshot() → None¶
-
classmethod
select_additional_infos() → textworld.core.EnvInfos¶ additional information needed when playing the game requested infos here are required to run the Agent
-
train() → None¶ call train() before performing training
deepword.agents.competition_agent module¶
-
class
deepword.agents.competition_agent.CompetitionAgent(hp, model_dir)¶ Bases:
deepword.agents.base_agent.BaseAgentThe agent built for participant the TextWorld competition. Include action filtering and rule based policies.
deepword.agents.cores module¶
-
class
deepword.agents.cores.BaseCore(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)¶ Bases:
deepword.log.Logging,abc.ABCCore: used for agents to compute policy. Core objects are isolated with games and gaming platforms. They work with agents, receiving trajectories, actions, and then compute a policy for agents.
How to get trajectories, actions, and how to choose actions given policies are decided by agents.
-
__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None¶ Initialize A Core for an agent.
- Parameters
hp – hyper-parameters, see
deepword.hparamsmodel_dir – path to save or load model
-
create_or_reload_target_model(restore_from: Optional[str] = None) → None¶ Create (if not exist) or reload weights for the target model
- Parameters
restore_from – the path to restore weights
-
init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None¶ Initialize models of the core.
- Parameters
is_training – training or evaluation
load_best – load from best weights, otherwise last weights
restore_from – path to restore
-
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶ Infer from policy.
- Parameters
trajectory – a list of ActionMaster
state – the current game state of observation + inventory
action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.
action_len – 1D array, length for each action.
action_mask – 1D array, indices of admissible actions from all actions of the game.
- Returns
Q-values for actions in the action_matrix
-
save_model(t: Optional[int] = None) → None¶ Save current model with training steps
- Parameters
t – training steps, None falls back to default global steps
-
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶ Train the core with one batch of data.
- Parameters
pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose
- Returns: Absolute loss between expected Q-value and predicted Q-value
for each data point
-
-
class
deepword.agents.cores.DQNCore(hp, model_dir)¶ Bases:
deepword.agents.cores.TFCoreDQNAgent that treats actions as types
-
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶ get either an random action index with action string or the best predicted action index with action string.
-
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶ Train the core with one batch of data.
- Parameters
pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose
- Returns: Absolute loss between expected Q-value and predicted Q-value
for each data point
-
-
class
deepword.agents.cores.DRRNCore(hp, model_dir)¶ Bases:
deepword.agents.cores.TFCoreDRRN agent that treats actions as meaningful sentences
-
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶ get either an random action index with action string or the best predicted action index with action string.
-
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶ Train the core with one batch of data.
- Parameters
pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose
- Returns: Absolute loss between expected Q-value and predicted Q-value
for each data point
-
-
class
deepword.agents.cores.DSQNCore(hp, model_dir)¶ Bases:
deepword.agents.cores.DRRNCore-
eval_snn(snn_data: Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray], batch_size: int = 32) → float¶
-
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶ Train the core with one batch of data.
- Parameters
pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose
- Returns: Absolute loss between expected Q-value and predicted Q-value
for each data point
-
-
class
deepword.agents.cores.DSQNZorkCore(hp, model_dir)¶ Bases:
deepword.agents.cores.DQNCore-
eval_snn(snn_data: Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray], batch_size: int = 32) → float¶
-
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶ Train the core with one batch of data.
- Parameters
pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose
- Returns: Absolute loss between expected Q-value and predicted Q-value
for each data point
-
-
class
deepword.agents.cores.GenDQNCore(hp, model_dir)¶ Bases:
deepword.agents.cores.TFCore-
decode_action(trajectory: List[deepword.agents.utils.ActionMaster]) → deepword.agents.utils.GenSummary¶
-
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶ Infer from policy.
- Parameters
trajectory – a list of ActionMaster
state – the current game state of observation + inventory
action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.
action_len – 1D array, length for each action.
action_mask – 1D array, indices of admissible actions from all actions of the game.
- Returns
Q-values for actions in the action_matrix
-
summary(token_idx: numpy.ndarray, col_eos_idx: numpy.ndarray, p_gen: numpy.ndarray, sum_logits: numpy.ndarray) → List[deepword.agents.utils.GenSummary]¶
-
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶ Train the core with one batch of data.
- Parameters
pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose
- Returns: Absolute loss between expected Q-value and predicted Q-value
for each data point
-
-
class
deepword.agents.cores.NLUCore(hp, model_dir)¶ Bases:
deepword.agents.cores.TFCoreThe agent that explores commonsense ability of BERT models. This agent combines each trajectory with all its actions together, separated with [SEP] in the middle. Then feeds the sentence into BERT to get a score from the [CLS] token. refer to https://arxiv.org/pdf/1810.04805.pdf for fine-tuning and evaluation
-
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶ Infer from policy.
- Parameters
trajectory – a list of ActionMaster
state – the current game state of observation + inventory
action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.
action_len – 1D array, length for each action.
action_mask – 1D array, indices of admissible actions from all actions of the game.
- Returns
Q-values for actions in the action_matrix
-
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶ Train the core with one batch of data.
- Parameters
pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose
- Returns: Absolute loss between expected Q-value and predicted Q-value
for each data point
-
-
class
deepword.agents.cores.PGNCore(hp, model_dir)¶ Bases:
deepword.agents.cores.TFCoreGenerate admissible actions for games, given only trajectory
-
decode(trajectory: List[deepword.agents.utils.ActionMaster], beam_size: int, temperature: float, use_greedy: bool) → List[deepword.agents.utils.GenSummary]¶
-
generate_admissible_actions(trajectory: List[deepword.agents.utils.ActionMaster]) → List[str]¶
-
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶ Infer from policy.
- Parameters
trajectory – a list of ActionMaster
state – the current game state of observation + inventory
action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.
action_len – 1D array, length for each action.
action_mask – 1D array, indices of admissible actions from all actions of the game.
- Returns
Q-values for actions in the action_matrix
-
summary(action_idx: numpy.ndarray, col_eos_idx: numpy.ndarray, decoded_logits: numpy.ndarray, p_gen: numpy.ndarray, beam_size: int) → List[deepword.agents.utils.GenSummary]¶ Return [ids, tokens, generation probabilities of each token, q_action] sorted by q_action (from larger to smaller) q_action: the average of decoded logits of selected tokens
-
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶ Train the core with one batch of data.
- Parameters
pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose
- Returns: Absolute loss between expected Q-value and predicted Q-value
for each data point
-
-
class
deepword.agents.cores.TFCore(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str)¶ Bases:
deepword.agents.cores.BaseCore,abc.ABCAgent core implemented through Tensorflow.
-
__init__(hp: tensorflow.contrib.training.python.training.hparam.HParams, model_dir: str) → None¶ - Parameters
hp – hyper-parameters
model_dir – path to model dir
-
batch_trajectory2input(trajectories: List[List[deepword.agents.utils.ActionMaster]]) → Tuple[List[List[int]], List[int]]¶ generate batch of src, src_len, trimmed by hp.num_tokens
see
deepword.agents.cores.TFCore.trajectory2input()- Parameters
trajectories – a batch of trajectories
- Returns
batch of src batch of src_len
-
create_or_reload_target_model(restore_from: Optional[str] = None) → None¶ Create the target model if not exists, then load model from the most recent saved weights.
- Parameters
restore_from – path to load target model, None falls back to default.
-
init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None¶ Initialize the core.
create the model
load the model if there are saved models
create target model for training
- Parameters
is_training – True for training, False for evaluation
load_best – load best model, otherwise load last weights
restore_from – specify the load path, load_best will be disabled
-
safe_loading(model: deepword.models.models.DQNModel, sess: tensorflow.python.client.session.Session, saver: tensorflow.python.training.saver.Saver, restore_from: str) → int¶ Load weights from restore_from to model. If weights in loaded model are incompatible with current model, try to load those weights that have the same name.
This method is useful when saved model lacks of training part, e.g. Adam optimizer.
- Parameters
model – A tensorflow model
sess – A tensorflow session
saver – A tensorflow saver
restore_from – the path to restore the model
- Returns
training steps
-
save_best_model() → None¶ Save current model to the best weights dir
-
save_model(t: Optional[int] = None) → None¶ Save model to model_dir with the number of training steps.
- Parameters
t – number of training steps, None falls back to global step
-
set_d4eval(device: str) → None¶ Set the device for evaluation, e.g. “/device:CPU:0”, “/device:GPU:1” Otherwise, a default device allocation will be used.
- Parameters
device – device name
-
trajectory2input(trajectory: List[deepword.agents.utils.ActionMaster]) → Tuple[List[int], int]¶ generate src, src_len from trajectory, trimmed by hp.num_tokens
- Parameters
trajectory – List of ActionMaster
- Returns
source indices src_len: length of the src
- Return type
src
-
-
class
deepword.agents.cores.TabularCore(hp, model_dir)¶ Bases:
deepword.agents.cores.BaseCoreTabular-wise DQN agent that uses matrix to store q-vectors and uses hashed values of observation + inventory as game states
-
create_or_reload_target_model(restore_from: Optional[str] = None) → None¶ Create (if not exist) or reload weights for the target model
- Parameters
restore_from – the path to restore weights
-
get_state_hash(state: deepword.agents.utils.ObsInventory) → str¶
-
init(is_training: bool, load_best: bool = False, restore_from: Optional[str] = None) → None¶ Initialize models of the core.
- Parameters
is_training – training or evaluation
load_best – load from best weights, otherwise last weights
restore_from – path to restore
-
policy(trajectory: List[deepword.agents.utils.ActionMaster], state: Optional[deepword.agents.utils.ObsInventory], action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → numpy.ndarray¶ Infer from policy.
- Parameters
trajectory – a list of ActionMaster
state – the current game state of observation + inventory
action_matrix – a matrix of all actions for the game, 2D array, each row represents a tokenized and indexed action.
action_len – 1D array, length for each action.
action_mask – 1D array, indices of admissible actions from all actions of the game.
- Returns
Q-values for actions in the action_matrix
-
save_model(t: Optional[int] = None) → None¶ Save current model with training steps
- Parameters
t – training steps, None falls back to default global steps
-
train_one_batch(pre_trajectories: List[List[deepword.agents.utils.ActionMaster]], post_trajectories: List[List[deepword.agents.utils.ActionMaster]], pre_states: Optional[List[deepword.agents.utils.ObsInventory]], post_states: Optional[List[deepword.agents.utils.ObsInventory]], action_matrix: List[numpy.ndarray], action_len: List[numpy.ndarray], pre_action_mask: List[numpy.ndarray], post_action_mask: List[numpy.ndarray], dones: List[bool], rewards: List[float], action_idx: List[int], b_weight: numpy.ndarray, step: int, others: Any) → numpy.ndarray¶ Train the core with one batch of data.
- Parameters
pre_trajectories – previous trajectories
post_trajectories – post trajectories
pre_states – previous states
post_states – post states
action_matrix – all actions for each of previous trajectories
action_len – length of actions
pre_action_mask – action masks for each of previous trajectories
post_action_mask – action masks for each of post trajectories
dones – game terminated or not for post trajectories
rewards – rewards received for reaching post trajectories
action_idx – actions used for reaching post trajectories
b_weight – 1D array, weight for each data point
step – current training step
others – other information passed for training purpose
- Returns: Absolute loss between expected Q-value and predicted Q-value
for each data point
-
deepword.agents.dsqn_agent module¶
-
class
deepword.agents.dsqn_agent.DSQNAgent(hp, model_dir)¶ Bases:
deepword.agents.base_agent.BaseAgentBaseAgent with hs2tj: hash states point to trajectories for SNN training
-
get_snn_pairs(batch_size: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]¶ Sample SNN pairs for SNN part training
- Parameters
batch_size – how many data points to generate. Notice that batch_size * 2 data points will be generated, one half for trajectory pairs with the same states; the other half for trajectory pairs with different states.
- Returns
trajectories src_len: length of them src2: the paired trajectories src2_len: length of them labels: 0 for same states; 1 for different states
- Return type
src
-
save_train_pairs(t: int, src: numpy.ndarray, src_len: numpy.ndarray, src2: numpy.ndarray, src2_len: numpy.ndarray, labels: numpy.ndarray) → None¶ Save SNN pairs for verification.
- Parameters
t – current training steps
src – trajectories
src_len – length of trajectories
src2 – paired trajectories
src2_len – length of paired trajectories
labels – 0 or 1 for same or different states
-
-
class
deepword.agents.dsqn_agent.DSQNCompetitionAgent(hp, model_dir)¶ Bases:
deepword.agents.dsqn_agent.DSQNAgent,deepword.agents.competition_agent.CompetitionAgent
-
class
deepword.agents.dsqn_agent.DSQNZorkAgent(hp, model_dir)¶ Bases:
deepword.agents.dsqn_agent.DSQNAgent,deepword.agents.zork_agent.ZorkAgent
deepword.agents.gen_agent module¶
-
class
deepword.agents.gen_agent.GenDQNAgent(hp, model_dir)¶ Bases:
deepword.agents.base_agent.BaseAgentGenDQNAgent works with
deepword.agents.cores.GenDQNCore.
deepword.agents.gen_drrn_agent module¶
-
class
deepword.agents.gen_drrn_agent.GenCompetitionDRRNAgent(hp, model_dir)¶
-
class
deepword.agents.gen_drrn_agent.GenDRRNAgent(hp, model_dir)¶ Bases:
deepword.agents.base_agent.BaseAgentWe generate admissible actions at every step, and then use DRRN to choose the best action to play.
This agent can be compared with previous template-gen agent.
deepword.agents.utils module¶
-
class
deepword.agents.utils.ActType(rnd, rule, rnd_walk, policy_drrn, policy_gen, jitter, policy_tbl)¶
-
class
deepword.agents.utils.ActionDesc(action_type, action_idx, token_idx, action_len, action, q_actions)¶
-
class
deepword.agents.utils.ActionMaster(action_ids: List[int], master_ids: List[int], action: str, master: str)¶ Bases:
object-
property
action¶
-
property
action_ids¶
-
property
ids¶
-
property
lens¶
-
property
master¶
-
property
master_ids¶
-
property
-
class
deepword.agents.utils.CommonActs(examine_cookbook, prepare_meal, eat_meal, look, inventory, gn, gs, ge, gw)¶
-
class
deepword.agents.utils.EnvInfosKey(recipe, desc, inventory, max_score, won, lost, actions, templates, verbs, entities)¶ Bases:
deepword.agents.utils.KeyInfo
-
class
deepword.agents.utils.GenSummary(ids, tokens, gens, q_action, len)¶
-
class
deepword.agents.utils.LinearDecayedEPS(decay_step, init_eps=1, final_eps=0)¶ Bases:
deepword.agents.utils.ScheduledEPS-
eps(t)¶
-
-
class
deepword.agents.utils.Memolet(tid, sid, gid, aid, token_id, a_len, a_type, reward, is_terminal, action_mask, sys_action_mask, next_action_mask, next_sys_action_mask, q_actions)¶ Bases:
deepword.agents.utils.Memoletend_of_episode: game stops by 1) winning, 2) losing, or 3) exceeding maximum number of steps. is_terminal: is current step reaches the terminal game state by winning or losing. is_terminal = True means for the current step, q value equals to the instant reward.
- TODO: Notice that end_of_episode doesn’t imply is_terminal. Only winning
or losing means is_terminal = True.
-
class
deepword.agents.utils.ObsInventory(obs, inventory, sid, hs)¶
-
class
deepword.agents.utils.ScannerDecayEPS(decay_step, decay_range, next_init_eps_rate=0.8, init_eps=1, final_eps=0)¶ Bases:
deepword.agents.utils.ScheduledEPS-
eps(t)¶
-
-
class
deepword.agents.utils.ScheduledEPS(name: Optional[str] = None)¶ Bases:
deepword.log.Logging-
eps(t)¶
-
-
deepword.agents.utils.batch_drrn_action_input(action_matrices: List[numpy.ndarray], action_lens: List[numpy.ndarray], action_masks: List[numpy.ndarray]) → Tuple[numpy.ndarray, numpy.ndarray, List[int], List[Dict[int, int]]]¶ Select actions from action_masks in a batch
-
deepword.agents.utils.bert_commonsense_input(action_matrix: numpy.ndarray, action_len: numpy.ndarray, trajectory: List[int], trajectory_len: int, sep_val_id: int, cls_val_id: int, num_tokens: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]¶ Given one trajectory and its admissible actions, create a training set of input for Bert.
Notice: the trajectory_len and action_len need to be confirmed that to have special tokens e.g. [CLS], [SEP] positions to be reserved.
E.g. input: [1, 2, 3], and action_matrix [[1, 3], [2, PAD], [4, PAD]] suppose we need length to be 10. output:
- [[CLS, 1, 2, 3, SEP, 1, 3, SEP, PAD, PAD, PAD],
[CLS, 1, 2, 3, SEP, 2, SEP, PAD, PAD, PAD, PAD], [CLS, 1, 2, 3, SEP, 4, SEP, PAD, PAD, PAD, PAD]]
segment of trajectory and actions: [[0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0]]
input size: [8, 7, 7]
- Returns
trajectory + action; segmentation ids; sizes
-
deepword.agents.utils.categorical_without_replacement(logits, k=1)¶ Courtesy of https://github.com/tensorflow/tensorflow/issues/ 9260#issuecomment-437875125 also cite here: @misc{vieira2014gumbel,
title = {Gumbel-max trick and weighted reservoir sampling}, author = {Tim Vieira}, url = {http://timvieira.github.io/blog/post/2014/08/01/ gumbel-max-trick-and-weighted-reservoir-sampling/}, year = {2014}
} Notice that the logits represent unnormalized log probabilities, in the citation above, there is no need to normalized them first to add the Gumbel random variant, which surprises me! since I thought it should be logits - tf.reduce_logsumexp(logits) + z
-
deepword.agents.utils.drrn_action_input(action_matrix: numpy.ndarray, action_len: numpy.ndarray, action_mask: numpy.ndarray) → Tuple[numpy.ndarray, numpy.ndarray, int, Dict[int, int]]¶ Select actions from action_mask.
- Parameters
action_matrix – action matrix for a game
action_len – lengths for actions in the action_matrix
action_mask – list of indices of selected actions
- Returns
- selected action matrix, selected action len, number of actions selected,
and the mapping from real ID to mask ID.
real ID: the action index in the original action_matrix mask ID: the action index in the action_mask
Examples
>>> a_mat = np.asarray([ >>> [1, 2, 3, 4, 0], >>> [2, 2, 1, 3, 1], >>> [3, 1, 0, 0, 0], >>> [6, 9, 9, 1, 0]]) >>> a_len = np.asarray([4, 5, 2, 4]) >>> a_mask = np.asarray([1, 3]) >>> drrn_action_input(a_mat, a_len, a_mask) [[2, 2, 1, 3, 1], [6, 9, 9, 1, 0]] [5, 4] {1: 0, 3: 1}
-
deepword.agents.utils.get_action_idx_pair(action_matrix: numpy.ndarray, action_len: numpy.ndarray, sos_id: int, eos_id: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]¶ Create action index pair for seq2seq training. Given action index, e.g. [1, 2, 3, 4, pad, pad, pad, pad], with 0 as sos_id, and -1 as eos_id, we create training pair: [0, 1, 2, 3, 4, pad, pad, pad] as the input sentence, and [1, 2, 3, 4, -1, pad, pad, pad] as the output sentence.
Notice that we remove the final pad to keep the action length unchanged. Notice 2. pad should be indexed as 0.
- Parameters
action_matrix – np array of action index of N * K, there are N, and each of them has a length of K (with paddings).
action_len – length of each action (remove paddings).
sos_id –
eos_id –
- Returns
action index as input, action index as output, new action len
-
deepword.agents.utils.get_best_1d_q(q_actions: numpy.ndarray) → Tuple[int, float]¶ Find the best Q-value given a 1D Q-vector
- Parameters
q_actions – a vector of Q-values
- Returns
best action index, Q-value
Examples
>>> q_vec = np.asarray([0.1, 0.2, 0.3, 0.4]) >>> get_best_1d_q(q_vec) 3, 0.4
-
deepword.agents.utils.get_best_batch_ids(q_actions: numpy.ndarray, actions_repeats: List[int]) → List[int]¶ - Get a batch of best action index of q-values for each group defined by
actions_repeats
- Parameters
q_actions – a 1D Q-vector
actions_repeats – groups of number of actions, indicating how many elements are in the same group.
- Returns
best action index for each group
Examples
>>> q_vec = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> repeats = [3, 4, 3] >>> #Q-vector splits into three groups containing 3, 4, 3 Q-values >>> # shaded_qs = [[1, 2, 3], [4, 5, 6, 7], [8, 9, 10]] >>> get_best_batch_ids(np.asarray(q_vec), repeats) [3, 7, 10]
-
deepword.agents.utils.get_hash_state(obs: str, inv: str) → str¶ Generate hash state from observation and inventory :param obs: observation of current step :param inv: inventory of current step
- Returns
hash state of current step
Get tag from a path of saved objects. E.g. actions-100.npz 100 will be extracted Make sure the item to be extracted is saved with suffix of npz.
- Parameters
path – path to find files with prefix
prefix – prefix
- Returns
list of all tags
Examples
>>> # suppose there are these files: >>> # actions-99.npz, actions-100.npz, actions-200.npz >>> get_path_tags("/path/to/data", "actions") [99, 100, 200]
-
deepword.agents.utils.get_snn_keys(hash_states2tjs: Dict[str, Dict[int, List[int]]], tjs: deepword.trajectory.Trajectory, size: int) → Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]¶ Get SNN training pairs from trajectories.
- Parameters
hash_states2tjs – the mapping from hash state to trajectory
tjs – the trajectories
size – batch size
- Returns
target_set, same_set and diff_set each set contains keys of (tid, sid) to locate trajectory
-
deepword.agents.utils.id_real2batch(real_id: List[int], id_real2mask: List[Dict[int, int]], actions_repeats: List[int]) → List[int]¶ Transform real IDs to IDs in a batch
An explanation of three ID system for actions, depending on which location does the action be in.
In the action matrix of the game: real ID. E.g. a game with three actions [“go east”, “go west”, “eat meal”], then the real IDs are [0, 1, 2]
In the action mask for each step of game-playing. E.g. when play at a step with admissible action as [“go east”, “eat meal”], then the mask IDs are [0, 1], mapping to the real IDs are [0, 2].
In a batch for training. E.g. in a batch of 2 entries, each entry is from a different game, say, game-1 and game-2.
Game-1, at the step of playing, has two actions, say [0, 2];
Game-2, at the step of playing, has three actions, say, [0, 4, 10].
Supposing the agent choose action-0 from game-1 for entry-1, and action-4 from game-2 for entry-2. Now the real IDs are [0, 4]. However, the mask IDs are [0, 1].
Why action-4 becomes action-1? Because for that step of game-2, there are only three action [0, 4, 10], and the action-4 is placed at position 1.
Converting mask IDs to batch IDs, we get [0, 3].
Why action-1 becomes action-3? Because if we place actions (mask IDs) for entry-1 and entry-2 together, it becomes [[0, 1], [0, 1, 2]]. The action list is then flatten into [0, 1, 0, 1, 2], then re-indexed as [0, 1, 2, 3, 4]. So action-1 maps to action-3 for entry-2.
- Parameters
real_id – action ids for each game in the original action_matrix of that game
id_real2mask – list of mappings from real IDs to mask IDs
actions_repeats – action sizes in each group
- Returns
a list of batch IDs
Examples
>>> rids = [0, 4] >>> id_maps = [{0: 0, 2: 1}, {0: 1, 4: 1, 10: 2}] >>> repeats = [2, 3] >>> id_real2batch(rids, id_maps, repeats) [0, 3]
-
deepword.agents.utils.remove_zork_version_info(text)¶
-
deepword.agents.utils.sample_batch_ids(q_actions: numpy.ndarray, actions_repeats: List[int], k: int) → List[int]¶ get a batch of sampled action index of q-values actions_repeats indicates how many elements are in the same group. e.g. q_actions = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] actions_repeats = [3, 4, 3] then q_actions can be split into three groups: [1, 2, 3], [4, 5, 6, 7], [8, 9, 10];
we sample from the indexes, we get the best idx in each group as the first one in that group, then sample another k - 1 elements for each group. If the number of elements in that group smaller than k - 1, we choose sample with replacement.
deepword.agents.zork_agent module¶
-
class
deepword.agents.zork_agent.ZorkAgent(hp, model_dir)¶ Bases:
deepword.agents.base_agent.BaseAgentThe agent to run Zork.
- TextWorld will not provide admissible actions like cooking games, so a
loaded action file is required.
