"We first collect a dataset of human-written demonstrations on prompts submitted to our API, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between two model outputs on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our GPT-3 policy to maximize this reward using the PPO algorithm."
Pros: related works about learning Human Values. Interesting how they train the reward models and use PPO to fine-tune GPT-3.
Cons: only provide access to the API. No dataset or training code released.
"In the fine-tuning stage, we train LaMDA to perform a mix of generative tasks to generate natural-language responses to given contexts, and classification tasks on whether a response is safe and high-quality, resulting in a single multi-task model that can do both."
Pros: useful objectives & metrics proposed: Quality (Sensibleness, Specificity, Interestingness), Safety, Groundedness.
Cons: no database or code released.
1. Gopher is a 280B parameter language model. Increasing the scale of a model boosts performance in areas like reading comprehension, fact-checking, and the identification of toxic language, but not in logical reasoning and common-sense tasks.
2. Gopher shows that they don't have to train a language model specifically on dialog data to achieve similar performance on dialog interactions, and hallucinations/safety/contradiction remains the same.
3. They also propose the Retrieval-Enhanced Transformer (RETRO), similar to OpenAI WebGPT (w/o retrieval during inference).
"We taught GPT-3 to use a text-based web-browser. The model is provided with an open-ended question and a summary of the browser state, and must issue commands such as “Search ...”, “Find in page: ...” or “Quote: …”. In this way, the model collects passages from web pages, and then uses these to compose an answer."
[code]
Pros: the main contribution is the environment and explicit annotation of the users' mental states. The task is also clearly defined for studying the Theory of Mind.
Comments