top of page
  • Writer's pictureLiang Qiu

Selected Reading (Feb 2022)

Updated: Feb 12, 2022

"We first collect a dataset of human-written demonstrations on prompts submitted to our API, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between two model outputs on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our GPT-3 policy to maximize this reward using the PPO algorithm."

Pros: related works about learning Human Values. Interesting how they train the reward models and use PPO to fine-tune GPT-3.

Cons: only provide access to the API. No dataset or training code released.


"In the fine-tuning stage, we train LaMDA to perform a mix of generative tasks to generate natural-language responses to given contexts, and classification tasks on whether a response is safe and high-quality, resulting in a single multi-task model that can do both."

Pros: useful objectives & metrics proposed: Quality (Sensibleness, Specificity, Interestingness), Safety, Groundedness.

Cons: no database or code released.


1. Gopher is a 280B parameter language model. Increasing the scale of a model boosts performance in areas like reading comprehension, fact-checking, and the identification of toxic language, but not in logical reasoning and common-sense tasks.

2. Gopher shows that they don't have to train a language model specifically on dialog data to achieve similar performance on dialog interactions, and hallucinations/safety/contradiction remains the same.

3. They also propose the Retrieval-Enhanced Transformer (RETRO), similar to OpenAI WebGPT (w/o retrieval during inference).


"We taught GPT-3 to use a text-based web-browser. The model is provided with an open-ended question and a summary of the browser state, and must issue commands such as “Search ...”, “Find in page: ...” or “Quote: …”. In this way, the model collects passages from web pages, and then uses these to compose an answer."


Pros: the main contribution is the environment and explicit annotation of the users' mental states. The task is also clearly defined for studying the Theory of Mind.


54 views0 comments

Recent Posts

See All

Selected Reading (Nov 2021)

* Internet-Augmented Dialogue [web] "We propose an approach that learns to generate an internet search query based on the context, and then conditions on the search results to finally generate a respo


bottom of page