Course Notes

CS146S: The Modern Software Developer

Personal notes from lesson review. Last formatted for quick revision.

Deep Dive into LLMs

How pretraining data is prepared

Tokenization / symbol conversion

Quick memory cue: data quality and filtering are as important as model architecture.

Text Diagram (same meaning)

Raw Web Pages Filtering Clean Text Bits (0/1) Bytes (0-255) BPE Tokens Train LLM

Lesson Screenshots

Pretraining pipeline and text extraction screenshot from Deep Dive into LLMs
Pretraining pipeline + extracting plain text from raw web pages.
Binary 0 and 1 representation screenshot from Deep Dive into LLMs
Text represented as bits (0/1) before byte grouping.
Combined pretraining and tokenization demo screenshot
Combined demo view: pretraining context beside numeric representation.