Artificial Intelligence: From First Principles to Agents

We’re Taking a Path

A guided mental map Artificial Intelligence is a large and noisy landscape. If you open social media, you’ll hear about agents, AGI, prompt engineering, multimodality, alignment, fine-tuning, embeddings, scaling laws, reasoning models, and a dozen other terms often used interchangeably, often without context. The result is conceptual fog. In this post, we won’t try to map the entire territory. Instead, we’ll walk a deliberate path through it, pointing to side roads along the way, but staying focused on one trajectory: from the basic idea of learning from data to modern AI agents. The goal is to give you a clean mental model:

What kind of problem AI systems solve
How modern machine learning is structured
Where deep learning fits
Why transformers matter
What LLMs actually are
And how real-world systems are built around them

We’ll take this step by step, using a few key research milestones as guideposts. This isn’t a literature review, but a hand-drawn map showing how the ideas evolved.

1. The Statistical Turn: Learning Instead of Programming

Much of early AI focused on rule-based systems. Researchers believed intelligence could be engineered explicitly:

Represent knowledge symbolically
Encode rules like “IF X THEN Y”
Apply logical inference to derive conclusions

This approach, often called symbolic AI, was deeply influenced by formal logic and philosophy. Intelligence was seen as structured reasoning over symbols. It’s important to note that this was not just a collection of simple if statements. Symbolic AI typically relied on richer forms of logic (e.g. first-order logic) where systems represented:

Objects (e.g., John)
Properties (e.g., has_fever(John))
General rules (e.g., “for all patients, fever AND cough imply possible flu”)

A separate inference engine would apply these rules to derive new conclusions. Knowledge and reasoning were explicitly encoded and mechanically applied. In constrained environments, it worked remarkably well. Systems were built for:

Playing chess
Proving theorems
Planning in well-defined domains
Medical expert diagnosis

But cracks appeared as AI moved into messier domains. Perception, language, and everyday reasoning turned out to be messy, ambiguous, and high-dimensional. Writing explicit rules for recognizing a cat in arbitrary lighting or handling the variability of natural language quickly became intractable. The problem wasn’t logic itself. The problem was scale. Encoding intelligence manually does not scale well to noisy, unstructured data. The major shift, often called the statistical turn, was conceptually simple:

Instead of writing the rules, let the system learn them from data.

This idea is the foundation of modern machine learning. At its core, machine learning defines a parameterized model and adjusts its parameters to optimize performance based on data. In the simplest supervised case, this looks like fitting a function:

f_\theta(x) \approx y

You don’t hardcode the mapping from input to output. You define a flexible function with parameters

\theta

, and you adjust those parameters during the training phase so that the function performs well on examples. This way of thinking — models as parameterized functions optimized from data — did not appear overnight. It emerged gradually from statistical decision theory, learning theory, and neural network research in the late 20th century. As researchers began formalizing concepts like generalization, model capacity, and risk minimization, optimization became the unifying lens through which learning was understood. By the 1990s and early 2000s, this formulation had become standard in machine learning literature and was consolidated in influential textbooks such as Pattern Recognition and Machine Learning (Bishop, 2006). But we don’t need the math to keep the core idea:

A learning system is an optimized function.

That’s the first anchor.

A Small but Important Clarification

Before going further, it’s important to separate ideas that are often conflated in casual discussions about AI. When people say “training a model” they may be referring to very different design choices. Modern AI systems involve at least four distinct layers:

Learning signal — Where does feedback come from? (Labels? Structure? Rewards?)
Model architecture — What kind of function are we using? (Linear model? Neural network? Transformer?)
Objective function — What are we optimizing? (Prediction error? Likelihood? Reward?)
Optimization method — How do we update parameters? (Gradient descent? Backpropagation?)

Keeping these layers separate prevents confusion later, especially when we get to deep learning and language models. We’ll move through these layers gradually, but keeping them distinct will make the rest of the journey clearer.

2. Three Ways to Learn

At a high level, machine learning differs primarily in the kind of feedback signal available to the learner. Across decades of research, three major learning paradigms emerged. They are not distinguished by architecture, but by how information flows from the environment to the model.

2.1 Supervised Learning

In supervised learning, the system is given input–output pairs:

Image → Label
Text → Sentiment
Features → Price

The model adjusts its parameters to minimize prediction error, typically formalized as minimizing expected loss over a dataset. This paradigm was shaped by statistical decision theory and later formalized through statistical learning theory. Work by Vapnik and others on concepts like VC dimension (cf. Andy Jones' blog post), structural risk minimization, and generalization bounds helped clarify when and why models trained on finite data can generalize to unseen examples. Most industrial machine learning systems fall into this category. Mental model:

Supervised learning is function approximation with examples.

2.2 Unsupervised Learning

In unsupervised learning, there are no explicit labels. Instead, the system attempts to model the structure or distribution of the data itself. Historically, this includes:

Clustering methods
Dimensionality reduction techniques like principal component analysis (PCA)
Probabilistic generative models such as mixture models and latent variable models

Rather than predicting externally defined targets, the system discovers latent structure: compressing information, modeling distributions, or organizing inputs into meaningful internal representations. Mental model:

Unsupervised learning is about discovering structure.

This perspective becomes crucial later when we discuss embeddings and representation learning.

2.3 Reinforcement Learning

In reinforcement learning (RL), the learner is embedded in an environment.

It takes actions
It receives rewards
It updates a policy to maximize cumulative long-term reward

The mathematical foundations of RL draw heavily from control theory and dynamic programming — particularly Bellman’s formulation of optimal control — and were later extended through temporal-difference learning, Q-learning, and policy gradient methods. Instead of direct labels, feedback is delayed and evaluative. Mental model:

Reinforcement learning is optimization through interaction.

RL will reappear later when we discuss how large language models are aligned with human preferences.

3. Representation Is the Real Breakthrough

So far, we’ve described how systems learn, including the kind of feedback they receive and the objectives they optimize. But this leaves an important question unanswered:

What exactly is being learned?

A model doesn’t just memorize input–output mappings. To generalize beyond its training data, it must construct internal representations; i.e. structured encodings of the world that capture patterns, regularities, and abstractions. Learning is optimization. But generalization depends on representation. And this is where modern AI took a decisive turn.

3.1 From Feature Engineering to Learned Representations

Before deep learning, most machine learning systems depended heavily on feature engineering. The workflow often looked like this:

Humans design features.
The model learns how to combine them.

In computer vision, engineers manually designed:

Edge detectors
Histogram of Oriented Gradients (HOG) features
SIFT descriptors

In natural language processing, models relied on:

Bag-of-words vectors
N-grams
Manually curated linguistic features

The learning algorithm itself was often relatively simple — logistic regression, SVMs, shallow neural networks. The real intelligence lived in the feature design. This created a bottleneck. Performance improved only as fast as humans could invent better representations.

3.2 The Shift: Let the Model Learn the Features

The breakthrough of deep learning was not merely deeper networks or better optimization. It was the systematic automation of feature learning. Instead of feeding hand-crafted abstractions into a model, researchers began training neural networks directly on raw inputs:

Pixels instead of edge descriptors
Word sequences instead of precomputed linguistic features

Deep networks learned intermediate layers of representation automatically. Lower layers captured local patterns. Higher layers captured increasingly abstract structure. This hierarchical representation learning became the defining advantage of deep neural networks. The turning point was made visible in 2012 with the success of deep convolutional networks on ImageNet, most famously in ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al., 2012). The key result wasn’t just higher accuracy. It was that the network learned its own feature hierarchy from raw data, at scale.

3.3 Representation as Structure

What does it mean to “learn a representation”? It means transforming raw inputs into an internal form where the task becomes simpler. In many cases, this internal form is geometric: high-dimensional vectors. But the important idea is not the vector itself. It is the structure the vector captures. A good representation:

Places similar inputs near each other.
Separates distinct concepts.
Encodes relevant invariances.
Discards irrelevant variation.

In this sense, representation learning is about discovering the right "coordinate system" (or, more precisely, a hyperspace, in better technical terms) for a problem. Once inputs are expressed in the right coordinates, prediction becomes easier.

3.4 Meaning Becomes Geometry

This geometric intuition became especially clear in natural language processing with work like Efficient Estimation of Word Representations in Vector Space (Mikolov et al., 2013). Word embeddings demonstrated that semantic relationships could emerge as spatial relationships:

Similar words cluster together.
Analogies correspond to vector offsets.

The model was not explicitly programmed with definitions of “king” or “queen.”

It learned statistical structure from large corpora, and that structure manifested geometrically. At this point, a new mental model becomes useful:

Modern AI systems convert messy reality into structured geometry.

And once meaning is encoded geometrically, scaling models becomes a matter of learning richer, higher-dimensional representations. This shift, from hand-designed features to learned representations, is the real inflection point that set the stage for modern AI. The next step was learning how to represent sequences effectively. That is where transformers enter the story.

A Small Personal Detour

This idea that representation is often the real bottleneck isn’t just theoretical. In my PhD work, the central problem wasn’t really “which algorithm should we use?”. It was: how do we represent program behavior in a way that a machine learning model can actually work with? The research focused on modeling binary files as structured graphs — specifically, System-Call Dependency Graphs (SCDGs). The hard part wasn’t classification itself. It was turning those complex, structured objects into something that preserved meaningful behavioral information while remaining learnable. In practice, that meant constantly asking questions like:

What structure should be kept?
What detail can be discarded?
What makes two binaries “close” or “similar” in a meaningful sense?
How do we evaluate whether the representation captures what matters?

At the time, this felt like a very domain-specific engineering challenge. Retrospectively, though, it’s clear that the work sat squarely inside a broader shift in AI: the growing realization that representation is often the central problem. This was before transformers and large language models became part of everyday technical conversation. Yet the underlying question was already the same: how do we construct representations that make learning possible? Before scale, before LLMs, before the current wave of generative AI, there is always a representation question. And in many systems, that’s where most of the intellectual work actually lives.

4. The Transformer Shift

By the mid-2010s, deep learning had established itself as a powerful framework for representation learning, especially in vision. Language, however, exposed a structural weakness. Unlike images, language is sequential. The meaning of a word depends on the words around it, sometimes far away in the sentence. Modeling these long-range dependencies proved harder than it looked. Earlier approaches relied on recurrent neural networks (RNNs) and later Long Short-Term Memory networks (LSTMs). These architectures processed sequences step by step, maintaining a hidden state that evolved over time. They worked, but imperfectly: Sequential processing limited parallelization; Long-range dependencies were hard to preserve; Training became unstable for very long contexts. Conceptually, the model compresses everything it has seen so far into a single evolving vector. That vector effectively becomes the model’s memory. This creates two constraints:

Information bottleneck — all prior context must fit into one state.
Sequential dependency — tokens must be processed in order, limiting parallelization.

For short sequences, this works well. For long contexts, information degrades and training becomes difficult. A different idea began to take shape:

What if, instead of processing tokens one at a time, the model could look at the entire sequence at once?

This was the key insight behind the 2017 paper Attention Is All You Need (Vaswani et al., 2017), which introduced the transformer architecture. The core innovation was the attention mechanism. Instead of forcing all prior information through a single evolving state, attention allows each token to compute its representation by directly weighing the relevance of every other token in the sequence. In practical terms:

Every word can directly “look at” every other word.
Context is modeled through weighted relationships.
Computation can be parallelized efficiently.

In other words:

Each word dynamically decides which other words matter for understanding it.

The model no longer relies on a single compressed memory. It constructs context dynamically, through weighted relationships. In attention-based models, every token can interact with every other token within a single layer. This enables richer and more flexible sequence representations, which makes large-scale training feasible. Under the hood, these relevance weights are learned similarity measures between token representations, allowing the model to discover which relationships matter for the task.

This shift changes three things fundamentally: 1. Information Flow Context is no longer forced through a narrow sequential channel. Relationships are modeled explicitly and directly. 2. Parallelization Because tokens are processed simultaneously rather than step-by-step, training can leverage modern hardware much more efficiently. 3. Representation Flexibility Attention builds context-sensitive representations. The embedding of a word is not fixed; it changes depending on surrounding tokens. Importantly, this shift was architectural rather than conceptual.

The model still optimized a predictive objective.
It still learned representations.
It still operated in high-dimensional space.

But the representational capacity of the architecture becomes far more expressive. And crucially, the architecture is stable and scalable. That combination, attention plus scale, set the stage for the emergence of large language models.

5. Scale Changes Behavior

By itself, the transformer architecture did not immediately produce the kind of systems we now associate with large language models. The decisive factor was scale:

Larger models.
More data.
More compute.

For years, increasing model size produced incremental improvements. But around the late 2010s, researchers began to notice something more systematic. Performance did not improve randomly. It improved predictably. This observation was formalized in Scaling Laws for Neural Language Models (Kaplan et al., 2020), which showed that model performance follows smooth power-law relationships with respect to model size, dataset size, and compute. In simple terms:

If you make the model bigger, feed it more data, and train it longer, performance improves in a surprisingly regular way.

There was no obvious saturation point. No sudden collapse. Just steady, measurable improvement. This had an important implication: Improving language models was no longer primarily about inventing new architectures. It was about scaling existing ones efficiently.

5.1 When Quantity Becomes Quality

The next surprise came with Language Models are Few-Shot Learners (Brown et al., 2020), which introduced GPT-3. As models crossed certain size thresholds, new capabilities appeared:

Performing tasks without task-specific training
Following instructions from prompts
Translating, summarizing, answering questions with minimal examples

These abilities were not explicitly programmed. They emerged from next-token prediction at scale. This phenomenon became known as in-context learning. Rather than updating weights, the model could adapt its behavior based on examples provided in the prompt. The architecture had not fundamentally changed. The objective had not fundamentally changed. The training procedure was still next-token prediction. What changed was scale. And scale altered behavior.

5.2 Why Scale Matters

Why does increasing size produce qualitatively different behavior? Several factors contribute:

Larger models can store more fine-grained statistical structure.
Deeper networks can build more abstract representations.
Massive datasets expose models to broader patterns of language use.
Optimization at scale smooths behavior across many tasks.

As capacity increases, the model’s internal geometry becomes richer. Patterns that were previously too subtle to capture become representable. This does not mean the model “understands” in a human sense. It means the representation space becomes expressive enough to simulate a wide range of linguistic behaviors. And once that happens, a new class of systems becomes possible. Large language models are not defined by a new learning paradigm. They are transformers trained at unprecedented scale. That is the crucial shift.

6. What Large Language Models Actually Are

At this point, we can finally define what a large language model (LLM) is (without any mysticism). A large language model is a transformer-based neural network trained to predict the next token in a sequence. That’s it. More precisely, it learns a conditional probability distribution:

P(\tau_N \mid \tau_1, \dots, \tau_{N-1})

Given a sequence of text, the model assigns probabilities to all possible next tokens and selects one according to that distribution. This formulation is the modern scaled-up version of classic language modeling: estimating the probability of sequences from data (see, for example, A Neural Probabilistic Language Model (Bengio et al., 2003)). Everything else emerges from this objective.

6.1 Self-Supervised Learning at Scale

LLMs are trained using self-supervised learning. Instead of requiring labeled datasets, they use raw text and treat the next token as the target. Self-supervised learning can be understood as a structured form of unsupervised learning, where the supervision signal is derived from the data's own structure rather than external labels. For example:

Input: "The capital of France is"

Target: "Paris"

No human annotation is needed. The data itself provides supervision. By repeating this objective across billions (or trillions) of tokens, the model learns:

Syntax
Grammar
Statistical regularities
Common facts
Patterns of reasoning present in text

The training objective remains simple. The scale is what changes.

6.2 Representation, Not Rules

An LLM does not store explicit symbolic rules. It does not contain a database of logical statements. It does not execute a formal reasoning engine. Instead, knowledge is encoded implicitly in its parameters, distributed across high-dimensional vector space. When the model produces text that appears to reason, what it is doing is generating sequences that statistically resemble reasoning patterns it has learned from data. This distinction is subtle but important:

Fluency is not the same as understanding.
Coherence is not the same as truth.

The model optimizes likelihood, not factual accuracy.

6.3 Why Hallucinations Happen

Because the objective is probabilistic next-token prediction, the model’s goal is to produce text that is plausible given the context — not text that is verified or grounded. If the training data contains inconsistent or incomplete information, the model may confidently generate incorrect outputs. For example, asked about a fictional research paper, an LLM might fabricate a plausible-sounding title, author list, and abstract — fluent but entirely invented. This is not a bug in reasoning logic. It is a direct consequence of the training objective. This tension of fluency without grounding is one reason hallucinations show up in practice. It has been studied under the broader umbrella of factuality and faithfulness in generation (e.g., On the Dangers of Stochastic Parrots (Bender et al., 2021); and work on factual consistency such as TruthfulQA (Lin et al., 2022).

6.4 Why They Feel Different

Despite their simplicity of objective, LLMs feel qualitatively different from earlier models. That perception comes from three factors:

Scale — large parameter counts allow rich internal representations.
Pretraining — exposure to massive, diverse corpora.
In-context learning — the ability to condition behavior on prompts without weight updates.

Together, these produce systems that can simulate many linguistic behaviors within a single unified model. But under the hood, the mechanism remains: A transformer predicting the next token.

7. From Model to System

If Sections 1–6 were about how models learn, this section is about how we make them useful and reliable in the real world. A large language model is a component. An AI application is a system. This distinction is easy to miss, and much of the confusion around modern AI comes from collapsing it. A base LLM is a probabilistic sequence model trained to predict the next token. By itself, it has:

No persistent memory beyond its context window
No direct access to external databases
No ability to execute code
No built-in mechanism for verifying truth

What makes modern AI systems powerful is not just the model, but the scaffolding built around it.

7.1 Fine-Tuning and Alignment

Modern LLM development typically follows a two-phase paradigm: pre-training on massive corpora for next-token prediction, followed by post-training for alignment and task-specific behavior. One way to adapt a base model is fine-tuning. Instead of training from scratch, the model’s parameters are adjusted on more specific datasets:

Instruction-following corpora
Domain-specific documents
Conversational examples

Reinforcement Learning from Human Feedback (RLHF) goes further. Instead of optimizing purely for next-token likelihood, the model is adjusted to produce outputs that humans rate as helpful, safe, or aligned (cf. Training language models to follow instructions with human feedback (Ouyang et al., 2022)). This does not change the underlying objective structure. It refines the behavior of the same probabilistic model. The architecture remains a transformer. The optimization remains gradient-based.

7.2 Prompting as Soft Programming

Not all adaptation requires weight updates. Because LLMs exhibit in-context learning, behavior can be shaped directly through the prompt. Providing examples inside the context window effectively conditions the model to continue in a particular pattern. In this sense, prompting is a form of soft programming:

You don’t change the code.
You shape the conditions under which it runs.

This capability was one of the most surprising consequences of scale.

7.3 Retrieval: Adding External Knowledge

A core limitation of LLMs is that they generate plausible text, not necessarily grounded truth. Retrieval-Augmented Generation (RAG) addresses this by inserting external information into the model’s context. The system:

Searches a knowledge base.
Selects relevant documents.
Injects them into the prompt.
Lets the model generate a response conditioned on those documents.

Crucially, the model itself remains unchanged. What changes is the architecture around it. RAG is therefore not a new learning paradigm, but a systems pattern: retrieve, then generate. This idea was formalized in work such as Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020). By grounding generation in external sources, the system improves factual reliability without modifying the underlying next-token predictor.

7.4 Function Calling: From Language to Action

Perhaps the most important addition to modern LLM systems is tool use, often implemented through function calling. Here, the model does not merely generate text, it can produce structured outputs that specify:

Which tool to call
With which arguments

For example:

Invoking a calculator
Querying a database
Sending an API request
Triggering code execution

The LLM’s role becomes one of orchestration:

Interpret user intent
Decide which function is appropriate
Produce structured parameters
Integrate the returned result into the final response

This is what enables agency-like behavior. This “language-to-action” pattern has been explored in research such as ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022), which interleaves reasoning traces with tool use, and Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023), where models learn when to call tools from data. As a brief aside, ReAct touches on a broader and active area of research often described as reasoning in large language models. Work in this direction investigates how models can generate intermediate steps, decompose problems, and structure multi-step solutions more reliably. We won’t explore reasoning methods in depth here, as that topic deserves its own discussion. For our purposes, the key point is architectural: the model itself is still predicting tokens, but those tokens now describe actions. The surrounding system executes those actions. The model then continues generation based on the results. In effect:

The language model becomes a reasoning interface over tools.

This is a crucial shift. The intelligence is no longer confined to internal representation. It is distributed across model, tools, and control logic.

7.5 Agents and Control Loops

When function calling is combined with iterative prompting and memory, systems begin to resemble agents. They can:

Plan multi-step actions
Call tools repeatedly
Evaluate intermediate results
Adjust behavior dynamically

Yet even here, the underlying mechanism remains the same: A transformer predicting the next token. The appearance of agency emerges from structured interaction loops built around probabilistic generation.

7.6 The Separation That Matters

The base model provides:

Large-scale representation learning
Linguistic fluency
Pattern generalization

The system provides:

Grounding
Tool access
Memory
Verification
Iterative control

Understanding this separation reduces much of the hype. Large language models are powerful components. Modern AI systems are engineered architectures that wrap those components in structure. The difference between the two is where most of the real engineering happens.

8. A Clean Mental Map

We’ve taken a deliberate path through modern AI.

We started with the statistical turn: the shift from writing rules to learning from data.
We separated learning signals from architectures, objectives, and optimization methods.
We saw how representation learning became the real breakthrough, moving the bottleneck from hand-crafted features to learned internal structure.
We examined how transformers reshaped sequence modeling by replacing sequential compression with attention.
We saw that scale changed behavior; not by altering objectives, but by expanding representational capacity.

And we clarified what large language models actually are:

Transformer-based next-token predictors trained at massive scale.

Nothing more. Nothing less.

If you compress the entire landscape into a few durable ideas, it looks like this:

AI models learn optimized functions.
Performance depends critically on representation.
Transformers enable scalable sequence representations.
Scale alters qualitative behavior.
LLMs are components, not complete systems.

Most confusion about AI comes from mixing these layers. When architectures are confused with learning paradigms. When probabilistic fluency is mistaken for reasoning. When models are mistaken for systems. When scale is mistaken for a new kind of intelligence. Stripped of hype, modern AI is remarkably consistent:

It is gradient-based optimization of large parameterized models that learn structured representations from data.

The apparent intelligence emerges from geometry, scale, and system design. Understanding that hierarchy makes it easier to reason about what current systems can do, and where their limits lie. The same principles extend naturally to multimodal systems, where representations of text, images, and audio are learned and combined within a unified architecture. And that, more than any individual tool or model, is the durable map.

AI may look chaotic from the outside. Internally, its structure is surprisingly disciplined.