913. Cat and Mouse

2025-04-27 00:00:00 +0000

Backward Induction

Reasoning Backwards From the End: It’s a method for solving problems or analyzing sequential decision-making situations by thinking about the very last possible decision first. You figure out the optimal choice at the final step, and then use that knowledge to determine the optimal choice at the second-to-last step, then the third-to-last, and so on, all the way back to the very first decision.

Optimal play versus the mechanics of the backward induction algorithm

Consider the state (prev_m, i, 0). It’s Mouse’s turn. The algorithm needs to evaluate all of Mouse’s possible moves. One possible move is to node i, even though it is not optimal
An optimal Mouse avoids suicide. The algorithm reflects this. But for the algorithm to know that moving to i is suicide (a Cat Win), it must have the outcome of the resulting state (i, i, 1) defined as 2 beforehand
Initializing state[i][i][1] = 2 doesn’t mean we assume the game will reach that state under optimal play. It means we are defining the consequence if it were reached, which is necessary information for the algorithm to correctly evaluate the preceding choices and determine the optimal path (which likely avoids that state).

Gotchas

Exclude the edge to 0 in degrees when it is cat’s turn
Exclude the edge from 0 when the current turn is mouse’s turn

683. K Empty Slots

2025-04-26 00:00:00 +0000

dp(i, j, k)

The maximum score obtainable from the subarray boxes[i…j], given the context that k identical boxes follow boxes[j]. It includes the points obtained by optimally handling the boxes[i…j] segment in the context of the k following boxes, which often involves removing those k boxes as part of a larger group connected to boxes[j].
The calculation finds the best score from this entire situation, which might involve grouping boxes[j] with those k followers. The state isn’t just about the score within [i, j], but the total score achievable from the setup described by i, j, and k.
Base case: dp(i,j, k) = 0 when i > j
Options
- Merge the rightmost continous block first
- Merge an intermediate block first, so that the rightmost continous block can be longer

Why is this problem confusing

While maybe not the most common introductory pattern, augmenting DP states with information about external context or future potential appears in more complex problems, such as:

Some advanced string DP problems.
Optimization problems where decisions within a subproblem are constrained by resources used globally or in adjacent subproblems.
Problems involving intervals where actions within the interval affect its boundaries or what can happen outside.

Why top down and not bottom up

Handling Complex Dependencies: The state dp[i][j][k] depends on states involving smaller ranges (j’ < j) but potentially larger values of k (e.g., dp[i][m][k+1] depends on dp[i][m][k’] where k’ > k). Figuring out the precise order to fill the 3D dp table iteratively in a bottom-up fashion can be quite tricky. You’d need a carefully constructed set of nested loops (likely iterating by subarray length len = j-i+1, then i, then perhaps k?) ensuring that whenever dp[i][j][k] is computed, all states it depends upon are already available. This can be complex and error-prone.
Potentially Sparse State Space: While the theoretical maximum number of states is O(N^3), not all combinations of (i, j, k) might actually be reachable starting from the initial call solve(0, n-1, 0). Top-down only explores and computes the states that are actually relevant to the final answer. A bottom-up approach typically iterates through and fills potentially all O(N^3) states defined by its loops, which might include unnecessary calculations (though using a dictionary for memoization in top-down helps mainly with implementation ease rather than asymptotic complexity here, as most states are likely reachable).

Gotchas

Base case should be 0
When finding the continous block, it needs to be within the range of the [i, j]

517. Super Washing Machines

2025-04-24 00:00:00 +0000

Intuition

The calculated max_moves = M represents the peak requirement imposed by either a single machine needing to unload its excess or a single boundary needing a certain net flow. Since M moves provide enough capacity to meet this peak requirement, and moves can be coordinated to address both needs simultaneously, M moves are sufficient to ensure all machines reach the target state.

The primary driver for moves is often to correct imbalances between sections of the machine line. The balance_k value represents the net number of dresses that must eventually move from the left section (0…k) to the right section (k+1…n-1) (if balance_k > 0) or vice-versa (if balance_k < 0).

The core requirement is to achieve a net flow of balance_k dresses across boundary k over the entire process. We know that abs(balance_k) <= M.

A machine with a deficit doesn’t create a bottleneck by itself in the same way an excess does. It doesn’t need to perform a minimum number of “give” actions. Instead, it needs dresses to arrive. The constraint shifts from the machine itself to the flow required to reach it, i.e., The impact of the deficit is measured via the boundary/flow bottleneck, not the machine excess bottleneck.

Pseudo code for the optimal action plan

# balance(0,0) = 0
# balance(0,n) = 0

for round in range(0, M): #M is the max of max bottlenecks
 next_round_state = machines[:]
 for i in range(0, n): # N is the number of machines
   if machine[i] > 0 
	if balance(0, i) < 0:
	   next_round_state[i] = machine[i] - 1 	
	   next_round_state[i - 1] = machine[i - 1] + 1 	
        else if balance(i+1, n) < 0:
	   next_round_state[i] = machine[i] - 1 	
	   next_round_state[i + 1] = machine[i + 1] + 1 	

 machines = next_round_state

When M > max(machines)

i.e., M = abs(balance(0, j)) for a TBD j

Around j the machine balances must looks like {… surplus at j1, 0,…, 0, deficit at j2,…}, and j >= j1 + 1 and j <= j2. Otherwise, abs(balance(0, j)) is not maximum. Therefore, we can definitely pass one dress from left to right at j1,…,j2 - 1, and thus M = abs(balance(0,j)) is reduced by 1 after this round

When M == max(machines)

assuming machines[k] = max(machines) for a TBD k

Claim A: balances(0, k) <= 0 and balances (k+1, n) <= 0 Proof: by contradiciton, then if balance(0, k) > 0 and balances(k+1, n) <= 0, then balance(0, k+1) > M which contradicts with M’s definition that it is the max of abs(balance(0, i)) over all Is

Claim B: balance(0, k) < 0 or balances(k+1, n) < 0. Could be both at the same time Proof: by contradiciton, if both >= 0, then balances(0, n) > 0, which contradicts that the all machines will have the same number of dresses eventually

Claim C: when balance(0, k) < 0 and balance(k+1, n) = 0, then machines[k-1] < M Proof: By contradiciton, assume machines[k-1] = M. At least one dress will be moved from k to k - 1. then abs(balance(0, k-1)) > abs(balance(0, k)) = M, this contradicts that M is also the max of abs(balance(i))

Claim D: when balance(0, k) = 0 and balance(k+1, n) < 0, then machines[k+1] < M Proof: similar to proof of claim C, just in a mirrored direction

Claim E: when balance(0, k) < 0 and balance(k+1, n) < 0, machines[k -1] < M or machines[k+1] < M Proof: By contradiction, proof is a combination of D & E

With claim C ,D, and E, we can prove that when M == max(machines), there is always a move that moves M to a neighboring machine with fewer dresses, that is, M is reduced by 1 after this round

When M < max(machines)

TODO

Communication lessons from Gemini 2.5 Pro

2025-04-23 00:00:00 +0000

Contextual Requests: Avoid Defensive or Overly Justifying Tone

Too Long/Detailed: Goes into unnecessary background or explains things the receiver likely already knows.
Focus on Personal Effort/Stress: Emphasizes how much work you’ve done or how stressed you are about the deadline (e.g., “…because I’ve been working on this all day and I really need to get it off my plate”).
Implies Potential Problems: Phrases it in a way that suggests negative consequences if the task isn’t done, rather than the positive outcome if it is (e.g., “…because if you don’t, the whole project will be delayed”).
Slightly Accusatory or Patronizing Language: Uses phrases like “As you should know…” or over-emphasizes obvious points.
- If your colleague is doing that
  - Acknowledge to yourself that the speaker might not intend it negatively, even if it feels that way. This can help manage your reaction.
  - Try to mentally filter past the “as I mentioned” preamble and focus on the actual point being made.
- In your own communication, use softer ways to refer back, like: “Just circling back to the point about X…” or “Building on our previous discussion…” or simply restating the point without the preamble.
Example (Contrast): “I really need you to finalize the slides, please try to get them done by 3 PM. Because, as you know, the practice run is at 4 PM, and I can’t possibly merge the decks if I don’t have your part, and it took me ages to get my section done, so we really can’t be late with this.” (Sounds stressed, over-explains the obvious dependency, focuses on personal effort and potential failure).
Informative/Collaborative: “Here’s the task, and here’s the simple, direct reason why it’s needed for our shared goal.”
Defensive/Justifying: “Here’s the task, and here are all the reasons why you absolutely must do it, please understand it’s important, I’ve worked hard, and bad things might happen if you don’t.”

Preach on known knowledge and skills

Instead of focusing on the what (the principles you know), try to understand why they might be bringing up this specific principle right now. Is there a subtle company focus? A recent challenge elsewhere in the org? Sometimes the timing or emphasis can offer clues about broader priorities, even if the content itself isn’t new to you.
Acknowledge and Pivot: Briefly acknowledge the principle, then immediately connect it to a specific action or question. “Yes, that principle of [Principle Name] is definitely key. I’ve been thinking about how it applies specifically to the [Project/Task Name]. For example, with [Specific Situation], I’m considering [Action A] vs. [Action B]. From your perspective, which aligns better with that principle in this practical case?” This shows you understand the principle and are actively applying it, shifting the focus to actionable advice.
Share Your Application (Briefly): Casually mention how you’re already using the principle. “That’s a great reminder about [Principle]. We actually leveraged that last week when we [brief example of your action] and it led to [positive outcome]. It really proved effective.” This subtly demonstrates your competence without being defensive.

Inaccurate view of your approach or focus

Is the Counter-Question a Good Strategy?
- probably not as your only or primary response. Use it sparingly, perhaps when the assumption seems particularly off-base or when you have the time and energy for a potential dialogue, and be very mindful of delivering it with a genuinely curious and non-confrontational tone.
Gentle, Direct Correction & Restatement: This is often the most efficient and professional approach for frequent, minor misinterpretations.
- Example: “Actually, I focus mainly on the significant deals like X and Y since they have the biggest impact. I don’t track all the smaller ones quite so closely.”
- Why it works: It corrects the record quickly and factually without making a big deal out of it or putting the other person on the defensive. It clarifies your actual behaviour.
Ignore (Selective Use): If the comment is passing, doesn’t materially affect the work, and you don’t have the energy, sometimes just letting it go is easiest. Choose your battles. If it doesn’t matter, maybe it doesn’t warrant correction every single time.
Seek Feedback (If You Suspect a Root Cause): If you’re genuinely puzzled why this impression persists, you could ask a trusted colleague or even your manager if they have insights into how you might be perceived or if your role’s focus is clear to others.

Blunt communication

Assume Neutrality, Not Malice: Try to consciously assume they aren’t trying to be hurtful. They might just be very direct, lack social awareness, be stressed, or think this is an efficient way to communicate. Attributing their bluntness to their style rather than a personal attack can lessen the sting.
Focus on the Task: Remind yourself that the interaction is about the work, not about you personally. Their feedback or question, however poorly phrased, likely relates to a task, project, or process.
Listen for the ‘What’, Not the ‘How’: Train yourself to filter out the tone and the rhetorical phrasing. Ask yourself: “What is the underlying point, question, or concern here?” What information do they actually need, or what disagreement are they trying to express?
Translate the Rhetorical Question: When they ask “Are you… [doing something they disagree with]?”, mentally translate it to: “I have a concern about [the topic]. Can you explain your approach or thinking?” or “I don’t agree with [the implied action]. Here’s why…”
Ask Open-Ended Questions: Instead of getting defensive, respond to “Are you…?” with questions that require them to explain their actual point.
- “Could you tell me more about your concern regarding [the topic]?”
- “What specific aspect of [the topic] are you questioning?”
- “Help me understand what you think should be happening instead.”
Paraphrase for Understanding (Content Only): Repeat back what you think their substantive point is, ignoring the blunt delivery.
- “So, if I understand correctly, you’re concerned about [the specific issue] because [potential reason]? Is that right?”
- “Okay, it sounds like you disagree with [method/decision]. What approach do you suggest?” This forces them to move beyond the rhetorical question into a constructive discussion.
It’s Not (Necessarily) About You: Consciously remind yourself that their communication style is likely how they interact with many people, not just you. It reflects their habits, not your shortcomings.
Separate Feedback from Judgment: Even if the delivery feels judgmental, focus only on the actionable feedback or the core question about the work itself.

Culture and strategy

Culture is the Enabler (or Disabler) of Strategy: Think of culture as the soil and climate, and strategy as the seeds you plant. You can have the best seeds (strategy), but if the soil is toxic or the climate hostile (culture), nothing will grow. Your focus on cultivating a culture where people tackle necessary work, not just interesting work, is precisely about improving the “soil” so that any strategy can take root and flourish.
Working on This Specific Culture Aspect Is Strategic: You aren’t choosing culture over strategy. You are identifying a critical cultural deficiency (avoidance of necessary work) that directly threatens the execution of any strategy the company adopts. Fixing this is a strategic imperative. It’s foundational work required for future strategic success.
Reconciling the Feedback: The feedback to “focus more on company strategy” might mean:
- Ensure Alignment: Make sure your team understands the company strategy and how their work (including the less glamorous parts) contributes to it. Use the strategy as the “why” behind the cultural push.
- Strategic Context: Ensure your understanding and communication of the strategy are sharp. You need to articulate which strategic goals are hampered by the current cultural issue.
- Don’t Lose Sight of the Big Picture: While fixing the cultural plumbing, don’t forget to keep an eye on the overall direction (strategy). Ensure the cultural changes you’re making actively support the intended strategic direction.
Frame the Culture Work Strategically: When you talk to your team about sharing the load and doing necessary work, explicitly link it to achieving specific strategic objectives. “We can’t hit Goal X if tasks Y and Z are consistently delayed because they aren’t seen as ‘interesting’. Achieving our strategy requires all hands on deck for all necessary tasks.”
Use Strategy to Prioritize Culture Efforts: Which aspects of the strategy are most at risk due to this cultural trait? Focus your cultural interventions there first.
Communicate Upwards: Explain to those who gave you the feedback how your focus on this specific cultural aspect is essential for enabling the company’s strategy. Frame it not as an either/or, but as necessary groundwork. “To ensure we can execute on Strategy A, I’m currently prioritizing building a team culture that ensures reliable completion of all required tasks, B and C, which are foundational to A.”

Directional listening

Challenge Literal Interpretation: Actively pause when you receive input. Ask yourself: “Is this a direct order, a strong suggestion, a brainstormed idea, or an expression of a concern?”
Instead of just accepting the suggestion, probe deeper:
- “Thanks for that suggestion. Can you help me understand the specific problem you’re hoping this will solve?”
- “What’s the main outcome you’re looking for with that approach?”
- “Could you walk me through your thinking on how this would work?”
- “That’s an interesting idea. How do you see it fitting with [Project Goal X] or [Constraint Y]?”
- “Is this something you feel is essential, or more of an idea to explore?” (Use this carefully, but it can directly ask about the ‘directionality’).
Show you’ve heard and understood the intent or the value in their suggestion, even if you disagree with the specific implementation.
- “I understand why you’re suggesting [the idea]. It makes sense that you’re concerned about [the underlying issue].”
- “Thank you, I appreciate you bringing that perspective. You’re right that [aspect they highlighted] is important.”
- “I can see the potential benefit of [their suggestion] in terms of [positive outcome].”
articulate why their suggestion might not be feasible or the best approach in its current form. Be factual and objective.
- Refer to Goals/Scope: “While that’s a great idea, it might take us outside the agreed-upon scope for this phase. Perhaps we can log it for future consideration?”
- Mention Constraints: “Implementing that would require [resource/time/budget] that we currently don’t have allocated.” or “Based on the current technical setup, that approach might introduce [specific risk or complexity].”
- Cite Data/Research: “Our user research indicated [finding X], which suggests we might want to prioritize [alternative approach].”
- Explain Trade-offs: “We could do [their suggestion], but it would mean we’d have to sacrifice [something else important]. My recommendation is to stick with [original plan] because it better aligns with our primary objective of [Goal Y].”
Frame your reasoning around the project’s success or the team’s objectives.
- We both want [desired outcome]. My concern with [their suggestion] is that it might impact [shared goal] in this way: […]. How can we achieve [the stakeholder’s underlying need] while protecting [the shared goal]?”
Show you’re still trying to incorporate the spirit of their feedback.
- “While we might not be able to do [exact suggestion] due to [reason], what if we addressed your concern by doing [alternative solution]?”
- “Could we perhaps implement a simpler version of your idea that fits within our current constraints?”
- “I understand the need for [what their suggestion achieves]. Could we explore [another way] to get a similar result?”
Since tone can be tricky, follow up verbal discussions with brief emails summarizing the input, your understanding, the decision made, and the reasoning. This confirms understanding without relying on tone interpretation.
- “Hi [Stakeholder], thanks for discussing [topic] today. Just to confirm my understanding, you suggested [their idea] to address [their concern]. After considering it against [project goals/constraints], we decided to proceed with [chosen approach] because [brief reason]. We appreciate your input on this.”
Explicitly Name the Ambiguity & Seek Clarification:
- Don’t assume the other person sees the priorities clearly either. They might just have a gut feeling or a different perspective on the unclear trade-offs.
- Say: “I’m finding the trade-offs here a bit ambiguous. My preferred option is [Your Option] because I’m weighing [Factor X] slightly higher. Your suggestion of [Their Option] seems to prioritize [Factor Y]. Can we talk through which of those factors is more critical for this specific situation or phase?”
- Ask clarifying questions about priorities: “To make the best decision here, could we clarify what the primary goal is? Is it speed, long-term stability, user satisfaction, cost reduction, etc.?” or “Who is the ultimate decision-maker for this type of trade-off if we can’t reach an agreement?”
Identify the Decision-Maker & Understand Roles:
- Is the person suggesting the alternative your manager, a project lead, or a peer? Their role might influence how the decision is ultimately made.
- If there’s a designated decision-maker (e.g., Product Owner, Tech Lead), and they are involved, their call might be the one that needs to be followed after the discussion.
- If it’s a peer and you’re both stuck, you might need to escalate together to get clarification on priorities from a lead or manager. “We have two viable options here with different trade-offs (X vs Y). Could you help us prioritize which factor is more important right now?”
Disagree and Commit
- This is crucial for maintaining momentum and professionalism. If, after a thorough discussion where you’ve voiced your reasoning and concerns:
  - Priorities are still somewhat ambiguous, but
  - A decision needs to be made, and
  - The decision-maker (or team consensus) leans towards the other option…
- …it can be necessary to “disagree and commit.” This means you:
  - Ensure your perspective and concerns have been heard and understood.
  - Acknowledge the decision.
  - Commit to fully supporting the chosen path and making it work to the best of your ability, without undermining it.
. “We decided to proceed with Option B, prioritizing [Factor Y] over [Factor X] for now, acknowledging the potential risk of […]. We will monitor […] closely.”

Evals for LLM systems

2025-04-22 00:00:00 +0000

Why evals should get more attention

Unsuccessful AI products almost always share a common root cause: a failure to create robust evaluation systems. Conversely, creating high quality evals is one of the most impactful things you can do.
Evaluating quality (ex: tests) and debugging issues (ex: logging & inspecting data) don’t get enough attention, but they improve your product past demo stage.
Addressing one failure mode led to the emergence of others, resembling a game of whack-a-mole.
There was limited visibility into the AI system’s effectiveness across tasks beyond vibe checks.
Prompts expanded into long and unwieldy forms, attempting to cover numerous edge cases and examples.

Unit Tests

Unlike typical unit tests, you want to organize these assertions for use in places beyond unit tests, such as data cleaning and automatic retries (using the assertion error to course-correct) during model inference.
The most effective way to think about unit tests is to break down the scope of your LLM into features and scenarios.
For example, here is one such prompt Rechat uses to generate synthetic inputs for a feature that creates and retrieves contacts.

Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. The contact details can include name, phone, email, partner name, birthday, tags, company, address and job.

For each of the instructions, you need to generate a second instruction which can be used to look up the created contact.

. The results should be a JSON code block with only one string as the instruction like the following:

        
[
  ["Create a contact for John (johndoe@apple.com)",
  "What's the email address of John Smith?"]
]

Model Eval

A two stage process where the model first answers the question, then we ask a different model to look at the response to check if it’s correct.
A prerequisite to performing human and model-based eval is to log your traces
Researchers are now employing LLMs like GPT-4 to evaluate the outputs of similar models. This recursive use of LLMs for evaluation underscores the continuous cycle of improvement and refinement in the field. Human and GPT-4 judges can reach above 80% agreement on the correctness and readability score. If the requirement is smaller or equal to 1 score difference, the agreement level can reach above 95%
Critiques from a good evaluator model can be used to curate high-quality synthetic data,

Example

asks the model to write a funny joke. The model then generates a completion.
We then create a new input to the model to answer the question: “Is this following joke funny? First reason step by step, then answer yes or no”
We finally consider the original completion correct if the new model completion ends with “yes”.

Methodology

The evaluation begins with the creation of a benchmark dataset, which should be as representative as possible of the data the LLM will encounter in a live environment.
- One way to speed up the process of building eval datasets, is to use GPT-4 to generate synthetic data
Once we have our evaluation test set complete with ground truth and responses generated by our LLM application, the next step is to grade these results. This phase involves using a mix of LLM-assisted evaluation prompts and more integrated, hybrid approaches.
- Opt for the most robust model you can afford: Advanced reasoning capabilities are often required to effectively critique outputs. Your deployed system might need to have low latency for user experience. However, a slower and more powerful LLM might be needed for evaluating outputs effectively.
- The evaluating model might make errors and give you a false sense of confidence in your system.

Human Eval

I often find that “correctness” is somewhat subjective, and you must align the model with a human.
- A translation could score high on BLEU for having words in technically correct order, but still miss the mark in conveying the right tone, style, or even meaning as intended in the original text.
LLMs may exhibit a preference for answers generated by other LLMs over human-authored text, potentially leading to a skewed evaluation favoring machine-generated content.

Best practices

May need to build a custom tool to show
- What tool (feature) & scenario was being evaluated.
- Whether the trace resulted from a synthetic input or a real user input.
- Filters to navigate between different tools and scenario combinations.
- Links to the CRM and trace logging system for the current record.
Create positive and negative evals: Something cannot be logically true and untrue at the same time. Carefully design your evals to increase confidence.
we noticed that many failures involved small mistakes in the final output of the LLM (format, content, etc). We decided to make the final output editable by a human so that we could curate & fix data for fine-tuning.
When starting, you should examine as much data as possible. I usually read traces generated from ALL test cases and user-generated traces at a minimum. You can never stop looking at data—no free lunch exists. However, you can sample your data more over time, lessening the burden.
One signal you are writing good tests and assertions is when the model struggles to pass them - these failure modes become problems you can solve with techniques like fine-tuning later on.

References

Getting Started with OpenAI Evals
All about LLM Evals
Your AI Product Needs Evals

Building agents

2025-04-21 00:00:00 +0000

Use cases for agents

Prioritize workflows that have previously resisted automation, especially where traditional methods encounter friction:
- Workflows involving nuanced context-sensitive decisions, for example refund approval in customer service workflows.
- Systems that have become unwieldy due to extensive and intricate rulesets, making updates costly or error-prone, for example performing vendor security reviews.
- Scenarios that involve interpreting natural language, extracting meaning from documents, or interacting with users conversationally, for example processing a home insurance claim.
Analyze log and diagonse
- analyze user’s query and statement
- use system log, monitoring, and knowledge base to analyze
Generate test data base on dialogs
- use natural language and knowledge base.
- Build atomic tools to cover core promotion, cart, order, fulfillment flows
Customer support
- Support interactions naturally follow a conversation flow while requiring access to external information and actions;
- Tools can be integrated to pull customer data, order history, and knowledge base articles;
- Actions such as issuing refunds or updating tickets can be handled programmatically
- Success can be clearly measured through user-defined resolutions.
Coding agents
- Code solutions are verifiable through automated tests;
- Agents can iterate on solutions using test results as feedback;
- The problem space is well-defined and structured;

Constraints

The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents’ autonomy makes them ideal for scaling tasks in trusted environments.
The autonomous nature of agents means higher costs, and the potential for compounding errors. We recommend extensive testing in sandboxed environments, along with the appropriate guardrails.

Start simple

When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance
We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what’s under the hood are a common source of customer error.
For many applications optimizing single LLM calls with retrieval and in-context examples is usually enough.
Build your agent prototype with the most capable model for every task to establish a performance baseline. From there, try swapping in smaller models to see if they still achieve acceptable results.

Best practices

Prioritize transparency by explicitly showing the agent’s planning steps.
While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt. For example, we found that the model would make mistakes with tools using relative filepaths after the agent had moved out of the root directory. To fix this, we changed the tool to always require absolute filepaths—and we found that the model used this method flawlessly.

Instructions

Give the model enough tokens to “think” before it writes itself into a corner.
Keep the format close to what the model has seen naturally occurring in text on the internet.
Make sure there’s no formatting “overhead” such as having to keep an accurate count of thousands of lines of code, or string-escaping any code it writes.
Rather than maintaining numerous individual prompts for distinct use cases, use a single flexible base prompt that accepts policy variables.

You are a call center agent. You are interacting with  who has been member for . The user's most common complains are about . Answer any questions the user may have

Routine

Make sure every step in your routine corresponds to a specific action or output. For example, a step might instruct the agent to ask the user for their order number or to call an API to retrieve account details. Being explicit about the action (and even the wording of a user-facing message) leaves less room for errors in interpretation.
Real-world interactions often create decision points such as how to proceed when a user provides incomplete information or asks an unexpected question. A robust routine anticipates common variations and includes instructions on how to handle them with conditional steps or branches such as an alternative step if a required piece of info is missing.
Generate instructions from routine

You are an expert in writing instructions for an LLM agent. Convert the following help center document into a clear set of instructions, written in a numbered list. The document will be a policy followed by an LLM. Ensure that there is no ambiguity, and that the instructions are written as directions for an agent. The help center document to convert is the following 

Workflows

Prompt chaining: trade off latency for higher accuracy, by making each LLM call an easier task
- Generating Marketing copy, then translating it into a different language.
- Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.
Routing
- Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools.
- Routing easy/common questions to smaller models like Claude 3.5 Haiku and hard/unusual questions to more capable models like Claude 3.5 Sonnet to optimize cost and speed.
Parallelization
- Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.
- Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt.
- Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem. Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.
Orchestrator-workers: complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task)
- Coding products that make complex changes to multiple files each time.
- Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information.
Evaluator-optimizer
- Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.
- Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.

Guardrails

Often implemented as parallelization workflows. The primary agent proactively generates outputs while guardrails run concurrently, triggering exceptions if constraints are breached.
Ensures agent responses stay within the intended scope by flagging off-topic queries. For example, “How tall is the Empire State Building?” is an off-topic user input and would be flagged as irrelevant.
Detects unsafe inputs ( jailbreaks or prompt injections) that attempt to exploit system vulnerabilities.
Prevents unnecessary exposure of personally identifiable information (PII) by vetting model output for any potential PII.
Flags harmful or inappropriate inputs (hate speech, harassment, violence) to maintain safe, respectful interactions.
Assess the risk of each tool available to your agent by assigning a rating—low, medium, or high—based on factors like read-only vs. write access, reversibility, required account permissions, and financial impact. Use these risk ratings to trigger automated actions, such as pausing for guardrail checks before executing high-risk functions or escalating to a human if needed.
Simple deterministic measures (blocklists, input length limits, regex filters) to prevent known threats like prohibited terms or SQL injections.
Ensures responses align with brand values via prompt engineering and content checks, preventing outputs that could harm your brand’s integrity.

References

A practical guide for building agents
Building effective agents

How to Recognize Hidden Feedback

2025-04-20 00:00:00 +0000

HBR

Why Vital Feedback Stays Hidden

People sometimes incorrectly assume that they’ve given the feedback more explicitly than they really have (“I made the same correction three times. Isn’t it obvious that I’m displeased with this?”)
People with critiques or suggestions may not even consciously realize that they have important feedback to give. They may be rushing quickly between tasks, out of touch with their actual concerns, lacking the words to express their intuitions, or only vaguely aware of the broader message that would be important to convey.

How to Detect and Learn from Hidden Feedback

When stakeholders repeatedly return to ostensibly minor suggestions or questions, this may mask broader, unstated concerns about capability, readiness, or performance. These micro-moments typically aggregate into unspoken feedback about perceptions of leadership blind spots or weaknesses.
- “I’ve noticed you’re particularly interested in our timelines. Is there any broader concern about our execution speed, or anything else? If so, I’d love to understand it so we can step up and perform at our best.”
When people suddenly engage (or send in proxies to engage) in decisions that would not typically involve them, or when they request more reviews than usual, it can indicate eroding confidence.
When people decrease their involvement or participation without explaining why, it can be a signal that they either don’t want to engage or have deprioritized the work—both of which may indicate growing concerns about leadership effectiveness or strategic alignment.
- Why send delegates rather than attend his meetings themselves?
- “I’ve noticed a pattern in attendance at my senior talent meetings. I’m curious if there’s something about our process that isn’t working for you, or if you have suggestions for how to make them better.”

Make it safe for others to tell you the truth

“As I look ahead, I’m trying to grow my leadership, and I value your perspective. What blind spots should I be aware of that I might not be seeing?”
“I’d value your perspective on how my message landed in today’s meeting. What signals or reactions did you notice? And how did it land with you?”
“Given your role, you might see things I’m missing. What patterns or concerns should I be aware of?”

Reframe as strategic counsel advice if the barrier to direct feedback is too high.

“As part of my professional development, I’m always looking to improve. What’s one thing you think I could be doing differently to raise my game?”
“Based on your experience and vantage point, how might you approach this challenge differently than I am?”
“If you were mentoring someone facing similar circumstances, what guidance would you offer?”

Articulate the pattern you’re seeing and more explicitly ask if there might be something important

“I’ve noticed this topic coming up in several conversations. Is there a broader concern we should be discussing?”
“This seems to be a recurring theme in our interactions. What underlying issues could we be addressing?”
“I’m seeing a pattern here that might signal something important. Could you help me understand if there’s more to explore?”

Listen to learn

“I’d love your perspective on {} . Can you give it some thought, and let’s discuss when we meet next?”
Don’t assume the first thing they say is the only or even the most important feedback they have for you.
- “That’s interesting. Can you say more about that?”
- “Very helpful. Can you share an example?”
- “Great to know that. What else?”

From Gemini 2.5 Pro

Consistently applying these techniques—actively seeking cues, creating safety, deep listening, paraphrasing, following up—requires considerable time and emotional energy from the leader, which could be a challenge in demanding roles.
Instead of broad invitations, ask specific questions like, “What’s one thing you see me doing that gets in my way?” or “What’s one thing I could do differently?”.
Faster methods are often better suited for gathering more direct, already-formed feedback or making delivery more efficient, but they might miss the underlying issues that aren’t being explicitly stated. The trade-off is often between speed and the depth or nuance of the feedback obtained.

Prompt Engieering

2025-04-19 00:00:00 +0000

Mostly notes on the whitepaper from Google

Prompt engieering is an iterative process
Document various prompt attempts
When zero shot does not work, try one shot and few shot prompting, i.e., single example and multiple examples. Rule of thumb is 3-5 examples. Include edge cases if the input could be diverse
Prompting for a JSON format output forces the model to create a structure and limit hallucinations.
Growing research suggests that focusing on positive instructions in prompting can be more effective than relying heavily on constraints. This approach aligns with how humans prefer positive instructions over lists of what not to do.
Placing the detailed Context before the specific Task/Ask is often preferred as it helps the LLM build a complete understanding before acting.

System prompting

sets the overall context and purpose for the language model. It defines the ‘big picture’ of what the model should be doing, like translating a language, classifying a review etc.
The name ‘system prompt’ actually stands for ‘providing an additional task to the system’

Classify movie reviews as positive, neutral or negative. Return valid JSON.

Review:...

Schema:...

JSON Response:...

Contextual prompting

Provides immediate, task-specific information to guide the response. It’s highly specific to the current task or input, which is dynamic.

Context: You are writing for a blog about retro 80's arcade video games.
Suggest 3 topics to write an article about with a few lines of description of what this article should contain.

Role prompting

There can be considerable overlap between system, contextual, and role prompting. E.g. a prompt that assigns a role to the system, can also have a context.

I want you to act as a travel guide. 
I will write to you about my location and you will suggest 3 places to visit near me. 
In some cases, I will also give you the type of places I will visit.
My suggestion: "I am in Amsterdam and I want to visit only museums."
Travel Suggestions:

Step-back prompting

Prompting the LLM to first consider a general question related to the specific task at hand, and then feeding the answer to that general question into a subsequent prompt for the specific task. This ‘step back’ allows the LLM to activate relevant background knowledge and reasoning processes before attempting to solve the specific problem.

Based on popular first-person shooter action games, what are 5 fictional key settings that contribute to a challenging and engaging level storyline in a first-person shooter video game?

Context: 5 engaging themes for a first person shooter video game:
...

Take one of the themes and write a one paragraph storyline for a new level of a first-person shooter video game that is challenging and engaging.

CoT

Generally, any task that can be solved by ‘talking through is a good candidate for a chain of thought. If you can explain the steps to solve the problem, try chain of thought.

When I was 3 years old, my partner was 3 times my age. Now, I am 20 years old. How old is my partner? Let's think step by step.

You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding as it’s a challenge with a zero-shot chain of thought.

Q: When my brother was 2 years old, I was double his age. Now I am 40 years old. How old is my brother? Let's think step by step.
A: When my brother was 2 years, I was 2 * 2 = 4 years old.  That's an age difference of 2 years and I am older. Now I am 40 years old, so my brother is 40 - 2 = 38 years old. The answer is 38.
Q: When I was 3 years old, my partner was 3 times my age. Now, I am 20 years old. How old is my partner? Let's think step by step.
A:

Self-consistency

Generating diverse reasoning paths: The LLM is provided with the same prompt multiple times. A high temperature setting encourages the model to generate different reasoning paths and perspectives on the problem.
Extract the answer from each generated response.
Choose the most common answer.

Prompts for explaining code

Explain to me the below Bash code:

{code in markdown}

Prompts for debugging and reviewing code

The below Python code gives an error:
{log in markdown}

Debug what's wrong and explain how I can improve the code.

{code in markdown}

Use variables in prompts

VARIABLES
{city} = "Amsterdam"
PROMPT
You are a travel guide. Tell me a fact about {city}

How to make papers easier to review

2025-04-18 00:00:00 +0000

The same ideas also apply to design docs

Common misconceptions

Author: Record the process of how we reached the solution by going through complex subproblems. The reviewer should be impressed by the smart process
- Reviewer: Why one problem after another, which is the main problem to solve? Which one is common practice, and which is of the author’s own?
Author: Spend two pages on thinking through the problem to transform to a different problem. Reviewer should appreciate this challenging mind game
- Reviwer: No intro and conclusion, what is the purpose of this thought process?
Authoer: Added innovation to existing work to reach SOTA. The reviewer should appreciate the experiment result
- Reviwer: Yet another AAA + BBB. Then let me check the effect. What does this result mean, is it a significant improvement

Don’t record review process

The reviewer want to quickly get your new contribution. The goal is the final state, and not the research progress.

Bad Example

To solve problem 1, I have solution A, but A’s result is not ideal because of problem 2, and 2 is solved by my solution B.

Reviewer’s PoV

how to prove 2 is the root cause of bad results of A
B seems to have little difference from A
A has some effect, how to prove this is because it solved problem 1

Author’s confusion

The final result is B, plus I built A too
Why need to prove the relationship between B and A

Reviewer’s reaction

You didn’t emphasize A
A didn’t solve your problem 2, which you emphasized
If A is not major contribution, than B has little value

How to fix with a better framework

Flow : Problem -> theory behind -> solution and implementation-> experiment to verify. This flow means we should avoid “based on” and “A+B”

A new solution to an important problem
A new framework and a common trick
- Prove that we need both. Without this trick, the result will be worse, that is because of an interest problem
- This trick can be applied to other methods

Reviewer’s PoV

I can see what is new and I just need to decide if they are technically sound and if effects are significant

Readability

Coherent is on logic itself, and not connecting words
Explain the effect right after the name (we propose XXX, which is a two-layer MLP). Otherwise, the reader may be frustrated when they discover it is so simple after many paragraphs
When you are expanding on the story and the reviewer is decideing if the story makes sense, do not attach reference to graphs a few pages after, because paper is linear story telling.
Around the table you should explain key points with high information density
If you want to emphasize table 5, then the sentence that analyzing the result better be on the same page as table 5, because reader may not look into your text and check graph and find related text
Don’t expect users to compare results themselves. Compare it your self, it is ok to repeat the comparison of certain results.

Rebuttal

Reviewer demands optional data
Add the data, also list influential parallel work and show that data is in general optional.
Don’t give the AC the impression that you are preaching, e.g., consider the review comment quality, rather than exclude scores from the reviewer. :w
Be explicit on your AIs to address comments

Story telling

Motivation

Define the problem: Why is it a common problem?
- For example, to solve the congestion, we need to control send speed at the end point level, and that means we need to 1. sense the network’s congestion, and 2. algorithm to control the send speed to reduce congestion
What are problems with current algorithms?
What is your key idea and what makes your key idea promising.
Adjust granularity based on the audience

Contribution

Insights, but not manual. Why does your design work? Prove that all parts of your design is needed and not only a part of it.
Can your insights be applied to other scenarios?

References

George Polya: Let's teach guessing

2025-03-30 00:00:00 +0000

Video

all-important reminder that we must test our guesses
Guess and then prove. All great discoveries are done this way
Finished math consists of proofs, but math in the making/discovery consists of guesses

How many parts do 5 planes cut

Students started with random guesses
Start with easier problems which could help
A student asked if planes can be parallel, and prompted the clarification that planes are random
3 planes give 8
Think of extreme cases in reasonable guessing
Guess 4 planes give 16 based on observation of pattern, and then generalize, i.e., induction
Need to test the guess/prediction
Use analogy to help guess, what if lines against planes? Turns out 4 lines confine a triangle, and so do 4 plane
Turns out 3 lines give only 7, which suggests that plane version is incorrect. We could but 1 triangle + 3 side adjecent + 3 vertex adjecent. In plane case, it is 1 finite + 4 faces + 6 cutting edge + 4 points
Analogy: line divided by points, and then we guessed by induction when we see the pattern with lower dimensions
Prove the pattern with 5 case is hard, so we try 4 + line case. After we verified the 11 case, we are more confident with the 26 guess.
Proving the answer is out of scope for this lecture

The bitter lession

2025-03-08 00:00:00 +0000

General methods that leverage computation are the most effective by a large margin. The ultimate reason is generalized Moore’s law, i.e., continued exponentially falling cost per unit of computation.
Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available.
General method and human knowledge often run counter against each other
- they are contenting for time
- psychological commitments
- the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation

Examples

Computer chess used massive, deep search. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers.
In computer Go, as in computer chess, researchers’ initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.
learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. I
Speech recognition. The statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems
Computer vision. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
The two methods that seem to scale arbitrarily with increased computation are search and learning.
we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

How I would implement row access policy and aggregation policy in SF

2025-02-06 00:00:00 +0000

Row access policy

CREATE OR REPLACE ROW ACCESS POLICY security.sales_policy
AS (sales_region varchar) RETURNS BOOLEAN ->
  'sales_executive_role' = CURRENT_ROLE()
    OR EXISTS (
      SELECT 1 FROM salesmanagerregions
        WHERE sales_manager = CURRENT_ROLE()
        AND region = sales_region
    )
;

Snowflake determines whether a row access policy is set on a database object.

SQL is parsed into AST
At each FROM node with table or view, look up the policy with the (role, table name) tuple

Snowflake creates a dynamic secure view (i.e. a secure inline view) of the database object.

Replace the from table table with a select all + from + selection subtree

The values of the columns specified in the ALTER TABLE or ALTER VIEW command (i.e when adding a row access policy to a table or view) are bound to the corresponding parameters in the policy, and the policy expression is evaluated.

Execute the sql

Aggregation policies

Aggregation policies can be used with or without an entity key. When aggregation policies are used without an entity key, they protect the privacy of individual rows in the data set (that is, row-level privacy). If you use an aggregation policy with an entity key, it protects the privacy of an entity, even if information about that entity appears in multiple rows (that is, entity-level privacy).

Entity-level privacy strengthens the privacy protections provided by aggregation policies. With entity-level privacy, Snowflake can ensure that an aggregation group contains a certain number of entities, not just a certain number of rows.

CREATE AGGREGATION POLICY my_agg_policy
  AS () RETURNS AGGREGATION_CONSTRAINT ->
    CASE
      WHEN CURRENT_ROLE() = 'ADMIN'
        THEN NO_AGGREGATION_CONSTRAINT()
      ELSE AGGREGATION_CONSTRAINT(MIN_GROUP_SIZE => 3)
    END;


ALTER TABLE t1 SET AGGREGATION POLICY my_agg_policy;

SQL to AST, and then when discovered an aggregation node replace this aggregation node with conditional grouping

SELECT state, AVG(elevation) AS avg_elevation
FROM t1
GROUP BY state;

becomes

--this query is not valid for Postgres
SELECT 
    CASE 
        WHEN COUNT(*) >= 3 THEN state
        ELSE 'Other'
    END AS state_group, 
    AVG(elevation) AS avg_elevation
FROM t1
GROUP BY 
    CASE 
        WHEN COUNT(*) >= 3 THEN state
        ELSE 'Other'
    END;

-- one Postgres version

SELECT
  state,
  AVG(elevation) AS avg_elevation
FROM
  (
    SELECT CASE
        WHEN policy.state IS NULL THEN NULL --filtered out
        ELSE policy.state
      END AS state,
      elevation
    FROM t1
      LEFT JOIN (
        SELECT state,
          COUNT(*) as count
        FROM t1
        GROUP BY state
        HAVING COUNT(*) >= 3
      ) policy ON t1.state = policy.state
  ) t1
GROUP BY
  state;

Lecture 2 & 3

2025-01-04 00:00:00 +0000

Lecture 2 - ML Refresher / Softmax Regression

Supervised learning

classification (e.g., spam detection, image recognition) and regression tasks (e.g., predicting house prices, stock forecasting).
Algorithms include decision trees, support vector machines (SVM), neural networks, and linear regression.
Requires a large amount of accurately labeled data for effective training.

Unsupervised learning

Useful for clustering (e.g., customer segmentation), association (e.g., market basket analysis), and dimensionality reduction (e.g., principal component analysis, or PCA)
Algorithms include k-means clustering, hierarchical clustering, and autoencoders.
Can work with unlabeled data, which is often more readily available and cost-effective to collect.

Multi-class classification setting

Matrix batch notation

X in $R^{m \times n}$
Matrix ops are more efficient than many vector ops

Loss function: classification error

$l_{\text{err}}(h(x), y) = 0 \quad \text{if} \quad \text{argmax}_i \, h_i(x) = y$
bad for optimization, because it is not differentiable

Loss function: softmax/cross-entropy loss

Softmax(h(x)) = normalize(exp(h(x)))
$l_ce(h(x), y) = -h_y(x) + log \sum_{j=1}^k exp(h_j(x))$ applied to a linear hypothesis class
Find \theta that minimizes the average loss of the training set

SGD

\alpha step size/learning rate Sample a minibatch of size B, change by \alpha /B

gradient of cross entropy

Hack to compute loss’s derivative: pretend everything is vector, and then reaggrange transpose matrices/vectors to make size work. Make sure check the numerical answers. This approach works for the matrix batch form of the loss

Lecture 3 - “Manual” Neural Networks

Need to create feature function \theta so that the classier is more powerful than the pure linear one.

Traditional: manually extract features
Neural network: learn features from data, automated

$\theta = {W_1 , W_2}$

A two layer network can approximate any smooth function well

Deep layers are more efficient than a single later at representing some functions unable to learn , e.g., parity function. Empirically, deep layers work better when the number of params is fixed.

Backpropogation

two-layer network gradient w.r.t. w_2

w_1

General form

Forward path: compute Z_i

Backward path: compute G_i

vector Jacobian product

VII.3 Backpropagation and the Chain Rule

2024-12-30 00:00:00 +0000

system of equations to solve for the weights that minimize L: The partial derivatives of L with respect to the weights x should be zero.

Backpropagation is a method to compute derivatives quickly, using the chain rule

$ dF/dx = \frac{d}{dx} \left( F_2(F_1(x)) \right) = \frac{dF_2}{dx_1} \left(F_1(x) \right) \frac{dF_1}{dx} \left (x \right) $

For a standard net, the steps in the chain rule can correspond to layers in the neural net.

Backpropagation has been discovered many times. Another name is automatic differentiation (AD). You will see that the steps can be arranged in two basic ways: forward-mode and backward-mode. The right choice of mode can make a large difference · in the cost (a factor of thousands). That choice depends on whether you have many functions F depending on a few inputs, or few functions F depending on many inputs.

Deep learning has basically one loss function depending on many weights. The right choice is “backward-mode AD”. This is what we call backpropagation.

Derivatives of the Learning Function F

find dw_i/db_j and dw_i/dA_{jk} for all components of b + Av. When j is different from i, the ith output w_i is not affected by b_j or A_{jk}

Columns of I are 1-hot vectors.

$ dw_i/db_j = \sigma _{ij}$

$dw_i/dA_{jk} = \sigma _{ij}v_k$

Combining Weights band A into M

It is often convenient to combine the bias vector b and the matrix A into one matrix M. Matrix of weights.

For each layer of the neural net, the top entry (the zeroth entry) is now fixed at 1. After multiplying that layer by M, the zeroth component on the next layer is still 1. Then ReLU(l) = 1 preserves that entry in every hidden layer.

$dw_i/dM_{jk} = \sigma _{ij}v_k$

Derivatives for Hidden Layers

$dw/dA_1 = d(A_2R(b_1 + A_1v_0))/dA_1 = A_2R’(b_1 + A_1v_0)d(b_1 + A_1v_0)/dA_1$

Automatic backpropagation go backwards from w to v

ReLU is a limiting case of an S-shaped sigmoid function

Delta function: $\delta (x) = d^2R/dx^2$. The delta function represents an impulse. It models a finite change in an infinitesimal time. It is physically impossible but mathematically convenient.

With ReLU, a neuron could stay at zero through all the steps of deep learning. This “dying ReLU” can be avoided in several ways-it is generally not a major problem. One way that firmly avoids it is to change to a Leaky ReLU with a nonzero gradient

Computational Graphs

Forward mode requires a new graph for each input x_i, to compute dF/dx_i

The reverse mode starts with the output F. It computes the derivatives with respect to both inputs. The computations go backward through the graph.

Instead of N forward graphs from N inputs, we will have one backward graph from one output. It finds the derivative ofF with respect to every node. It starts with dF/dF = 1

The reverse mode finds all derivatives dF/dx_i by following all chains backward from output to input. Those chains all appear as paths on one graph-not as separate chain rules for exponentially many possible paths. This is the success of reverse mode.

The derivatives of F with respect to the matrices A (and the bias vectors b) are easiest for the last matrix A_L in A_Lv_{L-1} dF_i/dA_{jk} = \delta {ij}v{k} Next is the derivative A_LReLU(A_{L-1}v_{L-1}) with respect to A_{L_1}

Adjoint Methods

Hyperparameters

The stepsize/learning rate s_k in gradient descent is first and foremost: accelerated (by momentum) or adaptive (ADAM) or stochastic with a random minibatch of training data at each step k.

line search

No method is perfect. s_k is too small Then gradient descent takes too long. s_k is too large Gradient descent will jump around the minimizing x^*

After reaching weights x that are close to minimizing the loss function L( x, v) we may want to bring new v ‘s from a validation set. This is not yet production mode. The purpose of cross-validation is to confirm that the computed weights x are capable of producing accurate outputs from new data.

Cross-validation

A first step in cross-validation is to divide the available data into K subsets. If K = 2, these would essentially be the training set and test set-but we are usually aiming for more information from smaller sets before working with a big test set. K-fold cross-validation uses each of K subsets separately as a test set. In every trial, the other K - 1 subsets form the training set. We are reworking the same data (moderate size) to learn more than one optimization can teach us.

Batch Normalization of Each Layer

As training goes forward, the mean and variance of the original population can change at every layer of the network. This change in the distribution of inputs is “covariate shift”. We often have to adjust the stepsize and other hyperparameters, due to this shift in the statistics of layers. A good plan is to normalize the input to each layer.

Normalization makes the training safer and faster. The need for dropout often disappears. Fewer iterations can now give more accurate weights. And the cost can be very moderate. Often we just train two additional parameters on each layer.

The key point is to normalize the inputs y_i to each new layer. What was good for thel, original batch of vectors (at layer zero) is also good for the inputs to each hidden layer.

Dropout

Dropout is the removal of randomly selected neurons in the network. Typically hidden layer neurons might be given probability p = 0.5 of surviving, and input components might have p = 0.8 or higher. The main objective of random dropout is to avoid overfitting.

In training, each new v_0 (the feature vector of an input sample) leads to a new thinned network.

At test time, we use the full network (no dropout) with weights rescaled from the training weights. The outgoing weights from an undropped neuron are multiplied by p in the rescaling.

To compute gradients, use backpropagation for each training example in the minibatch. Then average those gradients. Stochastic gradient descent can still include acceleration (momentum added) and adaptive descent and weight decay.

LeCun emphasizes that for a multiparameter search, random sampling is the way to cover many possibilities quickly. Grid search is too slow in multiple dimensions.

Loss Functions

Quadratic cost(square loss): is not a favorite choice for deep learning. One reason is the parabolic shape for the graph of l(w, v), as we approach zero loss at w = y. The derivative also approaches zero.

A zero derivative at the minimum is normal for a smooth loss function, but it frequently leads to an unwanted result: The weights A and b change very slowly near the optimum. Learning slows down and many iterations are needed.

Cross-entropy loss: expect that the N outputs w_i from training the neural net have been normalized to z(w), with z_i in (0, 1)

Shannon’s formula for entropy

Kullback-Leibler (KL) divergence.

Regularization

It adds a penalty term to the loss function L(x)

The penalty controls the size of x. Regularization is also called weight decay.

The coefficient \lambda is a hyperparameter. Its value can be based on cross-validation. The purpose of the penalty terms is to avoid overfitting (sometimes expressed as fitting the noise). Cross-validation for a given lambda finds the minimizing x on a test set. Then it checks by using those weights on a training set. If it sees errors from overfitting, lambda is increased.

A small value of lambda tends to increase the variance of the error : overfitting. Large lambda will increase the bias: underfitting because the fitting term

b-Ax

^2 is less important

Recent experiments on MNIST make it unclear if explicit regularization is always necessary.

The Structure of AlphaGo Zero

A convolution of 256 filters of kernel size 3 x 3 with stride 1 : E = 3, S = 1
Batch normalization
ReLU
A convolution of 256 filters of kernel size 3 x 3 with stride 1
Batch normalization
A skip connection as in ResNets that adds the input t? the block
ReLU
A fully connected linear layer to a hidden layer of size 256
ReLU

Training was by stochastic gradient descent on a fixed data set that contained the final 2 million games of self-played data from a previous run of AlphaGo Zero. The CNN includes a fully connected layer that outputs a vector of size 19^2 + 1. This accounts for all positions on the 19 x 19 board, plus a pass move allowed in Go.

Recurrent Neural Networks

appropriate for data that comes in a definite order. This includes time series and natural language: speech or text or handwriting. In the network of connections from inputs v to outputs w, the new feature is the input from the previous time t - 1. This recurring input is determined by the function h(t- 1).

The inputs to h(t) are the new data v(t) and the recurrent data h( t- 1) from the previous time. The weights multiplying the data are x_in, x_recur, and x_out

Google translate

Instead of programming every word and grammatical rule and exception in both languages, let the computer find the rules. Just give it enough correct translations. If we were recognizing images, the inputs would be many examples with correct labels (the training set). The machine creates the function F(x, v).

Part VII : Learning from Data

2024-12-29 00:00:00 +0000

There are a great many points in the training data, but there are far more weights to be computed in a deep network. The art of deep learning is to find, among many possible solutions, one that will generalize to new data.

F(x,v) The training data is given by a set of feature vectors v. The weights that allow F to classify that data are in the vector x. To optimize F, gradient descent needs its derivatives dF/dx. The weights x are the matrices A_1…A_L and bias vectors B_1…B_L. Sample data v=v_0, output w = v_L

Real codes use automatic differentiation (AD) for backpropagation. Each hidden layer with its optimized weights learns more about the data and the population from which it comes-in order to classify new and unseen data from the same population.

The Functions of Deep Learning

So we start with M different images (the training set). An image will be a set of p small pixels-or a vector v. v_i is the greyscale of ith pixel. For every v in that training set we know the digit it represents.

Continuous Piecewise Linear (CPL) functions. Linear for simplicity, continuous to model an unknown but reasonable rule, and piecewise to achieve the nonlinearity that is an absolute requirement for real images and data.

If A_1 is q by p, the input space R^p is is sliced by q hyperplanes into r pieces

Bias vs. Variance : Underfit vs. Overfit

A training set contains N vectors with m components each (them features of each sample). For each ofthose N points in R^m, we are given y_i. We want to learn $y_i = f(x_i) + \epsilon$, with epsilon mean = 0

Our learning algorithm actually finds a function F(x) close to f(x)

bias-variance tradeoff. High bias from underfitting, high variance from overfitting.

bias = E[f(x)-F(x)]. variance = E[F(X)^2] - E[F(X)]^2

The Construction of Deep Neural Networks

create a learning function F(x, v) with weights x that capture information from the training data v

Key operation: F=F_3(F_2(F_1(x, v)))
key rule: Chain rule for x-derivatives of F
Key algorithm: Stochastic gradient descent to find the best weights x
Key subroutine: Backpropagation to execute the chain rule, find the x-derivatives of F to solve \nabla F = 0
Key nonlinearity: ramp function ReLU(y). This makes the learning function F is continuous and piecewise linear in v.

One Internal Layer L = 2

For a classification problem each sample v_0 of the training data is assigned 1 or -1. We want the output v_2 to have that correct sign (most of the time). For a regression problem we use the numerical value (not just the sign) of v_2

Machine learning doesn’t aim to capture every detail of the numbers 0, 1, 2 … , 9. It just aims to capture enough information to decide correctly which number it is.

The Initial Weights in Gradient Descent

A proper choice of the net and the initial x_0 has random (and independent) weights that meet two requirements.

x_0 has a carefully chosen variance, which controls the mean of the computed weights.
The hidden layers in the neural net have enough neurons (not too narrow).

Many-layered depth can reduce the loss on the training set. But if $\sigma^2$ is wrong or width is sacrificed, then gradient descent can lose control of the weights. They can explode to infinity or implode to zero.

The good choice is 2/fan-in. The fan-in is the maximum number of inputs to neurons

The key point: Deep learning can go wrong if it doesn’t start right.

Stride and Subsampling

goal : Reduce the dimension from 128 to 64

In two steps Multiply the 128-component vector v by A, and then discard the odd-numbered components of the output. This is filtering followed by subsampling.
In one step Discard the odd-numbered rows of the matrix A. The new matrix A2 becomes short and wide: 64 rows and 128 columns. The “stride” of the filter is now 2. Now multiply the 128-component vector v by A2.

Max-pooling

Multiply the 128-component vector v by A, as before. Then from each even-odd pair of outputs, keep the maximum.

For an image (a 2-dimensional signal) we might use max-pooling over every 2 by 2 square of pixels. This speeds up the training, when the number of neurons on a hidden layer is divided by 4.

Pooling also reduces the possibility of overfitting. Average pooling would keep the average of the numbers in each pool : now the pooling layer is linear.

The Graph of the Learning Function F(v)

The graph of F ( v) is a surface made up of many, many flat pieces-they are planes or hyperplanes that fit together along all the folds where ReLU produced a change of slope. This is like origami except that this graph has flat pieces going to infinity.

That polygon separating blue from orange (or plus from minus : this is classification) is the analog of a separating hyperplane in a Support Vector Machine. If we were limited to linear functions and a straight line between a blue ball and an orange ring around it, separation would be impossible.

Counting Flat Pieces in the Graph

Suppose v_o has m components and A_1v_0+b_1 has N components. We have N functions of V_o. Each of those linear functions is zero along a hyperplane (dimension m - 1) in R^m. When we apply ReLU to that linear function it becomes piecewise linear, with a fold along that hyperplane. On one side of the fold its graph is sloping, on the other side the function changes from negative to zero.

Convolutional Neural Nets

That fully connected net will be extremely inefficient for image recognition. We are looking for connections between faraway pixels. Almost always, the important connections in an image are local

the search for structure is essentially the same everywhere in the image. There is normally no reason to process one part of a text or image or video differently from other parts. We can use the same weights in all parts: Share the weights. The neural net of local connections between pixels is shift-invariant: the same everywhere.

Suppose each neuron is connected to only E neurons on the next layer, and those connections are the same for all neurons. Then the matrix A between those layers has only E independent weights x.

we have time to create several different channels with their own E or E^2 weights. They can look for edges in different directions (horizontal, vertical, and diagonal).

In one dimension, a banded shift-invariant matrix is a Toeplitz matrix or a filter. Multiplication by that matrix A is a convolution x * v.

A Toeplitz matrix is a type of structured matrix where each descending diagonal from left to right is constant. That means all the elements along any diagonal parallel to the main diagonal are the same.

Each shift has a diagonal of 1 ‘s $A = x_1L + x_0C + x_{-1}R$, assuming S = 1

$dAv/d_x1 = Lv$

Convolutions in Two Dimensions

When the input v is an image, the convolution with x becomes two-dimensional. 1-d $x_1 , x_0 , x_{-1}$ changes to E^2 independent weights. v represents (N+2)^2 pixels, and output have only N^2 pixels

The 2D convolution $A=x_{11}LU+x_{01}CU+x_{-11}RU+…+x_{-1-1}RD$

These nine derivatives are computed inside backpropagation.

CNN’s can readily afford to have B parallel channels (and that number B can vary as we go deeper into the net).

A convolution is a combination of shift matrices (producing a filter or Toeplitz matrix)

A cyclic convolution is a combination of cyclic shifts (producing a circulant matrix)

In deep learning, the coefficients in the combination will be the “weights” to be learned.

In two dimensions (for images) the matrix A is block Toeplitz. Each small block is E by E

The same weights are used around all pixels (shift-invariance). The matrix produces a 2D convolution. Frequently A is called a filter

The dot products between a smooth function and a moving filter window will be smooth. But when an edge in the image lines up with a diagonal wall, we see a spike.

The difficulty with two or more dimensions is that edges can have many directions. We will need horizontal and vertical and diagonal filters for the test images. And filters have many purposes, including smoothing, gradient detection, and edge detection.

For a 2D function f, the natural smoother is convolution with a Gaussian. The filter removes noise (at a price in sharp edges)
Gradient detection. Image processing (as distinct from learning by a CNN) needs filters that .detect the gradient. They contain specially chosen weights. We mention some simple filters just to indicate how they can find gradients-the first derivatives of f.k
Edge detection. we don’t want smoothing, which would blur the edge. The good filters become Laplacians of Gaussians

The Stride of a Convolutional Filter

For a larger stride, the moving window takes longer steps as it moves across the image.

the length of the output y=Av is reduced by the factor of stride.

nonzero weights like x_1 are S columns apart for stride S

Extending the Signal

Instead of losing neurons at the edges of the image when A is not square, we can extend the input layer. We are “inventing” components beyond the image boundary. Then the output y = Av has equal dimensions for input and output.

For periodic signals, zero-padding is replaced by wraparound.

A more accurate choice is to go beyond the boundary by reflection.

This is what CNN’s usually do: Add more channels of weight matrices A to capture more features of the training sample.

Counting the Number of Inputs and Outputs

In a one-dimensional problem, suppose a layer has N neurons. We apply a convolutional matrix withE nonzero weights. The stride isS, and we pad the input signal by P zeros at each end. How many outputs (M numbers) does this filter produce?

Karpothy’s formula

In a 2D or 3D problem, this 1D formula applies in each direction.

This case 2P = E - 1 with stride S = 1 is the most common architecture for CNN’s.

it is very common to end a CNN with fully-connected layers.

Softmax Outputs for Multiclass Networks

“Softmax” replaces the two-output case of logistic regression. We are turning n numbers into probabilities.

For CNN’s in computer vision, the final layer often has a special form. If the previous layers used ReLU and max-pooling (both piecewise linear), the last step can become a difference-of-convex program, and eventually a multiclass Support Vector Machine (SVM). Then optimization of the weights in a piecewise linear CNN can be one layer at a time.

Residual Networks

But depth brings dangers. Information can jam up and never reach the output. The problem of “vanishing gradients” can be serious: so many multiplications in propagating so far, with the result that computed gradients are exponentially small. When it is well designed, depth is a good thing-but you must create paths for learning to move forward.

“skip connections” that go directly to the next layer.

entire layers can be removed without significant impact

Part VI: Optimization

2024-12-28 00:00:00 +0000

Gradient Descent Toward the Minimum

Minimize a function $f(x_l … , x_n)$ - all the first derivatives are 0 at the minimum. If we have n = 20 unknowns (a small number in deep learning) then minimizing one function f produces 20 equations

The steepest direction, in which f *x) decreases fastest, is given by the gradient $-\nabla f$

$\nabla f$ represents the vector of n partial derivatives of f: its gradient. $s_k$ is the stepsize or learning rate

The neat idea of automatic differentiation-rediscovered and extended as backpropagation in machine learning-makes those cost factors much smaller in practice.

df/dx = limit of $\delta f/ \delta x$ = limit of $ (f(x + \delta x) - f(x)) / \delta x $

For n = 1, df/dx is the gradient

we get a better ratio (closer to the limiting slope df/dx) by averaging the forward difference (where $\delta x > 0$) with the backward difference (where $\delta x < 0$). The average is the more accurate centered difference

The gradient of $F(x_1, …, x_n)$ is the column vector $\nabla F = (dF/dx_1,….,dF/dx_n). Its components are the n partial derivatives of F, and it points in the steepest direction.

For a constant vector a, $F(x) = a^Tx$ has gradient a.

For a symmetric matrix S, gradient of $F(x)=x^TSx$ is 2Sx

For a positive definite symmetric S, $1/2x^TSx-a^Tx$’s gradient is Sx-a

That steepest direction is perpendicular to the level direction

Stochastic Gradient Descent and ADAM

Gradient descent: $x_{k+1}=x_k-s_k\nabla L(x_k)$ L(x) is the loss function. But for large networks with many samples in the training set, this algorithm (as it stands) is not successful

Computing $\nabla L$ at every descent step-the derivatives of the total loss L with respect to all the weights x in the network-is too expensive. Potentially millions of separate losses are computed and added in every computation of L

The number of weights is even larger. many different choices $x^$ of the weights. x^ minimized L(x) for test data v. Some of those choices can give poor results on unseen test data.

Stochastic gradient descent uses only a “minibatch” of the training data at each step. B samples will be chosen randomly. This changes L(x) to a sum of only B losses.

Computing $\nabla l_i by backpropagation on B samples is much faster. Often B = 1

The stochastic algorithm produces weights that also succeed on unseen data.

And for each sample in the training set, we know if it is “a cat or a dog”-or if. the text is “poetry or prose”. We look for a learning function F that assigns’ good weights. Then for v in a similar population, F outputs the correct classification “cat” or “poetry”.

We use this function F for unidentified test data. The features of the test data are the new inputs v. The output from F will be the correct (?) classification-provided the function has learned the training data in a way that generalizes.

Generalization by SGD is the ability to give the correct classification for unseen test data v, based on the weights x that were learned from the training data.

I compare overfitting with choosing a polynomial of degree 60 that fits exactly to 61 data points. Need early stopping

The Loss Function and the Learning Function

“loss” L(x) that our function will (approximately) minimize. This is the sum of the errors in classifying each of the training data vectors v. And we need to describe the form of the learning function F that classifies each data vector v

At the beginning of machine learning the function F was linear-a severe limitation. Now F is certainly nonlinear. Just the inclusion of one particular nonlinear function at each neuron in each layer has made a dramatic difference.

The gradient is determined by the network architecture-the “feedforward” steps whose weights we will optimize.

Square loss: sum over the training samples

Hinge loss

Cross-entropy loss: favorite for neural nets. Cross-entropy loss or “logistic loss” is preferred for logistic regression (with two choices only). For a minibatch of size B, replace N by B. And choose the B samples randomly.

Stochastic Descent Using One Sample Per Step

To simplify, suppose each minibatch contains only one sample (so B = 1). That sample is chosen randomly. The theory of stochastic descent usually assumes that the sample is replaced after use-in principle the sample could be chosen again at step k + 1. But replacement is expensive compared to starting with a random ordering of the samples. In practice, we often omit replacement and work through samples in a random order.

Each pass through the training data is one epoch of the descent algorithm. Ordinary gradient descent computes one epoch per step (batch mode). Stochastic gradient descent needs many steps (for minibatches).

Stochastic descent is more sensitive to the stepsizes than full gradient descent.

We are doing much less work per step (B inputs instead of all inputs from the training set). But we do not necessarily converge more slowly. A typical feature of stochastic gradient descent is “semi-convergence” : fast convergence at the start.

later iterations of SGD are frequently erratic. Convergence at the start changes to large oscillations near the solution. One response is to stop early and thereby we avoid overfitting the data.

If $x_k$ is outside I, then $x_{k+1}$ moves toward the interval I. Otherwise, it just bounces inside I

Randomized Kaczmarz is Stochastic Gradient Descent

We are randomly selecting row i of A at step k, and adjust x_{k+1} to solve equation i in Ax=b, i.e., $a_i^Tx_{k+1}=b_i$. x_{k+1} is the the projection of x_k onto one hyperplane $a_i^Tx=b_i$ that meets $x^*=A^{-1}b$

Adaptive Methods Using Earlier Gradients

That “memory” guides the choice of search direction D and the all-important stepsize s.

For a standard SGD iteration, D_k depends only on the current gradient $\nabla L_k$. $\nabla L_k(x_k, B)$ is evaluated only on a random minibatch B of the test data.

Adaptive Stochastic Gradient Descent: x_{k+1}=x_k - s_kD_k

Exponential moving averages in ADAM: recent gradients $\nabla L$ have greater weight than earlier gradients in both s_k and step direction D_k

Typical values are \delta = 0.9 and \beta = 0.999.

The actual computation of D_k and S_k will be a recursive combination of old and new

Generalization : Why is Deep Learning So Effective?

Often we have more free parameters in x than data in v. In that case we can expect many sets of weights (many vectors x) to be equally accurate on the training set. Those weights could be good or bad. They could generalize well or poorly. Our algorithm chooses a particular x and applies those weights to new data v_{test}

Part V: Probability and Statistics

2024-12-27 00:00:00 +0000

The first 100 or 1000 flips do affect the sample mean. But 1000 flips will not affect its limit-because you are dividing by $N \to \infty$

F(x) is the cumulative distribution and its derivative p(x) is the probability density function. You could say that p(x) dx is the probability of a sample falling in between x and x + dx.

$E[x] = \int xp(x)dx$ $\sigma^2 = E[(x-m)^2] = \int p(x)(x-m)^2dx$

Normal distribution

Central Limit Theorem. Standard normal distribution N(0, 1).

$ F(\sigma) - F(-\sigma) \approx 2/3 $ $ F(2\sigma) - F(-2\sigma) \approx 0.95 $

The “binomial” probabilities count the number of heads in N coin flips. binomial approaches normal.

Stirling’s formula

Monte Carlo Estimation Methods

accepting uncertainty in the inputs and estimating the variance in the outputs. The error in approximating E[x] is ususally $1/ \sqrt N$

2-level Monte Carlo

Probability Distributions

Binomial: Tossing a coin n times

Poisson

p \to 0$ events and n trials with expected $\lambda = np$ successes. What is the probablity of k successes in n trials?

we wait a long time, or we include a large population, to produce a decent chance that one or more events will actually occur. Poisson is not for the probability that you will be struck by lightning : say one in a million, over your lifetime. It is for the probability that someone in a city of 100,000 will be struck.

Poisson assumes independent events. One of the most difficult aspects of applied probability is to estimate the dependence of one event on other events.

Exponential

The exponential distribution is continuous (not discrete). It describes the waiting time in a Poisson process. How long until lightning strikes your city ? The key assumption is that The waiting time is independent of the time you have already waited.

The future is independent of the past. Is this true for a television set to fail ? It is true for a computer to fail ? If failure depends on a random disaster, the waiting time has exponential distribution. The failure rate >. is constant. If failure comes from slow decay, the independence requirement will not hold.

The exponential distribution has no memory.: The probability of waiting at least y hours more for a table is unfortunately not affected by already having waited x hours

If failures only occur at integer times t = 0, 1, 2, 3, … then the exponential distribution (which has continuous time) is replaced by the geometric distribution (discrete time).

Log-normal

When x has a normal distribution, the exponential $e^x$ has a log-normal distribution. i.e., The distribution of x is log-normal when the distribution of x y = log(x) is normal.

The log-normal distribution appears when you take the product of many positive sample values. The arithmetic mean for normal compares with the geometric mean for log-normal.

Chi-squared

If x has the standard N(O, 1) distribution, then $x^2$ is chi-squared.

If $x_1….x_n$ are independent N(0, 1), then $x_1^2+…+ x_n^2$ has chi-squared distribution with n degrees of freedom

Multivariable Gaussian

When the variables are not independent, like stocks and bonds, C is symmetric and positive definite (in an extreme case, semidefinite). Then the joint distribution of Gaussian variables is “multivariate Gaussian”

Moments, Cumulants, and Inequalities

Suppose we know E[X], we want to know Pr(X>=a)

Markov’s inequality

Chebyshev’s inequality: The proof is to apply Markov’s inequality

Connecting a list of numbers to a (generating) function $f(x)=\Sigma a_nx^n$

Moments and Central Moments

$m_n =E[x^n]$

central moments (around m) $\mu_n=E[(x-m)^n]$

The normalized third central moment is the skewness of the distribution.

We generally use the fourth moment of a normal distribution as a basis for comparison. Then the kurtosis of any distribution is called kappa

Generating Functions and Cumulants

Probability generating function

Characteristic function. uses the characteristic function to prove the central limit theorem

Moment generating function M(t)

Cumulant generating function K(t)

We are capturing an infinite set of coefficients in each function. The key point of cumulants is that they lead to this property $K(t)=logM(t)$

Chernoff’s Inequality for Sums

The mean value of a sum. If those variables are independent, then also the variance

Chebyshev for a sum

Markov and Chebyshev Inequalities for Matrices

an essential step in the statistics of deep learning is to develop inequalities for the eigenvalues and the trace of random matrices.

Markov’s inequality. Suppose X is a semidefinite or definite random matrix, if A is any positive definite matrix, then

Prob {A- X is not positive semidefinite } <= Trace of $ E[X]A^{-1}$

Chebyshev’s inequality for a random matrix X with mean 0:

Matrix Chernoff Inequalities

Erdos-Renyi Random Graphs

Each edge is present with probability p. Then the question is : For which p is the graph likely to be connected?.

Covariance Matrices and Joint Probabilities

Prove that independent experiments have covariance 0

Max covariance = $\sigma_1\sigma_2$ when the two variables are fully dependent

U is a column in covariance matrix. $p_{ij}UU^T$ is positive semidefinite. So the whole matrix V (the sum of those rank 1 matrices) is at least semidefinit

The covariance matrix V is positive definite unless the experiments are dependent.

$V=E[(X-E(X))(X-E(X))^T]=\Sigma p_{ij}(X-E(X))(X-E(X))^T$

Variance of $c^TX$ = $c^TVc$

Diagonalizing the covariance matrix $V=Q\Lambda Q^T$ means finding M independent experiments as combinations of the original M experiments.

Another proof that covariance matrix V is positive semidefinite. V is the sum of the joint probability of each combination times the positive semidefinite matrix $(X-E(X))(X-E(X))^T$

The value of the expectation symbol E is that it also allows pdf’s : probability density functions like p(x, y, z) for continuous random variables x andy and z.

The covariance matrix of $Z =AX$ is $ V_Z=AV_XA^T$

Suppose x and y are independent random variables with mean 0 and variance 1, What are the covariance matrix for Z =(x, y, ax + by)?

Correlation

Normalize the random variables x to have their variance = 1 $ X = x/\sigma _x$

The correlation of x and y is the covariance of X and Y

Correlation matrix R

GPS Example

Distances from four or more satellites will pinpoint the receiver position (using least squares!) The speed of light changes in the ionosphere. But the correction will be almost the same for all nearby receivers. If one receiver stays in a known position, we can take differences from that one. Differential GPS reduces the error variance by fixing one receiver

Markov chains

$y_{n+1}=Py_n$

Find eigenvalues of a square matrix through determinant.

The determinant of a square matrix can be seen as a measure of the volume spanned by its row or column vectors. If the determinant is zero, it means these vectors are linearly dependent, and thus, the volume is zero. In geometric terms, it indicates that the transformation described by the matrix flattens the space into a lower dimension.

The determinant of a matrix is calculated from a specific combination of its elements, reflecting how rows or columns interact. If these are linearly dependent (i.e., one row or column can be expressed as a combination of others), then they don’t span an “n-dimensional volume” in the space, and the determinant becomes zero. The determinant can be viewed as a scaling factor for the volume of the unit cube under the linear transformation represented by the matrix: For a 2x2 matrix, it corresponds to the area of the parallelogram spanned by its columns (or rows). From an algebraic standpoint, if there are linearly dependent rows or columns, the matrix can be row-reduced to a form where one entire row (or column) becomes zero, leading the determinant to be zero, as any zero-row matrix has a zero determinant.

steady state with eigenvalue 1

$P=X\Lambda X^{-1}$, $P^\infty has the same split in both columns

The requirements for a positive Markov matrix: All p_{ij} > 0 and each column of P adds to 1. P has largest eigenvalue $\lambda _1 = 1$ and $x_1 > 0$

Transition Probabilities

$p_{ij}= Probability that x(n + 1) = i if x(n) = j$ $p_{ij}$ does not depend on n

Finite Markov Chain, Infinite State Markov Chain, Continuous Markov Chain

1 is an eigenvalue, because the rows of P - I add to the zero row. And no entry of Pis negative.

Perron-Frobenius Theorem $P_1 > 0$ guarantees success. For A > 0, All numbers in $Ax=\lambda _{max}x$ are strictly positive

if $

\lambda_2

< 1$, then P^n converges. The matrix P might not have n independent eigenvectors, so we can’t diagonalize P

Gambler’s Ruin

when a player has won all the money-the game is over and there is no way to leave either of those states. With those two steady states we must expect eigenvalue 1 twice.

Shape of P. What are the four eigenvalues of P? What is the probability that the game will continue forever with no winner?

Find eigenvalues of a square matrix through determinant.

steady state with eigenvalue 1

$P=X\Lambda X^{-1}$, $P^\infty has the same split in both columns

The requirements for a positive Markov matrix: All p_{ij} > 0 and each column of P adds to 1. P has largest eigenvalue $\lambda _1 = 1$ and $x_1 > 0$

Transition Probabilities

$p_{ij}= Probability that x(n + 1) = i if x(n) = j$ $p_{ij}$ does not depend on n

Finite Markov Chain, Infinite State Markov Chain, Continuous Markov Chain

1 is an eigenvalue, because the rows of P - I add to the zero row. And no entry of Pis negative.

Perron-Frobenius Theorem $P_1 > 0$ guarantees success. For A > 0, All numbers in $Ax=\lambda _{max}x$ are strictly positive

if $

\lambda_2

< 1$, then P^n converges. The matrix P might not have n independent eigenvectors, so we can’t diagonalize P

Gambler’s Ruin

when a player has won all the money-the game is over and there is no way to leave either of those states. With those two steady states we must expect eigenvalue 1 twice.

Shape of P. What are the four eigenvalues of P? What is the probability that the game will continue forever with no winner?

Part 1.11 Norms

2024-12-26 00:00:00 +0000

Every norm for vectors or functions or matrices must share these two properties of the absolute value

of a number

v + w

Euclidean norm, 1-norm, and infinite norm for a vector $

_p=(

v_1

^p+…+

v_n

^p)^{1/p}$

$v^Tw=

cos\theta$

S-norms

Choose any symmetric positive definite matrix S, S-norm $

^2_S = v^TSv$, S-inner product: $(v, w)_S = v^TSw$

Norms of functions

A function f(x) is a “vector in function space”.

For a vector space to be “complete”, every converging sequence $v_n$ must have $

v_n - v_\infty

\to 0$

Banach space. Hilbert space is a Banach space with inner product of $(v,v) =

^2$

The Frobenius Norm

A matrix norm follows rules for a vector norm, plus

_F^2 =

a_{11}

^2+…+

a_{mn}

^2$

When Q is an orthogonal matrix, we know that Qx has the same $l^2$ length as x. $

_F =

_F$

_F^2$ = trace of AT A =sum of eigenvalues

Matrix Norms from Vector Norms

If we choose the vector v with the largest growth factor, $

= max(

l2 norm: largest singular value l1 norm: largest l1 norm of the columns of A $l_\infty$ norm: largest l1 norm of the rows of A

The Nuclear Norm

Also called the trace norm.

_N = \sigma _1 + … + \sigma _n$. It is the minimum value of $

_F$ with UV=A

_F = \sigma _1^2 + … + \sigma _n^2$

_2 = \sigma _1$

Factoring Matrices and Tensors

A matrix is just a two-way tensor.

To compute a factorization A = UV, we introduce a simple alternating iteration. Update U with V fixed, then update V with U fixed. This UV idea also fits the famous k-means algorithm.

NMF

The goal of NMF is to approximate a nonnegative matrix A >= 0 by a lower rank product UV of two nonnegative matrices U >= 0 and V >= 0

When A >= 0 is symmetric positive definite, we hope for a matrix B >= 0 that satisfies $B^TB=A$. Often no solution though, and we want to find a close enough $B^TB$

SPCA: Find sparse low rank matrices B and C so that BC is close to A

Facial Feature Extraction: Each column vector of the data matrix A will represent a face. Its components are the intensities of the pixels in that image. The goal is to find a few “basic faces” in B, so that their combinations come close to the many faces in A.

Text Mining and Document Classification

each column of A represents a document. Each row of A represents a word. NMF identifies topics and classifies the whole set of documents with respect to those topics.

$a_j \approx \Sigma c_{ij}b_i$ c - importance, b topic

NMF is an NP hard problem

Computing the Factors: A central idea is alternating factorization: Hold one factor fixed, and optimize the other factor. Hold that one fixed, optimize the first factor, and repeat. Using the Frobenius norm, each step is a form of least squares.

Tensors

A column vector is a 1-way tensor. A matrix is a 2-way tensor. Then a 3-way tensor has elements $T_{ijk}$ row number, column number, and “tube number”

The rows and columns and tubes are the fibers of T, with only one varying index.

A Color Image is a Tensor with 3 Slices

A black and white image is just a matrix of pixels. The numbers in the matrix are the grayscales of each pixel. Normally those numbers are between zero (black) and 255 (white). A color image is a tensor. It has 3 slices corresponding to red-green-blue. Each slice shows the density of one of the colors RGB.

The Joint Probability Tensor

Suppose we measure age a in years and height h in inches and weight w in pounds. We put N children into I age groups and J height groups and K weight groups.

joint probabilities $p_{ijk}=N_{ijk}/N$. For each combination i, j, k we count only the children who are in age group i and also height group j and also weight group k.

The Norm and Rank of a Tensor

A tensor can multiply vectors, matrices, or tensors. Then it is a linear operator. A tensor can contain data. Its entries could give the brightness of pixels in an image. A color image is 3-way, stacking RGB. A color video will be a 4-way tensor.

Part 1.9 Principal Components

2024-12-24 00:00:00 +0000

Principal Component Analysis (PCA) uses the largest $\sigma$ connected to the first u’s and v’s to understand the information in a matrix of data. The closest rank k matrix to A is $A_k$. In statistics we are identifying the pieces of A with largest variance

PCA is “unsupervised” learning. Our only instructor is linear algebra. the SVD tells us to choose $A_k$.

Eckart-Young. Three choices for the matrix norm

PCA

We have n samples. For each sample we measure m variables, i.e., a m rows by n columns matrix.

The first step is to find the average (the sample mean) along each row of A0 . Subtract that mean from all m entries in the row. Now each row of the centered matrix A has mean zero

How will linear algebra find that closest line through (0, 0)? It is in the direction of the first singular vector

The Statistics Behind PCA

Subtracting those means in each row from each row of $A_0$ produced the centered A. The variances are the diagonal entries of the matrix $AA^T$. The covariances are the off-diagonal entries of the matrix $AA^T$. High covariance means that increased height goes with increased age. (Negative covariance means that one variable increases when the other decreases.).

sample covariance matrix $ S = AA^T / (n -1)$

Given the data in A, computing S would be a computational mistake. For large matrices, a direct SVD of A is faster and more accurate.

The leading eigenvector $u_1$ tells us slope of the closest line in the scatter plot.

The sum of squared distances from the data points to the $u_1$ line is a minimum.

The total variance in the data comes from the Frobenius norm (squared) of A. This is the trace of S-the sum down the diagonal. the trace equals the sum of the eigenvalues of the sample covariance matrix x

Rayleigh quotient for symmetric matrix

$R(x) = x^TSx/x^Tx$. The maximum value of R(x) is the largest eigenvalue. Other eigenvectors are saddle points. Saddles have first derivatives = zero but they are not maxima or minima.

^2 = max

^2/

^2 = max(R(x)) = \lambda_1(S)$

Generalized Eigenvalues

Generalized Rayleigh quotient $R(x) = x^TSx / x^TMx$. M is symmetric. In dynamical problems M is often the “mass matrix” or the “inertia matrix”. In statistics M is generally the covariance matrix.

If M is positive definite, the maximum of R(x) is the largest eigenvalue of $M^{-1}S$

$H=M^{-1/2}SM^{-1/2}$ is symmetric

$x=M^{-1/2}y$ to convert $Sx = \lambda Mx$ into $Hy=\lambda y$

A crucial fact about a symmetric matrix S is that any two eigenvectors are orthogonal. Not true for generalized eigenvectors, but true for $x_1Mx_2^T=0$

Positive Semidefinite M : Not Invertible

$x^TMx$ can be zero, possible to have infinite eigen values

In statistics M is often a covariance matrix. Its diagonal entries tell us the separate variances of two or more measurements. Its off-diagonal entries tell us the “covariances between the measurements”. If we are foolishly repeating exactly the same observations, or if one experiment is completely determined by another-then the covariance matrix M is singular. Its determinant is zero and it is not invertible

Fisher’s Linear Discriminant Analysis

We are given samples from two different populations and they are mixed together, with mean and variance of each pop. If all the samples are mixed together and we pick one, how do we tell if it probably came from population 1 or population 2 ?

Each sample with feature vector f, finds a separation vector v, if $v^Tf>c$ then likely from pop 1. v maximizes the separation ratio R.

$R = (x^Tm_1 - x^Tm_2)^2/x^T\Sigma _1x + x^T\Sigma _2x$. m: mean vector

$S = (m_1-m_2)(m_1-m_2)^T$ $M = \Sigma _1 + \Sigma _2$

$Sv = \lambda Mv$

$v = M^{-1} (m_1 - m_2)$

Neural networks will succeed by allowing separating surfaces that are not just planes.

Part 1.6: Eigenvalues

2024-12-23 00:00:00 +0000

The eigenvectors of A don’t change direction when you multiply them by A. The output Ax is on the same line as the input vector x. The eigenvector x is just multiplied by its eigenvalue

$A^2x=\lambda ^2 x$

$Ax=\lambda x$ means that $A - \lambda I = 0$, i.e., it is not invertible, it is singular, and its determinant is 0

Similar matrices

For every invertible matrix B, the eigenvalues of A is same as $BAB^-1$. The eigenvector is Bx, when $Ax=\lambda x$

To compute eigenvalue of A, we gradually make $BAB^-1$ diagonal, and eigenvalue will show up on the main diagonal

Diagonalizing a Matrix

Suppose A has a full set of n independent eigenvectors. Put those eigenvectors into an invertible matrix X. $A=X\Lambda X^{-1}$. This property is used to compute the power of A

Symmetric positive definite matrices

$S = S^T$, Then eigenvectors q can be chosen orthogonal

Spectral Theorem: Every real symmetric matrix has the form $S=Q\Lambda Q^T$. Q is the orthonomal eignvector matrix $Q^TQ=I$

A positive definite matrix has all positive eigenvalues.

the energy test: S is positive definite if the energy $\x^TSx$ is positive for all vector x != 0

if $Sx=\lambda x$, then $x^TSx > 0$ if $\lambda > 0$

if every eigenvector has positive energy, those eigenvectors can be chosen orthogonal because S is symmetric, every vector is a linear combination of eigenvectors

$S=A^TA$ for a matrix A with independent columns

For an ordinary function f(x) of one variable x, the minimum test df/dx = 0 and second deravative > 0. For f ( x, y) with two variables, the second derivatives go into a matrix, and it is mininum when the matrix is positive definite.

There a saddle point when S has both positive and negative eigenvalues. A saddle point matrix is “indefinite”

Optimization and Machine Learning

gradient descent. Each step takes the steepest direction toward the bottom point of the bowl. Calculus: The partial derivatives of f are all zero. Linear algebra: The matrix S of second derivative is positive definite.

If S is positive definite (or semidefinite) at all points x, then the function f(x) is convex

Machine learning produces “loss functions” with hundreds of thousands of variables. They measure the error-which we minimize. But computing all the second derivatives is completely impossible. We use first derivatives to tell us a direction to move-the error drops fastest in the steepest direction. Then we take another descent step in a new direction. This is the central computation in least squares and neural nets and deep learning.

Ellipse

Stay with a positive definite matrix S. Cut through that bowl at height energy = 1, then the curve that goes around the cut is an ellipse

The axes of the tilted ellipse point along those eigenvectors of S. Length = $1/\sqrt\lambda$

SVD

If A is not square then $Ax = \lambda x$ is impossible and eigenvectors fail. SVD helps in this case

$Av_1=\sigma _1u_1$, vs are orthogonal in $R^n$. us are perpendicular to each other in $R^m$. That number r is the rank of A

$Av_{r+1} = 0$

$AV = U\Sigma$. V and U are square orthogonal matrices $V^T=V^{-1}$ because their columns are orthogonal unit vectors.

SVD: $A=U\Sigma V^T$. Reduced SVD so that $\Sigma _r$ is square

A special property of the SVD is that those pieces come in order of importance. $\sigma _1u_1v_1^T$ is the closest rank 1 matrix to A. The sum of the first k pieces is the best rank k approximation of A

Eckart-Young

$A^TA=V\Sigma^T\Sigma V^T$. V contains orthonormal eigenvectors of $A^TA$. $\sigma_i^2$ are the nonzero eigenvalues of both for both $A^TA$ and $AA^T$

KL transform’s connection to SVD

KL begins with a covariance matrix V of a zero-mean random process. V is symmetric and positive definite or semidefinite. In general V could be an infinite matrix or a covariance function. Then the KL expansion will be an infinite series. The KL transform is a stochastic (random) form of Principal Component Analysis

The Geometry of the SVD

The SVD separates a matrix into (orthogonal) x (diagonal) x (orthogonal). The orthogonal matrices U and V rotate the plane. The diagonal matrix stretches it along the axes.

Finite difference

The discrete form of a derivative is a finite difference. The discrete form of an integral is a sum.

D corresponds to the backward difference $f(x) - f(x - \Delta x)$

The eigenvectors v of $D^TD$ are the right singular vectors of D. They are discrete sines, u are discrete cosines. $\sqrt2V$ These are the famous DST and DCT matrices-Discrete Sine Transform and Discrete Cosine Transform. The DCT matrix has been the backbone of JPEG image compression. Actually JPEG increases U to 8 by 8, which reduces the “blockiness” of the image. 8 by 8 blocks of pixels are transformed by a two-dimensional DCT -then compressed and transmitted.

Previous Page: 1 of 23 Next