517. Super Washing Machines
Intuition
The calculated max_moves = M represents the peak requirement imposed by either a single machine needing to unload its excess or a single boundary needing a certain net flow. Since M moves provide enough capacity to meet this peak requirement, and moves can be coordinated to address both needs simultaneously, M moves are sufficient to ensure all machines reach the target state.
The primary driver for moves is often to correct imbalances between sections of the machine line. The balance_k value represents the net number of dresses that must eventually move from the left section (0…k) to the right section (k+1…n-1) (if balance_k > 0) or vice-versa (if balance_k < 0).
The core requirement is to achieve a net flow of balance_k dresses across boundary k over the entire process. We know that abs(balance_k) <= M.
Pseudo code for the optimal action plan
# balance(0,0) = 0
# balance(0,n) = 0
for round in range(0, M): #M is the max of max bottlenecks
next_round_state = machines[:]
for i in range(0, n): # N is the number of machines
if machine[i] > 0
if balance(0, i) < 0:
next_round_state[i] = machine[i] - 1
next_round_state[i - 1] = machine[i - 1] + 1
else if balance(i+1, n) < 0:
next_round_state[i] = machine[i] - 1
next_round_state[i + 1] = machine[i + 1] + 1
machines = next_round_state
When M > max(machines)
i.e., M = abs(balance(0, j)) for a TBD j
Around j the machine balances must looks like {… surplus at j1, 0,…, 0, deficit at j2,…}, and j >= j1 + 1 and j <= j2. Otherwise, abs(balance(0, j)) is not maximum. Therefore, we can definitely pass one dress from left to right at j1,…,j2 - 1, and thus M = abs(balance(0,j)) is reduced by 1 after this round
When M == max(machines)
assuming machines[k] = max(machines) for a TBD k
Claim A: balances(0, k) <= 0 and balances (k+1, n) <= 0 Proof: by contradiciton, then if balance(0, k) > 0 and balances(k+1, n) <= 0, then balance(0, k+1) > M which contradicts with M’s definition that it is the max of abs(balance(0, i)) over all Is
Claim B: balance(0, k) < 0 or balances(k+1, n) < 0. Could be both at the same time Proof: by contradiciton, if both >= 0, then balances(0, n) > 0, which contradicts that the all machines will have the same number of dresses eventually
Claim C: when balance(0, k) < 0 and balance(k+1, n) = 0, then machines[k-1] < M Proof: By contradiciton, assume machines[k-1] = M. At least one dress will be moved from k to k - 1. then abs(balance(0, k-1)) > abs(balance(0, k)) = M, this contradicts that M is also the max of abs(balance(i))
Claim D: when balance(0, k) = 0 and balance(k+1, n) < 0, then machines[k+1] < M Proof: similar to proof of claim C, just in a mirrored direction
Claim E: when balance(0, k) < 0 and balance(k+1, n) < 0, machines[k -1] < M or machines[k+1] < M Proof: By contradiction, proof is a combination of D & E
With claim C ,D, and E, we can prove that when M == max(machines), there is always a move that moves M to a neighboring machine with fewer dresses, that is, M is reduced by 1 after this round
When M < max(machines)
TODO
Communication lessons from Gemini 2.5 Pro
Contextual Requests: Avoid Defensive or Overly Justifying Tone
- Too Long/Detailed: Goes into unnecessary background or explains things the receiver likely already knows.
- Focus on Personal Effort/Stress: Emphasizes how much work you’ve done or how stressed you are about the deadline (e.g., “…because I’ve been working on this all day and I really need to get it off my plate”).
- Implies Potential Problems: Phrases it in a way that suggests negative consequences if the task isn’t done, rather than the positive outcome if it is (e.g., “…because if you don’t, the whole project will be delayed”).
- Slightly Accusatory or Patronizing Language: Uses phrases like “As you should know…” or over-emphasizes obvious points.
- If your colleague is doing that
- Acknowledge to yourself that the speaker might not intend it negatively, even if it feels that way. This can help manage your reaction.
- Try to mentally filter past the “as I mentioned” preamble and focus on the actual point being made.
- In your own communication, use softer ways to refer back, like: “Just circling back to the point about X…” or “Building on our previous discussion…” or simply restating the point without the preamble.
- If your colleague is doing that
- Example (Contrast): “I really need you to finalize the slides, please try to get them done by 3 PM. Because, as you know, the practice run is at 4 PM, and I can’t possibly merge the decks if I don’t have your part, and it took me ages to get my section done, so we really can’t be late with this.” (Sounds stressed, over-explains the obvious dependency, focuses on personal effort and potential failure).
- Informative/Collaborative: “Here’s the task, and here’s the simple, direct reason why it’s needed for our shared goal.”
- Defensive/Justifying: “Here’s the task, and here are all the reasons why you absolutely must do it, please understand it’s important, I’ve worked hard, and bad things might happen if you don’t.”
Preach on known knowledge and skills
- Instead of focusing on the what (the principles you know), try to understand why they might be bringing up this specific principle right now. Is there a subtle company focus? A recent challenge elsewhere in the org? Sometimes the timing or emphasis can offer clues about broader priorities, even if the content itself isn’t new to you.
- Acknowledge and Pivot: Briefly acknowledge the principle, then immediately connect it to a specific action or question. “Yes, that principle of [Principle Name] is definitely key. I’ve been thinking about how it applies specifically to the [Project/Task Name]. For example, with [Specific Situation], I’m considering [Action A] vs. [Action B]. From your perspective, which aligns better with that principle in this practical case?” This shows you understand the principle and are actively applying it, shifting the focus to actionable advice.
- Share Your Application (Briefly): Casually mention how you’re already using the principle. “That’s a great reminder about [Principle]. We actually leveraged that last week when we [brief example of your action] and it led to [positive outcome]. It really proved effective.” This subtly demonstrates your competence without being defensive.
Inaccurate view of your approach or focus
- Is the Counter-Question a Good Strategy?
- probably not as your only or primary response. Use it sparingly, perhaps when the assumption seems particularly off-base or when you have the time and energy for a potential dialogue, and be very mindful of delivering it with a genuinely curious and non-confrontational tone.
- Gentle, Direct Correction & Restatement: This is often the most efficient and professional approach for frequent, minor misinterpretations.
- Example: “Actually, I focus mainly on the significant deals like X and Y since they have the biggest impact. I don’t track all the smaller ones quite so closely.”
- Why it works: It corrects the record quickly and factually without making a big deal out of it or putting the other person on the defensive. It clarifies your actual behaviour.
- Ignore (Selective Use): If the comment is passing, doesn’t materially affect the work, and you don’t have the energy, sometimes just letting it go is easiest. Choose your battles. If it doesn’t matter, maybe it doesn’t warrant correction every single time.
- Seek Feedback (If You Suspect a Root Cause): If you’re genuinely puzzled why this impression persists, you could ask a trusted colleague or even your manager if they have insights into how you might be perceived or if your role’s focus is clear to others.
Blunt communication
- Assume Neutrality, Not Malice: Try to consciously assume they aren’t trying to be hurtful. They might just be very direct, lack social awareness, be stressed, or think this is an efficient way to communicate. Attributing their bluntness to their style rather than a personal attack can lessen the sting.
- Focus on the Task: Remind yourself that the interaction is about the work, not about you personally. Their feedback or question, however poorly phrased, likely relates to a task, project, or process.
- Listen for the ‘What’, Not the ‘How’: Train yourself to filter out the tone and the rhetorical phrasing. Ask yourself: “What is the underlying point, question, or concern here?” What information do they actually need, or what disagreement are they trying to express?
- Translate the Rhetorical Question: When they ask “Are you… [doing something they disagree with]?”, mentally translate it to: “I have a concern about [the topic]. Can you explain your approach or thinking?” or “I don’t agree with [the implied action]. Here’s why…”
- Ask Open-Ended Questions: Instead of getting defensive, respond to “Are you…?” with questions that require them to explain their actual point.
- “Could you tell me more about your concern regarding [the topic]?”
- “What specific aspect of [the topic] are you questioning?”
- “Help me understand what you think should be happening instead.”
- Paraphrase for Understanding (Content Only): Repeat back what you think their substantive point is, ignoring the blunt delivery.
- “So, if I understand correctly, you’re concerned about [the specific issue] because [potential reason]? Is that right?”
- “Okay, it sounds like you disagree with [method/decision]. What approach do you suggest?” This forces them to move beyond the rhetorical question into a constructive discussion.
- It’s Not (Necessarily) About You: Consciously remind yourself that their communication style is likely how they interact with many people, not just you. It reflects their habits, not your shortcomings.
- Separate Feedback from Judgment: Even if the delivery feels judgmental, focus only on the actionable feedback or the core question about the work itself.
Culture and strategy
- Culture is the Enabler (or Disabler) of Strategy: Think of culture as the soil and climate, and strategy as the seeds you plant. You can have the best seeds (strategy), but if the soil is toxic or the climate hostile (culture), nothing will grow. Your focus on cultivating a culture where people tackle necessary work, not just interesting work, is precisely about improving the “soil” so that any strategy can take root and flourish.
- Working on This Specific Culture Aspect Is Strategic: You aren’t choosing culture over strategy. You are identifying a critical cultural deficiency (avoidance of necessary work) that directly threatens the execution of any strategy the company adopts. Fixing this is a strategic imperative. It’s foundational work required for future strategic success.
- Reconciling the Feedback: The feedback to “focus more on company strategy” might mean:
- Ensure Alignment: Make sure your team understands the company strategy and how their work (including the less glamorous parts) contributes to it. Use the strategy as the “why” behind the cultural push.
- Strategic Context: Ensure your understanding and communication of the strategy are sharp. You need to articulate which strategic goals are hampered by the current cultural issue.
- Don’t Lose Sight of the Big Picture: While fixing the cultural plumbing, don’t forget to keep an eye on the overall direction (strategy). Ensure the cultural changes you’re making actively support the intended strategic direction.
- Frame the Culture Work Strategically: When you talk to your team about sharing the load and doing necessary work, explicitly link it to achieving specific strategic objectives. “We can’t hit Goal X if tasks Y and Z are consistently delayed because they aren’t seen as ‘interesting’. Achieving our strategy requires all hands on deck for all necessary tasks.”
- Use Strategy to Prioritize Culture Efforts: Which aspects of the strategy are most at risk due to this cultural trait? Focus your cultural interventions there first.
- Communicate Upwards: Explain to those who gave you the feedback how your focus on this specific cultural aspect is essential for enabling the company’s strategy. Frame it not as an either/or, but as necessary groundwork. “To ensure we can execute on Strategy A, I’m currently prioritizing building a team culture that ensures reliable completion of all required tasks, B and C, which are foundational to A.”
Directional listening
- Challenge Literal Interpretation: Actively pause when you receive input. Ask yourself: “Is this a direct order, a strong suggestion, a brainstormed idea, or an expression of a concern?”
- Instead of just accepting the suggestion, probe deeper:
- “Thanks for that suggestion. Can you help me understand the specific problem you’re hoping this will solve?”
- “What’s the main outcome you’re looking for with that approach?”
- “Could you walk me through your thinking on how this would work?”
- “That’s an interesting idea. How do you see it fitting with [Project Goal X] or [Constraint Y]?”
- “Is this something you feel is essential, or more of an idea to explore?” (Use this carefully, but it can directly ask about the ‘directionality’).
- Show you’ve heard and understood the intent or the value in their suggestion, even if you disagree with the specific implementation.
- “I understand why you’re suggesting [the idea]. It makes sense that you’re concerned about [the underlying issue].”
- “Thank you, I appreciate you bringing that perspective. You’re right that [aspect they highlighted] is important.”
- “I can see the potential benefit of [their suggestion] in terms of [positive outcome].”
- articulate why their suggestion might not be feasible or the best approach in its current form. Be factual and objective.
- Refer to Goals/Scope: “While that’s a great idea, it might take us outside the agreed-upon scope for this phase. Perhaps we can log it for future consideration?”
- Mention Constraints: “Implementing that would require [resource/time/budget] that we currently don’t have allocated.” or “Based on the current technical setup, that approach might introduce [specific risk or complexity].”
- Cite Data/Research: “Our user research indicated [finding X], which suggests we might want to prioritize [alternative approach].”
- Explain Trade-offs: “We could do [their suggestion], but it would mean we’d have to sacrifice [something else important]. My recommendation is to stick with [original plan] because it better aligns with our primary objective of [Goal Y].”
- Frame your reasoning around the project’s success or the team’s objectives.
- We both want [desired outcome]. My concern with [their suggestion] is that it might impact [shared goal] in this way: […]. How can we achieve [the stakeholder’s underlying need] while protecting [the shared goal]?”
- Show you’re still trying to incorporate the spirit of their feedback.
- “While we might not be able to do [exact suggestion] due to [reason], what if we addressed your concern by doing [alternative solution]?”
- “Could we perhaps implement a simpler version of your idea that fits within our current constraints?”
- “I understand the need for [what their suggestion achieves]. Could we explore [another way] to get a similar result?”
- Since tone can be tricky, follow up verbal discussions with brief emails summarizing the input, your understanding, the decision made, and the reasoning. This confirms understanding without relying on tone interpretation.
- “Hi [Stakeholder], thanks for discussing [topic] today. Just to confirm my understanding, you suggested [their idea] to address [their concern]. After considering it against [project goals/constraints], we decided to proceed with [chosen approach] because [brief reason]. We appreciate your input on this.”
Evals for LLM systems
Why evals should get more attention
- Unsuccessful AI products almost always share a common root cause: a failure to create robust evaluation systems. Conversely, creating high quality evals is one of the most impactful things you can do.
- Evaluating quality (ex: tests) and debugging issues (ex: logging & inspecting data) don’t get enough attention, but they improve your product past demo stage.
- Addressing one failure mode led to the emergence of others, resembling a game of whack-a-mole.
- There was limited visibility into the AI system’s effectiveness across tasks beyond vibe checks.
- Prompts expanded into long and unwieldy forms, attempting to cover numerous edge cases and examples.
Unit Tests
- Unlike typical unit tests, you want to organize these assertions for use in places beyond unit tests, such as data cleaning and automatic retries (using the assertion error to course-correct) during model inference.
- The most effective way to think about unit tests is to break down the scope of your LLM into features and scenarios.
- For example, here is one such prompt Rechat uses to generate synthetic inputs for a feature that creates and retrieves contacts.
Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. The contact details can include name, phone, email, partner name, birthday, tags, company, address and job.
For each of the instructions, you need to generate a second instruction which can be used to look up the created contact.
. The results should be a JSON code block with only one string as the instruction like the following:
[
["Create a contact for John (johndoe@apple.com)",
"What's the email address of John Smith?"]
]
Model Eval
- A two stage process where the model first answers the question, then we ask a different model to look at the response to check if it’s correct.
- A prerequisite to performing human and model-based eval is to log your traces
- Researchers are now employing LLMs like GPT-4 to evaluate the outputs of similar models. This recursive use of LLMs for evaluation underscores the continuous cycle of improvement and refinement in the field. Human and GPT-4 judges can reach above 80% agreement on the correctness and readability score. If the requirement is smaller or equal to 1 score difference, the agreement level can reach above 95%
- Critiques from a good evaluator model can be used to curate high-quality synthetic data,
Example
- asks the model to write a funny joke. The model then generates a completion.
- We then create a new input to the model to answer the question: “Is this following joke funny? First reason step by step, then answer yes or no”
- We finally consider the original completion correct if the new model completion ends with “yes”.
Methodology
- The evaluation begins with the creation of a benchmark dataset, which should be as representative as possible of the data the LLM will encounter in a live environment.
- One way to speed up the process of building eval datasets, is to use GPT-4 to generate synthetic data
- Once we have our evaluation test set complete with ground truth and responses generated by our LLM application, the next step is to grade these results. This phase involves using a mix of LLM-assisted evaluation prompts and more integrated, hybrid approaches.
- Opt for the most robust model you can afford: Advanced reasoning capabilities are often required to effectively critique outputs. Your deployed system might need to have low latency for user experience. However, a slower and more powerful LLM might be needed for evaluating outputs effectively.
- The evaluating model might make errors and give you a false sense of confidence in your system.
Human Eval
- I often find that “correctness” is somewhat subjective, and you must align the model with a human.
- A translation could score high on BLEU for having words in technically correct order, but still miss the mark in conveying the right tone, style, or even meaning as intended in the original text.
- LLMs may exhibit a preference for answers generated by other LLMs over human-authored text, potentially leading to a skewed evaluation favoring machine-generated content.
Best practices
- May need to build a custom tool to show
- What tool (feature) & scenario was being evaluated.
- Whether the trace resulted from a synthetic input or a real user input.
- Filters to navigate between different tools and scenario combinations.
- Links to the CRM and trace logging system for the current record.
- Create positive and negative evals: Something cannot be logically true and untrue at the same time. Carefully design your evals to increase confidence.
- we noticed that many failures involved small mistakes in the final output of the LLM (format, content, etc). We decided to make the final output editable by a human so that we could curate & fix data for fine-tuning.
- When starting, you should examine as much data as possible. I usually read traces generated from ALL test cases and user-generated traces at a minimum. You can never stop looking at data—no free lunch exists. However, you can sample your data more over time, lessening the burden.
- One signal you are writing good tests and assertions is when the model struggles to pass them - these failure modes become problems you can solve with techniques like fine-tuning later on.
References
- Getting Started with OpenAI Evals
- All about LLM Evals
- Your AI Product Needs Evals
Building agents
Use cases for agents
- Prioritize workflows that have previously resisted automation, especially where traditional methods encounter friction:
- Workflows involving nuanced context-sensitive decisions, for example refund approval in customer service workflows.
- Systems that have become unwieldy due to extensive and intricate rulesets, making updates costly or error-prone, for example performing vendor security reviews.
- Scenarios that involve interpreting natural language, extracting meaning from documents, or interacting with users conversationally, for example processing a home insurance claim.
- Analyze log and diagonse
- analyze user’s query and statement
- use system log, monitoring, and knowledge base to analyze
- Generate test data base on dialogs
- use natural language and knowledge base.
- Build atomic tools to cover core promotion, cart, order, fulfillment flows
- Customer support
- Support interactions naturally follow a conversation flow while requiring access to external information and actions;
- Tools can be integrated to pull customer data, order history, and knowledge base articles;
- Actions such as issuing refunds or updating tickets can be handled programmatically
- Success can be clearly measured through user-defined resolutions.
- Coding agents
- Code solutions are verifiable through automated tests;
- Agents can iterate on solutions using test results as feedback;
- The problem space is well-defined and structured;
Constraints
- The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents’ autonomy makes them ideal for scaling tasks in trusted environments.
- The autonomous nature of agents means higher costs, and the potential for compounding errors. We recommend extensive testing in sandboxed environments, along with the appropriate guardrails.
Start simple
- When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance
- We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what’s under the hood are a common source of customer error.
- For many applications optimizing single LLM calls with retrieval and in-context examples is usually enough.
- Build your agent prototype with the most capable model for every task to establish a performance baseline. From there, try swapping in smaller models to see if they still achieve acceptable results.
Best practices
- Prioritize transparency by explicitly showing the agent’s planning steps.
- While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt. For example, we found that the model would make mistakes with tools using relative filepaths after the agent had moved out of the root directory. To fix this, we changed the tool to always require absolute filepaths—and we found that the model used this method flawlessly.
Instructions
- Give the model enough tokens to “think” before it writes itself into a corner.
- Keep the format close to what the model has seen naturally occurring in text on the internet.
- Make sure there’s no formatting “overhead” such as having to keep an accurate count of thousands of lines of code, or string-escaping any code it writes.
- Rather than maintaining numerous individual prompts for distinct use cases, use a single flexible base prompt that accepts policy variables.
You are a call center agent. You are interacting with who has been member for . The user's most common complains are about . Answer any questions the user may have
Routine
- Make sure every step in your routine corresponds to a specific action or output. For example, a step might instruct the agent to ask the user for their order number or to call an API to retrieve account details. Being explicit about the action (and even the wording of a user-facing message) leaves less room for errors in interpretation.
- Real-world interactions often create decision points such as how to proceed when a user provides incomplete information or asks an unexpected question. A robust routine anticipates common variations and includes instructions on how to handle them with conditional steps or branches such as an alternative step if a required piece of info is missing.
- Generate instructions from routine
You are an expert in writing instructions for an LLM agent. Convert the following help center document into a clear set of instructions, written in a numbered list. The document will be a policy followed by an LLM. Ensure that there is no ambiguity, and that the instructions are written as directions for an agent. The help center document to convert is the following
Workflows
- Prompt chaining: trade off latency for higher accuracy, by making each LLM call an easier task
- Generating Marketing copy, then translating it into a different language.
- Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.
- Routing
- Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools.
- Routing easy/common questions to smaller models like Claude 3.5 Haiku and hard/unusual questions to more capable models like Claude 3.5 Sonnet to optimize cost and speed.
- Parallelization
- Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.
- Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt.
- Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem. Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.
- Orchestrator-workers: complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task)
- Coding products that make complex changes to multiple files each time.
- Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information.
- Evaluator-optimizer
- Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.
- Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.
Guardrails
- Often implemented as parallelization workflows. The primary agent proactively generates outputs while guardrails run concurrently, triggering exceptions if constraints are breached.
- Ensures agent responses stay within the intended scope by flagging off-topic queries. For example, “How tall is the Empire State Building?” is an off-topic user input and would be flagged as irrelevant.
- Detects unsafe inputs ( jailbreaks or prompt injections) that attempt to exploit system vulnerabilities.
- Prevents unnecessary exposure of personally identifiable information (PII) by vetting model output for any potential PII.
- Flags harmful or inappropriate inputs (hate speech, harassment, violence) to maintain safe, respectful interactions.
- Assess the risk of each tool available to your agent by assigning a rating—low, medium, or high—based on factors like read-only vs. write access, reversibility, required account permissions, and financial impact. Use these risk ratings to trigger automated actions, such as pausing for guardrail checks before executing high-risk functions or escalating to a human if needed.
- Simple deterministic measures (blocklists, input length limits, regex filters) to prevent known threats like prohibited terms or SQL injections.
- Ensures responses align with brand values via prompt engineering and content checks, preventing outputs that could harm your brand’s integrity.
References
- A practical guide for building agents
- Building effective agents
How to Recognize Hidden Feedback
Why Vital Feedback Stays Hidden
- People sometimes incorrectly assume that they’ve given the feedback more explicitly than they really have (“I made the same correction three times. Isn’t it obvious that I’m displeased with this?”)
- People with critiques or suggestions may not even consciously realize that they have important feedback to give. They may be rushing quickly between tasks, out of touch with their actual concerns, lacking the words to express their intuitions, or only vaguely aware of the broader message that would be important to convey.
How to Detect and Learn from Hidden Feedback
- When stakeholders repeatedly return to ostensibly minor suggestions or questions, this may mask broader, unstated concerns about capability, readiness, or performance. These micro-moments typically aggregate into unspoken feedback about perceptions of leadership blind spots or weaknesses.
- “I’ve noticed you’re particularly interested in our timelines. Is there any broader concern about our execution speed, or anything else? If so, I’d love to understand it so we can step up and perform at our best.”
- When people suddenly engage (or send in proxies to engage) in decisions that would not typically involve them, or when they request more reviews than usual, it can indicate eroding confidence.
- When people decrease their involvement or participation without explaining why, it can be a signal that they either don’t want to engage or have deprioritized the work—both of which may indicate growing concerns about leadership effectiveness or strategic alignment.
- Why send delegates rather than attend his meetings themselves?
- “I’ve noticed a pattern in attendance at my senior talent meetings. I’m curious if there’s something about our process that isn’t working for you, or if you have suggestions for how to make them better.”
Make it safe for others to tell you the truth
- “As I look ahead, I’m trying to grow my leadership, and I value your perspective. What blind spots should I be aware of that I might not be seeing?”
- “I’d value your perspective on how my message landed in today’s meeting. What signals or reactions did you notice? And how did it land with you?”
- “Given your role, you might see things I’m missing. What patterns or concerns should I be aware of?”
Reframe as strategic counsel advice if the barrier to direct feedback is too high.
- “As part of my professional development, I’m always looking to improve. What’s one thing you think I could be doing differently to raise my game?”
- “Based on your experience and vantage point, how might you approach this challenge differently than I am?”
- “If you were mentoring someone facing similar circumstances, what guidance would you offer?”
Articulate the pattern you’re seeing and more explicitly ask if there might be something important
- “I’ve noticed this topic coming up in several conversations. Is there a broader concern we should be discussing?”
- “This seems to be a recurring theme in our interactions. What underlying issues could we be addressing?”
- “I’m seeing a pattern here that might signal something important. Could you help me understand if there’s more to explore?”
Listen to learn
- “I’d love your perspective on {} . Can you give it some thought, and let’s discuss when we meet next?”
- Don’t assume the first thing they say is the only or even the most important feedback they have for you.
- “That’s interesting. Can you say more about that?”
- “Very helpful. Can you share an example?”
- “Great to know that. What else?”
From Gemini 2.5 Pro
- Consistently applying these techniques—actively seeking cues, creating safety, deep listening, paraphrasing, following up—requires considerable time and emotional energy from the leader, which could be a challenge in demanding roles.
- Instead of broad invitations, ask specific questions like, “What’s one thing you see me doing that gets in my way?” or “What’s one thing I could do differently?”.
- Faster methods are often better suited for gathering more direct, already-formed feedback or making delivery more efficient, but they might miss the underlying issues that aren’t being explicitly stated. The trade-off is often between speed and the depth or nuance of the feedback obtained.
Prompt Engieering
Mostly notes on the whitepaper from Google
- Prompt engieering is an iterative process
- Document various prompt attempts
- When zero shot does not work, try one shot and few shot prompting, i.e., single example and multiple examples. Rule of thumb is 3-5 examples. Include edge cases if the input could be diverse
- Prompting for a JSON format output forces the model to create a structure and limit hallucinations.
- Growing research suggests that focusing on positive instructions in prompting can be more effective than relying heavily on constraints. This approach aligns with how humans prefer positive instructions over lists of what not to do.
- Placing the detailed Context before the specific Task/Ask is often preferred as it helps the LLM build a complete understanding before acting.
System prompting
- sets the overall context and purpose for the language model. It defines the ‘big picture’ of what the model should be doing, like translating a language, classifying a review etc.
- The name ‘system prompt’ actually stands for ‘providing an additional task to the system’
Classify movie reviews as positive, neutral or negative. Return valid JSON.
Review:...
Schema:...
JSON Response:...
Contextual prompting
Provides immediate, task-specific information to guide the response. It’s highly specific to the current task or input, which is dynamic.
Context: You are writing for a blog about retro 80's arcade video games.
Suggest 3 topics to write an article about with a few lines of description of what this article should contain.
Role prompting
There can be considerable overlap between system, contextual, and role prompting. E.g. a prompt that assigns a role to the system, can also have a context.
I want you to act as a travel guide.
I will write to you about my location and you will suggest 3 places to visit near me.
In some cases, I will also give you the type of places I will visit.
My suggestion: "I am in Amsterdam and I want to visit only museums."
Travel Suggestions:
Step-back prompting
Prompting the LLM to first consider a general question related to the specific task at hand, and then feeding the answer to that general question into a subsequent prompt for the specific task. This ‘step back’ allows the LLM to activate relevant background knowledge and reasoning processes before attempting to solve the specific problem.
Based on popular first-person shooter action games, what are 5 fictional key settings that contribute to a challenging and engaging level storyline in a first-person shooter video game?
Context: 5 engaging themes for a first person shooter video game:
...
Take one of the themes and write a one paragraph storyline for a new level of a first-person shooter video game that is challenging and engaging.
CoT
- Generally, any task that can be solved by ‘talking through is a good candidate for a chain of thought. If you can explain the steps to solve the problem, try chain of thought.
When I was 3 years old, my partner was 3 times my age. Now, I am 20 years old. How old is my partner? Let's think step by step.
- You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding as it’s a challenge with a zero-shot chain of thought.
Q: When my brother was 2 years old, I was double his age. Now I am 40 years old. How old is my brother? Let's think step by step.
A: When my brother was 2 years, I was 2 * 2 = 4 years old. That's an age difference of 2 years and I am older. Now I am 40 years old, so my brother is 40 - 2 = 38 years old. The answer is 38.
Q: When I was 3 years old, my partner was 3 times my age. Now, I am 20 years old. How old is my partner? Let's think step by step.
A:
Self-consistency
- Generating diverse reasoning paths: The LLM is provided with the same prompt multiple times. A high temperature setting encourages the model to generate different reasoning paths and perspectives on the problem.
- Extract the answer from each generated response.
- Choose the most common answer.
Prompts for explaining code
Explain to me the below Bash code:
{code in markdown}
Prompts for debugging and reviewing code
The below Python code gives an error:
{log in markdown}
Debug what's wrong and explain how I can improve the code.
{code in markdown}
Use variables in prompts
VARIABLES
{city} = "Amsterdam"
PROMPT
You are a travel guide. Tell me a fact about {city}
How to make papers easier to review
The same ideas also apply to design docs
Common misconceptions
- Author: Record the process of how we reached the solution by going through complex subproblems. The reviewer should be impressed by the smart process
- Reviewer: Why one problem after another, which is the main problem to solve? Which one is common practice, and which is of the author’s own?
- Author: Spend two pages on thinking through the problem to transform to a different problem. Reviewer should appreciate this challenging mind game
- Reviwer: No intro and conclusion, what is the purpose of this thought process?
- Authoer: Added innovation to existing work to reach SOTA. The reviewer should appreciate the experiment result
- Reviwer: Yet another AAA + BBB. Then let me check the effect. What does this result mean, is it a significant improvement
Don’t record review process
- The reviewer want to quickly get your new contribution. The goal is the final state, and not the research progress.
Bad Example
- To solve problem 1, I have solution A, but A’s result is not ideal because of problem 2, and 2 is solved by my solution B.
Reviewer’s PoV
- how to prove 2 is the root cause of bad results of A
- B seems to have little difference from A
- A has some effect, how to prove this is because it solved problem 1
Author’s confusion
- The final result is B, plus I built A too
- Why need to prove the relationship between B and A
Reviewer’s reaction
- You didn’t emphasize A
- A didn’t solve your problem 2, which you emphasized
- If A is not major contribution, than B has little value
How to fix with a better framework
Flow : Problem -> theory behind -> solution and implementation-> experiment to verify. This flow means we should avoid “based on” and “A+B”
- A new solution to an important problem
- A new framework and a common trick
- Prove that we need both. Without this trick, the result will be worse, that is because of an interest problem
- This trick can be applied to other methods
Reviewer’s PoV
- I can see what is new and I just need to decide if they are technically sound and if effects are significant
Readability
- Coherent is on logic itself, and not connecting words
- Explain the effect right after the name (we propose XXX, which is a two-layer MLP). Otherwise, the reader may be frustrated when they discover it is so simple after many paragraphs
- When you are expanding on the story and the reviewer is decideing if the story makes sense, do not attach reference to graphs a few pages after, because paper is linear story telling.
- Around the table you should explain key points with high information density
- If you want to emphasize table 5, then the sentence that analyzing the result better be on the same page as table 5, because reader may not look into your text and check graph and find related text
- Don’t expect users to compare results themselves. Compare it your self, it is ok to repeat the comparison of certain results.
Rebuttal
- Reviewer demands optional data
- Add the data, also list influential parallel work and show that data is in general optional.
- Don’t give the AC the impression that you are preaching, e.g., consider the review comment quality, rather than exclude scores from the reviewer. :w
- Be explicit on your AIs to address comments
Story telling
Motivation
- Define the problem: Why is it a common problem?
- For example, to solve the congestion, we need to control send speed at the end point level, and that means we need to 1. sense the network’s congestion, and 2. algorithm to control the send speed to reduce congestion
- What are problems with current algorithms?
- What is your key idea and what makes your key idea promising.
- Adjust granularity based on the audience
Contribution
- Insights, but not manual. Why does your design work? Prove that all parts of your design is needed and not only a part of it.
- Can your insights be applied to other scenarios?
References
George Polya: Let's teach guessing
- all-important reminder that we must test our guesses
- Guess and then prove. All great discoveries are done this way
- Finished math consists of proofs, but math in the making/discovery consists of guesses
How many parts do 5 planes cut
- Students started with random guesses
- Start with easier problems which could help
- A student asked if planes can be parallel, and prompted the clarification that planes are random
- 3 planes give 8
- Think of extreme cases in reasonable guessing
- Guess 4 planes give 16 based on observation of pattern, and then generalize, i.e., induction
- Need to test the guess/prediction
- Use analogy to help guess, what if lines against planes? Turns out 4 lines confine a triangle, and so do 4 plane
- Turns out 3 lines give only 7, which suggests that plane version is incorrect. We could but 1 triangle + 3 side adjecent + 3 vertex adjecent. In plane case, it is 1 finite + 4 faces + 6 cutting edge + 4 points
- Analogy: line divided by points, and then we guessed by induction when we see the pattern with lower dimensions
- Prove the pattern with 5 case is hard, so we try 4 + line case. After we verified the 11 case, we are more confident with the 26 guess.
- Proving the answer is out of scope for this lecture
The bitter lession
- General methods that leverage computation are the most effective by a large margin. The ultimate reason is generalized Moore’s law, i.e., continued exponentially falling cost per unit of computation.
- Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available.
- General method and human knowledge often run counter against each other
- they are contenting for time
- psychological commitments
- the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation
Examples
- Computer chess used massive, deep search. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers.
- In computer Go, as in computer chess, researchers’ initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.
- learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. I
- Speech recognition. The statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems
-
Computer vision. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
- The two methods that seem to scale arbitrarily with increased computation are search and learning.
- we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.
How I would implement row access policy and aggregation policy in SF
Row access policy
CREATE OR REPLACE ROW ACCESS POLICY security.sales_policy
AS (sales_region varchar) RETURNS BOOLEAN ->
'sales_executive_role' = CURRENT_ROLE()
OR EXISTS (
SELECT 1 FROM salesmanagerregions
WHERE sales_manager = CURRENT_ROLE()
AND region = sales_region
)
;
Snowflake determines whether a row access policy is set on a database object.
- SQL is parsed into AST
- At each FROM node with table or view, look up the policy with the (role, table name) tuple
Snowflake creates a dynamic secure view (i.e. a secure inline view) of the database object.
- Replace the from table table with a select all + from + selection subtree
The values of the columns specified in the ALTER TABLE or ALTER VIEW command (i.e when adding a row access policy to a table or view) are bound to the corresponding parameters in the policy, and the policy expression is evaluated.
- Execute the sql
Aggregation policies
Aggregation policies can be used with or without an entity key. When aggregation policies are used without an entity key, they protect the privacy of individual rows in the data set (that is, row-level privacy). If you use an aggregation policy with an entity key, it protects the privacy of an entity, even if information about that entity appears in multiple rows (that is, entity-level privacy).
Entity-level privacy strengthens the privacy protections provided by aggregation policies. With entity-level privacy, Snowflake can ensure that an aggregation group contains a certain number of entities, not just a certain number of rows.
CREATE AGGREGATION POLICY my_agg_policy
AS () RETURNS AGGREGATION_CONSTRAINT ->
CASE
WHEN CURRENT_ROLE() = 'ADMIN'
THEN NO_AGGREGATION_CONSTRAINT()
ELSE AGGREGATION_CONSTRAINT(MIN_GROUP_SIZE => 3)
END;
ALTER TABLE t1 SET AGGREGATION POLICY my_agg_policy;
- SQL to AST, and then when discovered an aggregation node replace this aggregation node with conditional grouping
SELECT state, AVG(elevation) AS avg_elevation
FROM t1
GROUP BY state;
becomes
--this query is not valid for Postgres
SELECT
CASE
WHEN COUNT(*) >= 3 THEN state
ELSE 'Other'
END AS state_group,
AVG(elevation) AS avg_elevation
FROM t1
GROUP BY
CASE
WHEN COUNT(*) >= 3 THEN state
ELSE 'Other'
END;
-- one Postgres version
SELECT
state,
AVG(elevation) AS avg_elevation
FROM
(
SELECT CASE
WHEN policy.state IS NULL THEN NULL --filtered out
ELSE policy.state
END AS state,
elevation
FROM t1
LEFT JOIN (
SELECT state,
COUNT(*) as count
FROM t1
GROUP BY state
HAVING COUNT(*) >= 3
) policy ON t1.state = policy.state
) t1
GROUP BY
state;
Lecture 2 & 3
Lecture 2 - ML Refresher / Softmax Regression
Supervised learning
- classification (e.g., spam detection, image recognition) and regression tasks (e.g., predicting house prices, stock forecasting).
- Algorithms include decision trees, support vector machines (SVM), neural networks, and linear regression.
- Requires a large amount of accurately labeled data for effective training.
Unsupervised learning
- Useful for clustering (e.g., customer segmentation), association (e.g., market basket analysis), and dimensionality reduction (e.g., principal component analysis, or PCA)
- Algorithms include k-means clustering, hierarchical clustering, and autoencoders.
- Can work with unlabeled data, which is often more readily available and cost-effective to collect.
Multi-class classification setting
Matrix batch notation
- X in $R^{m \times n}$
- Matrix ops are more efficient than many vector ops
Loss function: classification error
- $l_{\text{err}}(h(x), y) = 0 \quad \text{if} \quad \text{argmax}_i \, h_i(x) = y$
- bad for optimization, because it is not differentiable
Loss function: softmax/cross-entropy loss
- Softmax(h(x)) = normalize(exp(h(x)))
- $l_ce(h(x), y) = -h_y(x) + log \sum_{j=1}^k exp(h_j(x))$ applied to a linear hypothesis class
- Find \theta that minimizes the average loss of the training set
SGD
\alpha step size/learning rate Sample a minibatch of size B, change by \alpha /B
gradient of cross entropy
Hack to compute loss’s derivative: pretend everything is vector, and then reaggrange transpose matrices/vectors to make size work. Make sure check the numerical answers. This approach works for the matrix batch form of the loss
Lecture 3 - “Manual” Neural Networks
Need to create feature function \theta so that the classier is more powerful than the pure linear one.
- Traditional: manually extract features
- Neural network: learn features from data, automated
$\theta = {W_1 , W_2}$
A two layer network can approximate any smooth function well
Deep layers are more efficient than a single later at representing some functions unable to learn , e.g., parity function. Empirically, deep layers work better when the number of params is fixed.
Backpropogation
two-layer network gradient w.r.t. w_2
w_1
General form
Forward path: compute Z_i
Backward path: compute G_i
vector Jacobian product
VII.3 Backpropagation and the Chain Rule
system of equations to solve for the weights that minimize L: The partial derivatives of L with respect to the weights x should be zero.
Backpropagation is a method to compute derivatives quickly, using the chain rule
$ dF/dx = \frac{d}{dx} \left( F_2(F_1(x)) \right) = \frac{dF_2}{dx_1} \left(F_1(x) \right) \frac{dF_1}{dx} \left (x \right) $
For a standard net, the steps in the chain rule can correspond to layers in the neural net.
Backpropagation has been discovered many times. Another name is automatic differentiation (AD). You will see that the steps can be arranged in two basic ways: forward-mode and backward-mode. The right choice of mode can make a large difference · in the cost (a factor of thousands). That choice depends on whether you have many functions F depending on a few inputs, or few functions F depending on many inputs.
Deep learning has basically one loss function depending on many weights. The right choice is “backward-mode AD”. This is what we call backpropagation.
Derivatives of the Learning Function F
find dw_i/db_j and dw_i/dA_{jk} for all components of b + Av. When j is different from i, the ith output w_i is not affected by b_j or A_{jk}
Columns of I are 1-hot vectors.
$ dw_i/db_j = \sigma _{ij}$
$dw_i/dA_{jk} = \sigma _{ij}v_k$
Combining Weights band A into M
It is often convenient to combine the bias vector b and the matrix A into one matrix M. Matrix of weights.
For each layer of the neural net, the top entry (the zeroth entry) is now fixed at 1. After multiplying that layer by M, the zeroth component on the next layer is still 1. Then ReLU(l) = 1 preserves that entry in every hidden layer.
$dw_i/dM_{jk} = \sigma _{ij}v_k$
Derivatives for Hidden Layers
$dw/dA_1 = d(A_2R(b_1 + A_1v_0))/dA_1 = A_2R’(b_1 + A_1v_0)d(b_1 + A_1v_0)/dA_1$
Automatic backpropagation go backwards from w to v
ReLU is a limiting case of an S-shaped sigmoid function
Delta function: $\delta (x) = d^2R/dx^2$. The delta function represents an impulse. It models a finite change in an infinitesimal time. It is physically impossible but mathematically convenient.
With ReLU, a neuron could stay at zero through all the steps of deep learning. This “dying ReLU” can be avoided in several ways-it is generally not a major problem. One way that firmly avoids it is to change to a Leaky ReLU with a nonzero gradient
Computational Graphs
Forward mode requires a new graph for each input x_i, to compute dF/dx_i
The reverse mode starts with the output F. It computes the derivatives with respect to both inputs. The computations go backward through the graph.
Instead of N forward graphs from N inputs, we will have one backward graph from one output. It finds the derivative ofF with respect to every node. It starts with dF/dF = 1
The reverse mode finds all derivatives dF/dx_i by following all chains backward from output to input. Those chains all appear as paths on one graph-not as separate chain rules for exponentially many possible paths. This is the success of reverse mode.
The derivatives of F with respect to the matrices A (and the bias vectors b) are easiest for the last matrix A_L in A_Lv_{L-1} dF_i/dA_{jk} = \delta {ij}v{k} Next is the derivative A_LReLU(A_{L-1}v_{L-1}) with respect to A_{L_1}
Adjoint Methods
Hyperparameters
The stepsize/learning rate s_k in gradient descent is first and foremost: accelerated (by momentum) or adaptive (ADAM) or stochastic with a random minibatch of training data at each step k.
line search
No method is perfect. s_k is too small Then gradient descent takes too long. s_k is too large Gradient descent will jump around the minimizing x^*
After reaching weights x that are close to minimizing the loss function L( x, v) we may want to bring new v ‘s from a validation set. This is not yet production mode. The purpose of cross-validation is to confirm that the computed weights x are capable of producing accurate outputs from new data.
Cross-validation
A first step in cross-validation is to divide the available data into K subsets. If K = 2, these would essentially be the training set and test set-but we are usually aiming for more information from smaller sets before working with a big test set. K-fold cross-validation uses each of K subsets separately as a test set. In every trial, the other K - 1 subsets form the training set. We are reworking the same data (moderate size) to learn more than one optimization can teach us.
Batch Normalization of Each Layer
As training goes forward, the mean and variance of the original population can change at every layer of the network. This change in the distribution of inputs is “covariate shift”. We often have to adjust the stepsize and other hyperparameters, due to this shift in the statistics of layers. A good plan is to normalize the input to each layer.
Normalization makes the training safer and faster. The need for dropout often disappears. Fewer iterations can now give more accurate weights. And the cost can be very moderate. Often we just train two additional parameters on each layer.
The key point is to normalize the inputs y_i to each new layer. What was good for thel, original batch of vectors (at layer zero) is also good for the inputs to each hidden layer.
Dropout
Dropout is the removal of randomly selected neurons in the network. Typically hidden layer neurons might be given probability p = 0.5 of surviving, and input components might have p = 0.8 or higher. The main objective of random dropout is to avoid overfitting.
In training, each new v_0 (the feature vector of an input sample) leads to a new thinned network.
At test time, we use the full network (no dropout) with weights rescaled from the training weights. The outgoing weights from an undropped neuron are multiplied by p in the rescaling.
To compute gradients, use backpropagation for each training example in the minibatch. Then average those gradients. Stochastic gradient descent can still include acceleration (momentum added) and adaptive descent and weight decay.
LeCun emphasizes that for a multiparameter search, random sampling is the way to cover many possibilities quickly. Grid search is too slow in multiple dimensions.
Loss Functions
Quadratic cost(square loss): is not a favorite choice for deep learning. One reason is the parabolic shape for the graph of l(w, v), as we approach zero loss at w = y. The derivative also approaches zero.
A zero derivative at the minimum is normal for a smooth loss function, but it frequently leads to an unwanted result: The weights A and b change very slowly near the optimum. Learning slows down and many iterations are needed.
Cross-entropy loss: expect that the N outputs w_i from training the neural net have been normalized to z(w), with z_i in (0, 1)
Shannon’s formula for entropy
Kullback-Leibler (KL) divergence.
Regularization
It adds a penalty term to the loss function L(x)
The penalty controls the size of x. Regularization is also called weight decay.
The coefficient \lambda is a hyperparameter. Its value can be based on cross-validation. The purpose of the penalty terms is to avoid overfitting (sometimes expressed as fitting the noise). Cross-validation for a given lambda finds the minimizing x on a test set. Then it checks by using those weights on a training set. If it sees errors from overfitting, lambda is increased.
A small value of lambda tends to increase the variance of the error : overfitting. Large lambda will increase the bias: underfitting because the fitting term | b-Ax | ^2 is less important |
Recent experiments on MNIST make it unclear if explicit regularization is always necessary.
The Structure of AlphaGo Zero
- A convolution of 256 filters of kernel size 3 x 3 with stride 1 : E = 3, S = 1
- Batch normalization
- ReLU
- A convolution of 256 filters of kernel size 3 x 3 with stride 1
- Batch normalization
- A skip connection as in ResNets that adds the input t? the block
- ReLU
- A fully connected linear layer to a hidden layer of size 256
- ReLU
Training was by stochastic gradient descent on a fixed data set that contained the final 2 million games of self-played data from a previous run of AlphaGo Zero. The CNN includes a fully connected layer that outputs a vector of size 19^2 + 1. This accounts for all positions on the 19 x 19 board, plus a pass move allowed in Go.
Recurrent Neural Networks
appropriate for data that comes in a definite order. This includes time series and natural language: speech or text or handwriting. In the network of connections from inputs v to outputs w, the new feature is the input from the previous time t - 1. This recurring input is determined by the function h(t- 1).
The inputs to h(t) are the new data v(t) and the recurrent data h( t- 1) from the previous time. The weights multiplying the data are x_in, x_recur, and x_out
Google translate
Instead of programming every word and grammatical rule and exception in both languages, let the computer find the rules. Just give it enough correct translations. If we were recognizing images, the inputs would be many examples with correct labels (the training set). The machine creates the function F(x, v).
Part VII : Learning from Data
There are a great many points in the training data, but there are far more weights to be computed in a deep network. The art of deep learning is to find, among many possible solutions, one that will generalize to new data.
F(x,v) The training data is given by a set of feature vectors v. The weights that allow F to classify that data are in the vector x. To optimize F, gradient descent needs its derivatives dF/dx. The weights x are the matrices A_1…A_L and bias vectors B_1…B_L. Sample data v=v_0, output w = v_L
Real codes use automatic differentiation (AD) for backpropagation. Each hidden layer with its optimized weights learns more about the data and the population from which it comes-in order to classify new and unseen data from the same population.
The Functions of Deep Learning
So we start with M different images (the training set). An image will be a set of p small pixels-or a vector v. v_i is the greyscale of ith pixel. For every v in that training set we know the digit it represents.
Continuous Piecewise Linear (CPL) functions. Linear for simplicity, continuous to model an unknown but reasonable rule, and piecewise to achieve the nonlinearity that is an absolute requirement for real images and data.
If A_1 is q by p, the input space R^p is is sliced by q hyperplanes into r pieces
Bias vs. Variance : Underfit vs. Overfit
A training set contains N vectors with m components each (them features of each sample). For each ofthose N points in R^m, we are given y_i. We want to learn $y_i = f(x_i) + \epsilon$, with epsilon mean = 0
Our learning algorithm actually finds a function F(x) close to f(x)
bias-variance tradeoff. High bias from underfitting, high variance from overfitting.
bias = E[f(x)-F(x)]. variance = E[F(X)^2] - E[F(X)]^2
The Construction of Deep Neural Networks
create a learning function F(x, v) with weights x that capture information from the training data v
- Key operation: F=F_3(F_2(F_1(x, v)))
- key rule: Chain rule for x-derivatives of F
- Key algorithm: Stochastic gradient descent to find the best weights x
- Key subroutine: Backpropagation to execute the chain rule, find the x-derivatives of F to solve \nabla F = 0
- Key nonlinearity: ramp function ReLU(y). This makes the learning function F is continuous and piecewise linear in v.
One Internal Layer L = 2
For a classification problem each sample v_0 of the training data is assigned 1 or -1. We want the output v_2 to have that correct sign (most of the time). For a regression problem we use the numerical value (not just the sign) of v_2
Machine learning doesn’t aim to capture every detail of the numbers 0, 1, 2 … , 9. It just aims to capture enough information to decide correctly which number it is.
The Initial Weights in Gradient Descent
A proper choice of the net and the initial x_0 has random (and independent) weights that meet two requirements.
- x_0 has a carefully chosen variance, which controls the mean of the computed weights.
- The hidden layers in the neural net have enough neurons (not too narrow).
Many-layered depth can reduce the loss on the training set. But if $\sigma^2$ is wrong or width is sacrificed, then gradient descent can lose control of the weights. They can explode to infinity or implode to zero.
The good choice is 2/fan-in. The fan-in is the maximum number of inputs to neurons
The key point: Deep learning can go wrong if it doesn’t start right.
Stride and Subsampling
goal : Reduce the dimension from 128 to 64
- In two steps Multiply the 128-component vector v by A, and then discard the odd-numbered components of the output. This is filtering followed by subsampling.
- In one step Discard the odd-numbered rows of the matrix A. The new matrix A2 becomes short and wide: 64 rows and 128 columns. The “stride” of the filter is now 2. Now multiply the 128-component vector v by A2.
Max-pooling
Multiply the 128-component vector v by A, as before. Then from each even-odd pair of outputs, keep the maximum.
For an image (a 2-dimensional signal) we might use max-pooling over every 2 by 2 square of pixels. This speeds up the training, when the number of neurons on a hidden layer is divided by 4.
Pooling also reduces the possibility of overfitting. Average pooling would keep the average of the numbers in each pool : now the pooling layer is linear.
The Graph of the Learning Function F(v)
The graph of F ( v) is a surface made up of many, many flat pieces-they are planes or hyperplanes that fit together along all the folds where ReLU produced a change of slope. This is like origami except that this graph has flat pieces going to infinity.
That polygon separating blue from orange (or plus from minus : this is classification) is the analog of a separating hyperplane in a Support Vector Machine. If we were limited to linear functions and a straight line between a blue ball and an orange ring around it, separation would be impossible.
Counting Flat Pieces in the Graph
Suppose v_o has m components and A_1v_0+b_1 has N components. We have N functions of V_o. Each of those linear functions is zero along a hyperplane (dimension m - 1) in R^m. When we apply ReLU to that linear function it becomes piecewise linear, with a fold along that hyperplane. On one side of the fold its graph is sloping, on the other side the function changes from negative to zero.
Convolutional Neural Nets
That fully connected net will be extremely inefficient for image recognition. We are looking for connections between faraway pixels. Almost always, the important connections in an image are local
the search for structure is essentially the same everywhere in the image. There is normally no reason to process one part of a text or image or video differently from other parts. We can use the same weights in all parts: Share the weights. The neural net of local connections between pixels is shift-invariant: the same everywhere.
Suppose each neuron is connected to only E neurons on the next layer, and those connections are the same for all neurons. Then the matrix A between those layers has only E independent weights x.
we have time to create several different channels with their own E or E^2 weights. They can look for edges in different directions (horizontal, vertical, and diagonal).
In one dimension, a banded shift-invariant matrix is a Toeplitz matrix or a filter. Multiplication by that matrix A is a convolution x * v.
A Toeplitz matrix is a type of structured matrix where each descending diagonal from left to right is constant. That means all the elements along any diagonal parallel to the main diagonal are the same.
Each shift has a diagonal of 1 ‘s $A = x_1L + x_0C + x_{-1}R$, assuming S = 1
$dAv/d_x1 = Lv$
Convolutions in Two Dimensions
When the input v is an image, the convolution with x becomes two-dimensional. 1-d $x_1 , x_0 , x_{-1}$ changes to E^2 independent weights. v represents (N+2)^2 pixels, and output have only N^2 pixels
The 2D convolution $A=x_{11}LU+x_{01}CU+x_{-11}RU+…+x_{-1-1}RD$
These nine derivatives are computed inside backpropagation.
CNN’s can readily afford to have B parallel channels (and that number B can vary as we go deeper into the net).
A convolution is a combination of shift matrices (producing a filter or Toeplitz matrix)
A cyclic convolution is a combination of cyclic shifts (producing a circulant matrix)
In deep learning, the coefficients in the combination will be the “weights” to be learned.
In two dimensions (for images) the matrix A is block Toeplitz. Each small block is E by E
The same weights are used around all pixels (shift-invariance). The matrix produces a 2D convolution. Frequently A is called a filter
The dot products between a smooth function and a moving filter window will be smooth. But when an edge in the image lines up with a diagonal wall, we see a spike.
The difficulty with two or more dimensions is that edges can have many directions. We will need horizontal and vertical and diagonal filters for the test images. And filters have many purposes, including smoothing, gradient detection, and edge detection.
- For a 2D function f, the natural smoother is convolution with a Gaussian. The filter removes noise (at a price in sharp edges)
- Gradient detection. Image processing (as distinct from learning by a CNN) needs filters that .detect the gradient. They contain specially chosen weights. We mention some simple filters just to indicate how they can find gradients-the first derivatives of f.k
- Edge detection. we don’t want smoothing, which would blur the edge. The good filters become Laplacians of Gaussians
The Stride of a Convolutional Filter
For a larger stride, the moving window takes longer steps as it moves across the image.
the length of the output y=Av is reduced by the factor of stride.
nonzero weights like x_1 are S columns apart for stride S
Extending the Signal
Instead of losing neurons at the edges of the image when A is not square, we can extend the input layer. We are “inventing” components beyond the image boundary. Then the output y = Av has equal dimensions for input and output.
For periodic signals, zero-padding is replaced by wraparound.
A more accurate choice is to go beyond the boundary by reflection.
This is what CNN’s usually do: Add more channels of weight matrices A to capture more features of the training sample.
Counting the Number of Inputs and Outputs
In a one-dimensional problem, suppose a layer has N neurons. We apply a convolutional matrix withE nonzero weights. The stride isS, and we pad the input signal by P zeros at each end. How many outputs (M numbers) does this filter produce?
Karpothy’s formula
In a 2D or 3D problem, this 1D formula applies in each direction.
This case 2P = E - 1 with stride S = 1 is the most common architecture for CNN’s.
it is very common to end a CNN with fully-connected layers.
Softmax Outputs for Multiclass Networks
“Softmax” replaces the two-output case of logistic regression. We are turning n numbers into probabilities.
For CNN’s in computer vision, the final layer often has a special form. If the previous layers used ReLU and max-pooling (both piecewise linear), the last step can become a difference-of-convex program, and eventually a multiclass Support Vector Machine (SVM). Then optimization of the weights in a piecewise linear CNN can be one layer at a time.
Residual Networks
But depth brings dangers. Information can jam up and never reach the output. The problem of “vanishing gradients” can be serious: so many multiplications in propagating so far, with the result that computed gradients are exponentially small. When it is well designed, depth is a good thing-but you must create paths for learning to move forward.
“skip connections” that go directly to the next layer.
entire layers can be removed without significant impact
Part VI: Optimization
Gradient Descent Toward the Minimum
Minimize a function $f(x_l … , x_n)$ - all the first derivatives are 0 at the minimum. If we have n = 20 unknowns (a small number in deep learning) then minimizing one function f produces 20 equations
The steepest direction, in which f *x) decreases fastest, is given by the gradient $-\nabla f$
$\nabla f$ represents the vector of n partial derivatives of f: its gradient. $s_k$ is the stepsize or learning rate
The neat idea of automatic differentiation-rediscovered and extended as backpropagation in machine learning-makes those cost factors much smaller in practice.
df/dx = limit of $\delta f/ \delta x$ = limit of $ (f(x + \delta x) - f(x)) / \delta x $
For n = 1, df/dx is the gradient
we get a better ratio (closer to the limiting slope df/dx) by averaging the forward difference (where $\delta x > 0$) with the backward difference (where $\delta x < 0$). The average is the more accurate centered difference
The gradient of $F(x_1, …, x_n)$ is the column vector $\nabla F = (dF/dx_1,….,dF/dx_n). Its components are the n partial derivatives of F, and it points in the steepest direction.
For a constant vector a, $F(x) = a^Tx$ has gradient a.
For a symmetric matrix S, gradient of $F(x)=x^TSx$ is 2Sx
For a positive definite symmetric S, $1/2x^TSx-a^Tx$’s gradient is Sx-a
That steepest direction is perpendicular to the level direction
Stochastic Gradient Descent and ADAM
Gradient descent: $x_{k+1}=x_k-s_k\nabla L(x_k)$ L(x) is the loss function. But for large networks with many samples in the training set, this algorithm (as it stands) is not successful
Computing $\nabla L$ at every descent step-the derivatives of the total loss L with respect to all the weights x in the network-is too expensive. Potentially millions of separate losses are computed and added in every computation of L
The number of weights is even larger. many different choices $x^$ of the weights. x^ minimized L(x) for test data v. Some of those choices can give poor results on unseen test data.
Stochastic gradient descent uses only a “minibatch” of the training data at each step. B samples will be chosen randomly. This changes L(x) to a sum of only B losses.
Computing $\nabla l_i by backpropagation on B samples is much faster. Often B = 1
The stochastic algorithm produces weights that also succeed on unseen data.
And for each sample in the training set, we know if it is “a cat or a dog”-or if. the text is “poetry or prose”. We look for a learning function F that assigns’ good weights. Then for v in a similar population, F outputs the correct classification “cat” or “poetry”.
We use this function F for unidentified test data. The features of the test data are the new inputs v. The output from F will be the correct (?) classification-provided the function has learned the training data in a way that generalizes.
Generalization by SGD is the ability to give the correct classification for unseen test data v, based on the weights x that were learned from the training data.
I compare overfitting with choosing a polynomial of degree 60 that fits exactly to 61 data points. Need early stopping
The Loss Function and the Learning Function
“loss” L(x) that our function will (approximately) minimize. This is the sum of the errors in classifying each of the training data vectors v. And we need to describe the form of the learning function F that classifies each data vector v
At the beginning of machine learning the function F was linear-a severe limitation. Now F is certainly nonlinear. Just the inclusion of one particular nonlinear function at each neuron in each layer has made a dramatic difference.
The gradient is determined by the network architecture-the “feedforward” steps whose weights we will optimize.
Square loss: sum over the training samples
Hinge loss
Cross-entropy loss: favorite for neural nets. Cross-entropy loss or “logistic loss” is preferred for logistic regression (with two choices only). For a minibatch of size B, replace N by B. And choose the B samples randomly.
Stochastic Descent Using One Sample Per Step
To simplify, suppose each minibatch contains only one sample (so B = 1). That sample is chosen randomly. The theory of stochastic descent usually assumes that the sample is replaced after use-in principle the sample could be chosen again at step k + 1. But replacement is expensive compared to starting with a random ordering of the samples. In practice, we often omit replacement and work through samples in a random order.
Each pass through the training data is one epoch of the descent algorithm. Ordinary gradient descent computes one epoch per step (batch mode). Stochastic gradient descent needs many steps (for minibatches).
Stochastic descent is more sensitive to the stepsizes than full gradient descent.
We are doing much less work per step (B inputs instead of all inputs from the training set). But we do not necessarily converge more slowly. A typical feature of stochastic gradient descent is “semi-convergence” : fast convergence at the start.
later iterations of SGD are frequently erratic. Convergence at the start changes to large oscillations near the solution. One response is to stop early and thereby we avoid overfitting the data.
If $x_k$ is outside I, then $x_{k+1}$ moves toward the interval I. Otherwise, it just bounces inside I
Randomized Kaczmarz is Stochastic Gradient Descent
We are randomly selecting row i of A at step k, and adjust x_{k+1} to solve equation i in Ax=b, i.e., $a_i^Tx_{k+1}=b_i$. x_{k+1} is the the projection of x_k onto one hyperplane $a_i^Tx=b_i$ that meets $x^*=A^{-1}b$
Adaptive Methods Using Earlier Gradients
That “memory” guides the choice of search direction D and the all-important stepsize s.
For a standard SGD iteration, D_k depends only on the current gradient $\nabla L_k$. $\nabla L_k(x_k, B)$ is evaluated only on a random minibatch B of the test data.
Adaptive Stochastic Gradient Descent: x_{k+1}=x_k - s_kD_k
Exponential moving averages in ADAM: recent gradients $\nabla L$ have greater weight than earlier gradients in both s_k and step direction D_k
Typical values are \delta = 0.9 and \beta = 0.999.
The actual computation of D_k and S_k will be a recursive combination of old and new
Generalization : Why is Deep Learning So Effective?
Often we have more free parameters in x than data in v. In that case we can expect many sets of weights (many vectors x) to be equally accurate on the training set. Those weights could be good or bad. They could generalize well or poorly. Our algorithm chooses a particular x and applies those weights to new data v_{test}
Part V: Probability and Statistics
The first 100 or 1000 flips do affect the sample mean. But 1000 flips will not affect its limit-because you are dividing by $N \to \infty$
F(x) is the cumulative distribution and its derivative p(x) is the probability density function. You could say that p(x) dx is the probability of a sample falling in between x and x + dx.
$E[x] = \int xp(x)dx$ $\sigma^2 = E[(x-m)^2] = \int p(x)(x-m)^2dx$
Normal distribution
Central Limit Theorem. Standard normal distribution N(0, 1).
$ F(\sigma) - F(-\sigma) \approx 2/3 $ $ F(2\sigma) - F(-2\sigma) \approx 0.95 $
The “binomial” probabilities count the number of heads in N coin flips. binomial approaches normal.
Stirling’s formula
Monte Carlo Estimation Methods
accepting uncertainty in the inputs and estimating the variance in the outputs. The error in approximating E[x] is ususally $1/ \sqrt N$
2-level Monte Carlo
Probability Distributions
Binomial: Tossing a coin n times
Poisson
p \to 0$ events and n trials with expected $\lambda = np$ successes. What is the probablity of k successes in n trials?
we wait a long time, or we include a large population, to produce a decent chance that one or more events will actually occur. Poisson is not for the probability that you will be struck by lightning : say one in a million, over your lifetime. It is for the probability that someone in a city of 100,000 will be struck.
Poisson assumes independent events. One of the most difficult aspects of applied probability is to estimate the dependence of one event on other events.
Exponential
The exponential distribution is continuous (not discrete). It describes the waiting time in a Poisson process. How long until lightning strikes your city ? The key assumption is that The waiting time is independent of the time you have already waited.
The future is independent of the past. Is this true for a television set to fail ? It is true for a computer to fail ? If failure depends on a random disaster, the waiting time has exponential distribution. The failure rate >. is constant. If failure comes from slow decay, the independence requirement will not hold.
The exponential distribution has no memory.: The probability of waiting at least y hours more for a table is unfortunately not affected by already having waited x hours
If failures only occur at integer times t = 0, 1, 2, 3, … then the exponential distribution (which has continuous time) is replaced by the geometric distribution (discrete time).
Log-normal
When x has a normal distribution, the exponential $e^x$ has a log-normal distribution. i.e., The distribution of x is log-normal when the distribution of x y = log(x) is normal.
The log-normal distribution appears when you take the product of many positive sample values. The arithmetic mean for normal compares with the geometric mean for log-normal.
Chi-squared
If x has the standard N(O, 1) distribution, then $x^2$ is chi-squared.
If $x_1….x_n$ are independent N(0, 1), then $x_1^2+…+ x_n^2$ has chi-squared distribution with n degrees of freedom
Multivariable Gaussian
When the variables are not independent, like stocks and bonds, C is symmetric and positive definite (in an extreme case, semidefinite). Then the joint distribution of Gaussian variables is “multivariate Gaussian”
Moments, Cumulants, and Inequalities
Suppose we know E[X], we want to know Pr(X>=a)
Markov’s inequality
Chebyshev’s inequality: The proof is to apply Markov’s inequality
Connecting a list of numbers to a (generating) function $f(x)=\Sigma a_nx^n$
Moments and Central Moments
$m_n =E[x^n]$
central moments (around m) $\mu_n=E[(x-m)^n]$
The normalized third central moment is the skewness of the distribution.
We generally use the fourth moment of a normal distribution as a basis for comparison. Then the kurtosis of any distribution is called kappa
Generating Functions and Cumulants
Probability generating function
Characteristic function. uses the characteristic function to prove the central limit theorem
Moment generating function M(t)
Cumulant generating function K(t)
We are capturing an infinite set of coefficients in each function. The key point of cumulants is that they lead to this property $K(t)=logM(t)$
Chernoff’s Inequality for Sums
The mean value of a sum. If those variables are independent, then also the variance
Chebyshev for a sum
Markov and Chebyshev Inequalities for Matrices
an essential step in the statistics of deep learning is to develop inequalities for the eigenvalues and the trace of random matrices.
Markov’s inequality. Suppose X is a semidefinite or definite random matrix, if A is any positive definite matrix, then
Prob {A- X is not positive semidefinite } <= Trace of $ E[X]A^{-1}$
Chebyshev’s inequality for a random matrix X with mean 0:
Matrix Chernoff Inequalities
Erdos-Renyi Random Graphs
Each edge is present with probability p. Then the question is : For which p is the graph likely to be connected?.
Covariance Matrices and Joint Probabilities
Prove that independent experiments have covariance 0
Max covariance = $\sigma_1\sigma_2$ when the two variables are fully dependent
U is a column in covariance matrix. $p_{ij}UU^T$ is positive semidefinite. So the whole matrix V (the sum of those rank 1 matrices) is at least semidefinit
The covariance matrix V is positive definite unless the experiments are dependent.
$V=E[(X-E(X))(X-E(X))^T]=\Sigma p_{ij}(X-E(X))(X-E(X))^T$
Variance of $c^TX$ = $c^TVc$
Diagonalizing the covariance matrix $V=Q\Lambda Q^T$ means finding M independent experiments as combinations of the original M experiments.
Another proof that covariance matrix V is positive semidefinite. V is the sum of the joint probability of each combination times the positive semidefinite matrix $(X-E(X))(X-E(X))^T$
The value of the expectation symbol E is that it also allows pdf’s : probability density functions like p(x, y, z) for continuous random variables x andy and z.
The covariance matrix of $Z =AX$ is $ V_Z=AV_XA^T$
Suppose x and y are independent random variables with mean 0 and variance 1, What are the covariance matrix for Z =(x, y, ax + by)?
Correlation
Normalize the random variables x to have their variance = 1 $ X = x/\sigma _x$
The correlation of x and y is the covariance of X and Y
Correlation matrix R
GPS Example
Distances from four or more satellites will pinpoint the receiver position (using least squares!) The speed of light changes in the ionosphere. But the correction will be almost the same for all nearby receivers. If one receiver stays in a known position, we can take differences from that one. Differential GPS reduces the error variance by fixing one receiver
Markov chains
$y_{n+1}=Py_n$
Find eigenvalues of a square matrix through determinant.
The determinant of a square matrix can be seen as a measure of the volume spanned by its row or column vectors. If the determinant is zero, it means these vectors are linearly dependent, and thus, the volume is zero. In geometric terms, it indicates that the transformation described by the matrix flattens the space into a lower dimension.
The determinant of a matrix is calculated from a specific combination of its elements, reflecting how rows or columns interact. If these are linearly dependent (i.e., one row or column can be expressed as a combination of others), then they don’t span an “n-dimensional volume” in the space, and the determinant becomes zero. The determinant can be viewed as a scaling factor for the volume of the unit cube under the linear transformation represented by the matrix: For a 2x2 matrix, it corresponds to the area of the parallelogram spanned by its columns (or rows). From an algebraic standpoint, if there are linearly dependent rows or columns, the matrix can be row-reduced to a form where one entire row (or column) becomes zero, leading the determinant to be zero, as any zero-row matrix has a zero determinant.
steady state with eigenvalue 1
$P=X\Lambda X^{-1}$, $P^\infty has the same split in both columns
The requirements for a positive Markov matrix: All p_{ij} > 0 and each column of P adds to 1. P has largest eigenvalue $\lambda _1 = 1$ and $x_1 > 0$
Transition Probabilities
$p_{ij}= Probability that x(n + 1) = i if x(n) = j$ $p_{ij}$ does not depend on n
Finite Markov Chain, Infinite State Markov Chain, Continuous Markov Chain
1 is an eigenvalue, because the rows of P - I add to the zero row. And no entry of Pis negative.
Perron-Frobenius Theorem $P_1 > 0$ guarantees success. For A > 0, All numbers in $Ax=\lambda _{max}x$ are strictly positive
if $ | \lambda_2 | < 1$, then P^n converges. The matrix P might not have n independent eigenvectors, so we can’t diagonalize P |
Gambler’s Ruin
when a player has won all the money-the game is over and there is no way to leave either of those states. With those two steady states we must expect eigenvalue 1 twice.
Shape of P. What are the four eigenvalues of P? What is the probability that the game will continue forever with no winner?
Find eigenvalues of a square matrix through determinant.
The determinant of a square matrix can be seen as a measure of the volume spanned by its row or column vectors. If the determinant is zero, it means these vectors are linearly dependent, and thus, the volume is zero. In geometric terms, it indicates that the transformation described by the matrix flattens the space into a lower dimension.
The determinant of a matrix is calculated from a specific combination of its elements, reflecting how rows or columns interact. If these are linearly dependent (i.e., one row or column can be expressed as a combination of others), then they don’t span an “n-dimensional volume” in the space, and the determinant becomes zero. The determinant can be viewed as a scaling factor for the volume of the unit cube under the linear transformation represented by the matrix: For a 2x2 matrix, it corresponds to the area of the parallelogram spanned by its columns (or rows). From an algebraic standpoint, if there are linearly dependent rows or columns, the matrix can be row-reduced to a form where one entire row (or column) becomes zero, leading the determinant to be zero, as any zero-row matrix has a zero determinant.
steady state with eigenvalue 1
$P=X\Lambda X^{-1}$, $P^\infty has the same split in both columns
The requirements for a positive Markov matrix: All p_{ij} > 0 and each column of P adds to 1. P has largest eigenvalue $\lambda _1 = 1$ and $x_1 > 0$
Transition Probabilities
$p_{ij}= Probability that x(n + 1) = i if x(n) = j$ $p_{ij}$ does not depend on n
Finite Markov Chain, Infinite State Markov Chain, Continuous Markov Chain
1 is an eigenvalue, because the rows of P - I add to the zero row. And no entry of Pis negative.
Perron-Frobenius Theorem $P_1 > 0$ guarantees success. For A > 0, All numbers in $Ax=\lambda _{max}x$ are strictly positive
if $ | \lambda_2 | < 1$, then P^n converges. The matrix P might not have n independent eigenvectors, so we can’t diagonalize P |
Gambler’s Ruin
when a player has won all the money-the game is over and there is no way to leave either of those states. With those two steady states we must expect eigenvalue 1 twice.
Shape of P. What are the four eigenvalues of P? What is the probability that the game will continue forever with no winner?
Part 1.11 Norms
Every norm for vectors or functions or matrices must share these two properties of the absolute value | c | of a number | cv | = | c | v | , | v + w | <= | v | + | w |
Euclidean norm, 1-norm, and infinite norm for a vector $ | v | _p=( | v_1 | ^p+…+ | v_n | ^p)^{1/p}$ |
$v^Tw= | v | w | cos\theta$ |
S-norms
Choose any symmetric positive definite matrix S, S-norm $ | v | ^2_S = v^TSv$, S-inner product: $(v, w)_S = v^TSw$ |
Norms of functions
A function f(x) is a “vector in function space”.
For a vector space to be “complete”, every converging sequence $v_n$ must have $ | v_n - v_\infty | \to 0$ |
Banach space. Hilbert space is a Banach space with inner product of $(v,v) = | v | ^2$ |
The Frobenius Norm
A matrix norm follows rules for a vector norm, plus | AB | <= | A | B |
$ | A | _F^2 = | a_{11} | ^2+…+ | a_{mn} | ^2$ |
When Q is an orthogonal matrix, we know that Qx has the same $l^2$ length as x. $ | QB | _F = | B | _F$ |
$ | A | _F^2$ = trace of AT A =sum of eigenvalues |
Matrix Norms from Vector Norms
If we choose the vector v with the largest growth factor, $ | A | = max( | Av | / | v | )$ |
l2 norm: largest singular value l1 norm: largest l1 norm of the columns of A $l_\infty$ norm: largest l1 norm of the rows of A
The Nuclear Norm
Also called the trace norm.
$ | A | _N = \sigma _1 + … + \sigma _n$. It is the minimum value of $ | U | _F | V | _F$ with UV=A |
$ | A | _F = \sigma _1^2 + … + \sigma _n^2$ |
$ | A | _2 = \sigma _1$ |
Factoring Matrices and Tensors
A matrix is just a two-way tensor.
To compute a factorization A = UV, we introduce a simple alternating iteration. Update U with V fixed, then update V with U fixed. This UV idea also fits the famous k-means algorithm.
NMF
The goal of NMF is to approximate a nonnegative matrix A >= 0 by a lower rank product UV of two nonnegative matrices U >= 0 and V >= 0
When A >= 0 is symmetric positive definite, we hope for a matrix B >= 0 that satisfies $B^TB=A$. Often no solution though, and we want to find a close enough $B^TB$
SPCA: Find sparse low rank matrices B and C so that BC is close to A
Facial Feature Extraction: Each column vector of the data matrix A will represent a face. Its components are the intensities of the pixels in that image. The goal is to find a few “basic faces” in B, so that their combinations come close to the many faces in A.
Text Mining and Document Classification
each column of A represents a document. Each row of A represents a word. NMF identifies topics and classifies the whole set of documents with respect to those topics.
$a_j \approx \Sigma c_{ij}b_i$ c - importance, b topic
NMF is an NP hard problem
Computing the Factors: A central idea is alternating factorization: Hold one factor fixed, and optimize the other factor. Hold that one fixed, optimize the first factor, and repeat. Using the Frobenius norm, each step is a form of least squares.
Tensors
A column vector is a 1-way tensor. A matrix is a 2-way tensor. Then a 3-way tensor has elements $T_{ijk}$ row number, column number, and “tube number”
The rows and columns and tubes are the fibers of T, with only one varying index.
A Color Image is a Tensor with 3 Slices
A black and white image is just a matrix of pixels. The numbers in the matrix are the grayscales of each pixel. Normally those numbers are between zero (black) and 255 (white). A color image is a tensor. It has 3 slices corresponding to red-green-blue. Each slice shows the density of one of the colors RGB.
The Joint Probability Tensor
Suppose we measure age a in years and height h in inches and weight w in pounds. We put N children into I age groups and J height groups and K weight groups.
joint probabilities $p_{ijk}=N_{ijk}/N$. For each combination i, j, k we count only the children who are in age group i and also height group j and also weight group k.
The Norm and Rank of a Tensor
A tensor can multiply vectors, matrices, or tensors. Then it is a linear operator. A tensor can contain data. Its entries could give the brightness of pixels in an image. A color image is 3-way, stacking RGB. A color video will be a 4-way tensor.
Part 1.9 Principal Components
Principal Component Analysis (PCA) uses the largest $\sigma$ connected to the first u’s and v’s to understand the information in a matrix of data. The closest rank k matrix to A is $A_k$. In statistics we are identifying the pieces of A with largest variance
PCA is “unsupervised” learning. Our only instructor is linear algebra. the SVD tells us to choose $A_k$.
Eckart-Young. Three choices for the matrix norm | A |
PCA
We have n samples. For each sample we measure m variables, i.e., a m rows by n columns matrix.
The first step is to find the average (the sample mean) along each row of A0 . Subtract that mean from all m entries in the row. Now each row of the centered matrix A has mean zero
How will linear algebra find that closest line through (0, 0)? It is in the direction of the first singular vector
The Statistics Behind PCA
Subtracting those means in each row from each row of $A_0$ produced the centered A. The variances are the diagonal entries of the matrix $AA^T$. The covariances are the off-diagonal entries of the matrix $AA^T$. High covariance means that increased height goes with increased age. (Negative covariance means that one variable increases when the other decreases.).
sample covariance matrix $ S = AA^T / (n -1)$
Given the data in A, computing S would be a computational mistake. For large matrices, a direct SVD of A is faster and more accurate.
The leading eigenvector $u_1$ tells us slope of the closest line in the scatter plot.
The sum of squared distances from the data points to the $u_1$ line is a minimum.
The total variance in the data comes from the Frobenius norm (squared) of A. This is the trace of S-the sum down the diagonal. the trace equals the sum of the eigenvalues of the sample covariance matrix x
Rayleigh quotient for symmetric matrix
$R(x) = x^TSx/x^Tx$. The maximum value of R(x) is the largest eigenvalue. Other eigenvectors are saddle points. Saddles have first derivatives = zero but they are not maxima or minima.
$ | A | ^2 = max | Ax | ^2/ | x | ^2 = max(R(x)) = \lambda_1(S)$ |
Generalized Eigenvalues
Generalized Rayleigh quotient $R(x) = x^TSx / x^TMx$. M is symmetric. In dynamical problems M is often the “mass matrix” or the “inertia matrix”. In statistics M is generally the covariance matrix.
If M is positive definite, the maximum of R(x) is the largest eigenvalue of $M^{-1}S$
$H=M^{-1/2}SM^{-1/2}$ is symmetric
$x=M^{-1/2}y$ to convert $Sx = \lambda Mx$ into $Hy=\lambda y$
A crucial fact about a symmetric matrix S is that any two eigenvectors are orthogonal. Not true for generalized eigenvectors, but true for $x_1Mx_2^T=0$
Positive Semidefinite M : Not Invertible
$x^TMx$ can be zero, possible to have infinite eigen values
In statistics M is often a covariance matrix. Its diagonal entries tell us the separate variances of two or more measurements. Its off-diagonal entries tell us the “covariances between the measurements”. If we are foolishly repeating exactly the same observations, or if one experiment is completely determined by another-then the covariance matrix M is singular. Its determinant is zero and it is not invertible
Fisher’s Linear Discriminant Analysis
We are given samples from two different populations and they are mixed together, with mean and variance of each pop. If all the samples are mixed together and we pick one, how do we tell if it probably came from population 1 or population 2 ?
Each sample with feature vector f, finds a separation vector v, if $v^Tf>c$ then likely from pop 1. v maximizes the separation ratio R.
$R = (x^Tm_1 - x^Tm_2)^2/x^T\Sigma _1x + x^T\Sigma _2x$. m: mean vector
$S = (m_1-m_2)(m_1-m_2)^T$ $M = \Sigma _1 + \Sigma _2$
$Sv = \lambda Mv$
$v = M^{-1} (m_1 - m_2)$
Neural networks will succeed by allowing separating surfaces that are not just planes.
Part 1.6: Eigenvalues
The eigenvectors of A don’t change direction when you multiply them by A. The output Ax is on the same line as the input vector x. The eigenvector x is just multiplied by its eigenvalue
$A^2x=\lambda ^2 x$
$Ax=\lambda x$ means that $A - \lambda I = 0$, i.e., it is not invertible, it is singular, and its determinant is 0
Similar matrices
For every invertible matrix B, the eigenvalues of A is same as $BAB^-1$. The eigenvector is Bx, when $Ax=\lambda x$
To compute eigenvalue of A, we gradually make $BAB^-1$ diagonal, and eigenvalue will show up on the main diagonal
Diagonalizing a Matrix
Suppose A has a full set of n independent eigenvectors. Put those eigenvectors into an invertible matrix X. $A=X\Lambda X^{-1}$. This property is used to compute the power of A
Symmetric positive definite matrices
$S = S^T$, Then eigenvectors q can be chosen orthogonal
Spectral Theorem: Every real symmetric matrix has the form $S=Q\Lambda Q^T$. Q is the orthonomal eignvector matrix $Q^TQ=I$
A positive definite matrix has all positive eigenvalues.
the energy test: S is positive definite if the energy $\x^TSx$ is positive for all vector x != 0
if $Sx=\lambda x$, then $x^TSx > 0$ if $\lambda > 0$
if every eigenvector has positive energy, those eigenvectors can be chosen orthogonal because S is symmetric, every vector is a linear combination of eigenvectors
$S=A^TA$ for a matrix A with independent columns
For an ordinary function f(x) of one variable x, the minimum test df/dx = 0 and second deravative > 0. For f ( x, y) with two variables, the second derivatives go into a matrix, and it is mininum when the matrix is positive definite.
There a saddle point when S has both positive and negative eigenvalues. A saddle point matrix is “indefinite”
Optimization and Machine Learning
gradient descent. Each step takes the steepest direction toward the bottom point of the bowl. Calculus: The partial derivatives of f are all zero. Linear algebra: The matrix S of second derivative is positive definite.
If S is positive definite (or semidefinite) at all points x, then the function f(x) is convex
Machine learning produces “loss functions” with hundreds of thousands of variables. They measure the error-which we minimize. But computing all the second derivatives is completely impossible. We use first derivatives to tell us a direction to move-the error drops fastest in the steepest direction. Then we take another descent step in a new direction. This is the central computation in least squares and neural nets and deep learning.
Ellipse
Stay with a positive definite matrix S. Cut through that bowl at height energy = 1, then the curve that goes around the cut is an ellipse
The axes of the tilted ellipse point along those eigenvectors of S. Length = $1/\sqrt\lambda$
SVD
If A is not square then $Ax = \lambda x$ is impossible and eigenvectors fail. SVD helps in this case
$Av_1=\sigma _1u_1$, vs are orthogonal in $R^n$. us are perpendicular to each other in $R^m$. That number r is the rank of A
$Av_{r+1} = 0$
$AV = U\Sigma$. V and U are square orthogonal matrices $V^T=V^{-1}$ because their columns are orthogonal unit vectors.
SVD: $A=U\Sigma V^T$. Reduced SVD so that $\Sigma _r$ is square
A special property of the SVD is that those pieces come in order of importance. $\sigma _1u_1v_1^T$ is the closest rank 1 matrix to A. The sum of the first k pieces is the best rank k approximation of A
Eckart-Young
$A^TA=V\Sigma^T\Sigma V^T$. V contains orthonormal eigenvectors of $A^TA$. $\sigma_i^2$ are the nonzero eigenvalues of both for both $A^TA$ and $AA^T$
KL transform’s connection to SVD
KL begins with a covariance matrix V of a zero-mean random process. V is symmetric and positive definite or semidefinite. In general V could be an infinite matrix or a covariance function. Then the KL expansion will be an infinite series. The KL transform is a stochastic (random) form of Principal Component Analysis
The Geometry of the SVD
The SVD separates a matrix into (orthogonal) x (diagonal) x (orthogonal). The orthogonal matrices U and V rotate the plane. The diagonal matrix stretches it along the axes.
Finite difference
The discrete form of a derivative is a finite difference. The discrete form of an integral is a sum.
D corresponds to the backward difference $f(x) - f(x - \Delta x)$
The eigenvectors v of $D^TD$ are the right singular vectors of D. They are discrete sines, u are discrete cosines. $\sqrt2V$ These are the famous DST and DCT matrices-Discrete Sine Transform and Discrete Cosine Transform. The DCT matrix has been the backbone of JPEG image compression. Actually JPEG increases U to 8 by 8, which reduces the “blockiness” of the image. 8 by 8 blocks of pixels are transformed by a two-dimensional DCT -then compressed and transmitted.
Part I: Highlights of Linear Algebra
Ax = b : is the vector b in the column space of A?
$Ax = \lambda x$: eigenvector gives direction so that Ax keeps the direction of x. $A^2x = \lambda^2 x$
$Av = \sigma u$: SVD. Which part of that data matrix is important?
Ax is a linear combination of columns of A. Column space of A
3 indepedent columns in R^3 produce an invertible matrix.
A basis for a subspace is a full set of independent vectors. The rank of a matrix is the dimension of its column space.
A = CR. R = rref(A). The nubmer of independent columns equals independent rows
One column u and one row v, and all nonzero matrix $uv^T$ has rank 1
Outer product approach is important in data science because we are looking for important parts of A
A = LU: Elimniation, L is lower triangular, and U is upper triangular
A = QR: orthogonalizing. $Q^TQ=I$ and R is upper triangular
$S=Q\Lambda Q^T$: S is symmetric. Eigenvalues on the diagonal of lambda. Orthonormal eigenvectors in the columns of Q
$A=X\Lambda X^{-1}$: diagonalization when A is n by n with n independent eigenvectors. Eigenvalues of A on the diagonal of lambda. Eigenvectors of A in the columns of X.
$A=U\Sigma V^T$: SVD. Orthonormal singular vectors in U and V. Singular values in sigma
Orthogonal matrix: $Q^T=Q^{-1}$. $q_i * q_j = 1$ if i = j, otherwise 0
$Sq = \lambda q$ Eigenvector q and eigenvalue lambda. Each eigenvalue and eigenvector contribute a rank one piece to S
Columns in $Q\Lambda$ are $\lambda _1q_1$ to $\lambda _nq_n$, because lambda is a diagonal matrix
nullspace: N(A) contains all solutions x to Ax = 0
left nullspace: $N(A^T)$ contains all solutions y to $A^Ty=0
r independent equations Ax = 0 have n - r independent solutions
incidence matrix
A has 1 and -1 on every row, to show the end node and the start node for each edge
The r edges of a tree give r independent rows of A : rank = r = n - 1
There are rn - r = rn - n + 1 independent small loops in the graph
Rank
Rank of AB <= rank of A, because every column of AB is a combination of the columns of A
Rank of A + B <= (rank of A) + (rank of B)
Rank of $A^TA$ = rank of $AA^T$ = rank of A = rank of $A^T$, because $A^TA$ and A have the same nullspace.
If A is m by r and B is r by n, both with rank r, then AB also has rank r, because r = rank of ($A^TABB^T) <= rank of (AB) <= rank A = r. However this does not hold for BA
Elimination factored A = LU into a lower triangular L times an upper triangular U
Zero cannot be the first pivot. If there is a nonzero number lower down in column 1, its row can be the pivot row. Good codes will choose the largest number to be the pivot.
Every invertible n by n matrix A leads to P A = LU : P = permutation. The inverse of every permutation matrix P is its transpose $P^T$
Orthogonal Matrices and Subspaces
Orthogonal vectors x and y: $x^Ty= x_1y_1 + .. + x_ny_n = 0$
Orthogonal subspaces R and N. Every vector in the space R is orthogonal to every vector in N.
Orthonormal basis: Orthogonal basis of unit vectors. Every $v_i^Tv_i = 1$
Tall thin matrices Q with orthonormal columns $Q^TQ= I$. If this Q multiplies any vector x, the length of the vector does not change | Qx | = | x |
$(Qx)^TQx = x^TQ^TQx = X^Tx$
Orthogonal matrices” are square with orthonormal columns, $Q^T=Q^-1$
Pythagoras Law for right triangles $ | x-y | ^2 = (x-y)^T(x-y)=x^Tx - y^Tx - x^Ty + y^Ty = | x | ^2 + | y | ^2$ |
The row space of A is orthogonal to the nullspace of A. The column space of A is orthogonal to the nullspace of $A^T$
if $P^2 = P = P^T$, then Pb is the orthogonal projection of b onto the column space of P
Multiplying orthogonal matrices produces an orthogonal matrix. $(Q_1Q_2)^T(Q_1Q_2) = I$. Rotation times rotation = rotation. Reflection times reflection = rotation. Rotation times reflection = reflection
Suppose then by n orthogonal matrix Q has columns $q_1…q_n$, Every vector v can be written as a combination of the basis vectors. Coefficients in an orthonormal basis is just $c_i= q_iv$, i.e., v=Qc
$Q^Tv=Q^TQc=c$
Deep Learning and Neural Nets
- Goal of a neural net: To construct a function that classifies the training data correctly, so it can generalize to unseen test data.
Learning function F
- One input v for each training sample. For the problem of identifying handwritten digits, each input sample will be an image-a matrix of pixels
- Output: 0 to 9. Those ten numbers are the possible outputs from the learning function.
- Train a learning function on part of the MNIST set. Validate the function by choosing unseen MNIST samples
Linear/affine learning function
The simplest learning function would be linear $w=Av+b$
- The entries in the matrix A are the weights to be learned
- b is the bias vector and is to be learned
- Linearity is a very limiting requirement. What would be halfway between I and XIX?
Neural network
$F(v) = L(R(L(R(…Lv))))$
$Lv = Av + b$
R/ReLU: nonlinear function on each component of $Lv$, ReLU(x) = max(0, x) is often picked
$F(x,v)$ depends input v and weight x (A’s and b’s). $v_1=ReLU(A_1v+b_1)$ produces the first hidden layer of neural network
The complete net starts with the input layer v and output layer $w=F(v)$
$L_k(v_{k-1})= A_kv_{k-1}+b_k$
The goal is to minimzie total loss over all training samples by Choosing weights in $A_k$ and $b_k$, although Least square is not the best loss function for deep learning
The nubmer of weights $A_{ij}$ and $b_k$ is often larger than the nubmer of training samples
For images, CNN is often appropriate and weights are shared the diagonals of A are constant
Weights are optimized by stochastic gradient descent
positive definite symmetric matrices S
Orthogonal eigenvector q. Positive eigenvalues $\lambda$
$S = \lambda_1 q_1q_1^T + \lambda_2 q_2q_2^T….$. Assuming $\lambda_1 > \lambda_2…$, the $\lambda_1 q_1q_1^T$ is the most informative part
Layers of F(x, v)
x is in the ramp function ReLU
Start with fully connected layers - neurons on layer N connects to layer N + 1.
CNN repeats same weights around all pixels in an image
Pooling layer reduces dimenstions
Dropout randoms leaves out neurons
Batch normalization resets means and variance