Startup as RL problem
One of the most important things for a startup is a fast, honest feedback loop between your product and the people you want to serve.
A useful way to think about the journey to product–market fit (PMF) is as a sequential decision process. At you are far from PMF; your job is to reach it quickly and reliably while conserving runway.
The state is who you're serving (ICP), how painful the problem is, what the product can do, how it's distributed, and what usage and retention look like. PMF is the set of states where your ICP repeatedly pulls the product: people come back unprompted, recommend it, and revenue follows use.
Your actions are the things you can change: talk to users, ship changes, adjust onboarding or pricing, try new channels, reposition, or pivot.
Rewards are the scores you infer from observed events: signups, sessions, invites, purchases, churn. It is only a noisy proxy for true value; attribution leaks, seasonality, and randomness often distort short-term moves.
The game is to choose actions that move you into quickly by maximizing the discounted return . Short, clean feedback loops make credit assignment easier and learning faster; when rewards are sparse or delayed, it's much harder to see which decisions mattered.
Here are a few notes from my experience.
- Build for yourself, but stay a real user.
- A common failure mode: you solve a problem you had once or twice, but you no longer have it.
- Keep asking: can you still relate to the problem? Would you still choose your solution over the alternatives? If not, is this fixable, or has something more fundamental changed?
- RL lens: staying a real user keeps rewards on-policy () and dense; stop using it and rewards become off-policy () and sparse, slowing learning.
- Choose the right : prefer retained usage over revenue.
- Retained usage is a sturdier signal; revenue can lag at best and mislead at worst.
- Examples:
- A 1k+ employee company buys a yearly contract. For you, the contract size is make-or-break; for them, it's a fraction of one FTE's salary. Your champion shifts priorities, moves teams, or leaves. Procurement forgets or can't be bothered.
- A company buys for political reasons, e.g., the champion
- is a buddy from university and wants to help you fundraise
- wants someone to blame if their project turns south
- wants to stay in touch for a potential acquihire
- Consumers buy a subscription and either forget about it or keep it because they like the idea of using your product (e.g., the gym) or the bragging rights (e.g., an exclusive membership they visit once a year).
- Retained usage is harder to cheat. It's unlikely someone logs into your app several times a week just to fake metrics. There are exceptions where, despite heavy usage, the unit economics don't work (yet), like with heavily subsidized pricing. Even then, selling a product with temporarily negative margins that customers love is more fixable than selling one they barely use.
- RL lens: retention is sustained nonzero ; one-off revenue spikes are sparse rewards that can mislead the value estimate .
- Spend regular, casual time with your ICP. If getting a meeting with someone in your ICP takes significant effort every time, reconsider your ICP. Some hustle is always needed, but you'll accrue signals much faster if you build for a group you
- are part of yourself
- hang out with regularly in a low-stakes way
- have easy, abundant, and cheap access to (e.g., consumers you can approach in the park)
- RL lens: frequent, low-friction ICP contact shortens the effective horizon and yields more on-policy trajectories ().
- A bad MVP that some users can't stop using is a great signal. For example, a buggy app that breaks every few minutes but still sees daily multi-hour use from power users.
- RL lens: if even with frequent "negative rewards" (bugs), some users' stays high — keep going.