Startup as RL problem
One of the most important things for a startup is a fast, honest feedback loop between my product and the people I want to serve.
A useful way to think about the journey to product–market fit (PMF) is as a sequential decision process. At the start I am far from PMF; my job is to reach it quickly and reliably while conserving runway.
The state is who I'm serving (ICP), how painful the problem is, what the product can do, how it's distributed, and what usage and retention look like. PMF is the set of states where my ICP repeatedly pulls the product: people come back unprompted, recommend it, and revenue follows use.
My actions are the things I can change: talk to users, ship changes, adjust onboarding or pricing, try new channels, reposition, or pivot.
Observed rewards are signups, session durations, invites, purchases, churn.
The objective is to choose actions that move me toward PMF quickly by maximizing long-term return. Short, tight feedback loops make learning smooth; when rewards are sparse or delayed, it's much harder to assign credit.
Here are a few notes from my experience.
- Build for myself, but stay a real user.
- A common failure mode: I solve a problem I had once or twice, but I no longer have it.
- I keep asking: can I still relate to the problem? Would I still choose my solution over the alternatives? If not, is this fixable, or has something more fundamental changed?
- RL lens: staying a real user keeps rewards on-policy and dense; stop using it and rewards become off-policy and sparse, slowing learning.
- Choose the right reward: prefer retained usage over revenue.
- Retained usage is a sturdier signal; revenue can lag at best and mislead at worst.
- Examples:
- A 1k+ employee company buys a yearly contract. For me, the contract size is make-or-break; for them, it's a fraction of one FTE's salary. My champion shifts priorities, moves teams, or leaves. Procurement forgets or can't be bothered.
- A company buys for political reasons, e.g., the champion
- is a buddy from university and wants to help me fundraise
- wants someone to blame if their project turns south
- wants to stay in touch for a potential acquihire
- Consumers buy a subscription and either forget about it or keep it because they like the idea of using my product (e.g., the gym) or the bragging rights (e.g., an exclusive membership they visit once a year).
- Retained usage is harder to cheat. It's unlikely someone logs into my app several times a week just to fake metrics. There are exceptions where, despite heavy usage, the unit economics don't work (yet), like with heavily subsidized pricing. Even then, selling a product with temporarily negative margins that customers love is more fixable than selling one they barely use.
- RL lens: retention is sustained nonzero reward; one-off revenue spikes are sparse rewards that can mislead value estimates.
- Spend regular, casual time with my ICP. If getting a meeting with someone in my ICP takes significant effort every time, I reconsider my ICP. Some hustle is always needed, but I accrue signals much faster if I build for a group I
- am part of myself
- hang out with regularly in a low-stakes way
- have easy, abundant, and cheap access to (e.g., consumers I can approach in the park)
- RL lens: frequent, low-friction ICP contact shortens the effective horizon and yields more on-policy trajectories.
- A bad MVP that some users can't stop using is a great signal. For example, a buggy app that breaks every few minutes but still sees daily multi-hour use from power users.
- RL lens: even with occassional negative rewards (bugs), the average reward stays high.