ShapeShiftr | AI-Lab

What It Is

A fitness tracker I use every week to log body metrics, upload progress photos, and receive AI coaching to help me get in better physical shape. Managing and analyzing my data was previously difficult because it was scattered across five sources: a smart scale for weight, an iOS body-scanner for body fat, Apple Health for exercise, MyFitnessPal for calories, and an Apple Note for skin fold measurements. Shapeshiftr was my first vibe-coded app, and it became the test bed for almost every new AI capability I wanted to learn.

Try the live demo: the app at shapeshiftr.lovable.app has a seeded demo account and dataset wired up, so you can poke around the dashboard, charts, progress photos, AI coach, and weekly analysis.

Photo gallery: screenshots 2, 5, and 6 show the main dashboard in light mode, dark mode, and table view. 3-4 cover AI features (progress analysis, chat coach) and the side-by-side photo comparison view. 7-9 show the data entry, goals, and affirmations modals. 10-12 are the Langfuse side: LLM judges, a live trace with eval scores, and the chat-coach prompt version history. 13-15 are mobile views.

What I Learned

Real-stakes alignment work is mostly judgment calls — Even though I'll never ship this commercially, fitness and nutrition advice is a real harm surface and I wanted reps on something that mattered. The hardest line to draw was frustration vs distress: "nothing is working, I'm stuck" needs coaching, not a therapist referral, but "I hate my body, I look disgusting" needs the referral even if the data tells a different story. Refusing to ask leading questions like "what did your doctor say?" (which lets a user fabricate clearance for risky exercise or dieting) was the other delicate part. No single rule cleanly captured either case.
A real eval pipeline is built by hand, one labeled trace at a time — I'd read Hamel Husain's playbook and thought I knew the moves. Doing it surfaced how much product judgment and human insight a real eval pipeline needs: building a synthetic dataset that actually represents production, hand-labeling traces, doing axial coding to find recurring failure modes, and then training LLM judges on those labels.
Calibrating an LLM judge requires as much work as tuning the thing it judges — I run three judges on every chat-coach response (Factual Accuracy, Safety, and Quality), each with its own criteria and a stack of pass/fail tests. Tuning them to score production traffic took the same kind of work as tuning the coach itself: hand-labeling traces, axial coding the failures, deciding which model to run each judge on, and multiple rounds of prompt iteration to push judge-vs-human agreement above 86% on most dimensions.
Models that look similar can diverge on weird stuff — Even with Haiku 4.5 and Gemini 3.1 Flash-Lite (both small fast models), Gemini was substantially better at arithmetic across weeks (+24pp on factual accuracy) and Haiku was more disciplined about not appending unsolicited progress summaries. You don't see those quirks until you score the same prompts on both side by side.
AI features have to do real work for the user or they end up feeling like tack-ons — The Progress Analysis and Ask Coach both pull from the full weekly history, can look across months to spot trends, patterns, and correlations, and surface the affirmations and goals I've set when they're relevant.
AI compresses the build phase more than the polish phase — Lovable got me to a usable app in about four hours, but the long tail of bug fixes, edge cases, mobile responsiveness, and small usability calls stretched on for days even with AI leading.
Vibe-coding tools have a ceiling once you need robust evals and other integrations — Lovable was the right call for v1, but once I wanted prompt versioning, evals, and other 3rd party services, I needed out of its black box and into the codebase with a partner I could guide more directly. Running Claude Code inside Cursor gives me an IDE and MCPs (Langfuse, GitHub, Supabase) to close the gap on everything I was bouncing between tabs to do.
Staying in one workspace through the whole loop is hugely valuable — When I blew through my Supabase storage egress quota, pasting the dashboard screenshot into Claude got me a diagnosis and fix without ever leaving the project. MCPs (Langfuse, Supabase, GitHub, etc.) cut down on context/app switching and let the coding agent discover on its own which prompt version is deployed, what's in my database schema, and what the last commit touched.
A familiar codebase is the best test track for new tools — Every time a new LLM, image model, skill, or MCP shows up, I've got a real app I know inside out to put it through the paces with. One example: Gemini 3 Pro inside Cursor for a CSS overhaul that turned the dark-mode purple AI-slop defaults to teals, gold, and a pleasing light mode.

How It Works

Five metrics (weight, body fat %, skin fold, exercise mins/day, calories/day) entered weekly, plotted in dual-metric charts with goal reference lines, plus a sortable history table. Six months in, the trends are what make the data motivating.
Photo upload uses pose detection ported from my Fitness Frame project (TensorFlow.js MoveNet) to auto-crop front and side photos to a consistent 9:16 ratio so weekly comparisons aren't thrown off by framing. Photos live in user-scoped Supabase Storage with RLS.
AI Progress Analysis runs on demand against the full weekly history on Claude Haiku 4.5, with hard-coded validation ranges that catch entry mistakes (e.g., body fat outside 3-60%) before the LLM ever sees the data
AI Chat Coach handles interactive Q&A on Gemini 3.1 Flash-Lite. The system prompt injects the full weekly history table as context so the coach can compute deltas, name the rows it's comparing, and stay strict about daily-vs-weekly units.
Both system prompts (chat-coach-system, analyze-progress-system) live in Langfuse with production and staging labels; edge functions fetch by label at runtime with a 5-minute cache, so swapping prompts is a label move rather than a redeploy
Prompt-level guardrails define the safety rules (medical deferral, body image referrals, no harmful framing of dangerous behaviors, no leading "what did your doctor say?" questions), and the judges verify compliance
Three LLM judges (safety, quality, factual accuracy) score every coach response on Haiku 4.5. Judge prompts are versioned in Langfuse and configured as production evaluators, so every live response gets scored without manual review.
Synthetic Langfuse dataset feeds the eval loop: coach generates responses, judges score them, a calibration script compares scores to my hand labels, and a disagreement inspector drills into specific failures. Same pipeline runs against production traces for ongoing monitoring.
A nightly score monitoring script checks recent eval pass rates per dimension and pings Slack if anything drops below threshold. Hasn't fired on real traffic since I'm the primary user, but the wiring is ready.

Built With

Lovable kicked off v1, and Claude Code (running inside Cursor with MCPs for Langfuse, GitHub, and Supabase) carried it forward. Frontend is React + Vite + TypeScript with Tailwind, shadcn/ui, Recharts, and TensorFlow.js MoveNet for the pose-aware photo cropping ported in from Fitness Frame; backend is Supabase (Postgres with RLS, Auth, Storage, and Deno Edge Functions). Claude Haiku 4.5 powers the progress analysis and the LLM judges, Gemini 3.1 Flash-Lite powers the chat coach, and Langfuse handles prompt versioning, tracing, eval scoring, and Slack alerts. ChatGPT Images 2.0 generated hero image.

Loading content...

What It Is

Try the live demo: the app at shapeshiftr.lovable.app has a seeded demo account and dataset wired up, so you can poke around the dashboard, charts, progress photos, AI coach, and weekly analysis.

What I Learned

Real-stakes alignment work is mostly judgment calls — Even though I'll never ship this commercially, fitness and nutrition advice is a real harm surface and I wanted reps on something that mattered. The hardest line to draw was frustration vs distress: "nothing is working, I'm stuck" needs coaching, not a therapist referral, but "I hate my body, I look disgusting" needs the referral even if the data tells a different story. Refusing to ask leading questions like "what did your doctor say?" (which lets a user fabricate clearance for risky exercise or dieting) was the other delicate part. No single rule cleanly captured either case.

A real eval pipeline is built by hand, one labeled trace at a time — I'd read Hamel Husain's playbook and thought I knew the moves. Doing it surfaced how much product judgment and human insight a real eval pipeline needs: building a synthetic dataset that actually represents production, hand-labeling traces, doing axial coding to find recurring failure modes, and then training LLM judges on those labels.

Calibrating an LLM judge requires as much work as tuning the thing it judges — I run three judges on every chat-coach response (Factual Accuracy, Safety, and Quality), each with its own criteria and a stack of pass/fail tests. Tuning them to score production traffic took the same kind of work as tuning the coach itself: hand-labeling traces, axial coding the failures, deciding which model to run each judge on, and multiple rounds of prompt iteration to push judge-vs-human agreement above 86% on most dimensions.

Models that look similar can diverge on weird stuff — Even with Haiku 4.5 and Gemini 3.1 Flash-Lite (both small fast models), Gemini was substantially better at arithmetic across weeks (+24pp on factual accuracy) and Haiku was more disciplined about not appending unsolicited progress summaries. You don't see those quirks until you score the same prompts on both side by side.

AI features have to do real work for the user or they end up feeling like tack-ons — The Progress Analysis and Ask Coach both pull from the full weekly history, can look across months to spot trends, patterns, and correlations, and surface the affirmations and goals I've set when they're relevant.

AI compresses the build phase more than the polish phase — Lovable got me to a usable app in about four hours, but the long tail of bug fixes, edge cases, mobile responsiveness, and small usability calls stretched on for days even with AI leading.

Vibe-coding tools have a ceiling once you need robust evals and other integrations — Lovable was the right call for v1, but once I wanted prompt versioning, evals, and other 3rd party services, I needed out of its black box and into the codebase with a partner I could guide more directly. Running Claude Code inside Cursor gives me an IDE and MCPs (Langfuse, GitHub, Supabase) to close the gap on everything I was bouncing between tabs to do.

Staying in one workspace through the whole loop is hugely valuable — When I blew through my Supabase storage egress quota, pasting the dashboard screenshot into Claude got me a diagnosis and fix without ever leaving the project. MCPs (Langfuse, Supabase, GitHub, etc.) cut down on context/app switching and let the coding agent discover on its own which prompt version is deployed, what's in my database schema, and what the last commit touched.

A familiar codebase is the best test track for new tools — Every time a new LLM, image model, skill, or MCP shows up, I've got a real app I know inside out to put it through the paces with. One example: Gemini 3 Pro inside Cursor for a CSS overhaul that turned the dark-mode purple AI-slop defaults to teals, gold, and a pleasing light mode.

How It Works

Five metrics (weight, body fat %, skin fold, exercise mins/day, calories/day) entered weekly, plotted in dual-metric charts with goal reference lines, plus a sortable history table. Six months in, the trends are what make the data motivating.

Photo upload uses pose detection ported from my Fitness Frame project (TensorFlow.js MoveNet) to auto-crop front and side photos to a consistent 9:16 ratio so weekly comparisons aren't thrown off by framing. Photos live in user-scoped Supabase Storage with RLS.

AI Progress Analysis runs on demand against the full weekly history on Claude Haiku 4.5, with hard-coded validation ranges that catch entry mistakes (e.g., body fat outside 3-60%) before the LLM ever sees the data

AI Chat Coach handles interactive Q&A on Gemini 3.1 Flash-Lite. The system prompt injects the full weekly history table as context so the coach can compute deltas, name the rows it's comparing, and stay strict about daily-vs-weekly units.

Both system prompts (chat-coach-system, analyze-progress-system) live in Langfuse with production and staging labels; edge functions fetch by label at runtime with a 5-minute cache, so swapping prompts is a label move rather than a redeploy

Prompt-level guardrails define the safety rules (medical deferral, body image referrals, no harmful framing of dangerous behaviors, no leading "what did your doctor say?" questions), and the judges verify compliance

Three LLM judges (safety, quality, factual accuracy) score every coach response on Haiku 4.5. Judge prompts are versioned in Langfuse and configured as production evaluators, so every live response gets scored without manual review.

Synthetic Langfuse dataset feeds the eval loop: coach generates responses, judges score them, a calibration script compares scores to my hand labels, and a disagreement inspector drills into specific failures. Same pipeline runs against production traces for ongoing monitoring.

A nightly score monitoring script checks recent eval pass rates per dimension and pings Slack if anything drops below threshold. Hasn't fired on real traffic since I'm the primary user, but the wiring is ready.

Built With