Technical Feasibility & AI Architecture
Clinical Trial Navigator - Product Viability Report
Technical Achievability
The Clinical Trial Navigator is highly achievable using modern low-code and API-first architectures. The core dependency—the ClinicalTrials.gov API (AACT database)—is public and well-documented. The primary technical challenge is not "can we build it," but "can we ensure medical accuracy and data freshness" at scale.
Leveraging Large Language Models (LLMs) for "Plain Language Summaries" and "Eligibility Parsing" is a proven use case. Precedent exists in tools like Antidote.me and various MedTech startups. The main complexity lies in the asynchronous data processing required to index 450,000+ studies daily and running vector similarity searches efficiently without incurring exorbitant cloud costs.
A functional prototype can be built in 4-6 weeks by a single developer using Supabase and OpenAI. The architecture is standard (CRUD + Vector Search + AI wrapper), meaning there is low R&D risk.
- Medical Hallucination Risk: AI may misinterpret complex exclusion criteria.
- Data Latency: Syncing 450k+ records daily requires robust background job management.
- Implement "Human-in-the-loop" review for the first 1,000 trial summaries.
- Use pgvector (Postgres) instead of separate Vector DB to reduce complexity.
Recommended Technology Stack
| Layer | Technology | Rationale |
|---|---|---|
| Frontend | Next.js 14 + Tailwind CSS | Next.js offers Server-Side Rendering (crucial for SEO of trial pages) and API routes. Tailwind enables rapid UI development. PWA capabilities are native. |
| Backend | Python (FastAPI) or Node.js | Python is preferred for the AI processing pipeline (better library support). FastAPI is performant and async-ready for handling concurrent API requests. |
| Database | Supabase (Postgres + pgvector) | Supabase handles Auth, DB, and Storage in one platform. pgvector extension allows semantic search without a separate Pinecone bill. |
| AI Layer | OpenAI GPT-4o + LangChain | GPT-4o offers the best reasoning for medical text simplification. LangChain manages the prompt chains for eligibility parsing efficiently. |
| Infrastructure | Vercel (Web) + Railway (Worker) | Vercel provides optimal frontend performance. Railway hosts the Python background worker for data ingestion/processing (avoiding Vercel timeouts). |
System Architecture
Feature Implementation Complexity
| Feature | Complexity | Effort | Dependencies | Notes |
|---|---|---|---|---|
| User Authentication (HIPAA) | Low | 2 days | Supabase Auth | Enable 2FA, strict session management |
| ClinicalTrials.gov Ingestion | Medium | 5 days | Python, Background Jobs | Must handle large XML/JSON dumps efficiently |
| Smart Matching Engine | High | 10 days | OpenAI Embeddings, pgvector | Hybrid search: Semantic (AI) + Metadata (Location/Phase) |
| Plain Language Summaries | Medium | 4 days | OpenAI API | Cache results to save costs; prompt engineering critical |
| FHIR Health Record Import | High | 8 days | Smart/FHIR App Launch | Complex OAuth flows; defer to Phase 2 if needed |
| Logistics & Maps | Low | 3 days | Google Maps API | Standard geolocation + distance calc |
| Trial Tracker Dashboard | Low | 3 days | Frontend State | CRUD operations with status filtering |
AI/ML Implementation Strategy
Core Use Cases
-
1. Eligibility Parsing:
Raw Criteria Text → GPT-4o (Structured Extraction) → JSON (Inclusion/Exclusion Arrays) -
2. Semantic Matching:
Patient Profile → Text-Embedding-3-Large → Vector Search → Ranked Trial List -
3. Patient Briefs:
Full Protocol → GPT-4o (Summarization Prompt) → 8th Grade Reading Level Summary
Quality & Cost Control
Use "Grounding" techniques. Force the AI to answer only based on the provided trial text. If information is missing, instruct it to say "Not specified" rather than inventing details.
Est. Cost: $0.05 - $0.15 per active user/month (heavily dependent on usage).
Strategy: Cache all trial summaries (generate once, serve to millions). Use cheaper models (GPT-3.5-Turbo) for initial filtering, GPT-4o only for final summarization.
Data Requirements & Strategy
Data Sources
- Primary: ClinicalTrials.gov API (AACT Database).
- Volume: ~450k studies, ~2GB raw data.
- Update Freq: Daily (Delta updates).
- Secondary: User Input (Questionnaires), FHIR exports.
Data Schema
- Users: ID, Health Profile (JSONB), Preferences.
- Trials: NCT_ID, Criteria (Structured), Status, Locations.
- Matches: User_ID, Trial_ID, Score, Match_Reasoning.
- Subscriptions: User_ID, Query_Filters, Notification_Status.
Privacy (HIPAA)
- PII: Encrypt names/email at rest.
- Health Data: Isolate in separate schema; strict access controls.
- BAA: Required for Supabase, Vercel, OpenAI, SendGrid.
- Retention: Allow immediate account/data export/deletion.
Third-Party Integrations
| Service | Purpose | Complexity | Cost | Criticality |
|---|---|---|---|---|
| Supabase | Auth, DB, Storage | Medium | Free → $25/mo | Must-have |
| OpenAI | LLM Processing | Simple API | Usage-based | Must-have |
| ClinicalTrials.gov | Trial Data Source | Complex Parsing | Free | Must-have |
| SendGrid | Transactional Email | Simple API | Free → $20/mo | Must-have |
| Stripe | Payments (Premium) | Medium | 2.9% + 30¢ | High |
| Google Maps API | Logistics/Distance | Simple API | $200 free credit | Nice-to-have |
Scalability Analysis
Bottlenecks
- AI Latency: Generating summaries takes 3-5s. Must be async (background job) to avoid blocking UI.
- Vector Search: As the trial count grows, similarity search slows. Solution: Index optimization (HNSW) and filtering by location first to reduce search space.
- API Rate Limits: ClinicalTrials.gov has rate limits. Must use bulk download (XML) for initial sync, API for daily deltas.
Cost at Scale
| Users | 1,000 | 10,000 | 100,000 |
| Hosting (DB/Web) | $50 | $200 | $1,500 |
| AI Processing | $100 | $800 | $6,000 |
| Total Est. Monthly | $150 | $1,000 | $7,500 |
Technology Risks & Mitigations
AI Hallucination / Medical Inaccuracy
HIGH SEVERITYThe LLM might incorrectly interpret an exclusion criterion (e.g., misreading "history of" as "current") or hallucinating a benefit not listed in the protocol. This could lead patients to enroll in ineligible trials, endangering health and exposing the company to liability.
Use strict JSON schema validation for AI outputs. Implement a "Confidence Score" for AI matches. Always display the original eligibility criteria alongside the AI summary. Add a "Verify with Doctor" call-to-action on every match.
ClinicalTrials.gov API Changes
MED SEVERITYDependency on a government API. Changes in data structure, unscheduled downtime, or rate limit reductions could break the matching engine or stale the database.
Build an abstraction layer for data ingestion. Monitor the AACT database (Aggregated Clinical Trial Data) as a backup source. Implement robust error logging for the daily sync job to catch schema drift immediately.
HIPAA Compliance Violation
MED SEVERITYAccidental exposure of Patient Health Information (PHI) via logs, unencrypted storage, or sending PII to non-HIPAA compliant AI models.
Sign BAAs with all vendors before processing data. Use PII stripping (e.g., replace names with "Patient") before sending text to OpenAI. Enable audit logging on all database access.
Development Timeline (10 Weeks)
Phase 1: Foundation (Weeks 1-2)
- Setup Supabase project (Auth, Postgres with pgvector).
- Deploy Next.js frontend to Vercel.
- Implement "Questionnaire" UI for user health profile.
- Deliverable: User can sign up and save a health condition.
Phase 2: Data & AI Engine (Weeks 3-5)
- Build Python worker to ingest ClinicalTrials.gov data.
- Implement OpenAI embedding generation for trial descriptions.
- Build "Smart Match" API endpoint (Vector search + Filters).
- Deliverable: System returns relevant trials for a condition.
Phase 3: Features & Polish (Weeks 6-8)
- Develop "Plain Language Summary" generation pipeline.
- Build Trial Tracker Dashboard (Save/Status features).
- Integrate Maps API for logistics.
- Deliverable: Functional MVP with core workflows.
Phase 4: Launch Prep (Weeks 9-10)
- Security audit (HIPAA check).
- Load testing (simulate 100 concurrent users).
- Setup Analytics (PostHog/Mixpanel).
- Deliverable: Production-Ready v1.0.
Team Composition
Solo Founder Feasibility
Possible, but challenging. A solo technical founder can build the MVP using the low-code stack (Supabase + Vercel). However, the "AI Prompt Engineering" and "Medical Data Validation" require significant time investment.
Required Skills: React/Next.js, Python (for data worker), SQL, Prompt Engineering.