# Building a Touchless "XR" Gallery with MediaPipe & JavaScript
With the rise of the Apple Vision Pro and Meta Quest, "Spatial Computing" is the buzzword of the year. But you don't need a $3,500 headset to experience the future of interaction. Today, we are going to build a Minority Report-style web interface that runs entirely in a standard web browser, controlled 100% by hand gestures.
In this post, I'll break down how I built the Touchless XR Gallery: an infinite-scroll web app using Google MediaPipe for computer vision and Vanilla JavaScript for the physics engine.
The Concept
The goal was simple but ambitious: Create a web page where a user can scroll, pan through image carousels, and select items without touching the keyboard or mouse.
However, hand tracking has a history of being jittery and frustrating (the infamous "Gorilla Arm" fatigue). To solve this, I focused heavily on UX physics, specifically implementing three key features:
- The Joystick Model (for effortless scrolling)
- Magnetic Friction (for precision)
- The Smart Clutch (for stability)
The Tech Stack
| Technology | Purpose |
|---|
| HTML5 & Vanilla JS | No frameworks, just raw performance |
| Google MediaPipe Hands | Real-time skeletal tracking (via CDN) |
| Tailwind CSS | Sleek, dark-mode "XR" aesthetic |
| Picsum Photos | Infinite random stock imagery |
The "Secret Sauce": UX Mechanics
Getting the camera to see your hand is easy. Making it feel good to use is hard. Here is the logic I used to solve common gesture control problems.
1. The "Joystick" Navigation Model
Direct 1:1 mapping (where moving your hand 1 inch moves the screen 100 pixels) is exhausting. Instead, I used a Zone-based Joystick model.
The screen is divided into invisible trigger zones:
- Top 20%: Scroll Up
- Bottom 20%: Scroll Down
- Left/Right 15%: Pan the gallery carousels
- Center: Neutral (Stop)
This allows the user to rest their hand in a zone and let the content flow, rather than constantly waving their arm.
2. "Magnetic" Friction
Trying to click a moving target with a free-floating hand is difficult. To fix this, I implemented Friction.
When the virtual cursor hovers over an interactive element (like an image card), the horizontal scrolling speed automatically drops to 30%. This creates a "sticky" or "magnetic" feeling, giving the user time to lock onto the item they want without the carousel zooming past them.
3. The "Smart Clutch"
This is the most critical feature. In gesture interfaces, users often scroll when they try to click (the "Midas Touch" problem).
I implemented a Clutch system. As soon as the computer vision detects your thumb and index finger coming together (starting a pinch), it acts as a hard brake. All scrolling stops instantly. This freezes the UI, allowing you to finish the pinch and click safely.
The Implementation
Step 1: The Virtual Cursor & Smoothing
Raw webcam data is noisy; your hand shakes even when you think it's still. To fix this, we don't map the hand position directly to the cursor. We use Linear Interpolation (Lerp) to smooth it out.
The smoothing algorithm works by calculating a weighted average between the current position and the new input:
- A smoothing factor of 0.15 creates a fluid, "heavy" cursor feel
- Current cursor position = (Old Position × 0.85) + (New Input × 0.15)
Step 2: Pinch Detection
MediaPipe provides 21 skeletal landmarks for the hand. To detect a click, we calculate the Euclidean distance between Landmark 8 (Index Tip) and Landmark 4 (Thumb Tip).
The threshold is set to 0.06 - if the distance is less than this value, we are pinching.
Step 3: The Interaction Loop
The app runs a gameLoop on every animation frame that handles the logic priority:
- Check Pinch: If pinching, STOP everything (Brake)
- Check Vertical: If hand is high/low, scroll window
- Check Horizontal: If hand is left/right, scroll carousels
- Apply Friction: If hovering an image, multiply speed by 0.3
Optimizing for Performance
Since this runs in the browser, performance is key.
- No High-Res Bloat: I optimized the stock image requests to 250x350px. They load instantly and look sharp enough for cards.
- CSS Transforms: The cursor and hover effects use hardware-accelerated CSS transforms (translate, scale) to ensure the framerate stays at 60fps even while the computer vision model is crunching numbers.
- Debouncing: The "Toast Notification" system uses timeouts to prevent spamming the user if they pinch repeatedly.
Key Implementation Details
The application uses several key CSS styles for the XR cursor:
- Hovering state: Green border (#4ade80), 40x40px size with subtle glow
- Clutch active: Orange background with 16x16px size
- Gallery cards: 260x360px with smooth scale transitions on hover
The MediaPipe Hands library is loaded via CDN and runs at 30+ FPS on modern browsers, enabling real-time tracking of 21 hand landmarks.
Conclusion
The web is changing. We aren't just building for screens anymore; we are building for cameras, sensors, and 3D spaces.
This project proves that with just a few hundred lines of JavaScript, we can create immersive, accessible, and futuristic interfaces today. You don't need to wait for the next VR headset—you just need a webcam and some creativity.
🔗 Try the Demo
Live Demo: XR Gallery Demo
The future of interaction is gestural. Start building it today.