I Tried Running FunctionGemma On-Device in a React Native App. Here's How It Went.

Software Engineer
It started with a tweet. Google Devs posted a demo of FunctionGemma running a game, and I watched this tiny model parse natural language into structured function calls in real time. My immediate thought was: can I get this running in a React Native app?
Not because I had an app idea. I didn't. I just wanted to know if a 270M parameter model could do function calling on-device, inside a real mobile app, with no cloud API in the loop. A proof of concept. Could it actually work?
Finding a demo idea
I needed an app that would make the model do something useful. The constraint was that everything had to happen inside the app. No OS-level features like setting timers or sending messages. Just structured data going in and out of a local database.
I threw this at ChatGPT and got back a list: habit tracker, study planner, finance tracker, notes organizer, meal planner, form builder. I evaluated them on how well they'd fit structured function calling, how much reasoning the model would need, and how easy they'd be to demo. The finance tracker won because expense data is naturally structured. Amount, category, date, income or expense. It maps directly to function call parameters with almost no ambiguity.
That's it. The expense tracker isn't the point of this project. It's the vehicle. The real question was always about the model.
Why FunctionGemma
FunctionGemma is a specialized version of Gemma 3 at 270M parameters, trained specifically for function calling. That's important. I didn't need a chatbot. I needed a model that could hear "Uber was $18" and output something like:
<start_function_call>call:add_expense{ amount:18, transaction_type:expense, category:transport, date:2026-03-14 }<end_function_call>
String values wrapped in <escape> tokens, numbers left bare. Weird format, but once you parse it, it's reliable.
At fp16 the model is about 540MB. Not tiny for a mobile download, but totally fine as a one-time cost. And that size matters, because I tried quantizing to 4-bit (about 180MB) and the model completely fell apart. It started confusing its own tool schema with the output format, spitting out field names like description and type instead of actual values like 45 and food. At 270M parameters, every weight matters too much for aggressive quantization. fp16 or nothing.
The stack
The app is React Native with Expo, running on iOS. For on-device inference I used Apple's MLX framework through a custom Expo native module I built in Swift. The module handles the full lifecycle: downloading the model from HuggingFace, loading it into memory, running inference, and parsing tool calls from the output.
Voice input uses expo-speech-recognition with on-device recognition forced on, so even the speech-to-text never hits a server. SQLite stores everything locally. The entire pipeline from voice to database runs without a single network request.
Training was easier than I expected
This was my first time fine-tuning a model and pushing it to HuggingFace. I honestly expected it to take days. It took about two hours.
I wrote around 1,000 training examples covering two tools: add_expense and query_expenses. Twenty examples per category across 16 categories, plus income variations, relative dates ("paid rent yesterday"), and about 50 no-op examples where the model should do nothing. Those no-ops were critical. Without them, the model tries to turn "hello, how are you?" into an expense.
Training ran on Google Colab with HuggingFace's TRL library. Standard supervised fine-tuning, nothing exotic. Claude Code helped me structure the dataset and get the training script right. I was genuinely surprised how straightforward it was. The tooling around this stuff has gotten really good.
But here's the thing. My first training run produced garbage. The model would kind of work but mostly output wrong parameters or malformed calls. I spent a while debugging the model itself before realizing the problem was my training data format. The examples weren't matching FunctionGemma's expected token structure. Once I fixed the format to match what apply_chat_template actually produces, retrained, and pushed to HuggingFace, everything clicked.
The moment it actually worked
After pushing the retrained model, I opened the app on my phone and started testing. The results were wrong. Still broken. I sat there confused for a few minutes before realizing I'd forgotten to delete the app and reinstall. The old, badly-trained model was still cached on the device.
Fresh install. Launched the app. Said "add $10 coffee." Transaction appeared. "Spent $25 on dinner." There it is. "Got paid $3000 salary." Income, correct category. Five add operations back to back, all correct.
That was the moment. Not a gradual improvement. Just suddenly, five in a row, all working. On-device, no server, sub-second. I genuinely said "wow" out loud.
The hard parts
Before I trained the model, when I was still using the base FunctionGemma with just prompt engineering, the results were shaky enough that I started questioning whether I'd picked the wrong app idea. Delete and update operations are really hard to do with voice-only interaction. How do you say "delete the coffee from yesterday" when you have three coffees from yesterday? The model can't show you options and ask you to pick
I almost pivoted to a completely different demo. Instead I scaled back the scope. Keep only add_expense and query_expenses as function calls. Handle delete and update through normal touch UI. Voice for intent, touch for precision. That hybrid approach saved the project.
The other hard lesson was about prompt formatting. Swift dictionaries don't guarantee iteration order. When I dynamically built the tool declaration from a JSON schema, property keys would shuffle randomly. The model had seen properties in a specific order during training, and any deviation meant it would silently fail. No error, no crash, just no tool call.
The fix was unglamorous. I hardcoded the entire tool declaration as a string literal in Swift. Both tools, every parameter, every <escape> token. It's not pretty, but the model sees exactly what it was trained on, every time.
let fullPrompt = "<start_of_turn>developer\nYou are an expense tracking assistant..."
+ toolDecls // hardcoded, order-preserved declaration string
+ "<end_of_turn>\n<start_of_turn>user\n"
+ prompt
+ "<end_of_turn>\n<start_of_turn>model\n"
Temperature is set to 0.0 for the same reason. Higher values cause this small model to mix up declaration format with call format. Greedy decoding, maximum reliability. That's the tradeoff you make at 270M parameters.
What the demo actually does
You open the app, say "spent $12 on coffee," and a transaction row appears in a date-grouped list. The voice pill at the bottom shows live state: idle, listening with a live transcript, processing with the model, then done. If the model doesn't understand your input, you get a brief error and it resets.
You can say "show my food expenses" and a modal pops up with filtered results, transaction count, and total.
Relative dates work. "Paid rent yesterday" gets the right date because the prompt builder injects context like "Today's date is 2026-03-28. Yesterday was 2026-03-27." The model doesn't need to know what day it is. The prompt just tells it.
It's not perfect. Sometime word amounts like "fifty dollars" fail because the model needs digits. It sometimes echoes your exact words as the category ("dinner" instead of "food"). And it's single-turn only, so you can't have a back-and-forth conversation. But for direct commands, it works reliably.
What FunctionGemma is actually good at
To be clear about what this model is designed for: simple, focused tool calls. The official docs show examples like "what's the weather in Tokyo" mapping to get_current_weather, or "check Google's stock price" mapping to get_stock_price. Single turn, one intent, one function call. That's the sweet spot.
It's not going to replace a cloud LLM for anything complex. But for that narrow pattern of "user says a thing, app calls a function," it works surprisingly well on-device. My expense tracker is one example. A voice-controlled smart home toggle, a quick note tagger, a workout logger. Anything where the user's intent maps cleanly to a function with a few parameters. That's where a model like this fits.
What I'd tell another mobile dev
If you're curious about on-device inference, just try it. That's the real takeaway. I went from seeing a tweet to having a working proof of concept, and the hardest parts weren't the things I expected. Training was fast. MLX Swift was pleasant. The tooling around HuggingFace and fine-tuning has gotten genuinely good.
The hard parts were small and specific: getting the prompt format to match training exactly, figuring out that quantization breaks small models, realizing my training data was in the wrong format. Stuff you solve once and move on.
A 270M model doing function calling on a phone felt like science fiction not that long ago. Now it's a weekend project. The barrier to trying this stuff is way lower than most mobile devs think.
The full code is on GitHub.



