Why Your Family Document AI Won't Use Your Data for Training
The most common question we hear from families considering AI document tools: "Will my data be used to train AI models?" It's a fair question -- and for most AI products, the answer is complicated.
At Archevi, the answer is simple: no, never, by architecture.
This isn't a policy decision we could reverse tomorrow. It's an architectural guarantee. Our privacy layer anonymizes your data before it ever reaches an AI provider, making training on real data structurally impossible.
The Industry Standard (And Why It's Not Enough)
Most AI-powered products use a familiar pattern: your data trains the model by default, with an opt-out buried in settings. OpenAI's ChatGPT, for example, uses conversations for training unless you disable chat history or pay for the Enterprise tier. Google's Gemini follows a similar pattern.
Even when providers offer opt-out controls, two problems remain:
- Opt-out is retroactive -- data submitted before you found the toggle may already be in a training dataset
- You're trusting policy, not architecture -- a policy can change with the next terms-of-service update
How Archevi Is Different: Three Layers of Protection
Instead of relying on policy alone, Archevi uses three independent layers to ensure your data never trains any AI model.
Layer 1: No-Training Providers
We chose AI providers with contractual commitments against training on customer data:
- Groq processes our language queries using open-source Llama models. Zero data retention on inference -- your query is processed and discarded.
- Cohere powers our semantic search. SOC 2 Type II certified with explicit no-training guarantees.
But we don't stop there. What if a provider changes their terms? What if there's a breach? We built a second layer.
Groq operates under a zero-retention policy for API calls -- your data is processed and immediately discarded. Cohere holds SOC 2 Type II certification and provides contractual no-training guarantees. Both providers are audited regularly.
Layer 2: Boundary Anonymization
Before any query reaches an AI provider, we automatically detect and replace personal information with realistic surrogates. Names become different names. Emails become different emails. Locations become different locations.
This means even in a worst-case scenario -- if a provider broke every commitment -- they would only have synthetic surrogate data. There is nothing identifiable to train on.
Our anonymization engine uses Microsoft Presidio, the same named entity recognition system used by enterprises for data loss prevention. It detects names, emails, phone numbers, locations, and organizations in real time.
Layer 3: Hard Redaction
Highly sensitive data like Social Insurance Numbers, credit card numbers, and bank account numbers are never sent anywhere -- not even as surrogates. If our system detects this data in your query, the query is blocked entirely before it reaches any external service.
The Defence-in-Depth Approach
Here's what makes this architecture powerful: each layer works independently.
- If provider commitments hold (expected), your anonymized data isn't trained on -- and it wouldn't be identifiable anyway.
- If provider commitments fail (unlikely), they only have surrogate data with no connection to real people.
- If anonymization missed something (edge case), provider commitments still prevent training.
No single layer needs to be perfect. The combination makes real-world data exposure extremely unlikely.
Want to verify any AI provider's training practices? Check three things: (1) their Terms of Service for data usage clauses, (2) whether they hold SOC 2 or equivalent certification, (3) whether their API terms differ from their consumer product terms (they often do).
Your Documents Never Leave Canada
Your uploaded documents are stored on Canadian infrastructure under PIPEDA (Canada's federal privacy law). The documents themselves are never sent to AI providers. Only anonymized query text and anonymized document snippets are sent for processing -- and those contain surrogates, not your real information.
How This Compares
| Feature | Archevi | ChatGPT (Free) | Google Gemini |
|---|---|---|---|
| Trains on your data | Never | Yes (default) | Yes (default) |
| Opt-out available | N/A (not needed) | Yes (manual) | Yes (manual) |
| Anonymized before AI | Yes (automatic) | No | No |
| Hard redaction for SIN/CC | Yes | No | No |
| Data residency | Canada | US | US |
| No-training contract | Yes | Paid tiers only | Enterprise only |
Bottom Line
If you're storing sensitive family documents -- insurance, financial, medical, legal -- the AI that processes them should be designed so it can't expose your data, not just won't. That's the difference between policy and architecture.
Ready to try it? Start a free 14-day trial or read more about our security architecture.
For the full walkthrough of how our anonymization works in practice, see how Archevi protects your family's privacy.
Related Posts
How Archevi Protects Your Family's Privacy
Privacy isn't just a feature at Archevi -- it's the architecture. Learn how boundary anonymization, hard redaction, and Canadian data residency protect your family's sensitive documents from AI exposure.
Archevi vs. Google Drive: Why Families Need More Than Storage
Google Drive is great for storing files, but managing a family's important documents requires more than just storage. Compare AI search, privacy, expiry tracking, and family features side by side.
Why We Self-Host Everything on One Server
Most startups spread their stack across a dozen SaaS platforms. We put everything -- website, CMS, database, analytics, and AI pipeline -- on a single server. Here's why, and what it actually costs us in ways that aren't just money.