Why Your Family Document AI Won't Use Your Data for Training

RHRob HudsonFebruary 11, 2026(Updated April 10, 2026)4 min read

The most common question we hear from families considering AI document tools: "Will my data be used to train AI models?" It's a fair question -- and for most AI products, the answer is complicated.

At Archevi, the answer is simple: no, never, by architecture.

Important

This isn't a policy decision we could reverse tomorrow. It's an architectural guarantee. Our privacy layer anonymizes your data before it ever reaches an AI provider, making training on real data by design impossible.

Three-tier diagram showing what Archevi stores, what AI sees, and what Archevi never does. — Data Never Leaves

The Industry Standard (And Why It's Not Enough)

Most products use a familiar pattern: your data trains the model by default, with an opt-out buried in settings. OpenAI's ChatGPT, for example, uses conversations for training unless you disable chat history or pay for the Enterprise tier. Google's Gemini follows a similar pattern.

Even when providers offer opt-out controls, two problems remain:

Opt-out is retroactive -- data submitted before you found the toggle may already be in a training dataset
You're trusting policy, not architecture -- a policy can change with the next terms-of-service update

How Archevi Is Different: Three Layers of Protection

Instead of relying on policy alone, Archevi uses three independent layers to ensure your data never trains any AI model.

Layer 1: No-Training Providers

We chose AI providers with contractual commitments against training on customer data:

Groq processes our language queries using open-source Llama models. Zero data retention on inference -- your query is processed and discarded.
Cohere powers our semantic search. SOC 2 Type II certified with explicit no-training guarantees.

But we don't stop there. What if a provider changes their terms? What if there's a breach? We built a second layer.

Note

Groq operates under a zero-retention policy for API calls -- your data is processed and immediately discarded. Cohere holds SOC 2 Type II certification and provides contractual no-training guarantees. Both providers are audited regularly.

Layer 2: Boundary Name removal

Before any query reaches an AI provider, we automatically detect and replace personal information with realistic surrogates. Names become different names. Emails become different emails. Locations become different locations.

This means even in a worst-case scenario -- if a provider broke every commitment -- they would only have synthetic surrogate data. There is nothing identifiable to train on.

Our name removal engine uses Microsoft Presidio, the same named entity recognition system used by enterprises for data loss prevention. It detects names, emails, phone numbers, locations, and organizations in real time.

Layer 3: Hard Redaction

Highly sensitive data like Social Insurance Numbers, credit card numbers, and bank account numbers are never sent anywhere -- not even as surrogates. If our system detects this data in your query, the query is blocked entirely before it reaches any external service.

The Defence-in-Depth Approach

Here's what makes this architecture powerful: each layer works independently.

If provider commitments hold (expected), your anonymized data isn't trained on -- and it wouldn't be identifiable anyway.
If provider commitments fail (unlikely), they only have surrogate data with no connection to real people.
If anonymization missed something (edge case), provider commitments still prevent training.

No single layer needs to be perfect. The combination makes real-world data exposure extremely unlikely.

Tip

Want to verify any AI provider's training practices? Check three things: (1) their Terms of Service for data usage clauses, (2) whether they hold SOC 2 or equivalent certification, (3) whether their API terms differ from their consumer product terms (they often do).

Your Documents Never Leave Canada

Your uploaded documents are stored on Canadian infrastructure under PIPEDA (Canada's federal privacy law). The documents themselves are never sent to AI providers. Only anonymized query text and anonymized document snippets are sent for processing -- and those contain surrogates, not your real information.

How This Compares

Feature	Archevi	ChatGPT (Free)	Google Gemini
Trains on your data	Never	Yes (default)	Yes (default)
Opt-out available	N/A (not needed)	Yes (manual)	Yes (manual)
Anonymized before AI	Yes (automatic)	No	No
Hard redaction for SIN/CC	Yes	No	No
Data residency	Canada	US	US
No-training contract	Yes	Paid tiers only	Enterprise only

Bottom Line

If you're storing sensitive family documents -- insurance, financial, medical, legal -- the AI that processes them should be designed so it can't expose your data, not just won't. That's the difference between policy and architecture.

Ready to try it? Sign up free or read more about our security architecture.

For the full walkthrough of how our name removal works in practice, see how Archevi protects your family's privacy.