DNA-AI is a local, privacy-focused bioinformatics tool that analyzes raw DNA data (from 23andMe, AncestryDNA, etc.) against the NCBI ClinVar database. It combines deterministic data matching with a local Large Language Model (Llama 3) to explain health risks in plain English.
- 100% Private & Offline: Your DNA data never leaves your computer. The AI runs locally.
- Robust Matching: Matches user DNA against the ClinVar database using Chromosome and Position.
- Smart Filtering:
- Zygosity Detection: Distinguishes between Carriers (1 Copy) and Affected (2 Copies).
- Strand Flip Protection: Automatically detects and hides ~90% of false positives caused by reverse-strand sequencing errors (e.g., A>T or C>G palindromes).
- Strict Mode: Hides variants with "Conflicting interpretations" among scientists.
- AI Geneticist: Chat with a local Llama 3 AI to ask questions about your specific results.
- Universal Loader: Handles
.txt,.csv,.zip, and.gzfiles automatically.
Before running the program, ensure you have the following installed:
- Python 3.10+
- Ollama: (Required for the AI features). Download here.
Open your terminal/command prompt and run:
ollama pull llama3It is recommended to use a virtual environment.
# Create venv
python -m venv venv
# Activate venv (Windows)
venv\Scripts\activate
# Install requirements
pip install streamlit pandas langchain langchain-community langchain-chroma langchain-ollama pypdfYou need two files to run this program.
You need the "Variant Summary" file from NCBI.
- Download Link: variant_summary.txt.gz
- Location: NCBI FTP Site
- Note: Do NOT unzip this file. The program reads the
.gzdirectly to save space.
- Download your "Raw Data" from 23andMe, AncestryDNA, or MyHeritage.
- The file should look like a text list of
rsid,chromosome,position,genotype.
- Open your terminal in the project folder.
- Activate your virtual environment (if used).
- Run the Streamlit app:
streamlit run dna_chat_app.py --server.maxUploadSize=2000⚠️ Confirmed Risks Only: Hides "Benign" and "Uncertain" results.- 🔥 Strict Mode: Hides results where labs disagree (e.g., one lab says "Pathogenic" but another says "Benign").
- 🧬 Hide Strand Ambiguity: (Recommended: ON). Hides ambiguous "Palindrome" mutations (A↔T, C↔G) which are often technical errors in 23andMe data.
- Heterozygous (1 Copy / Carrier): You have one normal gene and one mutated gene. For recessive conditions, you are usually healthy but can pass it on.
- Homozygous (2 Copies)
⚠️ : You have two mutated genes. This is a significant finding and warrants further investigation or discussion with a professional.
"Found 0 matches"
- Ensure your DNA file uses Build 37 (GRCh37/hg19) coordinates (standard for 23andMe/Ancestry).
- If using very new clinical data (Build 38), the positions will not match ClinVar.
"Found 30,000 matches"
- This is normal before filtering. Turn on "Confirmed Risks Only" and "Hide Strand Ambiguity".
"The AI isn't responding"
- Make sure Ollama is running in the background. Open a separate terminal and type
ollama serve.
This project is open-source. Data provided by NCBI ClinVar. Public domain test data provided by Harvard Personal Genome Project.