Running Deepseek locally

I have been asked to design a machine capable of running Deepseek locally for a company that is interested in leveraging AI to improve their processes. The company is strict about data governance, security, and—importantly—budget.
After some research, I decided to write up a guide on the requirements to run Deepseek locally. This serves both as a backup for myself and as a reference to share with the company. Below are my findings and recommendations to ensure you have the right machine for your intended use of the Deepseek AI model.
What is Deepseek?
Deepseek is a family of large language models (LLMs) created by DeepSeek Technology Co., Ltd. Inspired by OpenAI's GPT-4, Deepseek models are available for free use and are designed for a wide range of tasks, including text generation, question answering, and natural language processing.
The main motivation for implementing Deepseek at the company is to provide a tool that can interact with uploaded documents, extract and summarize information, answer questions, generate ideas and tables, and generally assist users with any request related to the content of those documents.
Deepseek Model Variants
Deepseek offers two main types of models:
- Full Models: Large, high-accuracy models (often in the news) that require significant hardware resources.
- Distilled Models: Smaller, optimized versions of the full models. These are much more hardware-friendly and suitable for local or budget-conscious deployments.
About Distilled Models
Distilled models are created by compressing larger models into smaller, faster, and more efficient versions, while retaining much of the original performance. This makes them ideal for local deployment, edge devices, or scenarios with limited hardware.
Deepseek Distilled Model Comparison
Model Name | Parameters | File Size (FP16) | VRAM (Min) | System RAM | Typical Use Case |
---|---|---|---|---|---|
1.5B | 1.5B | ~3GB | 0–8GB | 8GB | Testing, small-scale tasks |
7B | 7B | ~14GB | 8GB | 16GB | Chatbots, document Q&A |
8B | 8B | ~16GB | 8GB | 16GB | Chatbots, light summarization |
14B | 14B | ~28GB | 16GB | 32GB | Advanced Q&A, summarization |
32B | 32B | ~64GB | 24GB | 64GB | High-quality, complex tasks |
70B | 70B | ~140GB | 48GB+ | 128GB | Enterprise, multi-user |
671B (quant) | 671B | ~131GB (quant.) | 131GB+ | 128GB+ | Research, large-scale analysis |
- FP16: Half-precision floating point (standard for most LLMs)
- Quantized: Lower-precision, smaller file size, lower VRAM needed but slower and less accurate
Where to Obtain Deepseek Models
- Official Website: Deepseek — links to model downloads and documentation.
- Hugging Face: Many Deepseek model weights are hosted on Hugging Face (search for "deepseek" and select the desired version/size).
- Community Forums: Reddit, Discord, and specialized AI forums often provide guides and links for downloading, quantizing, and running Deepseek models.
- License: Most Deepseek models are available for free for research and commercial use, but always check the latest license terms.
When to Use Each Model
- 1.5B–8B: For personal, prototyping, or small business use. Fast and can run on consumer hardware.
- 14B–32B: For higher accuracy, advanced summarization, or multi-user scenarios. Requires more powerful GPUs.
- 70B+: For enterprise, research, or production deployments with high concurrency or complex tasks. Needs multi-GPU or high-end setups.
Hardware and Software Requirements
To run Deepseek models efficiently, ensure your system meets the following requirements. These are based on the latest public documentation and community best practices.
Model-Specific Requirements
1.5B Model
- CPU no older than 10 years (modern multi-core CPU recommended).
- At least 8GB of RAM.
- Dedicated VRAM not required (CPU-only is sufficient).
- For faster inference, a dedicated GPU such as NVIDIA RTX 3060 (12GB VRAM) is recommended.
- Runs entirely on CPU or can use GPU for improved speed.
7B and 8B Models
- Dedicated GPU required.
- At least 8GB of dedicated VRAM (e.g., Geforce RTX 3060 Ti).
- At least 16GB system RAM recommended.
14B Model
- Dedicated GPU required.
- At least 16GB of dedicated VRAM (e.g., Geforce RTX 4080).
- At least 32GB system RAM recommended.
32B Model
- Dedicated GPU required.
- At least 24GB of dedicated VRAM (e.g., Geforce RTX 3090).
- At least 64GB system RAM recommended.
70B Model
- Dedicated GPU required.
- 48GB VRAM is the minimum for quantized models, but optimal performance (especially for full-precision or unquantized) may require 80–180GB VRAM, typically through multi-GPU setups (e.g., clusters of NVIDIA RTX 4090s or NVIDIA A100s).
- At least 128GB system RAM recommended.
671B Model (Quantized 1.58-bit)
- Dedicated GPU(s) required.
- At least 131GB VRAM for quantized model (e.g., multiple high-end GPUs; previously required 480GB VRAM).
- At least 128GB system RAM (more recommended for optimal performance).
- Fast SSD storage (model file is ~131GB).
- Note: Performance will be slow on consumer hardware, but it is possible to run.
GPU Recommendations by Model Size
Model Size | Recommended GPU(s) | VRAM Requirement |
---|---|---|
Small (1.5B) | NVIDIA RTX 3060 | 12GB (optional) |
Mid-Range (7B–8B) | NVIDIA RTX 3060/3080/4070 | 8–12GB |
High-End (14B–32B) | NVIDIA RTX 4090 | 12–24GB |
Enterprise (70B+) | NVIDIA RTX 4090/A100, Multi-GPU | 48–180GB+ |
General Hardware Recommendations
- System RAM: Should match or exceed the model's VRAM requirement (e.g., 128GB+ for 671B models).
- Storage: SSD with sufficient space for model files (up to 150GB+ for quantized 671B).
- CPU: Modern multi-core CPU (Intel i7/Ryzen 7 or better recommended for CPU-only models).
- Power Supply: Sufficient wattage for multi-GPU setups.
- Cooling: Adequate cooling for high-end GPUs and multi-GPU configurations.
GPU Compatibility
- NVIDIA GPUs: CUDA toolkit required.
- AMD GPUs: ROCm support if available (limited support in some frameworks).
Software Requirements
- Llama.cpp, Ollama, or similar inference engine (required for running quantized models).
- Open WebUI, Chatbox, or other compatible UI for interaction.
- CUDA Toolkit (for NVIDIA GPUs).
- Latest GPU drivers.
Performance Notes
- Even with optimal hardware, large models (especially 671B) will have slow inference speeds on non-enterprise hardware.
- For CPU-only setups, expect significantly slower performance.