Running Deepseek locally

2025-05-14·6 min

Table of Contents

I have been asked to design a machine capable of running Deepseek locally for a company that is interested in leveraging AI to improve their processes. The company is strict about data governance, security, and—importantly—budget.

After some research, I decided to write up a guide on the requirements to run Deepseek locally. This serves both as a backup for myself and as a reference to share with the company. Below are my findings and recommendations to ensure you have the right machine for your intended use of the Deepseek AI model.

What is Deepseek?

Deepseek is a family of large language models (LLMs) created by DeepSeek Technology Co., Ltd. Inspired by OpenAI's GPT-4, Deepseek models are available for free use and are designed for a wide range of tasks, including text generation, question answering, and natural language processing.

The main motivation for implementing Deepseek at the company is to provide a tool that can interact with uploaded documents, extract and summarize information, answer questions, generate ideas and tables, and generally assist users with any request related to the content of those documents.

Deepseek Model Variants

Deepseek offers two main types of models:

Full Models: Large, high-accuracy models (often in the news) that require significant hardware resources.
Distilled Models: Smaller, optimized versions of the full models. These are much more hardware-friendly and suitable for local or budget-conscious deployments.

About Distilled Models

Distilled models are created by compressing larger models into smaller, faster, and more efficient versions, while retaining much of the original performance. This makes them ideal for local deployment, edge devices, or scenarios with limited hardware.

Deepseek Distilled Model Comparison

Model Name	Parameters	File Size (FP16)	VRAM (Min)	System RAM	Typical Use Case
1.5B	1.5B	~3GB	0–8GB	8GB	Testing, small-scale tasks
7B	7B	~14GB	8GB	16GB	Chatbots, document Q&A
8B	8B	~16GB	8GB	16GB	Chatbots, light summarization
14B	14B	~28GB	16GB	32GB	Advanced Q&A, summarization
32B	32B	~64GB	24GB	64GB	High-quality, complex tasks
70B	70B	~140GB	48GB+	128GB	Enterprise, multi-user
671B (quant)	671B	~131GB (quant.)	131GB+	128GB+	Research, large-scale analysis

FP16: Half-precision floating point (standard for most LLMs)
Quantized: Lower-precision, smaller file size, lower VRAM needed but slower and less accurate

Where to Obtain Deepseek Models

Official Website: Deepseek — links to model downloads and documentation.
Hugging Face: Many Deepseek model weights are hosted on Hugging Face (search for "deepseek" and select the desired version/size).
Community Forums: Reddit, Discord, and specialized AI forums often provide guides and links for downloading, quantizing, and running Deepseek models.
License: Most Deepseek models are available for free for research and commercial use, but always check the latest license terms.

When to Use Each Model

1.5B–8B: For personal, prototyping, or small business use. Fast and can run on consumer hardware.
14B–32B: For higher accuracy, advanced summarization, or multi-user scenarios. Requires more powerful GPUs.
70B+: For enterprise, research, or production deployments with high concurrency or complex tasks. Needs multi-GPU or high-end setups.

Hardware and Software Requirements

To run Deepseek models efficiently, ensure your system meets the following requirements. These are based on the latest public documentation and community best practices.

Model-Specific Requirements

1.5B Model

CPU no older than 10 years (modern multi-core CPU recommended).
At least 8GB of RAM.
Dedicated VRAM not required (CPU-only is sufficient).
For faster inference, a dedicated GPU such as NVIDIA RTX 3060 (12GB VRAM) is recommended.
Runs entirely on CPU or can use GPU for improved speed.

7B and 8B Models

Dedicated GPU required.
At least 8GB of dedicated VRAM (e.g., Geforce RTX 3060 Ti).
At least 16GB system RAM recommended.

14B Model

Dedicated GPU required.
At least 16GB of dedicated VRAM (e.g., Geforce RTX 4080).
At least 32GB system RAM recommended.

32B Model

Dedicated GPU required.
At least 24GB of dedicated VRAM (e.g., Geforce RTX 3090).
At least 64GB system RAM recommended.

70B Model

Dedicated GPU required.
48GB VRAM is the minimum for quantized models, but optimal performance (especially for full-precision or unquantized) may require 80–180GB VRAM, typically through multi-GPU setups (e.g., clusters of NVIDIA RTX 4090s or NVIDIA A100s).
At least 128GB system RAM recommended.

671B Model (Quantized 1.58-bit)

Dedicated GPU(s) required.
At least 131GB VRAM for quantized model (e.g., multiple high-end GPUs; previously required 480GB VRAM).
At least 128GB system RAM (more recommended for optimal performance).
Fast SSD storage (model file is ~131GB).
Note: Performance will be slow on consumer hardware, but it is possible to run.

GPU Recommendations by Model Size

Model Size	Recommended GPU(s)	VRAM Requirement
Small (1.5B)	NVIDIA RTX 3060	12GB (optional)
Mid-Range (7B–8B)	NVIDIA RTX 3060/3080/4070	8–12GB
High-End (14B–32B)	NVIDIA RTX 4090	12–24GB
Enterprise (70B+)	NVIDIA RTX 4090/A100, Multi-GPU	48–180GB+

General Hardware Recommendations

System RAM: Should match or exceed the model's VRAM requirement (e.g., 128GB+ for 671B models).
Storage: SSD with sufficient space for model files (up to 150GB+ for quantized 671B).
CPU: Modern multi-core CPU (Intel i7/Ryzen 7 or better recommended for CPU-only models).
Power Supply: Sufficient wattage for multi-GPU setups.
Cooling: Adequate cooling for high-end GPUs and multi-GPU configurations.

GPU Compatibility

NVIDIA GPUs: CUDA toolkit required.
AMD GPUs: ROCm support if available (limited support in some frameworks).

Software Requirements

Llama.cpp, Ollama, or similar inference engine (required for running quantized models).
Open WebUI, Chatbox, or other compatible UI for interaction.
CUDA Toolkit (for NVIDIA GPUs).
Latest GPU drivers.

Performance Notes

Even with optimal hardware, large models (especially 671B) will have slow inference speeds on non-enterprise hardware.
For CPU-only setups, expect significantly slower performance.