-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Deepseek-V2 #4650
base: main
Are you sure you want to change the base?
Support Deepseek-V2 #4650
Conversation
ERROR 05-08 20:22:08 worker_base.py:145] ValueError: Model architectures ['DeepseekV2ForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LlavaForConditionalGeneration', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'OlmoForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'XverseForCausalLM'] |
it seems the model architecture is not supported in vLLM |
What's the reason it is not supported in this PR? |
Hi, with only MHA, is it possible to realize max_model_len = 128k? In my test, may only 12k. |
The internal inference implementation supports MLA. The implementation on vLLM is more about making it support quickly and matching the model parameters with the code. So the efficiency of using it for LLM Serving is not high enough. I think maybe the current PR could be quickly reviewed and merged asap. Subsequent communities can consider implementing an integrated version. |
Hi @zwd003 May you merge the latest main branch and fix the conflicts? Thanks. |
请问一下目前是否有在开发支持MLA吗 |
ok |
HI @zwd003 This error occurred during the deployment process. How to solve it? Thanks! (RayWorkerWrapper pid=52311) ERROR 05-11 18:04:33 worker_base.py:145] File "/opt/vllm/vllm/model_executor/models/deepseek_v2.py", line 156, in forward |
I encountered the same error |
|
Thanks! :D |
hello,I encountered this error when the QPS was increased to 2.
|
Could you show me lines about KV compression? Thanks. |
加载模型时报如下错误: Cache shape torch.Size([163840, 64]) [repeated 6x across cluster] Process finished with exit code 1 |
any update? looking forward to it.. |
vllm/config.py
Outdated
@@ -250,6 +250,9 @@ def get_hidden_size(self) -> int: | |||
return self.hf_text_config.hidden_size | |||
|
|||
def get_head_size(self) -> int: | |||
if hasattr(self.hf_text_config, "model_type") and self.hf_text_config.model_type=='deepseek_v2': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the head_dim
to the huggingface config instead of hard coding this here?
# TODO remove hard code | ||
if hasattr(self.hf_text_config, "model_type" | ||
) and self.hf_text_config.model_type == 'deepseek_v2': | ||
# FlashAttention supports only head_size 32, 64, 128, 256, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? According to
def get_supported_head_sizes() -> List[int]: |
deepseek_v2.py
-- that should make it quite a bit simpler :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, i will test it later with the latest flash attn
return 0.1 * mscale * math.log(scale) + 1.0 | ||
|
||
|
||
class DeepseekScalingRotaryEmbedding(RotaryEmbedding): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is extremely similar to YaRNScalingRotaryEmbedding
, can you extend that one instead to support mscale
?
@zwd003 I did the refactoring of the MoE code for you, can you look into the other comments I just added? |
I used gptq(int4) method to quantize deepseek_v2 model. When i load quantized model with vLLM,i got below error: vLLM parameters: --dtype float16 --load-format safetensors --trust-remote-code --tensor-parallel-size 2 --enforce-eager --device cuda --max-model-len 1024 generated config.json "quantization_config": {
"bits": 4,
"damp_percent": 0.1,
"desc_act": false,
"group_size": 128,
"modules_in_block_to_quantize": null,
"quant_method": "gptq",
"sym": true,
"true_sequential": true
}
|
I follow the file change to change the file ,but when I use the below code :
How to solve it? |
(deepseek) ailearn@gpts:/data/sdd/models$ cd /data/sdd/models/ ; CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.99 --max-model-len 1024 --model DeepSeek-V2-Lite-Chat --enforce-eager --trust-remote-code --tensor-parallel-size 4 --host 0.0.0.0 --port 8008 |
you can see that |
I made a fix with recent changes on vLLM. Assuming you have an 8xH100 machine, to run,
|
Thank you for your help. Can I use this option with A100*8 GPU? |
I think so. If you need more memory, vLLM will complain about |
Thanks a lot! I am going to test it and share results. |
Thank u for ur attempt, so what's the result? |
OK |
i also met this error . |
same problem [rank0]: TypeError: DeepseekV2ForCausalLM.init() got an unexpected keyword argument 'cache_config' |
@xxll88 @WhatGhost please use this for now: https://github.com/seungduk-yanolja/vllm-deepseek |
谢谢,在CPU环境模型加载没问题,推理时出错: |
When will this request be merged? |
Is this PR working? It is also failing lint and have merge conflict. If fixed, please ping us for a review. (cc @youkaichao to help merging in case I'm not available). |
get the same error |
@zwd003 do you need any help to get this over the line? |
Description:
This PR introduces support for the recently released DeepSeek-V2 model by DeepSeek-AI.
Key Updates:
Related Resources:
Todo:
We look forward to community feedback and suggestions to help us continuously improve and refine the integration and inference implementation of the DeepSeek-V2 model.
Testing
Note: Currently, only the inference method using the Multi-Head Attention (MHA) approach has been implemented, and the efficient inference mode mentioned in the paper has not yet been realized.