Collect LLM Quantization related papers, data, repositories
- [ICML] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [code]
- [ICML] QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks [code]
- [ICML] FrameQuant: Flexible Low-Bit Quantization for Transformers
- [ICML] SqueezeLLM: Dense-and-Sparse Quantization [code]
- [ICML] Extreme Compression of Large Language Models via Additive Quantization [code]
- [ICML] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [code]
- [ICML] BiE: Bi-Exponent Block Floating-Point for Large Language Models Quantization
- [ICML] Compressing Large Language Models by Joint Sparsification and Quantization [code]
- [ICML] LQER: Low-Rank Quantization Error Reconstruction for LLMs [code]
- [ICML] Accurate LoRA-Finetuning Quantization of LLMs via Information Retention [code]
- [ACL] DB-LLM: Accurate Dual-Binarization for Efficient LLMs
- [ACL] Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models [code]
- [ACL] LRQuant: Learnable and Robust Post-Training Quantization for Large Language Models [code]
- [ACL] Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [code]
- [ACL] BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation [code]
- [ACL] Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment
- [NeurIPS] LLM-MQ: Mixed-precision Quantization for Efficient LLM Deployment
- [NeurIPS] LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models [code]
- [NeurIPS] FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
- [ICLR] OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [code]
- [ICLR] LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models [code]
- [ICLR] Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models [code]
- [ICLR] AffineQuant: Affine Transformation Quantization for Large Language Models [code]
- [ICLR] SpQR: ASparse-Quantized Representation for Near-Lossless LLM Weight Compression [code]
- [ICLR] QLLM: ACCURATE AND EFFICIENT LOW-BITWIDTH QUANTIZATION FOR LARGE LANGUAGE MODELS [code]
- [ICLR] COMPRESSING LLMS: THE TRUTH IS RARELY PURE AND NEVER SIMPLE [code]
- [ICLR] LQ-LORA: LOW-RANK PLUS QUANTIZED MATRIX DECOMPOSITION FOR EFFICIENT LANGUAGE MODEL FINETUNING [code]
- [ICLR] QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models [code]
- [NAACL] ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models
- [NAACL] Divergent Token Metrics: Measuring degradation to prune away LLM components – and optimize quantization
- [NAACL] Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other
- [ICML] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [code]
- [EMNLP] Outlier Suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling [code]
- [EMNLP] EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
- [EMNLP] Zero-shot Sharpness-Aware Quantization for Pre-trained Language Models
- [EMNLP] A Frustratingly Easy Post-Training Quantization Scheme for LLMs [code]
- [EMNLP] Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization
- [ACL] Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models
- [NeurIPS] Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
- [NeurIPS] QuIP: 2-Bit Quantization of Large Language Models With Guarantees [code]
- [NeurIPS] Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing [code]
- [NeurIPS] QLoRA: Efficient Finetuning of Quantized LLMs [code]
- [NeurIPS] Training Transformers with 4-bit Integers
- [NeurIPS] TexQ: Zero-shot Network Quantization with Texture Feature Distribution Calibration
- [ACL] PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models
- [ICLR] OPTQ: Accurate Quantization for Generative Pre-trained Transformers [code]
- [ICLR] FIT: A Metric for Model Sensitivity
- [ICLR] PowerQuant: Automorphism Search for Non-Uniform Quantization
- [ICLR] Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats