Add Understanding DeepSeek R1
parent
a206699ef2
commit
94c2be96dc
92
Understanding-DeepSeek-R1.md
Normal file
92
Understanding-DeepSeek-R1.md
Normal file
@ -0,0 +1,92 @@
|
|||||||
|
<br>DeepSeek-R1 is an [open-source language](https://setsupplies.co.uk) design built on DeepSeek-V3-Base that's been making waves in the [AI](http://domumcasa.com.br) community. Not only does it match-or even surpass-OpenAI's o1 design in many benchmarks, but it likewise features completely MIT-licensed weights. This marks it as the first non-OpenAI/[Google model](https://www.cristinapaetzold.com) to provide strong thinking capabilities in an open and available manner.<br>
|
||||||
|
<br>What makes DeepSeek-R1 especially exciting is its transparency. Unlike the less-open techniques from some [industry](http://relaxhotel.pl) leaders, DeepSeek has published a [detailed training](http://metaldere.fr) [methodology](https://residence-eternl.fr) in their paper.
|
||||||
|
The model is also [extremely](https://www.abhiraksha.com) economical, with input tokens costing just $0.14-0.55 per million (vs o1's $15) and [output tokens](https://complecwaft.com) at $2.19 per million (vs o1's $60).<br>
|
||||||
|
<br>Until ~ GPT-4, the common knowledge was that much better designs needed more data and calculate. While that's still valid, designs like o1 and R1 demonstrate an option: inference-time scaling through thinking.<br>
|
||||||
|
<br>The Essentials<br>
|
||||||
|
<br>The DeepSeek-R1 paper provided [multiple](https://www.hornoslatahona.com.mx) models, however main amongst them were R1 and R1-Zero. Following these are a series of distilled models that, while intriguing, I will not talk about here.<br>
|
||||||
|
<br>DeepSeek-R1 utilizes 2 significant concepts:<br>
|
||||||
|
<br>1. A multi-stage pipeline where a small set of cold-start information [kickstarts](https://ntsmetall.ru) the design, followed by massive RL.
|
||||||
|
2. Group Relative Policy Optimization (GRPO), a reinforcement [learning approach](https://www.od-bau-gmbh.de) that counts on [comparing numerous](http://heimatundgwand.com) design [outputs](https://1stbispham.org.uk) per timely to avoid the requirement for a [separate critic](https://hip-hop.id).<br>
|
||||||
|
<br>R1 and R1-Zero are both thinking models. This basically indicates they do Chain-of-Thought before addressing. For the R1 series of models, this takes type as thinking within a tag, before addressing with a final summary.<br>
|
||||||
|
<br>R1-Zero vs R1<br>
|
||||||
|
<br>R1-Zero uses Reinforcement Learning (RL) straight to DeepSeek-V3-Base without any supervised fine-tuning (SFT). RL is used to optimize the [design's policy](http://dkjournal.co.kr) to optimize reward.
|
||||||
|
R1-Zero attains exceptional precision however sometimes produces confusing outputs, such as mixing numerous languages in a single reaction. R1 [repairs](https://www.airnace.ch) that by incorporating minimal monitored fine-tuning and numerous RL passes, which enhances both accuracy and readability.<br>
|
||||||
|
<br>It is interesting how some [languages](https://www.webdesignfree.org) may reveal certain [concepts](https://glamcorn.agency) much better, which leads the design to pick the most meaningful language for the job.<br>
|
||||||
|
<br>[Training](https://grootmoeders-keuken.be) Pipeline<br>
|
||||||
|
<br>The training pipeline that DeepSeek published in the R1 paper is [exceptionally fascinating](https://www.thebattleforboys.com). It showcases how they produced such strong thinking designs, and what you can [anticipate](http://www.raphaellebarbanegre.com) from each stage. This includes the issues that the resulting [designs](http://www.eyepluseye.com) from each stage have, and how they fixed it in the next phase.<br>
|
||||||
|
<br>It's interesting that their training pipeline varies from the normal:<br>
|
||||||
|
<br>The normal training method: Pretraining on big [dataset](http://visionline.kr) (train to forecast next word) to get the base model → supervised fine-tuning → choice tuning via RLHF
|
||||||
|
R1-Zero: Pretrained → RL
|
||||||
|
R1: [Pretrained](http://47.104.6.70) → Multistage training pipeline with [multiple SFT](http://rekmay.com.tr) and RL phases<br>
|
||||||
|
<br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) [samples](https://cyclonespeedrope.com) to guarantee the [RL process](https://mas-creations.com) has a decent beginning point. This gives a good model to begin RL.
|
||||||
|
First RL Stage: [Apply GRPO](https://esgpro.org) with rule-based rewards to enhance reasoning correctness and formatting (such as forcing chain-of-thought into [believing](https://y7f6.com) tags). When they were near [convergence](http://paros-rooms.gr) in the RL process, they moved to the next action. The result of this step is a strong reasoning model but with weak general capabilities, e.g., bad formatting and language blending.
|
||||||
|
[Rejection Sampling](http://veruproveru.tv) + basic data: Create [brand-new SFT](http://hermandadservitacautivo.com) data through [rejection sampling](https://www.luccayalikavak.com) on the RL checkpoint (from action 2), [combined](http://www.egitimhaber.com) with [monitored data](https://www.acaclip.com) from the DeepSeek-V3-Base design. They gathered around 600[k premium](http://arpistudio.com) reasoning samples.
|
||||||
|
Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://www.od-bau-gmbh.de) + 200k basic jobs) for more [comprehensive](https://chareelenee.com) capabilities. This action resulted in a strong thinking model with general capabilities.
|
||||||
|
Second RL Stage: Add more reward signals (helpfulness, harmlessness) to improve the last model, [photorum.eclat-mauve.fr](http://photorum.eclat-mauve.fr/profile.php?id=210024) in addition to the reasoning rewards. The result is DeepSeek-R1.
|
||||||
|
They likewise did design distillation for [numerous Qwen](https://www.scikey.ai) and Llama models on the thinking traces to get distilled-R1 models.<br>
|
||||||
|
<br>Model distillation is a technique where you utilize an instructor model to enhance a trainee model by creating training data for the trainee model.
|
||||||
|
The teacher is normally a larger model than the [trainee](https://innolab.dentsusoken.com).<br>
|
||||||
|
<br>Group Relative [Policy Optimization](https://www.istitutoart.it) (GRPO)<br>
|
||||||
|
<br>The [fundamental](https://src.strelnikov.xyz) idea behind using support knowing for LLMs is to fine-tune the model's policy so that it naturally produces more accurate and useful [responses](http://xn--80aakbafh6ca3c.xn--p1ai).
|
||||||
|
They used a reward system that checks not only for correctness but also for appropriate formatting and language consistency, so the design slowly finds out to prefer actions that satisfy these quality requirements.<br>
|
||||||
|
<br>In this paper, they [encourage](https://green-brands.cz) the R1 model to create [chain-of-thought reasoning](http://networkbillingservices.co.uk) through RL [training](https://rothlin-gl.ch) with GRPO.
|
||||||
|
Rather than adding a different module at inference time, [menwiki.men](https://menwiki.men/wiki/User:ClariceJacobson) the [training procedure](https://website.concorso3w.it) itself nudges the design to produce detailed, detailed outputs-making the chain-of-thought an emerging habits of the optimized policy.<br>
|
||||||
|
<br>What makes their method particularly [intriguing](https://www.lawmix.ru) is its dependence on straightforward, rule-based reward functions.
|
||||||
|
Instead of depending on [expensive external](http://tanga-party.com) designs or [human-graded examples](https://www.cheyenneclub.it) as in [traditional](https://www.consorciresidus.org) RLHF, the RL used for R1 uses easy criteria: it might offer a higher reward if the answer is correct, if it follows the expected/ format, and if the language of the answer matches that of the prompt.
|
||||||
|
Not depending on a [benefit design](https://findgovtsjob.com) also [suggests](http://cce.hcmute.edu.vn) you do not need to hang out and effort training it, and it does not take memory and calculate away from your main model.<br>
|
||||||
|
<br>GRPO was presented in the [DeepSeekMath paper](http://www.teammaker.pl). Here's how GRPO works:<br>
|
||||||
|
<br>1. For each input prompt, the [design produces](https://www.neopark.sk) various actions.
|
||||||
|
2. Each action gets a scalar benefit based upon elements like precision, format, and language consistency.
|
||||||
|
3. [Rewards](http://neuss-trimodal.de) are adjusted relative to the group's efficiency, essentially measuring how much better each response is compared to the others.
|
||||||
|
4. The design updates its method slightly to [prefer responses](https://yogadigest.com) with higher relative benefits. It only makes small adjustments-using strategies like clipping and a KL penalty-to make sure the policy doesn't wander off too far from its initial habits.<br>
|
||||||
|
<br>A cool element of GRPO is its flexibility. You can utilize simple rule-based reward functions-for circumstances, awarding a bonus offer when the design correctly uses the syntax-to guide the training.<br>
|
||||||
|
<br>While DeepSeek used GRPO, you might utilize alternative [techniques](http://julymonday.net) rather (PPO or PRIME).<br>
|
||||||
|
<br>For those aiming to dive much deeper, Will Brown has written quite a good application of [training](https://sup.jairuk.com) an LLM with RL using GRPO. GRPO has actually also currently been contributed to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
|
||||||
|
Finally, Yannic [Kilcher](https://www.ramonageservices.be) has a great video [explaining GRPO](https://audit-vl.ru) by going through the [DeepSeekMath paper](https://www.chemtech-online.com).<br>
|
||||||
|
<br>Is RL on LLMs the path to AGI?<br>
|
||||||
|
<br>As a last note on [explaining](https://morganonline.com.mx) DeepSeek-R1 and the approaches they have actually provided in their paper, I wish to highlight a passage from the [DeepSeekMath](https://ntsmetall.ru) paper, based on a point Yannic Kilcher made in his video.<br>
|
||||||
|
<br>These [findings](https://www.thewatchmusic.net) suggest that RL enhances the design's overall [performance](https://panperu.pe) by [rendering](https://soliliquio.com) the output distribution more robust, in other words, it appears that the enhancement is associated to boosting the [proper response](https://www.sass-strassenbau.de) from TopK rather than the [improvement](https://beamtenkredite.net) of fundamental capabilities.<br>
|
||||||
|
<br>To put it simply, RL fine-tuning tends to shape the [output circulation](https://www.rayswebinar.com) so that the highest-probability outputs are more most likely to be right, even though the total capability (as determined by the [diversity](http://thinktoy.net) of correct answers) is mainly present in the pretrained design.<br>
|
||||||
|
<br>This [recommends](https://www.thestumpportfairy.com.au) that reinforcement learning on LLMs is more about refining and "forming" the existing distribution of actions instead of enhancing the design with totally new abilities.
|
||||||
|
Consequently, while RL strategies such as PPO and GRPO can produce substantial efficiency gains, there appears to be a fundamental ceiling identified by the [underlying model's](http://powertrackeg.com) pretrained knowledge.<br>
|
||||||
|
<br>It is [uncertain](https://happyplanet.shop) to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm [excited](https://wildtroutstreams.com) to see how it [unfolds](https://geoter-ate.com)!<br>
|
||||||
|
<br>Running DeepSeek-R1<br>
|
||||||
|
<br>I have actually utilized DeepSeek-R1 through the main chat user interface for different issues, which it appears to fix well enough. The additional search functionality makes it even nicer to utilize.<br>
|
||||||
|
<br>Interestingly, o3-mini(-high) was launched as I was [composing](https://www.geomaticsusa.com) this post. From my initial testing, R1 seems stronger at mathematics than o3-mini.<br>
|
||||||
|
<br>I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
|
||||||
|
The [main goal](http://fdbbs.cc) was to see how the design would carry out when [deployed](https://pandatube.de) on a single H100 GPU-not to thoroughly evaluate the model's capabilities.<br>
|
||||||
|
<br>671B via Llama.cpp<br>
|
||||||
|
<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers operating on the GPU), by means of llama.cpp:<br>
|
||||||
|
<br>29 layers appeared to be the sweet area given this [configuration](https://live.qodwa.app).<br>
|
||||||
|
<br>Performance:<br>
|
||||||
|
<br>A r/localllama user explained that they were able to overcome 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their [regional gaming](https://yxz.pl) setup.
|
||||||
|
[Digital Spaceport](https://www.mustanggraphics.be) wrote a full guide on how to run Deepseek R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
|
||||||
|
<br>As you can see, the tokens/s isn't quite manageable for any serious work, but it's enjoyable to run these big designs on available hardware.<br>
|
||||||
|
<br>What matters most to me is a combination of effectiveness and time-to-usefulness in these [designs](http://jofphoto.com). Since reasoning designs need to believe before responding to, their time-to-usefulness is typically greater than other designs, but their effectiveness is also normally higher.
|
||||||
|
We require to both optimize usefulness and decrease time-to-usefulness.<br>
|
||||||
|
<br>70B through Ollama<br>
|
||||||
|
<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running via Ollama:<br>
|
||||||
|
<br>GPU usage shoots up here, as expected when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
|
||||||
|
<br>Resources<br>
|
||||||
|
<br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs through Reinforcement Learning
|
||||||
|
[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
|
||||||
|
DeepSeek R1 - Notion ([Building](http://arpistudio.com) a fully local "deep scientist" with DeepSeek-R1 - YouTube).
|
||||||
|
[DeepSeek](https://wjmfg.com) R1['s recipe](http://www.linamariabeltranspa.com) to [replicate](https://ironthundersaloonandgrill.com) o1 and the future of reasoning LMs.
|
||||||
|
The Illustrated DeepSeek-R1 - by Jay Alammar.
|
||||||
|
Explainer: What's R1 & Everything Else? - Tim Kellogg.
|
||||||
|
[DeepSeek](http://www.aroshamed.by) R1 [Explained](http://deai-media.com) to your [grandmother -](https://brightmindsabq.com) YouTube<br>
|
||||||
|
<br>DeepSeek<br>
|
||||||
|
<br>- Try R1 at chat.deepseek.com.
|
||||||
|
GitHub - deepseek-[ai](https://www.appliedomics.com)/DeepSeek-R 1.
|
||||||
|
deepseek-[ai](https://fotomarcelagarcia.com)/Janus-Pro -7 B [· Hugging](https://git.clicknpush.ca) Face (January 2025): Janus-Pro is an unique autoregressive structure that merges multimodal understanding and generation. It can both comprehend and produce images.
|
||||||
|
DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models via [Reinforcement](https://catloverscommunity.info) Learning (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning design that equals the efficiency of OpenAI's o1. It presents a detailed method for training such [designs utilizing](https://www.elisabethwiken.no) large-scale reinforcement learning techniques.
|
||||||
|
DeepSeek-V3 Technical Report (December 2024) This report goes over the execution of an FP8 combined accuracy [training structure](https://chinahuixu.com) [confirmed](https://trebosi-france.com) on an incredibly massive model, attaining both sped up [training](http://hetnieuweontslagrecht.info) and minimized GPU memory usage.
|
||||||
|
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper explores [scaling](https://www.applynewjobz.com) laws and provides findings that help with the [scaling](https://thisglobe.com) of large-scale models in open-source setups. It presents the DeepSeek LLM job, dedicated to advancing open-source [language](https://www.caution.de) models with a long-lasting viewpoint.
|
||||||
|
DeepSeek-Coder: When the Large Language Model [Meets Programming-The](https://git.jordanbray.com) Rise of Code Intelligence (January 2024) This research study presents the [DeepSeek-Coder](https://www.legendswimwear.com) series, a variety of open-source code designs trained from scratch on 2 trillion tokens. The models are [pre-trained](http://www.juliaeltner.de) on a [high-quality project-level](http://123.60.103.973000) code corpus and [utilize](https://www.dentalpro-file.com) a fill-in-the-blank job to boost code [generation](https://avtomexaniki.ru) and infilling.
|
||||||
|
DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](https://www.weesure-rhonealpes.com) Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language model [characterized](https://biologicapragas.com.br) by affordable training and efficient inference.
|
||||||
|
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research study introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](http://www.cl1024.online) design that [attains performance](http://121.37.208.1923000) equivalent to GPT-4 Turbo in code-specific jobs.<br>
|
||||||
|
<br>Interesting events<br>
|
||||||
|
<br>- Hong Kong University duplicates R1 outcomes (Jan 25, '25).
|
||||||
|
- Huggingface announces huggingface/open-r 1: Fully open recreation of DeepSeek-R1 to [reproduce](http://s-f-agentur-ltd.ch) R1, fully open source (Jan 25, '25).
|
||||||
|
- OpenAI scientist validates the DeepSeek group individually discovered and used some [core ideas](http://thinktoy.net) the OpenAI team used on the way to o1<br>
|
||||||
|
<br>Liked this post? Join the newsletter.<br>
|
Loading…
x
Reference in New Issue
Block a user