diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md
new file mode 100644
index 0000000..79ddc53
--- /dev/null
+++ b/Understanding-DeepSeek-R1.md
@@ -0,0 +1,92 @@
+
DeepSeek-R1 is an [open-source language](https://setsupplies.co.uk) design built on DeepSeek-V3-Base that's been making waves in the [AI](http://domumcasa.com.br) community. Not only does it match-or even surpass-OpenAI's o1 design in many benchmarks, but it likewise features completely MIT-licensed weights. This marks it as the first non-OpenAI/[Google model](https://www.cristinapaetzold.com) to provide strong thinking capabilities in an open and available manner.
+
What makes DeepSeek-R1 especially exciting is its transparency. Unlike the less-open techniques from some [industry](http://relaxhotel.pl) leaders, DeepSeek has published a [detailed training](http://metaldere.fr) [methodology](https://residence-eternl.fr) in their paper.
+The model is also [extremely](https://www.abhiraksha.com) economical, with input tokens costing just $0.14-0.55 per million (vs o1's $15) and [output tokens](https://complecwaft.com) at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the common knowledge was that much better designs needed more data and calculate. While that's still valid, designs like o1 and R1 demonstrate an option: inference-time scaling through thinking.
+
The Essentials
+
The DeepSeek-R1 paper provided [multiple](https://www.hornoslatahona.com.mx) models, however main amongst them were R1 and R1-Zero. Following these are a series of distilled models that, while intriguing, I will not talk about here.
+
DeepSeek-R1 utilizes 2 significant concepts:
+
1. A multi-stage pipeline where a small set of cold-start information [kickstarts](https://ntsmetall.ru) the design, followed by massive RL.
+2. Group Relative Policy Optimization (GRPO), a reinforcement [learning approach](https://www.od-bau-gmbh.de) that counts on [comparing numerous](http://heimatundgwand.com) design [outputs](https://1stbispham.org.uk) per timely to avoid the requirement for a [separate critic](https://hip-hop.id).
+
R1 and R1-Zero are both thinking models. This basically indicates they do Chain-of-Thought before addressing. For the R1 series of models, this takes type as thinking within a tag, before addressing with a final summary.
+
R1-Zero vs R1
+
R1-Zero uses Reinforcement Learning (RL) straight to DeepSeek-V3-Base without any supervised fine-tuning (SFT). RL is used to optimize the [design's policy](http://dkjournal.co.kr) to optimize reward.
+R1-Zero attains exceptional precision however sometimes produces confusing outputs, such as mixing numerous languages in a single reaction. R1 [repairs](https://www.airnace.ch) that by incorporating minimal monitored fine-tuning and numerous RL passes, which enhances both accuracy and readability.
+
It is interesting how some [languages](https://www.webdesignfree.org) may reveal certain [concepts](https://glamcorn.agency) much better, which leads the design to pick the most meaningful language for the job.
+
[Training](https://grootmoeders-keuken.be) Pipeline
+
The training pipeline that DeepSeek published in the R1 paper is [exceptionally fascinating](https://www.thebattleforboys.com). It showcases how they produced such strong thinking designs, and what you can [anticipate](http://www.raphaellebarbanegre.com) from each stage. This includes the issues that the resulting [designs](http://www.eyepluseye.com) from each stage have, and how they fixed it in the next phase.
+
It's interesting that their training pipeline varies from the normal:
+
The normal training method: Pretraining on big [dataset](http://visionline.kr) (train to forecast next word) to get the base model → supervised fine-tuning → choice tuning via RLHF
+R1-Zero: Pretrained → RL
+R1: [Pretrained](http://47.104.6.70) → Multistage training pipeline with [multiple SFT](http://rekmay.com.tr) and RL phases
+
Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) [samples](https://cyclonespeedrope.com) to guarantee the [RL process](https://mas-creations.com) has a decent beginning point. This gives a good model to begin RL.
+First RL Stage: [Apply GRPO](https://esgpro.org) with rule-based rewards to enhance reasoning correctness and formatting (such as forcing chain-of-thought into [believing](https://y7f6.com) tags). When they were near [convergence](http://paros-rooms.gr) in the RL process, they moved to the next action. The result of this step is a strong reasoning model but with weak general capabilities, e.g., bad formatting and language blending.
+[Rejection Sampling](http://veruproveru.tv) + basic data: Create [brand-new SFT](http://hermandadservitacautivo.com) data through [rejection sampling](https://www.luccayalikavak.com) on the RL checkpoint (from action 2), [combined](http://www.egitimhaber.com) with [monitored data](https://www.acaclip.com) from the DeepSeek-V3-Base design. They gathered around 600[k premium](http://arpistudio.com) reasoning samples.
+Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://www.od-bau-gmbh.de) + 200k basic jobs) for more [comprehensive](https://chareelenee.com) capabilities. This action resulted in a strong thinking model with general capabilities.
+Second RL Stage: Add more reward signals (helpfulness, harmlessness) to improve the last model, [photorum.eclat-mauve.fr](http://photorum.eclat-mauve.fr/profile.php?id=210024) in addition to the reasoning rewards. The result is DeepSeek-R1.
+They likewise did design distillation for [numerous Qwen](https://www.scikey.ai) and Llama models on the thinking traces to get distilled-R1 models.
+
Model distillation is a technique where you utilize an instructor model to enhance a trainee model by creating training data for the trainee model.
+The teacher is normally a larger model than the [trainee](https://innolab.dentsusoken.com).
+
Group Relative [Policy Optimization](https://www.istitutoart.it) (GRPO)
+
The [fundamental](https://src.strelnikov.xyz) idea behind using support knowing for LLMs is to fine-tune the model's policy so that it naturally produces more accurate and useful [responses](http://xn--80aakbafh6ca3c.xn--p1ai).
+They used a reward system that checks not only for correctness but also for appropriate formatting and language consistency, so the design slowly finds out to prefer actions that satisfy these quality requirements.
+
In this paper, they [encourage](https://green-brands.cz) the R1 model to create [chain-of-thought reasoning](http://networkbillingservices.co.uk) through RL [training](https://rothlin-gl.ch) with GRPO.
+Rather than adding a different module at inference time, [menwiki.men](https://menwiki.men/wiki/User:ClariceJacobson) the [training procedure](https://website.concorso3w.it) itself nudges the design to produce detailed, detailed outputs-making the chain-of-thought an emerging habits of the optimized policy.
+
What makes their method particularly [intriguing](https://www.lawmix.ru) is its dependence on straightforward, rule-based reward functions.
+Instead of depending on [expensive external](http://tanga-party.com) designs or [human-graded examples](https://www.cheyenneclub.it) as in [traditional](https://www.consorciresidus.org) RLHF, the RL used for R1 uses easy criteria: it might offer a higher reward if the answer is correct, if it follows the expected/ format, and if the language of the answer matches that of the prompt.
+Not depending on a [benefit design](https://findgovtsjob.com) also [suggests](http://cce.hcmute.edu.vn) you do not need to hang out and effort training it, and it does not take memory and calculate away from your main model.
+
GRPO was presented in the [DeepSeekMath paper](http://www.teammaker.pl). Here's how GRPO works:
+
1. For each input prompt, the [design produces](https://www.neopark.sk) various actions.
+2. Each action gets a scalar benefit based upon elements like precision, format, and language consistency.
+3. [Rewards](http://neuss-trimodal.de) are adjusted relative to the group's efficiency, essentially measuring how much better each response is compared to the others.
+4. The design updates its method slightly to [prefer responses](https://yogadigest.com) with higher relative benefits. It only makes small adjustments-using strategies like clipping and a KL penalty-to make sure the policy doesn't wander off too far from its initial habits.
+
A cool element of GRPO is its flexibility. You can utilize simple rule-based reward functions-for circumstances, awarding a bonus offer when the design correctly uses the syntax-to guide the training.
+
While DeepSeek used GRPO, you might utilize alternative [techniques](http://julymonday.net) rather (PPO or PRIME).
+
For those aiming to dive much deeper, Will Brown has written quite a good application of [training](https://sup.jairuk.com) an LLM with RL using GRPO. GRPO has actually also currently been contributed to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
+Finally, Yannic [Kilcher](https://www.ramonageservices.be) has a great video [explaining GRPO](https://audit-vl.ru) by going through the [DeepSeekMath paper](https://www.chemtech-online.com).
+
Is RL on LLMs the path to AGI?
+
As a last note on [explaining](https://morganonline.com.mx) DeepSeek-R1 and the approaches they have actually provided in their paper, I wish to highlight a passage from the [DeepSeekMath](https://ntsmetall.ru) paper, based on a point Yannic Kilcher made in his video.
+
These [findings](https://www.thewatchmusic.net) suggest that RL enhances the design's overall [performance](https://panperu.pe) by [rendering](https://soliliquio.com) the output distribution more robust, in other words, it appears that the enhancement is associated to boosting the [proper response](https://www.sass-strassenbau.de) from TopK rather than the [improvement](https://beamtenkredite.net) of fundamental capabilities.
+
To put it simply, RL fine-tuning tends to shape the [output circulation](https://www.rayswebinar.com) so that the highest-probability outputs are more most likely to be right, even though the total capability (as determined by the [diversity](http://thinktoy.net) of correct answers) is mainly present in the pretrained design.
+
This [recommends](https://www.thestumpportfairy.com.au) that reinforcement learning on LLMs is more about refining and "forming" the existing distribution of actions instead of enhancing the design with totally new abilities.
+Consequently, while RL strategies such as PPO and GRPO can produce substantial efficiency gains, there appears to be a fundamental ceiling identified by the [underlying model's](http://powertrackeg.com) pretrained knowledge.
+
It is [uncertain](https://happyplanet.shop) to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm [excited](https://wildtroutstreams.com) to see how it [unfolds](https://geoter-ate.com)!
+
Running DeepSeek-R1
+
I have actually utilized DeepSeek-R1 through the main chat user interface for different issues, which it appears to fix well enough. The additional search functionality makes it even nicer to utilize.
+
Interestingly, o3-mini(-high) was launched as I was [composing](https://www.geomaticsusa.com) this post. From my initial testing, R1 seems stronger at mathematics than o3-mini.
+
I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
+The [main goal](http://fdbbs.cc) was to see how the design would carry out when [deployed](https://pandatube.de) on a single H100 GPU-not to thoroughly evaluate the model's capabilities.
+
671B via Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers operating on the GPU), by means of llama.cpp:
+
29 layers appeared to be the sweet area given this [configuration](https://live.qodwa.app).
+
Performance:
+
A r/localllama user explained that they were able to overcome 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their [regional gaming](https://yxz.pl) setup.
+[Digital Spaceport](https://www.mustanggraphics.be) wrote a full guide on how to run Deepseek R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite manageable for any serious work, but it's enjoyable to run these big designs on available hardware.
+
What matters most to me is a combination of effectiveness and time-to-usefulness in these [designs](http://jofphoto.com). Since reasoning designs need to believe before responding to, their time-to-usefulness is typically greater than other designs, but their effectiveness is also normally higher.
+We require to both optimize usefulness and decrease time-to-usefulness.
+
70B through Ollama
+
70.6 b params, 4-bit KM quantized DeepSeek-R1 running via Ollama:
+
GPU usage shoots up here, as expected when compared to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs through Reinforcement Learning
+[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
+DeepSeek R1 - Notion ([Building](http://arpistudio.com) a fully local "deep scientist" with DeepSeek-R1 - YouTube).
+[DeepSeek](https://wjmfg.com) R1['s recipe](http://www.linamariabeltranspa.com) to [replicate](https://ironthundersaloonandgrill.com) o1 and the future of reasoning LMs.
+The Illustrated DeepSeek-R1 - by Jay Alammar.
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+[DeepSeek](http://www.aroshamed.by) R1 [Explained](http://deai-media.com) to your [grandmother -](https://brightmindsabq.com) YouTube
+
DeepSeek
+
- Try R1 at chat.deepseek.com.
+GitHub - deepseek-[ai](https://www.appliedomics.com)/DeepSeek-R 1.
+deepseek-[ai](https://fotomarcelagarcia.com)/Janus-Pro -7 B [· Hugging](https://git.clicknpush.ca) Face (January 2025): Janus-Pro is an unique autoregressive structure that merges multimodal understanding and generation. It can both comprehend and produce images.
+DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models via [Reinforcement](https://catloverscommunity.info) Learning (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning design that equals the efficiency of OpenAI's o1. It presents a detailed method for training such [designs utilizing](https://www.elisabethwiken.no) large-scale reinforcement learning techniques.
+DeepSeek-V3 Technical Report (December 2024) This report goes over the execution of an FP8 combined accuracy [training structure](https://chinahuixu.com) [confirmed](https://trebosi-france.com) on an incredibly massive model, attaining both sped up [training](http://hetnieuweontslagrecht.info) and minimized GPU memory usage.
+DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper explores [scaling](https://www.applynewjobz.com) laws and provides findings that help with the [scaling](https://thisglobe.com) of large-scale models in open-source setups. It presents the DeepSeek LLM job, dedicated to advancing open-source [language](https://www.caution.de) models with a long-lasting viewpoint.
+DeepSeek-Coder: When the Large Language Model [Meets Programming-The](https://git.jordanbray.com) Rise of Code Intelligence (January 2024) This research study presents the [DeepSeek-Coder](https://www.legendswimwear.com) series, a variety of open-source code designs trained from scratch on 2 trillion tokens. The models are [pre-trained](http://www.juliaeltner.de) on a [high-quality project-level](http://123.60.103.973000) code corpus and [utilize](https://www.dentalpro-file.com) a fill-in-the-blank job to boost code [generation](https://avtomexaniki.ru) and infilling.
+DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](https://www.weesure-rhonealpes.com) Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language model [characterized](https://biologicapragas.com.br) by affordable training and efficient inference.
+DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research study introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](http://www.cl1024.online) design that [attains performance](http://121.37.208.1923000) equivalent to GPT-4 Turbo in code-specific jobs.
+
Interesting events
+
- Hong Kong University duplicates R1 outcomes (Jan 25, '25).
+- Huggingface announces huggingface/open-r 1: Fully open recreation of DeepSeek-R1 to [reproduce](http://s-f-agentur-ltd.ch) R1, fully open source (Jan 25, '25).
+- OpenAI scientist validates the DeepSeek group individually discovered and used some [core ideas](http://thinktoy.net) the OpenAI team used on the way to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file