Astro/Solid - Hacker News

$danielhanchen 7 hours ago

For local runs, I made some GGUFs! You need around RAM + VRAM >= 250GB for good perf for dynamic 2bit (2bit MoE, 6-8bit rest) - can also do SSD offloading but it'll be slow.

./llama.cpp/llama-cli -hf unsloth/DeepSeek-V3.1-GGUF:UD-Q2_K_XL -ngl 99 --jinja -ot ".ffn_.*_exps.=CPU"

More details on running + optimal params here: https://docs.unsloth.ai/basics/deepseek-v3.1

[-]

$pshirshov 5 hours ago

By the way, I'm wondering why unsloth (a goddamn python library) tries to run apt-get with sudo (and fails on my nixos). Like how tf we are supposed to use that?

[-]

$danielhanchen 4 hours ago

Oh hey I'm assuming this is for conversion to GGUF after a finetune? If you need to quantize to GGUF Q4_K_M, we have to compile llama.cpp, hence apt-get and compiling llama.cpp within a Python shell.

There is a way to convert to Q8_0, BF16, F16 without compiling llama.cpp, and it's enabled if you use `FastModel` and not on `FastLanguageModel`

Essentially I try to do `sudo apt-get` if it fails then `apt-get` and if all fails, it just fails. We need `build-essential cmake curl libcurl4-openssl-dev`

See https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_z...

[-]

$pxc 2 hours ago

It seems Unsloth is useful and popular, and you seem responsive and helpful. I'd be down to try to improve this and maybe package Unsloth for Nix as well, if you're up for reviewing and answering questions; seems fun.

Imo it's best to just depend on the required fork of llama.cpp at build time (or not) according to some configuration. Installing things at runtime is nuts (especially if it means modifying the existing install path). But if you don't want to do that, I think this would also be an improvement:

  - see if llama.cpp is on the PATH and already has the requisite features
  - if not, check /etc/os-release to determine distro
  - if unavailable, guess distro class based on the presence of high-level package managers (apt, dnf, yum, zypper, pacman) on the PATH
  - bail, explain the problem to the user, give copy/paste-friendly instructions at the end of we managed to figure out where we're running

Is either sort of change potentially agreeable enough that you'd be happy to review it?

[-]

$danielhanchen 2 hours ago

As an update, I pushed https://github.com/unslothai/unsloth-zoo/commit/ae675a0a2d20...

(1) Removed and disabled sudo

(2) Installing via apt-get will ask user's input() for permission

(3) Added an error if failed llama.cpp and provides instructions to manual compile llama.cpp

$danielhanchen 2 hours ago

Thanks for the suggestions! Apologies again I'm pretty bad at packaging, so hence the current setup.

1. So I added a `check_llama_cpp` which checks if llama.cpp does exist and it'll use the prebuilt one https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_z...

2. Yes I like the idea of determining distro

3. Agreed on bailing - I was also thinking if doing a Python input() with a 30 second waiting period for apt-get if that's ok? We tell the user we will apt-get some packages (only if apt exists) (no sudo), and after 30 seconds, it'll just error out

4. I will remove sudo immediately (ie now), and temporarily just do (3)

But more than happy to fix this asap - again sorry on me being dumb

$lambda 3 hours ago

You realize that not everyone is on a Debian derived distro, right? The right thing to do is handle package management separately from your actual code.

From the code:

  f"[FAIL] Unsloth: apt-get does not exist when installing {package}? Is this NOT a Linux / Mac based computer?"

Not all Linux systems use apt-get, and Macs don't use apt-get. What this whole thing hallucinated by an LLM?

You should never do anything like this. Package management should be handled separately from your code. You should just specify what your dependencies are, and if you want to make it nicer, provide packages in the system package format that allow you to install dependencies, or maybe add a helper script "Install Debian/Ubuntu dependencies", but you should never just go ahead an assume that it's a Debian/Ubuntu system, and try to use sudo to install random packages.

[-]

$danielhanchen 2 hours ago

I added it since many people who used Unsloth don't know how to compile llama.cpp, so the only way from Python's side is to either (1) Install it via apt-get within the Python shell (2) Error out then tell the user to install it first, then continue again

I chose (1) since it was mainly for ease of use for the user - but I agree it's not a good idea sorry!

:( I also added a section to manually compile llama.cpp here: https://docs.unsloth.ai/basics/troubleshooting-and-faqs#how-...

But I agree I should remove apt-gets - will do this asap! Thanks for the suggestions :)

[-]

$Imustaskforhelp an hour ago

Hey man, I was seeing your comments and you do seem to respond to each and everyone nicely regarding this sudo shenanigan.

I think that you have removed sudo so this is nice, my suggestion is pretty similar to that of pxc (basically determine different distros and use them as that)

I wonder if we will ever get a working universal package manager in linux, to me flatpak genuinely makes the most sense even sometimes for cli but flatpak isn't built for cli unlike snap which both support cli and gui but snap is proprietory.

[-]

$danielhanchen 25 minutes ago

Hey :) I love suggestions and keep them coming! :)

I agree on handling different distros - sadly I'm not familiar with others, so any help would be appreciated! For now I'm most familiar with apt-get, but would 100% want to expand out!

Interesting will check flatpak out!

$pxx 2 hours ago

> What this whole thing hallucinated by an LLM?

probably not, because LLMs are a little more competent than this

$elteto 3 hours ago

Dude, this is NEVER ok. What in the world??? A third party LIBRARY running sudo commands? That’s just insane.

You just fail and print a nice error message telling the user exactly what they need to do, including the exact apt command or whatever that they need to run.

[-]

$danielhanchen 2 hours ago

As an update, I pushed https://github.com/unslothai/unsloth-zoo/commit/ae675a0a2d20...

(1) Removed and disabled sudo

(2) Installing via apt-get will ask user's input() for permission

(3) Added an error if failed llama.cpp and provides instructions to manual compile llama.cpp

Again apologies on my dumbness and thanks for pointing it out!

$danielhanchen 2 hours ago

Yes I had that at the start, but people kept complaining they don't know how to actually run terminal commands, hence the shortcut :(

I was thinking if I can do it during the pip install or via setup.py which will do the apt-get instead.

As a fallback, I'll probably for now remove shell executions and just warn the user

[-]

$devin an hour ago

Don't optimize for these people.

[-]

$danielhanchen 23 minutes ago

Yep agreed - I primarily thought it was a reasonable "hack", but it's pretty bad security wise, so apologies again.

The current solution hopefully is in between - ie sudo is gone, apt-get will run only after the user agrees by pressing enter, and if it fails, it'll tell the user to read docs on installing llama.cpp

$pxc 3 hours ago

How unusual is this for the ecosystem?

$tw1984 3 hours ago

for such dynamic 2bit, is there any benchmark results showing how many performance I would give up compared to the original model? thanks.

[-]

$danielhanchen 2 hours ago

Currently no, but I'm running them! Some people on the aider discord are running some benchmarks!

$hodgehog11 10 hours ago

For reference, here is the terminal-bench leaderboard:

https://www.tbench.ai/leaderboard

Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.

[-]

$segmondy 8 hours ago

garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted to present a meaningful benchmark, the agent tools will stay the same and then we can really compare the models.

there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.

[-]

$paradite 33 minutes ago

Hey. I like your roast on benchmarks.

I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:

Example recent one on GPT-5:

https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...

All results:

https://eval.16x.engineer/evals/coding

$coliveira 9 hours ago

My personal experience is that it produces high quality results.

[-]

$amrrs 9 hours ago

Any example or prompt you use to make this statment?

[-]

$imachine1980_ 9 hours ago

I remember asking for quotes about the Spanish conquest of South America because I couldn't remember who said a specific thing. The GPT model started hallucinating quotes on the topic, while DeepSeek responded with, "I don't know a quote about that specific topic, but you might mean this other thing." or something like that then cited a real quote in the same topic, after acknowledging that it wasn't able to find the one I had read in an old book. i don't use it for coding, but for things that are more unique i feel is more precise.

[-]

$valtism 3 hours ago

Was that true for GPT-5? They claim it is much better at not hallucinating

$mycall 7 hours ago

I wonder if Conway's law is at all responsible for that, in the similarity it is based on; regional trained data which has concept biases which it sends back in response.

$sync 6 hours ago

I'm doing coreference resolution and this model (w/o thinking) performs at the Gemini 2.5-Pro level (w/ thinking_budget set to -1) at a fraction of the cost.

[-]

$dr_dshiv 6 hours ago

Strong claim there!

$SV_BubbleTime 2 hours ago

Vine is about the only benchmark I think is real.

We made objective systems turn out subjective answers… why the shit would anyone think objective tests would be able to grade them?

$guluarte 9 hours ago

tbh companies like anthopic, openai, create custom agents for specific benchmarks

[-]

$bazmattaz 8 hours ago

Do you have a source for this? I’m intrigued

[-]

$guluarte 8 hours ago

https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5a... "we iteratively refine prompting by analyzing failure cases and developing prompts to address them."

$amelius 6 hours ago

Aren't good benchmarks supposed to be secret?

[-]

$wkat4242 6 hours ago

This industry is currently burning billions a month. With that much money around I don't think any secrets can exist.

$seunosewa 9 hours ago

The DeepSeek R1 in that list is the old model that's been replaced. Update: Understood.

[-]

$yorwba 9 hours ago

Yes, and 31.3% is given in the announcement as the performance of the new v3.1, which would put it in sixteenth place.

$tonyhart7 8 hours ago

Yeah but the pricing is insane, I don't care about SOTA if its not break my bank

$YetAnotherNick 9 hours ago

Depends on the agent. Rank 5 and 15 are claude 4 sonnet, and this stands close to 15th.

$rsanek 2 hours ago

Looks to be the ~same intelligence as gpt-oss-120B, but about 10x slower and 3x more expensive?

https://artificialanalysis.ai/models/deepseek-v3-1-reasoning

[-]

$petesergeant 24 minutes ago

I don't think you're necessarily wrong, but your source is currently only showing a single provider. Comparing:

https://openrouter.ai/openai/gpt-oss-120b and https://openrouter.ai/deepseek/deepseek-chat-v3.1 for the same providers is probably better, although gpt-oss-120b has been around long enough to have more providers, and presumably for hosters to get comfortable with it / optimize hosting of it.

$seunosewa 9 hours ago

It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those.

[-]

$darrinm 7 hours ago

Did you try the strict (beta) function calling? https://api-docs.deepseek.com/guides/function_calling

$ivape 9 hours ago

What formats? I thought the very schema of json is what allows these LLMs to enforce structured outputs at the decoder level? I guess you can do it with any format, but why stray from json?

[-]

$seunosewa 8 hours ago

Sometimes it will randomly generate something like this in the body of the text: ``` <tool_call>executeshell <arg_key>command</arg_key> <arg_value>echo "" >> novels/AI_Voodoo_Romance/chapter-1-a-new-dawn.txt</arg_value> </tool_call> ```

or this: ``` <｜toolcallsbegin｜><｜toolcallbegin｜>executeshell<｜toolsep｜>{"command": "pwd && ls -la"}<｜toolcallend｜><｜toolcallsend｜> ```

Prompting it to use the right format doesn't seem to work. Claude, Gemini, GPT5, and GLM 4.5, don't do that. To accomodate DeepSeek, the tiny agent that I'm building will have to support all the weird formats.

$refulgentis 7 hours ago

In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema. Generally, the training is doing 99% of the work, of course, it's just "strict" means "we'll check it's work to the point a GBNF grammar created from the schema will validate."

One of the funnier info scandals of 2025 has been that only Claude was even close to properly trained on JSON file edits until o3 was released, and even then it needed a bespoke format. Geminis have required using a non-formalized diff format by Aider. Wasn't until June Gemini could do diff-string-in-JSON better than 30% of the time and until GPT-5 that an OpenAI model could. (Though v4a, as OpenAI's bespoke edit format is called, is fine because it at least worked well in tool calls. Geminis was a clown show, you had to post process regular text completions to parse out any diffs)

[-]

$dragonwriter 7 hours ago

> In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema.

I thought the APIs in use generally interface with backend systems supporting logit manipulation, so there is no need to reject and reinference anything; its guaranteed right the first time because any token that would be invalid has a 0% chance of being produced.

I guess for the closed commercial systems that's speculative, but all the discussion of the internals of the open source systems I’ve seen has indicated that and I don't know why the closed systems would be less sophisticated.

[-]

$refulgentis 5 hours ago

I maintain a cross-platform llama.cpp client - you're right to point out that generally we expect nuking logits can take care of it.

There is a substantial performance cost to nuking, the open source internals discussion may have glossed over that for clarity (see github.com/llama.cpp/... below). The cost is very high, default in API* is not artificially lower other logits, and only do that if the first inference attempt yields a token invalid in the compiled grammar.

Similarly, I was hoping to be on target w/r/t to what strict mode is in an API, and am sort of describing the "outer loop" of sampling

* blissfully, you do not have to implement it manually anymore - it is a parameter in the sampling params member of the inference params

* "the grammar constraints applied on the full vocabulary can be very taxing. To improve performance, the grammar can be applied only to the sampled token..and nd only if the token doesn't fit the grammar, the grammar constraints are applied to the full vocabulary and the token is resampled." https://github.com/ggml-org/llama.cpp/blob/54a241f505d515d62...

[-]

$dragonwriter 2 hours ago

Thanks for the explanation!

$vitaflo 3 hours ago

Sad to see the off peak discount go. I was able to crank tokens like crazy and not have it cost anything. That said the pricing is still very very good so I can't complain too much.

$esafak 9 hours ago

It seems behind Qwen3 235B 2507 Reasoning (which I like) and gpt-oss-120B: https://artificialanalysis.ai/models/deepseek-v3-1-reasoning

Pricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1

[-]

$bigyabai 9 hours ago

Those Qwen3 2507 models are the local creme-de-la-creme right now. If you've got any sort of GPU and ~32gb of RAM to play with, the A3B one is great for pair-programming tasks.

[-]

$indigodaddy 5 hours ago

Do we get these good qwen models when using qwen-code CLI tool and authing via qwen.ai account?

$pdimitar 9 hours ago

Do you happen to know if it can be run via an eGPU enclosure with f.ex. RTX 5090 inside, under Linux?

I'm considering buying a Linux workstation lately and I want it full AMD. But if I can just plug an NVIDIA card via an eGPU card for self-hosting LLMs then that would be amazing.

[-]

$oktoberpaard 8 hours ago

I’m running Ollama on 2 eGPUs over Thunderbolt. Works well for me. You’re still dealing with an NVDIA device, of course. The connection type is not going to change that hassle.

[-]

$pdimitar 8 hours ago

Thank you for the validation. As much as I don't like NVIDIA's shenanigans on Linux, having a local LLM is very tempting and I might put my ideological problems to rest over it.

Though I have to ask: why two eGPUs? Is the LLM software smart enough to be able to use any combination of GPUs you point it at?

[-]

$arcanemachiner 7 hours ago

Yes, Ollama is very plug-and-play when it comes to multi GPU.

llama.cpp probably is too, but I haven't tried it with a bigger model yet.

$SV_BubbleTime 2 hours ago

Even today some progress was released on parallelizing WAN video generation over multiple GPUs. LLMs are way easier to split up.

$gunalx 9 hours ago

You would still need drivers and all the stuff difficult with nvidia in linux with a egpu. (Its not nessecarily terrible just suboptimal) Rather just add the second GPU in the Workstation, or just run the llm in your AMD GPU.

[-]

$pdimitar 9 hours ago

Oh, we can run LLMs efficiently with AMD GPUs now? Pretty cool, I haven't been following, thank you.

[-]

$DarkFuture 8 hours ago

I've been running LLM models on my Radeon 7600 XT 16GB for past 2-3 months without issues (Windows 11). I've been using llama.cpp only. The only thing from AMD I installed (apart from latest Radeon drivers) is the "AMD HIP SDK" (very straight forward installer). After unzipping (the zip from GitHub releases page must contain hip-radeon in the name) all I do is this:

llama-server.exe -ngl 99 -m Qwen3-14B-Q6_K.gguf

And then connect to llamacpp via browser to localhost:8080 for the WebUI (its basic but does the job, screenshots can be found on Google). You can connect more advanced interfaces to it because llama.cpp actually has OpenAI-compatible API.

$bigyabai 9 hours ago

Sure, though you'll be bottlenecked by the interconnect speed if you're tiling between system memory and the dGPU memory. That shouldn't be an issue for the 30B model, but would definitely be an issue for the 480B-sized models.

$decide1000 9 hours ago

I use it on a 24gb gpu Tesla P40. Very happy with the result.

[-]

$hkt 8 hours ago

Out of interest, roughly how many tokens per second do you get on that?

[-]

$edude03 8 hours ago

Like 4. Definitely single digit. The P40s are slow af

[-]

$coolspot 5 hours ago

P40 has memory bandwidth of 346GB/s which means it should be able to do around 14+ t/s running a 24 GB model+context.

$tomr75 9 hours ago

With qwen code?

$xmichael909 7 hours ago

Seems to hallucinate more than any model I've ever worked with in the past 6 months.

[-]

$energy123 an hour ago

What context length did you use?

$dude250711 7 hours ago

Did they "borrow" bad data this time?

$jbellis 7 hours ago

About halfway between V3 and Qwen3 Coder.

https://brokk.ai/power-ranking?version=openround-2025-08-20&...

[-]

$indigodaddy 5 hours ago

Is gpt-5 Mini free from any providers?

[-]

$drmidnight 5 hours ago

Duck.ai has it as an option

$snippai 4 hours ago

Looks quite competitive among open-weight models, but I guess still behind GPT-5 or Claude a lot.

$greenavocado 4 hours ago

I have yet to see evidence that it is better for agentic coding tasks than GLM-4.5

[-]

$postalrat 4 hours ago

Is that it? Nothing else you haven't seen evidence for?

$abtinf 8 hours ago

Unrelated, but it would really be nice to have a chart breaking down Price Per Token Per Second for various model, prompt, and hardware combinations.

[-]

$imranq 7 hours ago

There is one: https://pricepertoken.com/

[-]

$rapind 6 hours ago

Claude's Opus pricing is nuts. I'd be surprised if anyone uses it without the top max subscription.

[-]

$jjani 3 hours ago

Sure I do, but not as part of any tools, just for one-off conversations where I know it's going to be the best out there. For tasks where reasoning helps little to none, it's often still number one.

$memothon an hour ago

Some people have startup credits

$dr_dshiv 6 hours ago

Cheep!

$0.56 per million tokens in — and $1.68 per million tokens out.

[-]

$NiekvdMaas 5 hours ago

That's actually a big bump from the previous pricing: $0.27/$1.10

[-]

$kenmacd 4 hours ago

And unfortunately no more half price 8-hours a day either :(

$aussieguy1234 5 hours ago

They say the SWE bench verified score is 66%. Claude Sonnet 4 is 67%. Not sure if the 1% difference here is statistically significant or not.

I'll have to see how things go with this model after a week, once the hype has died down.

$loog5566 5 hours ago

I'm doing this model

DeepSeek-v3.1