Local AI needs to be the norm

(unix.foo)

254 points | by cylo 4 hours ago

39 comments

  • TheJCDenton 1 hour ago
    For the mainstream audience, the sentiment around local ai today is the same that they had around open source a few decades ago. For a few products, some paid solutions were so much more advanced that open source were very often completely overlooked. Why bother ? And the like. Then we had captive SaaS and other plateforms and now it's obviously wrong for most of us.

    The dependency we have with anthropic and openai for coding for instance is insane. Most accept it because either they don't care, or they just hope chinese will never stop open weights. The business model of open weights is very new, include some power play between countries and labs, and move an absurd amount of money without any concrete oversight from most people.

    It's a very dangerous gamble. Today incredible value is available for nearly everyone. But it may stop without any warning, for reason outside our control.

    • apublicfrog 2 minutes ago
      > It's a very dangerous gamble. Today incredible value is available for nearly everyone. But it may stop without any warning, for reason outside our control.

      What stops you from running the best open weighted LLMs currently available on consumer grade hardware for the rest of time? They're good enough for 95% of use cases, and they don't have a used by date. From what I can see, the "danger" is not having the next tier that comes out, but the impact of that is very low.

    • slicktux 7 minutes ago
      I’m just waiting for the US Government to implement their own local AI. Which will eventually lead to them open sourcing it because it’s tax payer funded and being that the NSA has decades worth of internet data they can train on; open weights would be just as good as any companies…
    • oytis 1 hour ago
      What is the business model of open weight AI? I don't think there is any. At best it can serve as an advertisement for the more advanced models you sell.

      The huge difference to open source is that you can't just train an LLM with free time and motivation. You need lots of data and a lot of compute.

      I sure want to be wrong on that, I definitely like the open-weight version of the future more

      • wood_spirit 1 hour ago
        Meta released Llama just when OpenAI was so hot and its valuation was going through the roof. Speculating, but Meta probably thought the model not competitive enough to keep as a secret weapon but well good enough to commercially damage OpenAI who were a sudden competitor for most-valued-company?

        In the same way you can imagine the Chinese government pushing the release of deepseek etc to make sure no one thinks the US has “won” and to keep everyone aware that a foreign model might leapfrog in the short term future etc.

        At some point though if OpenAI/Antropic/Google plateau or go bust then the open source sponsorship becomes less likely, as making it open source was a weapon not a principle.

        • 2ndorderthought 28 minutes ago
          I disagree. I think deepseek, qwen, and kimi earn a lot of trust open sourcing their models. While still profiting.

          Effectively they are saying "yea don't crowd our data centers with small queries, go ahead and send your frontier questions to our frontier models. Oh btw those us models? You can run something about as good for free from us if you want hah." It's a power and marketing move. It's also insanely smart to keep up with it to remain sustainable as a brand. Especially given how small their investments into this are.

          Look at anthropics growing pains. Deepseek has other hosts spreading their brand for free while they grow. Brilliant honestly. In my opinion it makes anthropic and openai look clueless on a lot of levels.

          China is playing a different game here. To them this is commoditizing their compliment and building good will. The Chinese economy doesn't teter on the brink of collapse to deliver frontier grade LLMs. Nope, Alibaba just made qwen because it needs it. It needs efficient models. Similarly, in China they manufacture and automate so much more than the US ever could. LLMs to them are a topping not the whole meal like they are in the us.

      • PAndreew 1 hour ago
        Perhaps you can create a compelling UX around it and sell it as a subscription. "Normies" will not be able/willing to build it. You can then patch the model/ship new features around it as it evolves. For example I have built an ambient todo list / health data extractor using Gemma 4 2EB and Whisper. Nothing to brag about but it does fairly decent job even in foreign languages.
      • majormajor 49 minutes ago
        > What is the business model of open weight AI? I don't think there is any. At best it can serve as an advertisement for the more advanced models you sell.

        I don't think local will necessarily be open-weight. And then it's not that different from personal computing: you're giving up the big lucrative corporate mainframe, thin-client model for "sell copies to a ton of individuals."

        So it'd be someone else (an Apple, or the next-year equivalent of 1976 Apple) who'd start eating into that. There are a few on-device things today, but not for much heavy lifting. At first it's a toy, could maybe become more realized in a still-toy-like basis like a fully-local Alexa; in the future it grows until it eats 80-90% of the OpenAI/Anthropic use cases.

        Incumbents would always rather you pay a subscription or per-use forever, but if the market looks big enough, someone will try to disrupt it.

      • js8 42 minutes ago
        What is the business model of Wikipedia? I don't think there is any.

        Not everything good in our society needs to have a "business model". People still work on it. It's FINE.

        • sroussey 6 minutes ago
          > What is the business model of Wikipedia?

          Donations. Have you donated lately?

          Wikipedia is cheap compared to creating and training models.

          I don’t think donations will suffice at all.

          As an example, we had millions of web developers download and install Firebug before browsers shipped their own dev tools. Donations over the course of multiple years would have paid my salary for a month if I were not a volunteer.

          But from the “it’s fine” point of view, models will be baked into your OS.

          Then later models will be embedded into hardware. Likely only OS makers models.

        • avidphantasm 24 minutes ago
          Ultimately, information is a public good: it is non-excludable (you can’t stop people from using it) and it is non-rival (we can all use it at the same time). Public goods are often very useful, and because they are non-excludable and non-rival, ultimately can’t have a market-based business model. I would class open-weights AI models as public goods, and would support government expenditure to produce them.
        • phainopepla2 17 minutes ago
          Training AI models is capital intensive, though. Unless there's some sort of mega-crowdfunding effort for open weight model training there needs to be a way to recoup that money on the other end. Either that or state sponsorship I guess
      • dleslie 38 minutes ago
        This is where government funding can play a role.

        Sometimes there are things where the public good is best served with public expenditure.

      • karussell 1 hour ago
        > What is the business model of open weight AI?

        This is what I do not understand as well and advertising the knowledge and more advanced model is also the only thing that comes to my mind.

        Since a month I am using gemma4 locally successfully on a MBP M2 for many search queries (wikipedia style questions) and it is really good, fast enough (30-40t/s) and feels nice as it keeps these queries private. But I don't understand why Google does this and so I think "we" need to find a better solution where the entire pipeline is open and the compute somehow crowdfunded. Because there will be a time when these local models will get more closed like Android is closing down. One restriction they might enforce in the future could be that they cripple the models down for "sensitive" topics like cybersecurity or health topics. Or the government could even feel the need to force them to do so.

        • 2ndorderthought 1 hour ago
          Why would you want to try to support all users simple queries on your ai data center if they could run it on their own computer?

          It builds good will also. it also shows research prowess.

          For China it's different. They need to show Americans who don't trust them at all because of propaganda that they have no tricks up their sleeve. It also doesn't hurt when Chinese companies drop models for free people can run at home that are about as good as sonnet. Serious mic drop.

          • TheJCDenton 7 minutes ago
            Very good point on using local ai to avoid data centers costs.

            Running AI models on local hardware was exploratory at first, and if it's so easy today it's thanks to open source. It's a little bit coincidental that we have this today, and that mainstream hardware have this capability. The fact that a phone can run very small models is exploratory or some kind of marketing opportunity at best.

            Why would hardware company ships cards with more AI capabilites (like more VRAM) in the foreseable future ? On what ground does the marketing for on device AI will keep generating interest ? For something as important, it's very uncertain. But above all, it should not depends on these brittle justifications.

            Showing good will in distribution and research prowess today is positive communication, but it can be exactly the oppositite if/when an attack using those small models will reach a high value target.

            For China the cultural difference is so huge, it's difficult to say. I would think they first and foremost need to show to evryone inside and outside of China that they match american models. Second, i would say that when americans prefer few very powerfull companies on the get go because they can leverage a lot of capital rapidly to industrialize, China will prefer leveraging a lot of smaller companies exploring a lot of things simultanously (so doing a lot of research), THEN creating legislation to let only the best (or a few) to survive effectively. In the end it's the same result (monopoly or oligopoly), but China may have a stronger core (research) and America may have stronger productive capital, that may be proved obsolete... In the long run, in either side it's a gamble, again.

          • karussell 1 hour ago
            Indeed cost can be another factor. Maybe also the main reason why Chrome added an offline model.
            • 2ndorderthought 42 minutes ago
              That and it's lucrative for Android/chrome to have a text summarizer model embedded on your phone probably for government contracts and data exfil but we won't go through there.
      • worldsayshi 1 hour ago
        It should be feasible to crowd fund training runs right?
        • dmd 1 hour ago
          A training run costs somewhere in the neighborhood of a billion dollars. That’s a thousand millions.

          How many crowdfunded projects do you know that have raised even one percent of that? Who’s going to be in charge of collecting that scale of money? Perhaps some sort of company formed for the benefit of humanity, which will promise to be a non-profit? Some sort of “Open” AI?

          Oh, wait.

          • iugtmkbdfil834 56 minutes ago
            << That’s a thousand millions.

            I can't say that you are lying and you are not exactly exaggerating either. It is true that a new SOTA model -- from literal scratch -- it would be expensive.

            But, and it is not a small but, is the starting point really zero?

      • fragmede 45 minutes ago
        The business model is the total lack of attention to Qwen and Kimi that would happen if their models weren't downloadable. Before releasing the weights, there was basically zero attention paid in the western hemisphere to them, for whatever reason. By releasing the weights, they're relevant in the western world. The business model is to get people in the West to pay to use their platform hosting their AI, that otherwise would never have heard of them. As you said, advertising/marketing, essentially.
    • aabhay 1 hour ago
      Disagree with this. When cost becomes an important factor or the free but worse option becomes compelling and accessible (i.e. on device agent via apple style UX), there has been significant user behavior towards local. Think about stuff like removing backgrounds from photos, OCR on PDFs, who uses paid services for casual usage of these things?
    • iLoveOncall 15 minutes ago
      The mainstream audience does not have the faintest idea that "local AI" is even a thing.
      • CamperBob2 1 minute ago
        Just as their counterparts in 1975 had no idea that "personal computers" were even a thing.

        Read through an old issue of Popular Electronics and then surf /r/LocalLlama, and you'll get a sense of real-time deja vu.

    • RataNova 12 minutes ago
      [dead]
  • pronik 52 minutes ago
    They will be, and that moment is not that far off. We've got the progression in place already: first, large data centers could have performant LLMs, we are now firmly in "a bunch of servers with a couple of H100s each" territory, slowly going into "128 GB VRAM on a MacBook Pro or a Strix Halo". Within the next year, the pattern of "expensive remote LLM for planning, local slow-but-faster-than-human LLM for execution" will become the norm for companies, slowly moving to "using local LLM for everything is good enough". And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed. The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.
    • RataNova 10 minutes ago
      The biggest impact of local models may simply be that they prevent remote inference from becoming the only game in town
    • dakolli 39 minutes ago
      This is simply delusional, It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.

      To sell tokens profitably you'd need to be able to run inference at 150 tokens per second for less than $1,000 USD a month.

      I don't think people realize how expensive it is to host decently capable models and how much their use of capable models is subsidized.

      You can only squeeze so many parameters on consumer grade hardware(that's actually affordable, two 4090s is not consumer grade and neither is 128gb macbooks, this is incredibly expensive for the average person, and the models you can still run are not "good enough" they are still essentially useless).

      People are betting their competency on a future where billionaires are forever generous, subsidizing inference at a 10-1 20-1 loss ratio. Guess what, that WILL end and probably soon. This idea that companies can afford to give you access to 2mm in GPUs for 5 hours a day at a rate of $200.00 a month is simply unsustainable.

      Right now they are trying to get you hooked, DON'T FALL FOR IT. Study, work hard, sweat and you'll reap the benefits. The guy making handmade watches, one a month in Switzerland makes a whole lot more than the guy running a manufacturing line make 50k in China. Just write your own fkin code people.

      Don't bet your future on having access to some billionaire's thinking machine. Intelligence, knowledge and competency isn't fungible, the llm hype is a lie to convince you that it is.

      • zozbot234 30 minutes ago
        No one runs SOTA models 24/7 for individual use or even for a single household or small business, whereas you can run your own hardware basically 24/7 for AI inference.

        With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity.

        This is perhaps not so useful on a 128GB or 96GB RAM Apple Silicon device (I've seen recent reports of DS4 runs with even one agent flow hitting serious thermal and power limits on these devices, so increasing compute intensity will probably not be useful there) but it will become useful with 64GB devices or lower that have to stream from a slow disk, or with things like the DGX Spark or to a lesser extent Strix Halo, that greatly overprovision compute while being bottlenecked on memory bandwidth.

        • dakolli 21 minutes ago
          Please go let me know what that's actually useful for other than spawning your next AI girlfriend to role play with.
      • hparadiz 31 minutes ago
        Posts like this are so funny to me. I'm staring at a mountain of old hardware right now that cost about $20k ten years ago. I have to pay someone now to come haul it away. What makes you think the current new hardware won't end up with the same fate.

        > Just write your own fkin code people

        Bro is nostalgic for googling random stack overflow threads for 10 days to figure out a bug the agent fixes in an hour.

        • dakolli 23 minutes ago
          I'm just saying that agent that can fix your bugs actually cost $100-150 an hour to run and you're getting it essentially for $200.00 a month.

          The cost of cloud compute actually hasn't gone down for old hardware all that much, it still costs $500.00 a year rent 4 core i7700k that's 10 years old. Don't expect much more valuable hardware, like modern GPUs to deflate in price all that quickly.

          There's 3 fabs in the world that make ddr7 and they aren't going to be selling their stock to consumers going forward, it will be purchased by datacenters almost entirely and stay in them until EOL.

          Your brain is going to atrophy (this is proven), they'll raise the price to something thats closer to break even and you'll be forced to pay it because you no longer have those muscles.

        • cindyllm 26 minutes ago
          [dead]
      • nullc 23 minutes ago
        > two 4090s is not consumer grade

        I think that is a very narrow perspective. Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"?

        I agree with your view that cheap tokens on SOTA are a trap-- people should use local AI or no AI.

        • dakolli 17 minutes ago
          I would still question what usefulness there is with a local model even with 10k in GPUs. I certainly haven't seen any great uses myself from these smaller models (<500 parameters) except claims from people who are totally enamored with AI and basically anything output from an LLM impresses them like a toddler who's entertained by the sound their velcro shoes makes.
  • wrxd 16 minutes ago
    The example in the post confirms my theory that for local models to succeed they need to be "good enough", not big enough that they can compete with frontier models.

    They need to be able to do a small task well and they need to be able to run reasonably on consumer-class devices. Even better if they can run on mobile phones.

    In my experiments with local LLMs I noticed that while increasing the size of the model is nice the real thing that turns a barely useless model into something useful is the ability to use tools. Giving my models the ability to search the web and fetch web pages did way more to solve hallucinations than getting a bigger model. And it doesn't have a training cutoff. Sure, the bigger model is probably better at using tools but I often find the smaller models to be good enough.

  • mattlondon 1 hour ago
    Yet there is another post a few rows down where people are losing their shit that Chrome has a local LLM model that uses a couple of GB of space for local-inference.

    Damned if they do, damned if they don't.

    • dlcarrier 1 hour ago
      Maybe don't use gigabytes of bandwidth and storage space, without asking.
      • hparadiz 29 minutes ago
        Easy. Stop using Chrome.
    • bytecauldron 59 minutes ago
      This is a bit disingenuous. People aren't losing their shit about a local model being installed. It's the lack of user autonomy. Just give the option to download a model instead of a silent install. It's not that hard. This is how every other local option works.
      • wmf 33 minutes ago
        AFAIK Apple and MS auto-download local models.
    • fg137 43 minutes ago
      You might want to read the comments to understand what people are actually complaining about.

      This comment is quite dishonest about the nature of the discussion.

    • aabhay 1 hour ago
      This is a weird take. If its not opt in or you’re shoe horning it into a browser, then that sucks. Nobody is getting enraged that an app for running local LLMs downloads data to do so.
      • avadodin 18 minutes ago
        Although you can opt out and even disable the download feature when you build them in some cases, most of the local LLM tools are too download–happy by default.
    • themafia 1 hour ago
      If it was such a good and laudable idea why didn't they tell me about it before they activated it? It seems to me like they avoided it in the hopes that I wouldn't notice, because, presumably if I had, I would have IMMEDIATELY disabled it.

      Also why doesn't their task manager show that it's actually the one downloading? Why does it go out of it's way to hide this activity?

      Since I have conky on my desktop I could catch this immediately, and take the action I preferred with my own computer, which was to _immediately_ disable it.

      • StilesCrisis 1 hour ago
        I'm guessing you immediately close the What's New Chrome tab when you update?

        https://developer.chrome.com/blog/new-in-chrome-148#prompt-a...

        https://www.google.com/chrome/ai-innovations/

        They have absolutely not been shy about any of this.

        • themafia 56 minutes ago
          I've never had a "What's new" tab ever open because I disable the customized home page where that's displayed. I'm guessing you're not aware that's an option.

          Please show me where in either of those documents it explains it's going to download a 4GB model.

          • crazygringo 1 minute ago
            I use an extension that gives me a customized homepage, but I still always get the "what's new" tab on every major version upgrade.

            It's a totally separate tab that opens. It's got nothing to do with what you use as your homepage.

    • ekjhgkejhgk 1 hour ago
      You don't understand the difference between "I run a local LLM because I chose to" vs "The browser chose to run a local LLM and I have no say"? You don't understand?

      Not to mention that the LLM that I choose to run requires a monster machine and is infinitely more capable than whatever google chose to put on their browser?

      I mean, none of this affects me because I don't use chrome, obviously, but you don't see the difference? Bewildering.

      • StilesCrisis 59 minutes ago
        Did you opt into WebGPU? QUIC? Canvas 2D? Brotli? Browsers don't work that way.
        • za_creature 35 minutes ago
          The size difference between the local LLM and all of the above is about... the size of the local LLM.
  • scriptsmith 37 minutes ago
    I've got some demos of what the new Prompt API in Chrome that uses a local model can do: https://adsm.dev/posts/prompt-api/#what-could-you-build-with...

    As OP says, it shines in constrained environments where the model is transforming user-owned data. Definitely less useful for anything more open-ended.

    • 2ndorderthought 33 minutes ago
      Yea I do not recommend treating chromes prompt API as a good example of local LLMs. It's fine and stuff but it's really weak. 8b models from a year ago are better in some ways. And a lot of the recent model drops are meaningfully better.
      • scriptsmith 30 minutes ago
        It's based on a Gemma 3n model, and yeah it's not the best. But if you have a use case that needs constrained JSON output for example, it's pretty neat.

        Maybe it would do better with the new Gemma 4 models, which the Chrome devs have been hinting at moving to. And why the API doesn't let you introspect / pick the model, I'm still not sure.

    • dakolli 13 minutes ago
      So you're running an llm to do data transformation that deterministic processes would be much better suited for and running 1,000 watt power supply to do so. Wild.
  • Guillaume86 13 minutes ago
    I think we should separate the private AI discussion from the local AI discussion. The pragmatic choice to run big LLMs is one/several big servers online, but that doesn't mean private companies should be the only ones to run them.

    A self hosted inference solution that offer good tenant isolation guarantees (ideally zero trust) and is easy enough to deploy and maintain (think Plex for AI) would be my choice for privacy. Now to be honest I have done zero research about this and have zero idea how feasible that is, maybe it already exists and there's some discord servers I should join?

    Edit: I don't need to mention it here but what's incredible is that open models are in the ballpark of the best commercial models so supposedly, the hardest part by far is already solved.

  • timeattack 1 hour ago
    My problem with LLMs (apart from philosophical aspects and economical impact) is that it would be unlikely for any of us to be able to train something functional locally (toy-like LLMs -- sure, but something really useful -- no). Apart from that it requires immense computing power, it also requires a dataset which is for the most part is obtained illegally.
    • kibwen 1 hour ago
      This seems overly pessimistic.

      I may personally be of modest intelligence, but to acquire the intelligence that I do have, I did not need to train on every book ever written, every Wikipedia article ever written, every blog post ever written, every reference manual ever written, every line of code ever written, and so on. In fact, I didn't train on even 1% of those materials, or even 0.00000000001% of those. The texts themselves were demonstrably not a prerequisite for intelligence.

      At minimum, given that it only took me about 20 years of casual observation of my surroundings to approximate intelligence, this is proof positive that the only "dataset" you need is a bunch of sensors and the world around you.

      And yes, of course, the human brain does not start from zero; it had a few million years of evolution to produce a fertile plot for intelligence to take root. But that fundamental architecture is fairly generic, and does not at all seem predicated on any sort of specific training set. You could feasibly evolve it artificially.

      • krupan 32 minutes ago
        What does this even have to do with the parent? Your capabilities have nothing to do with LLM capabilities. The two work in completely different ways. The reason LLMs work is because they are huge and have been trained on vast amounts of data, full stop. Sure, there's potential someday to get something useful using less data, but we aren't there.
      • _heimdall 1 hour ago
        You're also embodied and experiencing the world around you with more senses than only the ability to read text.
        • rogerrogerr 1 hour ago
          > the only "dataset" you need is a bunch of sensors and the world around you.
    • krupan 31 minutes ago
      And this is important because even though you are running a model locally, it's still a proprietary model. You have no say in what it was trained on, how that training data is labeled, what the guardrails are, what biases it might have, none of that.
    • RataNova 9 minutes ago
      That's a fair concern, but I'd separate training from inference here
    • pronik 47 minutes ago
      There is so much technology that we are unable to reproduce locally, I don't think LLMs are in any way different. There will be large LLM manufacturers, small LLM manufacturers, LLM artisanals, LLM enthusiasts and of course LLM consumers, just like with everything.
    • dlcarrier 1 hour ago
      Not the whole thing, at least with current technology, but LoRAs are really good at fine tuning, and can be generated in a few hours on high-end gaming computers, so as long as the base model is in your language, you likely have enough spate computing power, in whatever electronics you own, to train a few LoRAs a month.

      In the future, when regular home computers have the capabilities of modern servers, we'll be able to train the entire LLM at home.

    • Ucalegon 1 hour ago
      Depends on the domain. There are plenty of different use cases where the data needed for training is available for personal, or non-commercial, use. At that point, it does come down to compute/time to do the training, which if you are willing to wait, consumer grade hardware is perfectly capable of developing useful models.
    • cyanydeez 1 hour ago
      That sounds like government. So your problem is mostly that you expect to have a collective social effort, but not enough to pay for it as a public good.
  • hackyhacky 25 minutes ago
    I would like a standardized API for local AI to exist outside of the Apple ecosystem. The Prompt API is Chrome is halfway there.

    * What is the answer to local AI for native apps on Windows?

    * What is the answer to local AI for Linux?

    This is a big opportunity for Linux, given the high quality of open-weight models. I hope some answer emerges before designs fracture and we get a dozen mutually incompatible answers.

  • ksec 13 minutes ago
    While I agree that would be the goal, we are too early for that. Just like how speech recognition used to require many server in a Datacenter to process and you send your data over. It is now completely on devices.

    We are at least 5 years away from that. And DRAM needs a substantial breakthrough in cost reduction.

  • vb-8448 1 hour ago
    > Use cloud models only when they’re genuinely necessary.

    The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.

    I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.

    I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.

    • lelanthran 16 minutes ago
      > The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.

      That's not a problem, that's a feature; I have something like 8 tabs open to different free-tier providers. ChatGPT, Claude and Gemini are the SOTA ones.

      I have no problem maxing one out, then moving to the next. I can do this all day, have them implement specific functions (or classes) in my code. The things is, because I actually know how to write and design software, I don't need to run an agent in a loop to produce everything in a day, I can use the web chatbots with copy/paste to literally generate thousands of lines of code per hour while still having a strong mental model of the code that I can go in and change whatever I need to.[1]

      ---------------------

      [1] Just did that this morning on a Python project: because I designed what I needed, each generation was me prompting for a single function. So when I needed to add something this morning I didn't even bother asking an chatbot to do it, I just went ahead directly to the correct place and did it.

      You can't do that if you generate the entire thing from specs.

      • vb-8448 6 minutes ago
        We are speaking about local AI, and having all this SOTA models basically for free is blocking the progress of local or independent third party setups.
    • RataNova 8 minutes ago
      The path of least resistance usually wins, especially when the pricing hides the real cost
    • Analemma_ 1 hour ago
      I'm also just not seeing good performance from local models. Every time a thread about LLMs comes up, there are tons of people in the comments insisting that they're getting just as good results from the latest DeepSeek/qwen/whatever as with Opus, and that just hasn't been my experience at all: open-source models just fall over completely compared to Claude when asked to do anything remotely complicated.

      I have a sneaking suspicion this is kinda like the situation with Linux in the 90s, where it kinda worked but it reeeeeally wasn't ready for the home user, but you had a lot of people who would insist to your face everything was fine, mostly for ideological reasons.

      • lelanthran 12 minutes ago
        > Every time a thread about LLMs comes up, there are tons of people in the comments insisting that they're getting just as good results from the latest DeepSeek/qwen/whatever as with Opus, and that just hasn't been my experience at all: open-source models just fall over completely compared to Claude when asked to do anything remotely complicated.

        Different usage patterns - you want to issue a single spec then walk away and come back later (when it has consumed $10k worth of API tokens inside your $200/m subscription) to a finished product.

        Many people issue a spec for a single function, a single class or similar. When you break it down like that, the advantages of SOTA models shrinks.

        • vb-8448 1 minute ago
          My experience is that in medium/big codebases even with single functions going with the xhigh is basically better from a user perspective (faster to get the result, and you can trust it) while going with lower models(e.g. sonnet instead of opus) you have to always carefully review the output because 1 of 10 it will hallucinate, you won't catch it immediately and at some point it will bite you.
      • kgeist 46 minutes ago
        It depends a lot on how you run those models. I think a lot of disagreement is because of that. A lot of people run local models with incredibly small context windows (makes an agentic LLM circle in loops), use very small quants (like 4 bit => huge degradation), don't set the recommended parameters (like top-p/temperature), or download GGUFs with broken chat templates. And then they claim model X is bad :)

        I'm currently running both Sonnet 4.6 and Qwen 3.6-27b on the same codebase (via OpenCode, the parameters were carefully tuned to have a good quality/context size ratio), and on this project, they both struggle with complex non-trivial tasks, and both work flawlessly otherwise. Sonnet 4.6 understands the intent better if my task is ambiguously formulated, but otherwise the gap is pretty small for coding under a harness.

      • bilbo0s 13 minutes ago
        This.

        I’ve begun to suspect that most people are probably running different hardware. Sure, you run the latest deep flash on your brand new M5 128G maybe you get acceptable performance?

        But honestly, how many people have an extra $9000 laying around these days?

        Right now, running with acceptable performance is kind of a luxury. I wish the people who always say - “This is great!” - would realize that not everyone has their hardware.

  • RataNova 13 minutes ago
    I mostly agree, though I think local AI will need better UX around failure modes. Cloud models are often used not just because developers are lazy, but because they are more capable and easier to support consistently across devices.
  • Animats 58 minutes ago
    Question: for software development, how much of an AI do you need for local development? Can it be run locally? Can someone train something that knows a lot about software but lacks comprehensive coverage of history, politics, and popular culture?
    • mrkeen 52 minutes ago
      This is a good snapshot of things:

      https://news.ycombinator.com/item?id=48050751

      A specialist handrolls a cut-down framework to power a 1 or 2 bit quantised version of a cut-down sort-of-frontier model.

      It can be yours if you have 128GB or 256GB of RAM.

    • dd8601fn 55 minutes ago
      The ones that are good for more than elaborate auto-complete are pretty hefty, but it can be done. They’re still not Opus behind claude code.
  • revolvingthrow 1 hour ago
    A local Answer Machine is the dream, especially when the internet is decaying and generally on its last legs, but the hardware requirements seem like a huge mountain to climb. Things are progressing tremendously - deepseek v4 flash is very good for what it is - but even that goes beyond any reasonable local setup, which imo is 128 GB ram + 16 GB vram. 4 ram slots on a consumer board craters ram speed, 256 gb macs are too expensive, and even then the inference is ungodly slow.

    On the other hand… v4 flash model is actual magic compared to what was available 2 years ago. If the rate of improvement stays as is, we’ll get a similar performance in a ~120B model in a year, which is viable (if expensive) for everyman hardware. Possibly you’ll be able to run its equivalent on a ~$1200 laptop by 2028, which for me-in-2020 would sound straight out of a scifi movie. A good harness that lets the model fetch data from other sources like a local wikipedia copy from kiwix could do a lot for factual knowledge, too; there’s only so much you can encode in the model itself, but even a cheapish (pre-curent prices) 2TB drive can hold an immense amount of LLM-accessible data.

    Big caveat: I don’t see local models for programming or generally demanding agentic tasks being worth it anytime soon. You likely want bleeding edge models for it, and speed is far more important. Chat at 20tok/s is fine; working on even a small codebase at 20tok/s, especially on a noticeably weaker model, is just a waste of time. Maybe it’s a PEBKAC but I have no idea how people make any meaningful use out of qwen 3.6.

    • zozbot234 16 minutes ago
      > and even then the inference is ungodly slow.

      This is the wrong way of putting it. Local inference with SOTA models is all about slowing down compute for the sake of fitting on bespoke repurposed hardware. You don't need to go fast if you have the whole machine to yourself 24/7. Cloud AI vendors can't match that kind of economics.

  • 1a527dd5 9 minutes ago
    Consumer/private needs to be local.

    Work? I don't want it local at all. I want it all cloud agent.

  • holtkam2 1 hour ago
    I wish I could upvote this twice. We (devs) really REALLY need to consider on-device compute before going to the cloud for LLM inference.
  • krupan 23 minutes ago
    Here I was hoping that this was some plea for us to get away from proprietary solutions that we have no control over and go back to open source, but no, not that at all.
  • daishi55 33 minutes ago
    > We are building applications that stop working the moment the server crashes or a credit card expires

    Isn’t this true of any application that accesses anything not running on your computer? This is just describing what it means to add an API call to your app. Nothing to do with AI (?)

  • jjordan 2 hours ago
    It feels like we're one technological breakthrough away from all of these data centers going up to be deemed irrelevant.
    • krupan 26 minutes ago
      It took us only, what 70-ish years of computer and AI research to get to this point, so yeah, probably just one little thing and then we'll have it </sarcasm>

      Seriously. I have never ever seen so many people so willingly drink the marketing kool-aid from companies selling their product before. It's scarier to me than any threats of AI actually disrupting society (because it is so far from being capable of doing that).

    • Lalabadie 1 hour ago
      The cynical take is getting more and more to be the only rational one:

      The promised mega-data center deals are meant to boost valuations today, not serve tons of customers three years from now.

      • _heimdall 1 hour ago
        It seems pretty clearly inline with the dotcom bubble to me. Every company claims to be a leading AI company, those building infrastructure are promising the moon and getting 1/3 of the way there, and no one knows how to monetize it justify the hype or expense.
      • jjordan 1 hour ago
        oof, this bubble popping is gonna be brutal.
    • i_love_retros 1 hour ago
      What would that breakthrough be?
      • Waterluvian 1 hour ago
        Magic math and computer science that allows us to get the same quality response for a fraction of the GPU.
        • intothemild 1 hour ago
          That's already happening. Qwen3.6 and Gemma4.

          Basically small and medium models that are crazy well trained for their sizes.

          Then we have a lot of specular decoding stuff like MTP and others coming to speed up responses, and finally better quantisation to use less memory.

          Local LLM is the future, and the larger labs know that the open models will eat their lunch once people realise that the gap is only a few months. If we were good with LLMs a couple months ago, we're good with the open models now.

          • krupan 29 minutes ago
            And how were those models developed and trained?
            • lelanthran 7 minutes ago
              > And how were those models developed and trained?

              That's irrelevant to my decision to use local or not.

        • YZF 1 hour ago
          The current LLMs are also "magic" so anything is possible. AFAIK there is no proof that the current architecture is optimal. And we have our brains as a pretty powerful local thinking machine as a counter-example to the idea that thinking has to happen in data centers.
          • _heimdall 1 hour ago
            I want to ask what makes them magic, but even those building LLMs don't really know what happens when they run inference...

            I have to assume current architectures aren't optimal though, the idea that we stumbled into the one and only optimal solution seems almost impossible.

        • toufka 1 hour ago
          I mean, the most cutting edge of iPhones, iPads and MacBook Pros _today_ are quite capable of running in realtime today’s high-end local LLMs.

          If you project out that hardware just a couple of years, and the trained models out a couple of years, you end up in a place where it makes so much more sense to run them locally, for all sorts of latency, privacy, efficacy, and domain-specific reasons.

          Not all that different from the old terminal & mainframe->pc shifts.

          Finally - hardware has seemingly gotten out ahead of software that most folks use - watching YouTube, listening to music, playing a game or two. There was a time when playing an mp3 or watching a 4k video really taxed all but the nicest systems. Hardware fixed that problem, like it very well could this one.

          • sofixa 1 hour ago
            > I mean, the most cutting edge of iPhones, iPads and MacBook Pros _today_ are quite capable of running in realtime today’s high-end local LLMs

            Definitely not the high end local LLMs. The small ones, yes, absolutely.

            > If you project out that hardware just a couple of years

            One of the biggest bottlenecks for LLMs is memory capacity and bandwidth. With the current glut for memory, it's unlikely we'll see lots of advancements in terms of average memory available or its bandwidth on regular (not super high end devices) in the coming years.

            Alternatively, it's possible we get dedicated SMLs for e.g. phone specific use cases, that are optimised and run well.

      • _heimdall 1 hour ago
        I'd assume its a totally different architecture that isn't based on storing a compressed dataset of all digital human text.
  • barrkel 1 hour ago
    Local models are extraordinarily expensive if you're not maximizing throughput, and you're not going to be maximizing it.

    Local models need to be resident in expensive RAM, the kind that has fat pipes to compute. And if you have a local app, how do you take a dependency on whatever random model is installed? Does it support your tool calling complexity? Does it have multimodal input? Does it support system messages in the middle of the conversation or not? Is it dumb enough to need reminders all the time?

    Spend enough time building against local models and you'll see they're jagged in performance. You need to tune context size, trade off system message complexity with progressive disclosure. You simply can't rely on intelligence. A bunch of work goes into the harness.

    Meanwhile, third party inference is getting the benefits of scale. You only need to rent a timeslice of memory and compute. It's consistent and everybody gets the same experience. And yes, it needs paying for, but the economics are just better.

    • LPisGood 1 hour ago
      > And if you have a local app, how do you take a dependency on whatever random model is installed?

      Reading the tea leaves here, it will probably be common for OS’s to have built in models that can be accessed via API. Apple already does this.

    • bheadmaster 1 hour ago
      > And if you have a local app, how do you take a dependency on whatever random model is installed?

      Why not ship your own model? In the age of Electron apps, 10GB+ apps are not unheard of.

      • _heimdall 1 hour ago
        Personally I wouldn't want a couple dozen apps installed all with their own model.

        It seems easier to have industry specs that define a common interface for local models.

        I also assume the OS can, or would need to, be involved in proving the models. That may not be a good thing depending on your views of OS vendors, but sharing a single local model does seem more like an OS concern.

        • alex7o 1 hour ago
          I mean the openai API is the industry standard for allowing apps to communicate with models, llama-server has it, oMLX has it, ollama has it, vLLM has it, lmstudio as well. I don't think this is such a hard thing to do, but it requires people to set it up.
          • _heimdall 1 hour ago
            I don't know enough about that API surface to know if its a particularly good one for the use cases we'd have, but yes defining a universal spec for all implementors to support wouldn't be a big lift and is done in plenty of other areas already.
      • alex7o 1 hour ago
        There is no other way than shipping your own model, because you will want an abstracted API over the inference, and you don't know what the user has installed. Also you can ship 9b fp4 model but it all just depends
        • _heimdall 1 hour ago
          Knowing what's installed would have to be an OS API. If LLMs provide a standard API surface to the OS, likely including metadata related to feature support.
        • LPisGood 1 hour ago
          You can know what the user has installed if the OS developer offers something.
  • rduffyuk 26 minutes ago
    agree with the article but the limitation for local llm usefulness is the limited scope from my experiments. eventually context heavy data pipelines require larger models which consumer hardware can't deal with yet. the local model for summary on a page like you describe could be done via code as well, i've found using an llm isn't always the right choice. for example i use ner tagging in my md docs for better indexing and llm search capabilities. this is purely code based and not via an llm. tried with an llm and the results were a lot worse. augmenting tools to make the llm produce better outputs gives better results.
  • msteffen 1 hour ago
    > One of the current trends in modern software is for developers to slap an API call to OpenAI or Anthropic for features within their app.

    Well there’s your problem, control needs to go the other way. If you want your app to be AI-enabled, you need to make it easy for AI to control your app. Have you used OpenClaw? It’s awesome!

  • Galanwe 2 hours ago
    I would love for local inference to be possible, but from my experience, Kimi 2.6 is the only model that would be worth it, and its a $10k (M3 Ultra max spec'd - 30s TTFT so kind of slowish) to $30k (RTX6000/700GB+ DDR5) upfront, noise / power consumption aside.
    • mft_ 2 hours ago
      You're maybe missing the article's point, which is to use local models appropriately:

      > “But Local Models Aren’t As Smart”

      > Correct.

      > But also so what?

      > Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.

      > And for those tasks, local models can be truly excellent.

      • Galanwe 1 hour ago
        This is a bit naive IMHO...

        I have tried quite a bunch of local models, and the reality is that it's not just a matter of of "it's a small model that should be hostable easily". Its also a matter of whats your acceptable prefill TTFT and decode t/s.

        All the local models I used, on a _consumer grade_ server (32GB DDR5, AMD Ryzen) have been mostly unusable interactively (no use as coding agent decently possible), and even for things like classification, context size is immediatly an issue.

        I say that with 6m experience running various local models for classifying and summarizing my RSS feeds. Just offline summarizing ans tagging HN articles published on the front page barely make the queue sustainable and not growing continuously.

        • mft_ 58 minutes ago
          1) Again, I suspect you're missing the point of the article. The iPhone's on-device LLM is (apparently) ~3 Bn parameters - and runs well/fast enough to be used in the manner described. Of course, the iPhone has its GPU to leverage.

          2) It's probably not the time/place to trouble-shoot your "consumer grade server" LLM experience, but if you're running on CPU (you don't mention a GPU) then yeah, your inference speed will be slow.

          3) Counterpoint: my consumer-grade Macbook Pro (M1 Max, 64GB) runs Qwen3.6-35B-A3B fast enough to be very usable for regular interactive coding support. (And it would fly with smaller models performing simpler tasks.)

      • mikrl 1 hour ago
        One of my hobbyist workflows involved transcribing ETF prospecti into yaml for an optimizer to optimize over.

        Used to take me maybe 10-20 minutes per sheet.

        Then I got codex to whip up a script that sends each sheet to a fairly low parameter locally running LLM and I have the yaml in a couple seconds.

        My dream is to bootstrap myself to local productivity with providers… I know I’ll never get there because hedonic treadmill etc, but I do feel there’s lots more juice to squeeze. I just need to invest more time into AI engineering…

  • refulgentis 37 minutes ago
    The shitty thing here is, either everyone's shipping 800 MB at least with their binary, or, you have to rely on the platform vendor anyway. I'm hoping there's enough external pressure that the OS vendors turn it more into a repository than a blessed-model-garden.
    • wrxd 9 minutes ago
      To be fair the author of the post is using the model Apple provides with the OS so it doesn't have any extra binary size
  • vegabook 1 hour ago
    >> years ago I launched "The Brutalist Report"

    proceeds to brutalise the reader with an 88-point headline font.

  • krupan 24 minutes ago
    If you don't need a lot of smarts, do you even need an LLM? Aren't older machine learning techniques just as good, or like, you know, old-school algorithms?
  • agentifysh 1 hour ago
    Until the hardware is economical and powerful enough, local AI that can compete with frontier models today is still far off.

    If we could even get something like GPT 5.5 running locally that would be quite useful.

  • wilg 1 hour ago
    Two issues -

    1. Local models are likely to be more power-expensive to run (per-"unit-of-intelligence") than remote models, due to datacenter economies of scale. People do not like to engage with this point, but if you have environmental concerns about AI, this is a pretty important one.

    2. Using dumb models for simple tasks seems like a good idea, but it ends up being pretty clear pretty quick that you just want the smartest model you can afford for absolutely every task.

    • manc_lad 55 minutes ago
      I think using the best model for every tasks makes sense when these models are subsidised. when the prices go up (assuming they do) this could trigger a more varied approach. assuming the model doesn't self select for you.
  • dana321 1 hour ago
    "NO AI" needs to be the norm, we should be working on better ways of sharing information and better documentation instead of fighting with computers for substandard results.
  • eyk19 1 hour ago
    Apple stock is going to skyrocket
  • holoduke 55 minutes ago
    We need computers with 128gb or maybe even 192gb of memory before local use make sense. From my own experience 32b LLMs are the absolute minimum for proper tool use and decent output quality. But for local ai you want also vision models and maybe even various LLMs. Plus some memory for the system of course. On my 36gb M3 the 24b Gemma model is nice. But the entire system gets allocated for that thing.
  • williamtrask 2 hours ago
    I wonder if a popularization moment for local AI will ultimately be the pin-prick that pops the AI bubble. Like the deepseek or openclaw moments but bigger/next.
    • gdulli 1 hour ago
      That's like wondering if enough people discovering local media streaming will disrupt commercial streaming services. It's not going to happen. Most people are not ambitious and will let themselves be controlled by the services of least resistance.

      And you can't take comfort in knowing that you, personally, will remain in control of your own computing. The majority will let the range and direction of their thoughts and output be determined by the will of the tech giant whose AI they adopt. And that will shape society.

      • williamtrask 33 minutes ago
        Yeah... probably right. I do hold out hope that this is mostly a timeframe thing. Like, the library, printing press, etc. all had their moments of centralization. But eventually they federated.
  • hypfer 1 hour ago
    Same as local compute.

    Welcome back to 2014. Let us now continue yelling at the cloud.

  • shmerl 1 hour ago
    Depending on some remote AI provider is a major lock-in pitfall. But it's exactly what those AI providers want you to do.
  • artursapek 1 hour ago
    I'm someone who is trying to build a subscription-based business to cover underlying LLM costs, and very hopeful I can one day just sell a permanent license to the software instead with customers using local LLMs to power it.
  • cubefox 1 hour ago
    Local AI is a bit like wind parks. Everyone is in favor, except if they are in your own backyard. There was recently a huge outcry when Chrome shipped a local 4 GB AI model: https://news.ycombinator.com/item?id=48019219

    I have to conclude that people would like to have powerful local AI but it should at the same time only be a tiny model. In which case it wouldn't be powerful.

  • sgt 4 hours ago
    I guess Google got that memo!
  • qwertmax 16 minutes ago
    [flagged]
  • throwaway613746 2 minutes ago
    [dead]
  • debpalash 19 minutes ago
    [dead]