Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SantaCoder: A new 1.1B code model for generation and infilling (huggingface.co)
168 points by moyix on Dec 22, 2022 | hide | past | favorite | 74 comments


Looks like Santa's going to be too busy to deliver presents this Christmas:

    def all_prime_elements(sequence):
        """Returns every prime element of the sequence."""
        return [i for i in sequence if is_prime(i)]

    def is_prime(i):
        """Returns True if i is prime, False otherwise."""
        if i == 2:
            return True
        if i == 3:
            return True
        if i == 5:
            return True
        if i == 7:
            return True
        if i == 11:
            return True
        if i == 13:
            return True
        if i == 17:
            return True
        if i == 19:
            return True
        if i == 23:
            return True
        if i == 29:
            return True
        if i == 31:


Mine never defines is_prime():

  def all_prime_elements(sequence):
    """Returns every prime element of the sequence."""
    return [x for x in sequence if is_prime(x)]


  def all_prime_factors(sequence):
    """Returns every prime factor of the sequence."""
    return [x for x in sequence if is_prime(x) and x!= 1]


  def all_prime_factors_with_limit(sequence, limit):
    """Returns every prime factor of the sequence up to the given limit."""
    return [x for x in sequence if is_prime(x) and x!= 1]


  def all_prime_factors_with_limit_and_step(sequence, limit, step):
    """Returns every prime factor of the sequence up to the given limit."""
    return [x for x in sequence if is_prime(x) and x!= 1]


  def all_prime_factors_with_limit_and_step_and_start(sequence, limit, step, start):
    """Returns every prime factor of the sequence up to the given limit."""
    return [x for x in sequence if is_prime(x) and x!= 1]


  def all_prime_factors_with_limit_and_step_and_start_and_end(sequence, limit, step, start, end):
    """Returns every prime factor of the sequence up to the given limit."""
    return [x for x in sequence if is_prime(x) and x!= 1]


  def all_prime_factors_with_limit_and_step_and_start_and_end_and_step(sequence, limit, step, start, end, step):
    """Returns every prime factor of the sequence up to the given limit."""
    return [x for x in sequence if is_prime(x) and x!= 1]
(and so on)


If I prompt it, it actually comes up with a decent function:

  def is_prime(element):
    """Returns whether a number is prime."""
    if element < 2:
      return False
    if element == 2:
      return True
    if element % 2 == 0:
      return False
    for i in range(3, int(math.sqrt(element)) + 1, 2):
      if element % i == 0:
        return False
    return True
Of course, if you ask it to keep generating past that there's the usual slow descent into madness.


I got something similar:

    def is_prime(n):
      """ Use sieve of erasthotenes to check if n is prime. """
      if n < 2:
         return False
      if n == 2:
         return True
      if n % 2 == 0:
         return False
      for i in range(3, int(n\*0.5)+1, 2):
         if n % i == 0:
            return False
      return True


Makes me wonder if it used the "CS Grad" memes as part of its training set:

https://i.imgur.com/RGLFim0_d.webp?maxwidth=2560&fidelity=hi...


I see Yandere Simulator got into the training dataset


Despite being only 1.1B params, SantaCoder outperforms Facebook's InCoder (6.7B params) and Salesforce's CodeGen-Multi-2.7B.

Paper: https://hf.co/datasets/bigcode/admin/resolve/main/BigCode_Sa...

Dataset search: https://huggingface.co/spaces/bigcode/santacoder-search

Model weights: https://huggingface.co/bigcode/santacoder


SantaCoder's impressive but that's probably misleading. It's reported that incoder doesn't generate as diverse a set of solutions but does do better at the ones it generates. This means it performs well at a lower number of tries when compared to other similar models, which is what matters in practice. The numbers reported here required many trials.

With a fuller context and just a handful of tries, it's unlikely that 6.7B version of incoder will be outperformed by SantaCoder.


The amount of context is dictated by the benchmark, but I agree it would be good to see what the pass@1 and pass@10 numbers are – if the raw data is available somewhere that can easily be computed.


Any idea how this, and those other two models, would compare to GitHub Copilot?


Based on the reverse engineering done by Parth Thakkar [1], the model used by Copilot is probably about 10x as large (12B parameters), so I would expect Copilot to still win pretty handily (especially since the Codex models are generally a lot better trained than Salesforce CodeGen or InCoder). It's also a little bit hard to compare directly because as Parth documents, there are a lot of extra smarts that go into Copilot on the client side.

The SantaCoder paper does have some benchmarks on MultiPL-E though, so you could compare them to the Codex results on that benchmark reported here (but keep in mind that code-davinci-002 is probably even larger than the model used by Copilot): https://arxiv.org/abs/2208.08227

[1] https://thakkarparth007.github.io/copilot-explorer/posts/cop...


Just out of curiosity, in what sense is Codex is better trained than CodeGen?


OpenAI hasn't said exactly how they trained code-davinci-002 so this is speculative, but I'm reasonably sure it was trained on more data and languages than CodeGen and for longer. It was also trained using fill-in-the middle [1].

[1] https://arxiv.org/abs/2207.14255


Not even close


If you haven't noticed, bigcode has also released "The Stack", a 3TB (!) dataset of code (https://huggingface.co/datasets/bigcode/the-stack). Also, they have a special policy where "The Stack" only contains permissively-licensed code, and anyone can see if their data is included and opt-out.

It's true they haven't actually trained a model on the stack, and this is...not copilot. But I like what they're doing and I think it should be appreciated. Honestly, I may even say they're doing with code what stability.ai is doing with images.


> It's true they haven't actually trained a model on the stack

What do you mean? SantaCoder is trained on The Stack:

> Dataset

> The base training dataset for the experiments in this paper contains 268 GB of Python, Java and JavaScript files from The Stack v1.1 (Kocetkov et al., 2022) after removing data from opt-out requests, near-deduplication, PII-redaction (see Section 4), and filtering based on line-length and percentage of alphanumeric characters. This dataset was also decontaminated by removing files that contained test-samples from the following benchmarks: HumanEval (Chen et al., 2021), APPS (Hendrycks et al., 2021), MBPP (Austin et al., 2021) and MultiPL-E (Cassano et al., 2022).

It's definitely not on par with Copilot yet, but SantaCoder is a trial run for a larger & better model that they're planning to train in 2023. Stay tuned! :)


Increase the number of tokens to a large number and you end up with masterpieces like this:

def all_elements_in_range_excluding_and_including_and_excluding_and_including_and_excluding(sequence, start, end):


I think my job is safe.

    def all_odd_prime_elements(sequence):
        """Returns every odd prime element of the sequence."""
        return [x for x in sequence if x % 2 == 1]
    
    
    def all_even_prime_elements(sequence):
        """Returns every even prime element of the sequence."""
        return [x for x in


You can increase the number of tokens to be generated in "Advanced Settings"


That's not the issue here, it's just saying any odd number is prime, which is false


I’m not sure about job safety for both of you since the function doc says what it is and it’s definitely not searching for primes :)


“Every odd prime element” Code does not check primality, only oddness.


Is anyone else here building AI programming services based on models like this? I see a lot of comments saying the models can't do much programming. But I just suspect there must be a silent contingent that is also working on services like that. And maybe less likely to promote the abilities of these models because it encourages competition.


We are at Codeium (codeium.com)! Not the SantaCoder model specifically, but the same types of LLM architectures. We've started with AI-based code autocomplete, but we think there is a lot more we can do.

We wrote up some of our learnings so far in @swyx's blog recently: https://lspace.swyx.io/p/what-building-copilot-for-x-really


What I would really like is something I saw someone talking about here; I'd like the editor to brighten text it finds "unexpected" which could immediately alert to bugs, or to the fact that the code I'm writing looks weird in some way and might either be restructured or accompanied by a comment.


Yep, these kinds of applications are on our mind! We consider autocomplete to be the "baseline" task since there are plenty of benchmarks and research to compare our model's performance to, but there's lots of things like highlighting code, upgrading to new libraries/conventions, etc that we can do with a good base model.


my unsolicited advice: pick an X. what is the one best use case for this other than code? law? finance? focus on that vertical. if you have no idea what that could be or if that market is too small, you're already in trouble.

I don't use anticomplete at all. What I would like is something that can take my current, bad code and style transfer it into proper, modern code. best case, take code as I write it naturally and confirm it to the style guide of my organization.


We're building tools like this at Grit: https://www.grit.io/

These kinds of models are particularly good at repetitive, boring work like refactoring legacy code and completing framework migrations. Unlike Copilot, we've specialized specifically in these areas and completing them end-to-end (instead of just sitting in the IDE, we open already-verified PRs).


May I ask what model you are using?


We use a few depending on the task (Codex, fine-tuned T5, Bert models, etc.). Constantly experimenting with different variations. Since we focus on solving narrower problems in more depth, it leaves more room for optimizing accuracy.


Have you been able to get your rate limit increased for code-davinci-002? It defaults to a very small amount.


I recommend trying davinci-003 as well.


Yeah, been messing with that a lot.


I've been pretty impressed with chat gpt generating working implementations of various algorithms in different languages. Crucially, it actually knows about algorithms. I was trying to get it to generate some algorithm for calculating concave hulls the other day and ended up learning a thing or two about various algorithms for that in this space. Almost but not quite worked for my use case. It seems limited in the amount of code it can generate in one go. But otherwise, I was pretty impressed.

So, we're not that far off from basically pair programming with an AI that will do most of the boring/tedious work we currently do manually. Something like chat gpt integrated into an IDE could be useful right now.


Copilot or Codium


Yes we are incorporating into Graphistry as part of how we help sec/fraud/misinfo/crime/etc analyst teams investigate their data. Our platform does all sorts of GPU visual graph analytics & graph AI once data gets loaded in, and as part of our visual playbooks automation layer, this helps users make automations and fancier queries. Think Splunk, Spark, Neo4j, ... .

IMO tough question of who can do codegen as a scalable standalone startup, but that's ok. Pretty darn easy & useful for many productivity platforms like ours where it's just a super nice feature as part of delivering a broader magical experience.

Related: we are hiring a k8s/pydata person, ideally who has need a user & builder of investigation platforms, as we are working w co's like Nvidia to bring this kind of thing to some pretty major enterprise & gov teams. See gdoc linked on our careers page.


There's replit. Constantly announcing new features around such models. They'd introduced "ghostwriter" a while back and yesterday or so they announced ghostwriter chat.


Yes, we are building something that is somewhat like ILP/IFP (and other tried and tested but non scalable techniques) with the search space reduced by using modern ML language models. And indeed; the thing that works best in our system has not been done in the open yet. Of course we have no idea if it's viable for the masses; maybe if people see how well it works.


We built a semantic code search CLI tool (fully local and open source) using a similar model that I tuned https://github.com/sturdy-dev/semantic-code-search


Yup, as part of another thing. ML assisted everything is here to stay.


I've been messing around some. Flan-T5 generates surprisingly close stuff occasionally for simple prompts like #square x or #sum the elements in the list.


There are a bunch of really good ideas used to train this model - multi query attention, infilling, near deduplication and dataset cleaning.

I do wish that the demo was a little more interactive (not needing to click buttons to create a generation) since it makes it hard to see the full power of the model.

One of the things we tried at Codeium for our playground on browser was to make it super clear how well the model performs by making the experience interactive - https://www.codeium.com/playground


I found this really interesting:

> e investigate the impact of 4 preprocessing methods on the training data: filtering files from repositories with 5+ GitHub stars, filtering files with a high comments-to- code ratio, more aggressive filtering of near-duplicates, and filtering files with a low character-to-token ratio. We observe modest impact of the new filters except for the stars filter, which deteriorates performance on text2code benchmarks significantly. This is an interesting result given that previous work has explicitly filtered for GitHub Stars as a proxy for data quality


I am having trouble getting the demo to run. It just errors out


Give this notebook a shot:

https://github.com/arjunguha/BigCode-demos/blob/main/bigcode...

A GPU will help, but I found it passable on a CPU as well.


The demo is up again!


Same here


Might be overloaded – if you have a GPU you can try running it locally by getting the model weights here: https://huggingface.co/bigcode/santacoder


Any idea how much GPU memory you'd need to run this locally?

EDIT: just tried it and it didn't seem to go past ~6gb


It's 1 billion model with Fp16 precision so 4-6 GB max.


As a software engineer what is the use-case for these kind of 'code generation' tools? Are they good enough to generate different scripts for OS tasks? Can they automate CRUD APIs? What level of detail is required to use them? Like, do I basically have to describe an algorithm in English or can I go up to a higher level and talk about features and what the software ought to do? Are these tools good enough to improve my productivity in any way or is this more for demos?


The current use case is generating simplistic boilerplate examples to post on hacker news and criticise / praise depending on your AI bias


I've been using chatgpt with my side projects. Its ability to generate boilerplate for APIs just from what would normally be a google search prompt means you can often go straight from idea to the part where you're adding the interesting features for your app.

Its ability to generate what are essentially highly specialized tutorials that match exactly your use cases is also a really big deal.

Overall it's really extended what I'm capable of doing. Not because I couldn't do the things before but because I can skip over the boring part in the beginning and save my emotional energy for the part that actually matters.

>do I basically have to describe an algorithm in English or can I go up to a higher level and talk about features and what the software ought to do?

It understands any "well known" algorithm, api, paradigm or pattern that was written about before 2022. Even pretty obscure stuff. One thing I tried was copy and pasting some of my code into it and having it generate unit tests.


Quite unimpressive, so far.

Only works somewhat well for very simple tasks and well known tasks. Any mildly more complex and it fails. It also seems to have no understanding of imports. It's barfing out a dozen of oneline-functions for common tasks, which all are just a call to some lib-function, of which half are not even in the standard-python-library.

Also kinda strange that at some point it drifts away from the demanded task, or just ends on unfinished code if the token-number is too small. For example, I asked some code relating xml-parsing and handling, and after some xml-functions, it's moving to json and yaml.

I guess with some optimization and integration, there might be some benefit for this, to replace the common stack overflow-copy n'paste. But I don't see this yet at adding significant value to actual work.


I think we’re a good 10 years away from “read my codebase and add the feature my PM asked for.”


We’ll become diff revisers


ChatGPT is great at analyzing diffs so… I’m gonna be reviewing the reviews


Also at summarizing, so I'm gonna review ChatGPT's summary of ChatGPT's review.


Software Developer will become Software Debugger.


Didn't work for me

- def closest_point_on_curve(target,curve):

    """Returns the 3D point on the curve closest to target 3D point"""
Response:

- return closest_point_on_curve_3d(target,curve)


For the following example it just goes on generating a never-end sequence of calls:

def point_line_projection(line,point): """Returns the perpendicular projection of the point on the line.""" return line.point_projection(point)

def line_intersection(line1,line2): """Returns the intersection point of two lines.""" return line1.intersection(line2)

def line_intersection_point(line1,line2): """Returns the intersection point of two lines.""" return line1.intersection_point(line2)

def line_intersection_point_line(line1,line2): """Returns the intersection point of two lines.""" return line1.intersection_point_line(line2)

def line_intersection_point_line_parallel(line1,line2): """Returns the intersection point of two lines.""" return line1.intersection_point_line_parallel(line2)

def line_intersection_point_line_parallel_point(line1,line2,point): """Returns the intersection point of two lines.""" return line1.intersection_point_line_parallel

...


He's got ya' there. :D


As soon as I see infilling, I want to see if it can solve equations.

Prompt:

> let x = <FILL-HERE>; > assert(x + 50 === 200);

Output:

> let x = 100; > assert(x + 50 === 200);

Not yet :/


Any VS Code extension?


Are these models merely interpolating between existing solutions, or also extrapolating, i.e., generating novel solutions?

Have there been examples of novel code, i.e. code that was not in the input set?


Are there any models that edit or generate file trees?


Like, an entire repository of files?

No.

How would a model get trained on that? You'd have to pass in the entire repository for each sample. It's prohibitively difficult to create that sort of model.

If you want that, you'll have to build tooling on top of a text model (ie. an application that calls a model repeatedly), that takes a prompt and breaks it up into per-file prompts, then incrementally generates the files passing the context of previous files, and the 'context' would be too large, so you'd get large scale consistency errors.

Broadly speaking the number of tokens = the size of the text it can generate.

With small models, the number is trivial (code fragment), so generally speaking 'generate an entirely application' one-step models currently don't exist.

That said, stable diffusion has proved that you can iterate in latent space and use a VAE to upscale to larger sizes to reduce the over all model size while still having output that is ~order of magnitude larger than the latent space.

...so it's not totally out of the question that's coming.

However, right now? no.


> so generally speaking 'generate an entirely application' one-step models currently don't exist.

This already exists. People have created full apps in ChatGPT by doing this. Many examples online.


No, they haven’t. ChatGPT has the same prompt history limitations as other models, and cannot generate entire end to end applications.

What you’ve seen is applications running on top of chatGPT and using it iteratively to generate multiple code segments.


But the context window size of ChatGPT is sufficient.

And again it does generate a full application, just not all of it at once.

Just see https://news.ycombinator.com/item?id=33854638

Sure you can't prompt it once and download a zip, but you still get a consistent full app at the end that can be used as a base from this prompting.


What is the memory requirements of this model?


Are there model weights?



A few more "getting started" examples would be nice.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: