Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It'll be 7B they're referring to, on my M1 Max 32GB with a 4000 token output request I get 67ms/token on 7B (4bit) and 154ms/token on 13B (4bit)... I've made a tweak to the code to increase the context size but it doesn't seem to change perf.

  main: mem per token = 22357508 bytes
  main:     load time =  2741.67 ms
  main:   sample time =   156.68 ms
  main:  predict time = 11399.12 ms / 154.04 ms per token
  main:    total time = 14914.39 ms


This was generating 2000 tokens, so it seems to get slightly faster on longer generation runs maybe?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: