Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Surprised nobody has pointed this out yet — this is not a GPT 4.5 level model.

The source for this claim is apparently a chart in the second tweet in the thread, which compares ERNIE-4.5 to GPT-4.5 across 15 benchmarks and shows that ERNIE-4.5 scores an average of 79.6 vs 79.14 for GPT-4.5.

The problem is that the benchmarks they included in the average are cherry-picked.

They included benchmarks on 6 Chinese language datasets (C-Eval, CMMLU, Chinese SimpleQA, CNMO2024, CMath, and CLUEWSC) along with many of the standard datasets that all of the labs report results for. On 4 of these Chinese benchmarks, ERNIE-4.5 outperforms GPT-4.5 by a big margin, which skews the whole average.

This is not how results are normally reported and (together with the name) seems like a deliberate attempt to misrepresent how strong the model is.

Bottom line, ERNIE-4.5 is substantially worse than GPT-4.5 on most of the difficult benchmarks, matches GPT-4.5 and other top models on saturated benchmarks, and is better only on (some) Chinese datasets.



To try to avoid the inevitable long arguments about which benchmarks or sets of them are universally better: there is no such thing anymore. And even within benchmarks, we're increasingly squinting to see the difference.


Do the benchmarks reflect real-world usability? My feeling is that the benchmark result numbers stop working above 75%.

In a real problem you may need to get 100 things right in a chain which means a 99% chance of getting each single one correct results in only 37% change of getting the correct end result. But creating a diverse test that can correctly identify 99% correct results in complex domains sounds very hard since the answers are often nuanced in details where correctness is hard to define and determine. From working in complex domains as a human, it often is not very clear if something is right or wrong or in a somewhat undefined and underexplored grey area. Yet we have to operate in those areas and then over many iterations converge on a result that works.

Not sure how such complex domains should be benchmarked and how we objectively would compare the results.


GPT-4.5's advantages are supposed to be in aspects that aren't being captured well in current benchmarks, so the claim would be shaky even if ERNIE's benchmarks actually showed better performance.


You know what's sad? Every Western company has been using this technique for a long time...


So, fairly accurate if you're Chinese?


It doesn't really matter what nationality or ethnicity you are, but if you communicate with the model in Chinese you might get better results from this model.

Then again, if they've misrepresented the strength of the model overall, there might be some other shenanigans with their results. The fact that their results show their model is worse than GPT-4.5 on 2 Chinese language benchmarks, while it's so much stronger on some of the others, is a bit weird.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: