Testing the New 3DMark CPU Benchmark: For the Boids

Testing the New 3DMark CPU Benchmark: For the Boids

A few weeks in the past, UL (previously Futuremark) launched the newest check in its ongoing 3DMark gaming benchmark suite, CPU Profile. The premise behind this new CPU-specific check is a simulation to measure how processor efficiency scales with cores and threads. Usually 3DMark exams are designed to measure total gaming efficiency – and thus are largely a GPU benchmark – nonetheless this one is a little bit completely different because it focuses extra particularly on CPU efficiency. So we needed to check out UL’s newest check to get a greater thought of what precisely it’s testing, what precisely it’s making an attempt to perform, and simply how helpful it is perhaps.

UL’s 3DMark and the New Take a look at

The 3DMark software program (with the tagline ‘The Gamer’s Benchmark’) has been a staple of the artificial benchmarking group for its number of exams designed to emulate completely different ranges of gaming complexity. From a singular interface, customers can run easy exams aimed toward cellular and built-in graphics efficiency, to mid-level gaming at affordable resolutions and element, as much as overengineered exams for techniques that don’t exist but. Every of the exams supplies a baseline set of graphics calculations designed to emulate online game efficiency and produces a composite quantity to signify that efficiency for that market. When you’ve ever heard of Time Spy or Fireplace Strike, two fashionable benchmarking exams notably for overclockers, then 3DMark is the place it comes from.

3DMark additionally acts as a car for brand new characteristic exams. Through the years UL has launched separate particular exams to seek out draw name limitations, DirectX Raytracing processing efficiency, Variable Fee Shading (VRS) efficiency, PCIe 4.0 testing, and NVIDIA DLSS efficiency. The latest check to this portfolio is the CPU Profile, the purpose of this text.

What’s the CPU Take a look at Measuring

The CPU Profile check showcases a easy low decision scene derived from the imagery of the newest gaming exams. The speed limiter of this scene is the uncooked CPU calculations within the background – the check runs an efficient 150 frames of pictures, nonetheless every body includes a parallel compute framework based mostly on the flocking of birds.

Chicken flocking, additionally identified in simulation as boids (bird-oid object, not an accent factor), includes the interplay of a lot of objects in motion to one another relying on small random motion and guidelines relating to separation, alignment, and cohesion. Every boid has to:

  • be cautious of its distance to different boids in a pack (separation),
  • the course of journey relative to others (alignment), and
  • the will to maneuver in the direction of a mean place inside line of sight (cohesion)

We’ve all seen how birds transfer in mass flocks, or fish in shoals, and there are precise mathematical fashions that can be utilized to simulate it. A minor adjustment in separation, alignment, and cohesion can regulate precisely how all of them work together and transfer.

From a simulation standpoint, every boid is impartial in its actions such that it may be calculated in parallel to others, nonetheless every boid must have data of its native atmosphere and the positions and instructions of different boids inside that atmosphere. The extra boids within the native atmosphere, the larger the lookup desk for that particular person must be – the dimensions of that lookup desk on every time step is commonly a combination between separation distance and line of sight: the extra objects a person can see/is interacting with directly, the larger that calculation. The information for this lookup desk must be polled from many various locations in cache and reminiscence, nearly at random, and for excellent simulation, on each timestep as properly.

For anybody that desires to play with a 100 boid simulation of their browser, Ben Eater has a very good one, or customers can play with Github code right here with a Javascript model. This can be a single threaded design, and simply can scale to a couple thousand on a single core with none optimized code.



Boids with easy edge boundary circumstances

Past that, boid simulation isn’t often run on CPU cores anyhow. Customers can work together with a GPU model of their browser immediately, with 65000+ boids operating very fortunately.

So with all this discuss boids, the CPU Profile check in 3DMark is doing precisely this simulation completely*. The workload outlined on 3DMark’s states that they’ve a easy, extremely optimized simulation of boids break up into two components.

  • One: Half the boids use SSSE3 optimized directions
  • Two: Half the boids use AVX2 optimized directions, in any other case SSSE3

The benchmark does six separate sub-tests based mostly on the variety of threads: 1, 2, 4, 8, 16, max. Moderately than giving an total rating, the check arms the consumer six completely different scores, based mostly on a easy calculation:

  • Rating = 350,000 / common body time

The simulation lasts for a set 150 frames, so every sub-test has the identical fastened calculation simulation (and we assume the identical fastened seeds for RNG). On the quickest processors, the max threads part can take below 10 seconds, permitting the simulation to run with CPUs working solely inside turbo clockspeeds (we’ll get again to why this issues later), whereas the only thread part on the slowest processors can take 5 minutes or so.

The tip outcomes web page is one thing that appears like this:

The check provides you six completely different outcomes together with a system data tracker if it was enabled.

The last word function of the check being to benchmark CPU efficiency at a number of completely different thread counts, making a check that may scale up to make use of the entire threads a shopper CPU can present, but in addition provides a take a look at efficiency with decrease thread counts, which is the place many video games lie immediately. Put one other approach, on deciding whether or not to have a single-threaded or multi-threaded gaming check, UL determined to do each by testing with a number of thread counts.

*On launch UL’s web site stated the check was in two components with a physics engine, nonetheless UL has clarified to us in e-mail that this was a duplicate/paste error from a earlier check. The web site has since been up to date.

Dialogue of the Take a look at

Usually when probing a brand new check for our benchmark suite, it pays off to take a vital eye to what precisely the check is measuring and the way it pertains to the actual world. Each benchmark has a spot in a evaluate lineup, although it’s at all times essential to quantify the place it needs to be, and what weight needs to be given to the outcomes. For instance, we’ve real-world exams that help in efficiency on that software program, however we even have a mixture of artificial exams for total efficiency notion. Often we focus extra on the actual world exams for evaluation and advice, however the small portion of synthetics assist in sustaining baselines and for those who wish to see them.

Usually we filter 3DMark’s gaming exams into that latter portion of artificial testing. With the identical program model and the identical video drivers, we are able to see how completely different processors and graphics playing cards scale in gentle of the artificial workload, even when the artificial workload is making an attempt to emulate a mean gaming expertise. UL has been fairly clear that the purpose of 3DMark’s gaming exams is to do exactly that – emulate actual world efficiency.

Sadly, the commentary across the CPU Profile check is slightly unclear. You is perhaps forgiven that the check is designed to showcase the place a processor is perhaps restricted in gaming, after all of the check is shipped alongside a half-dozen different GPU gaming exams and through the check itself, we’re handled to some very game-looking imagery.



The arrows on the left look to be boids (300-ish?), however not sure if associated to the simulation in any respect

In observe, it is unclear whether or not the pictures proven on display have something to do with the simulation at hand (whereas UL has responded to a few emails, they haven’t answered this straight but). We solely see 300 or so boids on display, and but a easy simulation on a single core of a Core i7-6950X can simply do just a few thousand.

If we go into UL’s press launch for the check, the headline for the web page is ‘New CPU benchmarks for players and overclockers’, the web page describes that it runs a CPU simulation throughout 1,2,4,8,16, max threads. For every of these sub-tests, it additionally provides a short indication of what the check is helpful for. Right here is our abstract of UL’s press launch on the sub-tests:

  • 1 Thread: Uncooked CPU efficiency, however others scores are higher indicators of gaming.
  • 2 Threads: Greatest for DX9 video games resembling DOTA2, League, and CS:GO
  • 4 Threads: Greatest for DX9 video games resembling DOTA2, League, and CS:GO
  • 8 Threads: Trendy DX12 video games, correlates will with 3DMark TimeSpy
  • 16 Threads: Computational duties, much less related for gaming
  • Max Threads: Full Efficiency, not related for gaming

In gaming workloads, we might usually agree with this. Nonetheless, the underlying workload utilized in CPU Profile is just not a gaming workload. That is the place the confusion kicks in. UL says that its boid simulation is akin to related conditions in video games, even to the purpose the place having half utilizing SSSE3 and half utilizing AVX2 is extra akin to recreation engines utilizing completely different optimizations; nonetheless it fully skips over the truth that in each one in every of its sub-tests, the ‘recreation’ is CPU restricted, even at 8 threads, and at 16 threads. That is high quality for a CPU-speciifc check, however it’s naive of how most video games perform on high-end {hardware}.

As talked about above, UL hasn’t acknowledged how dense its boid simulation is, nor the way it scales; by AnandTech’s estimates you want not less than 2000+ to saturate a single thread with unoptimized code, so with optimized code scaled throughout 8 threads or 16 threads, we must be taking a look at 50000 or 100000 flocking objects in a simulation area. For video games that showcase boid flocking environments, most of them are utilizing secondary physics, i.e. unable to be influenced by the character, however those who do have interacting physics, they’re unlikely to be simulating on this scale. Furthermore, there’s nothing to say {that a} recreation engine gained’t merely enhance/lower the boids within the simulation based mostly on efficiency.

Orthogonal to all of that is the size of the check. As a result of the check is a set 150 frames no matter what number of threads are working, it means the perfect processors can churn via the max threads in just a few seconds, whereas the slowest processors take a number of minutes in 1T mode. The dialogue level right here is all the way down to how every processor induces its Turbo modes.

At numerous instances previously decade, Intel and AMD has privately expressed concern for large max thread workloads that take just a few seconds to finish – often max thread workloads require sufficient time for a processor to hit a gentle state frequency, and so finishing throughout the turbo window makes the check an unrepresentative metric. Take, for instance, CineBench R20 that may full in 5 seconds with the next common pixels per second than a Cinema4D check which may take just a few hours. Furthermore, gaming is some time trip of turbo outcomes, and never a set workload consistently at turbo. If Intel and AMD have beforehand acknowledged that these types of in-turbo max thread exams are irrelevant for efficiency comparisons, then the brand new CPU Profile check would fall to an analogous destiny.

We approached UL with this, together with the concept the CPU Profile simulation needs to be a set time as an alternative of a set set of frames, however ultimately UL disagreed. One of many targets of the check was apparently having a brief check size. They needed the 8 thread outcome to correlate to Time Spy Excessive outcomes, which meant discovering a time that labored whereas additionally being brief was a purpose of the challenge. UL additionally acknowledged {that a} 150 fastened body check ends in a set quantity of labor, and steered that slower techniques will course of much less with fastened time steps – I ought to level out that is irrelevant when you’re taking a mean when fastened time steps are in place. Over 150 frames, UL acknowledged they may assure a balanced workload throughout all threads (one thing which does not occur in gaming), and past that the consistency of the check would diverge in its outcomes.

Finally, I disagree with a few of UL’s selections right here, and discover that quite a lot of these arguments appear arbitrary at finest –  particularly given my very own expertise in constructing our in-house exams resembling 3DPM (which by the way does do fastened time, not fastened compute). This additionally implies that I’m having a tough time correlating what this benchmark is doing to how a consumer can interpret the outcomes for a gaming workload. What UL has completed right here is create a CPU benchmark, at the beginning, and it seems that merely utilizing a simulation mechanic ‘that can be utilized in video games’ is being described as a software to assist determine gaming efficiency. At this stage, with the data I’ve at hand, I stay unconvinced that the workload is gaming-relevant.

Outcomes

Typical for a UL benchmark, CPU Profile generates a collection of dimensionless scores. These scores straight correlate to the underlying benchmark, however they are not a selected measurement in and of themselves. Complicating issues a bit for CPU profile, the benchmark generates half a dozen scores – so except you learn the documentation, the information can come off as glut of numbers which are missing context.



Instance from UL’s web site

Having a look at these numbers, UL states on its web site that the outcomes assist showcase the outcome in comparison with others, but in addition the overclocking potential in your processor. This can be a trace that this benchmark is definitely higher for overclockers than anybody else, as having six completely different outcomes numbers and 6 completely different suggestions for CPU overclocking doesn’t assist the right way to interpret gaming a lot, particularly given the bar showcasing the rating is kind of small and never providing any further context.



Outcomes from one in every of our CPUs, laborious to see these bars

There’s additionally the matter of presenting the outcome as a rating. All of UL’s exams give a rating on the finish, and as we’ve showcased above the outcomes for this check a calculation of an arbitrary quantity (350000) divided by the typical body time (in milliseconds). The rationale for not giving the outcomes as a uncooked body time is easy psychology – greater numbers look higher on graphs and are simpler to interpret. So by dividing a quantity by the typical body time, every part will get a scale. It additionally helps that eradicating the models of the outcome would possibly scale back confusion. The draw back right here is that the preliminary quantity may be very arbitrary.

On the web site, UL calls it a reference worth utilizing ‘a time fixed set to 70 multiplied by a rating fixed set to 5000’, which involves 350000. There aren’t any explanations as to why these numbers exist, although we are able to interpret that 70 meant to be 70 milliseconds, and if a rating achieves 70 milliseconds (word you want an 8 core processor to get that) then the ultimate result’s 5000 factors. Nearly all processors in all sub-tests will rating below this, showcasing that the pivot for the outcomes scaling is definitely greater than most processors will obtain.

With the information, UL may have merely represented the information as a mean body price. For instance, listed below are some outcomes for the Ryzen 7 2700X, an 8 core/16 thread processor, operating at inventory with JEDEC reminiscence. The desk showcases the uncooked common body time, UL’s rating, and a mean body price metric.











3DMark CPU Profile

AMD Ryzen 7 2700X
AnandTech Common

Body Time (milliseconds)
3DMark CPU Rating Common

Frames Per Second
1T 660.9 ms 530 1.5 fps
2T 380.8 ms 919 2.6 fps
4T 217.3 ms 1611 4.6 fps
8T 121.5 ms 2881 8.2 fps
16T 78.6 ms 4453 12.7 fps
nT 78.0 ms 4487 12.8 fps

Notice that in case your recreation is operating at 12 frames per second on a Ryzen 7 2700X, then one thing is ready too excessive anyway.

However as we begin itemizing a number of processors, this information will get extreme and dense very quick.











3DMark CPU Profile

Outcomes Given as Common FPS
AnandTech R9

5950X
R9

3950X
R7

2700X
  i9

11900K
i9

9900KS
1T 2.7 2.2 1.5   3.1 2.3
2T 5.1 4.0 2.6   6.2 4.7
4T 8.4 6.4 4.6   11.7 9.2
8T 14.1 11.0 8.2   20.7 17.2
16T 22.4 19.1 12.7   24.8 20.7
nT 31.1 28.6 12.8   24.8 20.7

How ought to we order this desk? Ought to it’s ordered by 1T outcomes, or by max thread outcomes? If we’re specializing in gaming, maybe we must always order by 2T/4T or 8T as an alternative, which makes the opposite outcomes additional information that we’re discarding for being irrelevant or making it too advanced. As is often the case, the draw back to providing multi-dimensional information – on this case, outcomes with a number of portions of threads – is that it turns into an entire lot more durable to current it in a easy method.

Thus far I’ve run the check on 24 processors, from a 64-core Threadripper all the way down to a twin core Apollo Lake. Moderately than a desk of outcomes, these outcomes are ordered by which processor scores the best for every of the sub-tests. There’s even a Sandy and Ivy Bridge in there.

The ensuing graph is kind of noisy, particularly because the quickest excessive thread depend processors aren’t the quickest low thread depend processors (and vice versa). Finally a graph like this would possibly look higher with only a few parts on it, resembling right here:

This showcases that the Core i9-11900K scores finest on this check, till it hits 16 threads when the additional reminiscence bandwidth of the 3990X takes over. It needs to be famous that Tiger Lake does abysmal on this check, simply behind the R9 3950X in 1T and behind the i3-9100F in max threads, as the facility limits of the cellular processor matter greater than the additional threads. I might want to examine with a U-series AMD to see what the distinction is right here.

By and huge once we scale out to extra threads, we see that having a extra full system helps on this check, nonetheless within the single threaded mode, it doesn’t all appear to be about IPC, which is probably one of many limits of the boid simulation. We are able to truly see the Core i3 carry out higher in 2T/4T in comparison with the Ryzen 9 3950X, maybe on account of cross-thread discuss over the chiplets being extra of a priority.



The benchmark in full: not sure if any of this pertains to what’s truly being calculated…

How this all pertains to gaming although is a query left unanswered. It’s a powerful CPU check, and as a simulation of flocking conduct, has the best parts for a scientific workload price analyzing. Nonetheless, deciphering the efficiency scaling as a perform of gaming efficiency with a CPU-limited workload isn’t actually related right here, I really feel – not less than not with out extra data from UL about how they’re deciphering this check. We’ve been emailing with UL backwards and forwards to grasp the check, and we’re ready to see if any additional data will probably be made obtainable. The factor is that the majority video games which are CPU restricted, particularly DX9 esports titles, are bottlenecking on draw calls from the processor to the GPU, and are not ready on CPU compute besides in just a few fractional eventualities. This makes the CPU Profile check extra for the acute overclockers in that regard, making an attempt to eke out efficiency throughout CPU and reminiscence.

Leave a Reply

Your email address will not be published. Required fields are marked *