Go Back   IceInSpace > Equipment > Software and Computers

Reply
 
Thread Tools Rate Thread
  #1  
Old 16-11-2024, 10:49 AM
g__day's Avatar
g__day (Matthew)
Tech Guru

g__day is offline
 
Join Date: Dec 2005
Location: Sydney
Posts: 2,883
Dual Xeons - a processing platform I am playing with

Living 13kms North East of Sydney there is not much I can do with our nightime skies and light pollution - but I did want to experiment with what can one achieve if one integrates massive amounts of data on a target. Do to that in a sensible amount of time requires a lot of compute and I/o power.

So it got my thinking what are my options - and I could see four obivious paths:

1. Upgrade my existing workstation from a 8 core extreme I7 to a 16 core I9-10980XE
2. Build a AMD Ryzen Threadripper 3990X based rig (64 cores)
3. Buy a old HP Z640 dual Xeon E5-2699v3 (36 cores)
4. Rent a AWS massive compute and I/O cluster when I want to stack and integrate shots

The first option I may do down the track - the chips aren't so common and finding one and potential upsizing my water cooler for the larger hotter CPU meant this path would probably cost $1,400.

The second option is likely the most powerful - the CPU alone costs around $5K its basically a baby EPYC core - but a solution going down that pathway could be in the $8K - $10K route - a bit too much for a processing love to have.

Option 4 was a bit tricky to justify - as there would be a lot of data to move up and down from the cloud which would require considerable time and internet bandwidth - and I would need not just very high compute - high I/Ops is required too - so lacking the experience I discounted this option.

So that left option 3 - and during a roadtrip to the Sunshine coast over the past two weeks opportunity struck. The TechFactory just before Brisbane had a HP Z640 Win 10 Pro based dual Xeon with 36 cores and 72 threads, a P4000 graphics card (about equivalent to a NVidia 1070) a 1TB SSD, 8 TB HDD and 256GB RAM for $1,900 so I jumped on it.

I also picked up a 27" 2K monitor, keyboard and mouse for $200 and my wife got a great Mac Air - so it was a real fun visit to a place that re-purposes end of life high end servers, switches and workstations - its was a real geeks playground - they had militarly grad equipment everywhere.

So I got back home and set it all up. First task was to improve I/O - so I added a old ASUS Hyper M.2 X16 Gen 4 PCIE card that I had bought years ago and never used (for lack of PCIE 16 lanes). To this I added 4 x 2TB MP600 Elite cards and put them in RAID 0 as my working space - a 8TB scratch file space. So all up that is another $1K in gear. The Z640 BIOS makes it easy to bifrucate a PCIE X16 lane into 4x/4x/4x/4x that you need to see all of the M2 drive and thus stripe them under Windows Disk Management into a dynamic disk. The only losses you get is the Z640 is only PCIE x16 gen 3.0 - whereas the drives and hypercard are Gen 4 - so although each drive is rated to 7000 MB/sec under gen 4 this halves under gen 3. But in RAID scores above 8000 MB/sec are commonplace which makes for blindingly fast I/O.

So all the rest I had to do was install the latest NVidia Quadro P4000 drives, instal PI and install GPU acceleration for PI using CUDA. These CPUs are not Windows approve for Windows 11 - so I am stuck on Win 10 Pro for a while.

Testing PI's WBPP showed interesting results - PI only uses half the processor cores. It sees 72 logical cores and schedules significant work for this many cores - but Windows seems to only dispatch all the work to one of the CPUs - the first NUMA node of 36 logical cores - the rest just sit idle.

So at 50% CPU load this new workstation too about 3 hours to stack and integrat 730 subs - that took my old workstation 4 hours. So if half the processors can do that - I will be really keen to see what the rig can do when fully loaded! It is also super quiet - at full load it is basically totally silent and cool. The PI guys are working on release 1.9 due out in the next few weeks - but are trying to investigate why my and a few other rigs only use half the avialable processors (and this is only a Windows behaviour - on Linux all processors get used).

I also saw very unusual CPU scheduling behaviour on the RC-Astro Xterminator suite. My old workstation could process an image with BlurXterminator in about 3 minutes on CPU and 20 seconds on GPU; the new one took 40 minutes on CPU (with only about 5-6 cores only 5% loaded) but on GPU it did it in under 40 seconds - a 60 fold improvement - pointing to something really off with the workload dispatcher.

So for anyone thinking of having a dedicated astro processing rig - a used HP Z640 of the right configuration, core count and memory size can be nought for a real steal nowadays!
Attached Thumbnails
Click for full-size image (disk-io-v4 NVME 4 drives in RAID 0 Gen 4 on PCIE Gen 3 interface - Windows RAID controller.PNG)
38.7 KB33 views
Click for full-size image (PI first WBPP run 763 subs.jpg)
208.4 KB27 views
Click for full-size image (PI benchmark half cores.jpg)
163.5 KB35 views

Last edited by g__day; 25-11-2024 at 07:56 PM.
Reply With Quote
  #2  
Old 17-11-2024, 11:01 AM
rustigsmed's Avatar
rustigsmed (Russell)
Registered User

rustigsmed is offline
 
Join Date: Mar 2012
Location: Mornington Peninsula, Australia
Posts: 3,980
looks like a fun project Matthew - so many cores will be interesting to see how it runs when all cores are up and running properly. might be worth dual booting to see the linux performance - it is generally higher scoring. the system will be great in core heavy tasks!
Attached Thumbnails
Click for full-size image (Screenshot_20241117_115812.png)
66.0 KB40 views
Reply With Quote
  #3  
Old 17-11-2024, 03:32 PM
Leo.G (Leo)
Registered User

Leo.G is offline
 
Join Date: Nov 2016
Location: Lithgow, NSW, Australia
Posts: 1,184
Quote:
a used HP Z640 of the right configuration, core count and memory size can be nought for a real steal nowadays

Did you mean "Bought" or "Nought"?


I went down this path several years back with an old IBM desktop server with twin Xeon processors, nowhere near as much RAM and I can't remember which graphics card. My son at the time had his Lenovo home server with a much later Xeon processor and more RAM and it was by far quicker, less noise and a lot less electricity.
In saying that my son has a HP C7000 blade cabinet fully populated including one double sized unit with 4 x Xeon processors and a few disk shelves, along with a 48 tape library but We haven't played with any of it for processing. It's all large equipment and not so cheap to run.


One thing, Registax from memory was the one program which always threw up an exception error if I tried to run more than one core for processing and it didn't matter which version I tried (or it could have been DSS, I can't remember).
We used to buy a lot of cheap server gear from Grays auctions going back a few years. I've always wanted to try one of the smaller HP units or the similar other near identical machine brand which currently alludes my pathetic brain.


The same with my sons Lenovo home/small business server (S30), it was great for everything including gaming when we first got it.
I have a D30 which takes slower dual processors but it has a dead BIOS my son was re-writing for it, he may still finish it one day. It's a HUGE machine though.

We suffer from small house syndrome, too much junk, not enough house.
Reply With Quote
  #4  
Old 18-11-2024, 09:26 AM
AlexN's Avatar
AlexN
Widefield wuss

AlexN is online now
 
Join Date: Mar 2008
Location: Caboolture, Australia
Posts: 6,979
I have a HP Z-640 too that is my PI machine.

Its specs are
Dual Xeon E5-2680 v4's (28c/56t total)
256GB DDR4
12Gb nVidia Titan Xp.
512GB SSD for boot/windows
Pci-e riser with 4x2tb nvme SSD's for storage/processing swap etc.

Including the Titan XP and nvme hdd's etc it was less than $1200... and in PI, its ridiculous. WBPP full pipeline including normalisation, calibration, image solving, integration and drizzle integration usually takes less than 1hr for 3~400 subs, and less than 10 minutes for preliminary stacks of 56 or less images (56 individual threads means each operation runs in parallel if I have less than 56 subs)

The Titan Xp GPU pumps through Blur/Noise/Star Xterminator or Graxpert in a matter of seconds too...

I literally couldn't think of a better machine for PI. I'd love to get a pair of 2699's for it, just to boost that thread count and and base speed up a little - but I think the 28c 56t at 2.4GHz base, 3.3GHz boost is plenty..
Reply With Quote
  #5  
Old 18-11-2024, 09:35 AM
AlexN's Avatar
AlexN
Widefield wuss

AlexN is online now
 
Join Date: Mar 2008
Location: Caboolture, Australia
Posts: 6,979
Oh, worth noting that LOTS of users seem to complain that dual CPU rigs do not get fully loaded by PI, as only one CPU gets loaded.

Mine gets 100% load across all cores/threads during WBPP, I don't recall doing anything specific in order to make that happen - it may have been a bios settings or a CPU affinity setting that I did when I first got/configured the rig.
Reply With Quote
  #6  
Old 22-11-2024, 12:31 AM
g__day's Avatar
g__day (Matthew)
Tech Guru

g__day is offline
 
Join Date: Dec 2005
Location: Sydney
Posts: 2,883
An update from what I know so far - comming from discussion on the pixinsight forum - including running some code Juan the CEO of PI shared on my machines.

https://pixinsight.com/forum/index.p...11-24h2.24249/

So I haven't programmed in 40 years - but last week I downloaded a C++ compiler and with some amazing help from ChatGPT trialled a few simple parallel programs on my workstations to confirm something that hasn't been well shared about PI.

On boot up it detects how many logical processors (72) are on my machine. On benchmark running and general PI tasks - PI only uses 36 cores and reports is sees only 36 logical cores - although PI CEO confirms from the past version PI is meant to be multi core aware.

Windows divides machines with over 64 cores into two (or more) NUMA nodes - each having and equal number of cores. The code under Windows in C++ to determine numa cores has to be slightly more sophisticated than that that Juan has shared to correctly how many determine logical cores a machine with over 64 cores has. So that is problem one - gettting the logical core count correct.

Problem two - making sure each core get the workload. Once you have over 64 cores - your programming has to be Processor or NUMA node affinity aware - else all work goes to the current NUMA node the main program started on.


So until this is sorted by PI I have decided on a simple workaround - turn off hyperthreading! This lowers the logical CPU count from 72 back down to 36 - and being under the 64 limit - PI then assigns all work to all processors.


Now Hyperthreading improves CPU resources by about 10% form memory - meaning 36 physical cores will perform on many workloads like 40 physical cores. But if switching it on means I am only using 18 physical cores - well I will take 36 physical cores over 36 logical ones representing only 18 physical cores.

My CPU scores doing this went from low 12,000 in PI to 15,500 - so I will take that gain.

Just have to see everything is stable now. With HT on - everything worked very reliably. When I first switched it off Windows didn't get past the splash screen teh first time, and halted with CPU in task manager later on. Now I am running a WBPP workload that took 3 hrs 15 mins last I ran it - see how it will perform this time!
Attached Thumbnails
Click for full-size image (PI no HT 100 percent load bemchmark.PNG)
33.6 KB27 views
Click for full-size image (PI no HT 100 percent load bemchmark full.PNG)
66.9 KB28 views
Click for full-size image (PI no HT 100 percent load.PNG)
121.4 KB22 views

Last edited by g__day; 25-11-2024 at 07:59 PM.
Reply With Quote
  #7  
Old 22-11-2024, 05:19 PM
g__day's Avatar
g__day (Matthew)
Tech Guru

g__day is offline
 
Join Date: Dec 2005
Location: Sydney
Posts: 2,883
So to complete this update - PI WBPP ran in 2 hours 18 mins once it could throw the work to both my new workstations cores.

So that is roughly 1/2 the time my old Workstation - which was no slouch - took to process 721 subs!

Put another way - my total processing time has decreased from 20 seconds a sub to 10 seconds
Attached Thumbnails
Click for full-size image (WBPP rno HT success - 2hrs 18 mins 721 subs.PNG)
147.2 KB27 views
Click for full-size image (Blackmagic Speedtest original workstation.jpg)
181.0 KB28 views
Click for full-size image (BlackMagicSpeed test.jpg)
73.0 KB22 views
Click for full-size image (PI benchmark torture test score 17325.PNG)
60.5 KB26 views
Click for full-size image (PI no HT 100 percent load bemchmark full.PNG)
66.9 KB39 views
Reply With Quote
  #8  
Old 22-11-2024, 05:37 PM
AlexN's Avatar
AlexN
Widefield wuss

AlexN is online now
 
Join Date: Mar 2008
Location: Caboolture, Australia
Posts: 6,979
That's awesome Matthew.

The old Z640 is an awesome processing rig, I've been thoroughly impressed by it over the last 12 months... It hammers at PI
Reply With Quote
  #9  
Old 23-11-2024, 02:07 PM
joshman's Avatar
joshman (Josh)
Registered User

joshman is offline
 
Join Date: Aug 2007
Location: Coffs Harbour, Australia
Posts: 697
How well do the benchmark results translate into actual PI Performance?


I recently built a machine dedicated to PI Processing (and some casual gaming):
AMD 7950X, 64GB Ram, NVIDIA 4070 TI Super. 2TB NVME for Windows, 4TB NVME for PI Working Space/file storage, and dedicated 2x 500GB NVME for PI Swap Space.


I had wanted to put the 2x 500GB into Raid 0 for Swap, but ended up just directing 8 swap folder pointers to each drive in PI, and setting up 32 File/Folder I/O threads.



From a pure Benchmark result, it's fantastic
Attached Thumbnails
Click for full-size image (Screenshot 2024-11-23 150017.png)
35.6 KB40 views
Reply With Quote
  #10  
Old 23-11-2024, 04:20 PM
rmuhlack's Avatar
rmuhlack (Richard)
Professional Nerd

rmuhlack is offline
 
Join Date: Nov 2011
Location: Strathalbyn, SA
Posts: 970
I have been using refurbished workstations for pixinsight for a while now, and I agree - fantastic bang-for-buck.

My current workstation is a Dell T7910 which also features a dual Xeon E5-2699v3 (36 physical processors, 72 logical processors) setup, with 128GB RAM and a Quadro P4000 GPU. I also encountered the issue with only one CPU loading up (apparently a Win10 limitation), so the answer to utilise all logical cores is to use linux - which yielded a Total performance PI benchmark score of 25992.
Attached Thumbnails
Click for full-size image (pixinsight benchmark.JPG)
77.3 KB31 views
Reply With Quote
  #11  
Old 24-11-2024, 12:14 AM
AlexN's Avatar
AlexN
Widefield wuss

AlexN is online now
 
Join Date: Mar 2008
Location: Caboolture, Australia
Posts: 6,979
Yeah mine is in the mid 20k range too with the 2x E5 2680v4s with raid 0 512gb nvme drives for swap, and data on a 1tb nvme drive
Reply With Quote
  #12  
Old 24-11-2024, 05:50 PM
g__day's Avatar
g__day (Matthew)
Tech Guru

g__day is offline
 
Join Date: Dec 2005
Location: Sydney
Posts: 2,883
Josh - that is a awesome result - what was the build cost of your rig - as it shows well what modern technology can deliver!

Richard - that is a brillant result - do you remember what your rig was scoring on Windows 10 - I didn't know Linux was that much faster than Windows!

I do follow STAstros analysis on the last 3 versions of PI on Windows and how his rig went from 31K on the benchmark on 1.8.9.1 to 7K on 1.8.9.3

https://pixinsight.com/forum/index.p...11-24h2.24249/

So PI seems to have issues under Windows getting all it can from a dual Xeon. If I could double my CPU scores I would be stoked too!
Attached Thumbnails
Click for full-size image (1729587601801.png)
78.0 KB23 views

Last edited by g__day; 25-11-2024 at 08:02 PM.
Reply With Quote
  #13  
Old 25-11-2024, 09:53 AM
AlexN's Avatar
AlexN
Widefield wuss

AlexN is online now
 
Join Date: Mar 2008
Location: Caboolture, Australia
Posts: 6,979
As a note - My rig is running Windows 10.

I think it must be the CPU's that you're running, resulting in > 64 Logical processors as was mentioned earlier.

Under Windows 10, my 28c 56t dual 2680 v4's, all 56 logical processors are pinned at 100% utilisation during many operations in WBPP, namely calibration and integration, if I have 56 or more subs, all 56 logcial processors will be at 100% utilisation.

It would be interesting to see if jumping to Linux would making any sort of difference, however this machine is used for more than just PI, and some of those tasks are either not possible in Linux due to drivers/applications not existing, or far less performant.

I'm stuck with windows... but with 56 threads, I'm having a good time in PI with 2680 v4's..
Reply With Quote
  #14  
Old 25-11-2024, 12:48 PM
g__day's Avatar
g__day (Matthew)
Tech Guru

g__day is offline
 
Join Date: Dec 2005
Location: Sydney
Posts: 2,883
Watching a Youtube video of dual booting Ubuntu for a Windows user - the steps don't seem so hard - so I think I will wait until release 1.9 comes out and how performance is reported on large core rigs on Ubuntu versus Windows.

If I were to dual boot I would likely add a new drive to be the boot drive then its see if you can add a boot manager to a PCIE x4 NVME drive and have the rig boot from that - and likely switch Hyperthreading back on when I play with Linux.

Seems like lot of effort to get back 30% - 70% of performance that is lost since release 1.9.8.1 to release 1.8.9.3 for reasons Jaun as CEO of the PI simply can't get to the bottom of!


But yes on the upside - these systems are still very fast!
Reply With Quote
  #15  
Old 26-11-2024, 09:07 AM
rustigsmed's Avatar
rustigsmed (Russell)
Registered User

rustigsmed is offline
 
Join Date: Mar 2012
Location: Mornington Peninsula, Australia
Posts: 3,980
Quote:
Originally Posted by g__day View Post
Watching a Youtube video of dual booting Ubuntu for a Windows user - the steps don't seem so hard - so I think I will wait until release 1.9 comes out and how performance is reported on large core rigs on Ubuntu versus Windows.

If I were to dual boot I would likely add a new drive to be the boot drive then its see if you can add a boot manager to a PCIE x4 NVME drive and have the rig boot from that - and likely switch Hyperthreading back on when I play with Linux.

Seems like lot of effort to get back 30% - 70% of performance that is lost since release 1.9.8.1 to release 1.8.9.3 for reasons Jaun as CEO of the PI simply can't get to the bottom of!


But yes on the upside - these systems are still very fast!

Hi Matthew - great idea on dual booting and having a separate drive makes it cleaner rather than reducing the windows partition size (but can be done). just be aware that the systems will run different partition formats - windows uses NTFS and linux you can choose either ext4 or btrfs. what this effectively means is that windows won't be able to see files on the linux formatted drives. while on the other hand linux can see the windows partition and copy / paste files from the windows drives. I was thinking you may have had some trouble editing files on the windows partition from linux but i've just tested it and it seems to be work fine.


Also if you don't like the GNOME desktop look that Ubuntu uses you can choose another of the ubuntu flavours that is ubuntu under the hood but with a different desktop environment (look and feel of the icons some default programs, application menu/launcher etc) https://ubuntu.com/desktop/flavours - I would recommend Kubuntu 24.10 https://kubuntu.org/getkubuntu/ which uses KDE plasma desktop environment if you want it to be a bit more windows-familiar in layout. or alternatively you are able to install different desktop environments at the same time and choose which one you want at log in https://en.ubunlog.com/how-to-have-m...d-derivatives/ but it is cleaner and takes up way less space just having one option. Explaining Computers youtube channel has a lot of good info on dual booting and introducing people to linux.



keep us posted on your results
Reply With Quote
  #16  
Old 01-12-2024, 11:28 AM
g__day's Avatar
g__day (Matthew)
Tech Guru

g__day is offline
 
Join Date: Dec 2005
Location: Sydney
Posts: 2,883
Well the latest workstation news is I wanted to up its CUDA capabilities - for PI Xterminators, for SetiAstro Suite denoise and sharpen and Davinci Resolve encoding. So after a lot of research about power and cabling I decided to step away from 2nd hand cards like the P6000 and go with a ASUS ProAft Geforce RTX 4070 - the 8 pin power connector version (using dual female 6 pin PCIE to single male 8 pin adapter).

So the GPU will be available on Monday and hopefully the required power cable will arrive next week so I can put it all together. I am still trying to work out if it will / should have a bracket to help support the weight - the way the P4000 currently does.

This approach will up the number of CUDA cores from 1,792 to 5,888. Now from a gaming perspective - one wants to keep a CPU speed to GPU speed balanced - and one would be generally looking for a 4-5 GHz CPU to pair with a 4070 or faster card - but for compute processing a ton of 2.8 GHz cores trying to keep a 4070 busy will be interesting to observe; and I guess it could still play games - though that is not its key purpose.

The 4070 was the most modern card I wished to go to - not a Super or Ti version or a 4080 or 4090 or wait a few months for the 5000 series - becuase of power and unknowns about the 5000 series. All the larger 4xxx cards need more than a single 8 PCIE power cable to get a 12 or 16 pin cable - and I didn't want to get crazy wiring to get two 8 pin cables by gluing 6 pin PCIE and 2 x SATA power cables to try and clear that hurdle.

So hopefully in the next few days I can report how it all goes down!

Last edited by g__day; 04-12-2024 at 05:35 AM.
Reply With Quote
  #17  
Old 07-12-2024, 05:26 PM
g__day's Avatar
g__day (Matthew)
Tech Guru

g__day is offline
 
Join Date: Dec 2005
Location: Sydney
Posts: 2,883
So RTX 4070 added now - glad that is all set up - a tad tricky - but its rocking along - and PixInsight Xterminators are far faster, as is SetiAstro AstroSuite and Davinci Resolve - things are between 3x - 6x faster in initial benchmarks - but I will post a more comprehensive update later.

Bottom line - I am completely happy with this set up now!


Test results:

So a bit more of an Astronomy workload benchmarking afternoon since installing a current generation graphics video card (Asus ProArt RTX 4070) in my new workstation. I concentrated benchmarking using SetiAstro's amazing astro suite running Sharpen and Denoise on a 73 MB test image - testing both the CPUs and new vs old GPU, than I ran Passmark for good measure.


The results are rather solid and pleasing:


SetiAstro Sharpen took:
1. Dual CPUs 36 cores at 100% load took 15 mins 30 seconds
2. Original Quadro P4000 GPU 3 mins 20 seconds
3. New RTX 4070 GPU took 76 seconds


SetiAstro Denoise took:
1. Dual CPUs took 5 mins 40 seconds
2. Original P4000 GPU took 3 min 27 seconds
3. RTX 4070 GPU took 47 seconds



Pass mark results were all pleasing for a system whose heyday of the CPU and Memory set up would have been ten years ago!
Attached Thumbnails
Click for full-size image (Passmark run 3.PNG)
130.5 KB18 views
Click for full-size image (Passmark drive performance NVME RAIDo 0.PNG)
37.5 KB16 views
Click for full-size image (Userbenchmark RTX 4070 page 1.jpg)
185.4 KB25 views
Click for full-size image (sav10 dual xeon Sharpen GPU RTX 4070 1 min 16 secs.jpg)
169.1 KB16 views

Last edited by g__day; 08-12-2024 at 05:48 PM.
Reply With Quote
  #18  
Old 15-12-2024, 11:57 AM
g__day's Avatar
g__day (Matthew)
Tech Guru

g__day is offline
 
Join Date: Dec 2005
Location: Sydney
Posts: 2,883
So it appears the QT6 library that PI is compiled with under Windows may be causing some material delays in all processing since 1.8.9-2,

Interestingly I see in the benchmarks for PI a rig with the exact same CPUs as mine - it completes the benchmark in just 60% of the time mine does under Linux - so that is something worth looking into!
Reply With Quote
  #19  
Old 17-12-2024, 04:38 PM
By.Jove (Jove)
Registered User

By.Jove is offline
 
Join Date: Aug 2021
Location: Sydney
Posts: 132
Switch to the dark side - Apple Silicon M4 will toast your rig.
Reply With Quote
  #20  
Old 19-12-2024, 06:54 AM
rustigsmed's Avatar
rustigsmed (Russell)
Registered User

rustigsmed is offline
 
Join Date: Mar 2012
Location: Mornington Peninsula, Australia
Posts: 3,980
Quote:
Originally Posted by By.Jove View Post
Switch to the dark side - Apple Silicon M4 will toast your rig.

not so sure about that ... for PI purposes anyway..



fastest M4 benchmark - https://www.pixinsight.com/benchmark...X051MI27XK407Z (Total 26356)


fastest E5-2699 v3 (linux) - https://www.pixinsight.com/benchmark...N64CY8O8L08466 (Total 25992)
Reply With Quote
Reply

Bookmarks

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +10. The time is now 03:20 PM.

Powered by vBulletin Version 3.8.7 | Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Advertisement
Bintel
Advertisement