. "2017-06-13T05:37:17Z" . . "%VOSWARNING%\n\n\n%META:TOPICINFO{author=\"RumiTsekova\" date=\"1159179079\" format=\"1.1\" version=\"1.1\"}%\n%META:TOPICPARENT{name=\"VOSArticles\"}%\n---+ Virtuoso and Database Scalability\n\nBy Orri Erling\n\n\n%TOC%\n\n\n---++ Abstract\n\nThis article shows results of running a [[http://tpc.org/tpcc/default.asp][TPC-C]]-based \nbenchmark on a number of different platforms with different CPUs, amounts of memory, and \nnumbers of disks. The motivation for the tests was finding a suitable building block for \na Virtuoso-based web application server farm.\n\nWe concentrate on hardware and database scales that are most likely relevant for developers \nof small to mid-size online applications. The database size starts at 4G, and the test has \n80 concurrent clients. We compare systems with between 512M and 4G of memory, and 1 to 4 disks. \n\nThe subject of benchmarking and optimizing the operation of a database server on a particular \nOS for a particular workload is extremely complex. In this article, we show some sample results \nand outline dependencies between factors affecting performance but do not claim to present an \nexhaustive study of the subject.\n\nThe numbers obtained are not comparable with the official TPC-C metric. The type of system and \nthe test rules are quite different.\n\n\n---++ Metrics\n\nThe benchmarks measure the following:\n\n * Time to load a 40 warehouse TPC C database. This is about 4G worth of short data rows, \ninserted in ascending order from an ODBC client application. This measures the speed of \na single CPU, as the operation is mostly CPU bound with serial disk writes taking place in \nthe background. Before this is done, a 12GB empty database is preallocated, striped on as \nmany disks as are available.\n * The transactions per minute metric, running 80 clients, 2 per warehouse, each client \nproducing the TPC-C transaction mix, starting the next transaction as soon as the previous \ncompletes. This is affected by a combination of random disk access latency, amount of memory \nand CPU speed.\n * Serial read - This measures the disk throughput for reading through a large fragmented \ndatabase table, a typical operation for any business intelligence query.\n\n\n---++ Results\n\n---+++ Transaction Throughput\n\n| *Throughput* | *disk type and quantity* | *RAM* | *processor kind, speed, and quantity* | *CPU%* | *System* |\n| 19001 | 3 x SCSI | 44 GB | 2 x SPARC, 1.6GHz | 120% | Solaris 10 on SunFire 440 |\n| 11580 | 4 x SATA | 1 GB | 1 x AMD64, 2 GHz | 50% | FC4 on AMD64 Generic PC |\n| 8138 | 1 x SCSI | 2 GB | 2 x AMD64, 1.8GHz | 42% | Solaris 10 on SunFire V20z |\n| 6081 | 4 x SCSI RAID | 2 GB | 4 x Pentium 3 700 MHz | N/A | Windows 2000 on Dell |\n| 2989 | 2 x SATA | 1 GB | 1 x AMD64 2GHz | 10% | FC4 on Generic PC |\n| 2701 | 4 x SCSI | 0.5 GB | 2 x Pentium 3 600MHz | 102% | Solaris 7 on Generic PC |\n| 83937 | 4 x SATA 7200rpm | 8 GB | 2 x dual core Xeon 5130, 2GHz, 64 bit mode | 274% | Supermicro 7045 A/T |\n| 49450 | 4 x SATA 7200rpm | 8 GB | 2 x dual core Xeon 5130, 2GHz, 64 bit mode | 187% | Supermicro 7045 A/T |\n| 2607 | 1 x 80G built-in | 2 GB | 1 x Intel Core Duo 1.66GHz | 27% | Mac OS X on Mac mini |\n\nThe throughput is the count of successful new order transactions made during the test run, \nscaled to a transactions per minute count. For every 10 new orders, there are 10 payments, \n1 delivery, 1 order status, and 1 stock level transaction. These are not separately reported.\n\nThe CPU% is percentage of one CPU; thus on multiprocessor systems, the percentage can be over 100.\n\n\n---+++ Initializing 40 warehouses\n\n| *Time* | *disk type and quantity* | *processor kind, speed, and quantity* | *System* |\n| 12m 6.874s | 4 x SATA | 1 x AMD64 2GHz | FC4 on AMD64 Generic PC |\n| 17m 4.007s | 3 x SCSI | 2 x SPARC 1.6GHz | Solaris 10 onSunFire 440 |\n| 17m 50.294s | 1 x SCSI | 2 x AMD64, 1.8GHz | Solaris 10 on SunFire V20z |\n| 44m | 4 x SCSI | 2 x Pentium 3 600MHz | Solaris 7 on Generic PC |\n| 4m 34s | 4 x SATA 7200rpm | 2 x dual core Xeon 5130, 2GHz, 64 bit mode | Supermicro 7045 A/T |\n| 14m 14s | 1 x 80G built-in | 1 x Intel Core Duo 1.66GHz | Mac OS X on Mac mini |\n\n\n---+++ Serial Read\n\n| *MB per second* | *disk type and quantity* | *System* |\n| 8.25 | 3 x SCSI | Solaris 10 onSunFire 440 |\n| 7.77 | 4 x SATA | FC4 on AMD64 Generic PC |\n| 2.69 | 1 x SCSI | Solaris 10 on SunFire V20z |\n\n\n---++ Test Conditions\n\nEach system was tested with the database and 80 clients running on the same machine. First \na 12G preallocated test database was made, then filled with 40 warehouses. After this, \n80 test drivers were started, two per warehouse, and left to run for 60 minutes of real \ntime. After 60 minutes the database was shut down, disconnecting the clients. The database \nwas restarted and the number of new orders was counted. This number minus the original \nnumber of orders divided by 60 is the metric. After this the sequential read test was run.\n\nBelow are the relevant lines of the virtuoso.ini file used for the test:\n\n\n[Database]\nTransactionFile = tpcc.trx\nStriping = 1\n\n[Parameters]\nServerThreads = 1000\n; These are allocated as needed, the essential \n; is to have 30% more than the number of \n; concurrent clients.\nCheckpointInterval = 20\n; To have a steady state test, one that does \n; not accumulate an infinite transaction log, \n; we checkpoint every 20 minutes, three times \n; per the one hour test run.\nNumberOfBuffers = 43000\n; 43000 if 512MB RAM; 120000 if 2G - Systems \n; were tried with different settings and the \n; best result was taken.\nMaxDirtyBuffers = 30000\n; About 3/4 of number of buffers\nMaxCheckpointRemap = 500000\n; Make this larger than the working set, so \n; that checkpoints will just be buffer flushes \n; without need to move data inside db files. \n; 500000 pages is about 4G\nFDsPerFile = 16\n; Especially if the database is a single file \n; or few files, especially if on a striped \n; RAID, this should be large. This controls \n; how many threads may have a read/write \n; system call going on a single file. The OS \n; will get to sort them and/or do them in \n; parallel if the fs is striped. It is best \n; to do striping at the db level and to have \n; disks that are known to be independent rather \n; than relying on a RAID controller for this.\n\n[Striping]\nSegment1 = 12G, tpcc1.seg q1\n; Add files here, one file per independent disk, \n; have q2...qn after each to allocate dedicated \n; I/O thread per device\n\n\n\n---++ Analysis\n\n---+++ Working Set \n\nTo understand the dynamic between main memory and I/O, we have to look at the working set of \nthe transaction mix. The stock and customer tables amount to about 58 MB per warehouse and \nare subject to constant updating. The frequency of access to all rows is not equal but we may \nidealize the situation by assuming that at the page level all pages of these tables get even \nfrequency of access even if the rows do not. Additionally, there are 10 points per warehouse \nin the *orders* and *order_line* \ntables where sequential inserts take place. Both the delivery and stock level transactions \nconcern recently inserted orders. The delivery transaction updates orders 2.8 MB worth of \ninserts after they were first inserted. If we count new orders before they get updated by \nthe delivery transaction and are subsequently not touched as part of working set, we get an \nextra 28 MB of working set per warehouse. With low memory, order lines are not likely to stay \nin cache for the interval between initial insertion and update on delivery, so we can roughly \nestimate the per warehouse working set to be 66MB, counting 8 MB for the 40 insert or update \npoints corresponding to insert in *orders* and \n*order_line* and the recent order lines scanned by the stock \nlevel transaction.\n\nFor 40 warehouses, this gives a working set of 2640MB. Since the activity on orders and \norder lines is localized and constant, the cache misses are divided between the stock and \ncustomer tables. \n\n\n---+++ Effect of Server Cache Size\n\nOn a system with 2G RAM, we have about half the working set in memory if we configure half the \nspace for database buffers and the other half will go mostly towards OS file caching. If we \nconfigure most of the space for database buffers, the OS will swap database buffers out in order \nto keep more file cache. This is very bad for performance.\n\nOn a system with 1G, we may have up to a quarter in memory, again if we configure half the \nmemory for database buffers.\n\nMeans of avoiding double buffering are system dependent and are not addressed in this article \nbut may be revisited in a follow up.\n\nWe have run the test on a SunFire V20z Solaris 10 machine with 2G of RAM with different amounts \nof database buffers. The measurements were --\n| *Buffers* | *Throughput* |\n| 43000 | 3872 |\n| 86000 | 6955 |\n| 100000 | 7819 |\n| 120000 | 8137 |\n| 150000 | 7253 |\n%BR%%BR%%BR%%BR%\n\nThere are two factors explaining the large effect of the amount of server disk cache:\n * There is a difference between having the data in warm OS file cache and having it in the \nuser space of the process. For a SELECT COUNT (*) doing a full table scan, having \nthe pages in the server cache as opposed to copying them from warm OS file cache to the server \ncache makes approximately a threefold difference, measured on Linux.\n * Having more buffers in the server cache significantly reduces the number of disk writes \nper transaction. With 43000 buffers, we have typically 3 disk writes for every new order. With \n140000 buffers, we have about 1.4 writes per new order. Additional tuning could drop this \nstill further.\n\nWhen the amount of main database cache goes over 2/3 of main memory, performance drops sharply, \nmostly due to the OS swapping out the database process in favor of OS disk cache. Wiring down \nthe database process may help but results will depend on the OS.\n\n\n---+++ CPU Load\n\nThe CPU load due to the clients is minimal, about 1/20 of the CPU usage of the server. The \nclients do nothing except execute stored procedure calls one after the other and do virtually \nno other I/O.\n\nWe see that the database load time essentially reflects the clock speed. This is a single CPU, \nCPU bound operation where all I/O is sequential background writes.\n\nThe transaction rate on the other hand reflects the number of independently addressable disks \nand the amount of memory.\n\nThe split between user and system CPU time was around 15% system CPU, as percentage of total \nCPU. The percentage rarely exceeded 25%. This was the case in all operating systems.\n\nTo find out the difference between CPU bound and I/O bound situations, we decreased the working \nset to 10 warehouses, to about 650MB and we ran the same test with again two clients per \nwarehouse. After the test reached a steady state after a ramp-up of 7 minutes on our dual AMD64 \nSunFire, we measured CPU at around 70%,, varying between 80% and 60%, meaning around one \nand a half of 2 CPUs on the Virtuoso server process. We continued to have 350% disk reads, \nmeaning an average of 3.5 read system calls pending at each time, down from over 20 pending at \na time when we ran with the 40 warehouse configuration. We note however that writes to disk do \noccur, both for transaction log and dirty buffers, even though the frequently updated working \nset tends not to get written to disk and at any time about 2/3 of the buffers are dirty. Thus \nthe test is not free of I/O; however, it is mostly CPU bound.\n\nThe mostly memory based database throughput was 42748 transactions per minute, with 64% CPU, \nthat is 128% of one CPU. This is about five times the throughput of the heavily I/O bound \nscenario with the 40 warehouse working set.\n\n\n---++ Disk Parallelism\n\nIf the working set is significantly larger than the server's disk cache performance grows near \nlinearly with the number of disks, at least between 1 and 4 disks. \n\nWe see nearly the same throughput for an old 600MHz dual Pentium 3 with 512 MB and a new 2GHz \nAMD64 with 1GB and 2 SATA disks. In both cases, the benchmark is heavily I/O bound, thus the \ndouble amount of parallel disk seeks compensates for having only half the memory. \n\nThe fragmented sequential read metric also reflects the number of disks. There is a per-disk \nordered read ahead, done with one thread per device, so that the throughput is roughly linear \nto the number of disks, at least at the beginning. This is independent of the amount of memory, \nas the read is done in into a cold cache and no row is visited twice. There is substantial \nfragmentation, thus this should not be compared with raw disk data transfer rates.\n\n\n---++ Implications for Web Applications\n\nThe TPC C workload is I/O intensive. Each transaction requires fairly little CPU in comparison \nwith composing a typical dynamic web page, for example. When profiling the Virtuoso web \napplications suite, it is not uncommon to have about 15% of the work database related and \nthe rest having to do with composing the page.\n\nTPC C is not representative of an e-commerce application, since most of the clicks processed \nby one have to do with browsing the catalogue or filling the shopping cart, as opposed to \nprocessing an order.\n\nIt is safe to estimate that a typical web application, for the same amount of I/O as is generated \nby TPC-C, will use 5-10x more CPU. Thus, while there was little or no benefit from multiple CPUs \nin the cases covered, a web application is likely to benefit from 2 or 4 CPUs for the same amount \nof I/O.\n\n\n---++ Other DBMS\n\nThe same observations will apply to basically any DBMS. We have ported the Virtuoso TPC-C test \ndriver and stored procedures to Oracle and Microsoft SQL Server. The Oracle stored procedures \ncan be found in binsrc/tests/oracle_tpcc. The DB2 and Microsoft \nSQL Server procedures can be found in any TPC full disclosure report. For the interested, \nmaking the tests run on any database with stored procedures and ODBC should be a simple task. \nIn many cases, license terms prohibit us from publishing results of tests run on databases \nother than our own.\n\n\n---++ Conclusions\n\nFrom the measurements we infer that the lowest cost per throughput would be found with an AMD64 \nPC with 4GB memory and as many SATA disks as possible. While high end disks provide better seek \ntimes and transfer rates, these are offset by having a larger number of independently seekable \nunits, whether this were for sequential or random access. A commodity PC with this spec costs \naround $2000-2500, where as the Sun servers tested are more expensive. We do not have exact \nprices for all the systems tested.\n\nAn in depth study of scalability would have to involve measurements with system specific OS \ntuning for eliminating double buffering and preventing swapping of the server process. Also \nthe performance per added disk should be graphed, up to the point of diminishing returns. \nIndeed, each answer gives rise to more questions.\n\nWe trust however that the data presented herein gives an initial feel for the factors affecting \nVirtuoso performance on common hardware configurations.\n\n\n%VOSCOPY%" . . . . . . . . "VOSScale" . "VOSScale" . "VOSScale" . . "2017-06-13T05:37:17.717143"^^ . . . "5d983451e151d48b2532b5d848e12f92" . "2017-06-13T05:37:17.717143"^^ . . . "2017-06-13T05:37:17Z" . .