Skip navigation

Monthly Archives: September 2008

If you suddenly needed a cronnable Postgresql database update command for SQL text files, you would probably just type:

cat /path/to/some/dir/*.sql | psql -U postgres someDatabase

So, I am asking myself, have I created something pointless?

As it turns out:

  • pgBee keeps track of the update process. If a pgBee instance is killed, the next invocation will carry on from where the previous one has stopped. And if it finds SQL errors, it will report how far it got in the input files before quitting.
  • pgBee is actually faster than psql when executing SQL statements from a text file. psql took 112m (with one transaction for each statement), psql -1 took 97m (with one transaction for the entire file) but pgBee finished in 21m !!! (with one transaction per batch) That’s a whopping 898 operations per second. All tests were run on the same database server (localhost), pgBee was batching groups of 100 statements at a time and a real data file was used, with 1131753 SQL statements in total (511335 DELETEs and 567577 INSERTs).

In a previous post, I promised some examples/tutorials on load-testing Postgresql servers with Tsung. Well, I have tried to develop a database performance testing methodology that may be: a. application-specific, and b. easily applied to different servers and configurations, to assess their relative performance.

Tsung is ideally suited for application-specific Postgresql testing, as it supports a “proxy mode” to record SQL sessions, which are then turned into a scenario file and replayed any number of times. It also supports including alternative sessions in the same scenario file, so that each simulated new user may send a different set of SQL statements, according to the probability assigned to each session.

Different parts of a session may be grouped into transactions (Tsung-speak — nothing to do with your normal database transactions) for statistical monitoring of SQL groups. Transactions are characterised by their name, and names may be shared across sessions. This way, there are tremendous reporting possibilities, as all sessions may have a “connection” transaction offering global connection statistics, while transactions with unique names produce statistics on a specific use-case basis (e.g. complex data search, typical page load etc.).

I’d say there are two main preparation stages for meaningful Postgresql load-testing with Tsung:

These stages will be analyzed each in their respective post. It turns out capturing SQL statements and turning them into a Tsung scenario file was not as easy as I thought.

a Postgresql Bulk Updater in Java

pgBee is a set of Java classes I wrote for automating bulk updates of Postgresql databases on Linux servers. It requires Java (doh!) and Ant (as a build/execute front-end), it is cronnable and performs very well, especially in multi-threaded mode, which takes full advantage of multi-core CPUs in modern servers. The source of inspiration for pgBee has been previously described.

This code is released under a GNU General Public License (GPL).

Ant sometimes refuses to run in the background, so the best way to make pgBee work as a cron job is probably to call a simple shell script from cron, like the one below:

export JAVA_HOME=/usr
export ANT_HOME=/usr/local/ant
/usr/local/bin/ant -f /path/to/build.xml run </dev/null &

All configuration is done in the settings.xml file, but some options may be set through the command line, e.g.

ant -f /path/to/build.xml -Dlock=yes -Dthreads=8 -Dparallel=yes run

pgBee processes all files it finds in a particular (in) directory and moves them to either a done directory or a rejects directory, if there were SQL errors. You’ll need to create the right directory structure and configure pgBee settings before starting. The pgBee process catches SIGTERM, SIGHUP etc. signals and exits gracefully, ready to resume from where it stopped the next time it is run. So, it should be quite reliable, in the absence of hard resets and kill -9. Having said that, I am supplying no guarantees of fitness for any purpose of any kind 🙂 Please use at your own risk.

If you need to make sure a particular set of statements is processed in the same transaction, you only have to include all statements in the same line of an input file, separated by semi-colons. There’s no limit to how many SQL statements you may include in a single line. More information about input file format, usage and configuration may be found in the downloadable tarball

Data models are good and they are clear, if you’re the person writing the application and devising the model. Hell, sometimes, they are not clear even then! So, imagine what happens when you get someone from the street to connect to your database and read your schema in order to understand it. No chance!

Now, this is not about some poor wardriver who doesn’t know how to read the implicit relationships between tables in your model – they had it coming! But what about your legit users, working on a particular aspect of your infrastructure or application, such as developers, DBAs etc. ? How on earth do they make sense of it all when they first start?

Yes, yes, in an ideal world everything’s properly documented, but when was the last time you saw that in a real life situation? Real IT people don’t write helpful comments when they create their tables, views, functions etc. Referential integrity? Don’t make me laugh! Most developers avoid database constraints, to keep the application portable between database systems and database error messages to a bare minimum. Integrity rules are usually enforced at the application level. From a DBA’s perspective, most enterprise-level databases are big collections of seemingly unrelated tables, with no business logic in the DB system itself.

But don’t despair! Help is at hand. Enter Schema Spy:

You dowload the jar file, and then you run something along the lines of

java -jar schemaSpy_3.1.1.jar -t pgsql -cp /path/to/jdbc.jar \
                              -u user -p password -s schema \
                              -db dbname -host localhost:5432 \
                              -o output-dir

After a while, you have a look in output-dir, and the reports are really nice.

Schema Spy even deduces table relationships from field names and types. And it seems to support several different database systems, including Oracle and MySQL. Hurrah!