The UNION clause is used to join two SELECT statements to create a single output of data. Now why would you want to do that? If you are pulling data from a single table, you can always just adjust the WHERE clause to include both conditions. What about when you are pulling data from more than one table? The UNION clause can help.
My first use of the UNION clause came as I was creating a cross-tab report. I was pulling data from multiple database tables and was able to massage the data so that I could use it with the UNION to create one set of output that could then be displayed in a dynamic spreadsheet-like report. When I was done, I had a 650-line SQL query with over 100 UNION clauses. Here is a simple example of how it works:
Suppose you have two tables: favorite_sports and favorite_hobbies. Never mind that nobody would ever create tables this way, but it helps make the example clear. The favorite_sports table has the following contents:
id | name | sport
----+------+---------------
1 | Matt | Skiing
2 | Nate | Rock Climbing
3 | Dave | Skiing
And the favorite_hobbies table has the following contents:
id | name | hobby
----+------+-----------
1 | Matt | Ham Radio
The table definitions don't have to match exactly to use the UNION operator, they just need to be somewhat similar. In this example, I used a varchar(20) for the sport column and varchar(25) for the hobby column. That works fine, just don't try to match up an integer with a string.
A simple example of the UNION clause would be:
SELECT * FROM favorite_sports
UNION
SELECT * FROM favorite_hobbies;
And would give the following results:
id | name | sport
----+------+---------------
1 | Matt | Ham Radio
2 | Nate | Rock Climbing
3 | Dave | Skiing
1 | Matt | Skiing
Notice how we get all of the rows out of both tables. Unfortunately the column heading for the sport or hobby is "sport". We can fix that by using an alias as in the following query:
SELECT id, name, sport AS "sport/hobby"
FROM favorite_sports
UNION
SELECT *
FROM favorite_hobbies;
And gives us the following results:
id | name | sport/hobby
----+------+---------------
1 | Matt | Ham Radio
2 | Nate | Rock Climbing
3 | Dave | Skiing
1 | Matt | Skiing
Notice the use of double quotes for the alias name. We generally use single quotes in SQL but that will generate a syntax error here. Now if we want to add a type column, it is as simple as running the following query:
SELECT id, name, sport AS "sport/hobby", 'sport' AS type
FROM favorite_sports
UNION
SELECT *, 'hobby'
FROM favorite_hobbies;
Which gives us the following results:
id | name | sport/hobby | type
----+------+---------------+-------
1 | Matt | Ham Radio | hobby
1 | Matt | Skiing | sport
2 | Nate | Rock Climbing | sport
3 | Dave | Skiing | sport
Hopefully this gives you a concrete example of the UNION clause and helps you understand when to use it.
Thursday, February 20, 2014
Wednesday, February 19, 2014
Monitoring PostgreSQL
PostgreSQL often does not require a lot of monitoring. For those developers creating applications or trying things out on their own system, it isn't required at all. Once you move your application into production, the database may or may not need to be monitored depending on how much traffic it sees. Production environments that see a lot of traffic and run close to system capacity need a bit more care.
PostgreSQL includes a number of system catalog tables that you can use to find out information about your database cluster. I've always referred to them as Data Dictionary tables because that's what they would be called in Oracle and Oracle is the first database system I learned. PostgreSQL also includes a number of stored procedures or functions that combine the information in some of these tables and present the results in a more easily readable form. Monitoring PostgreSQL is done by querying these tables or running these functions. Here are some of the common ones that I have used to help monitor PostgreSQL:
View "pg_catalog.pg_stat_activity"
Column | Type | Modifiers
------------------+--------------------------+-----------
datid | oid |
datname | name |
pid | integer |
usesysid | oid |
usename | name |
application_name | text |
client_addr | inet |
client_hostname | text |
client_port | integer |
backend_start | timestamp with time zone |
xact_start | timestamp with time zone |
query_start | timestamp with time zone |
state_change | timestamp with time zone |
waiting | boolean |
state | text |
query | text |
The last column in the view is named query but will also show other running SQL statements like INSERTS or DELETES. Other columns of note include the pid column and is the process id of the statement. This allows you to go out to the operating system and use tools like ps to find more information about a specific statement that is running.
Running the command on a development server with no current traffic gives the following results:
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | clie
nt_port | backend_start | xact_start | query_start
| state_change | waiting | state | query
-------+----------+------+----------+----------+------------------+-------------+-----------------+-----
--------+-------------------------------+-------------------------------+-------------------------------
+------------------------------+---------+--------+---------------------------------
12914 | postgres | 3389 | 16391 | mbennett | psql | | |
-1 | 2014-02-19 09:05:51.377371-07 | 2014-02-19 09:10:28.021888-07 | 2014-02-19 09:10:28.021888-07
| 2014-02-19 09:10:28.02189-07 | f | active | select * from pg_stat_activity;
(1 row)
Listing everything in a single line makes it rather difficult to read but this gives you an idea of what your output will look like. For a larger server running in a high-traffic production environment, you will get a number of rows output and so you may only want to see those columns to help you diagnose problems. They may include pid, application_name, query_start, and query.
SELECT pg_database_size('database_name');
PostgreSQL can have multiple databases running on the same server. You can get a list of all database by doing a \l in psql. Be sure to replace the string database_name with the name of your actual database. The result is output in bytes.
SELECT pg_total_relation_size('table_name');
Be sure to replace table_name with the actual name of the table for which you are interested. Once again the result is in total bytes.
SELECT relname, relpages
FROM pg_class
ORDER BY relpages DESC
LIMIT 10;
Changing the LIMIT clause will allow you to see more or fewer tables. Leaving it off completely shows all of the tables in your database including the system catalog or data dictionary tables and views. The relpages column shows how many database pages are used by each of the tables and so it is a relative size as opposed to the number of bytes. You can use the pg_total_relation_size function to see the number of bytes used by the table, as shown above.
View "pg_catalog.pg_locks"
Column | Type | Modifiers
--------------------+----------+-----------
locktype | text |
database | oid |
relation | oid |
page | integer |
tuple | smallint |
virtualxid | text |
transactionid | xid |
classid | oid |
objid | oid |
objsubid | smallint |
virtualtransaction | text |
pid | integer |
mode | text |
granted | boolean |
fastpath | boolean |
The column we are most interested in viewing is granted. If the value of granted is true, the lock has been granted. If the value is false, the query is waiting for a lock and provides a clue as to what might be causing a problem.
PostgreSQL includes a number of system catalog tables that you can use to find out information about your database cluster. I've always referred to them as Data Dictionary tables because that's what they would be called in Oracle and Oracle is the first database system I learned. PostgreSQL also includes a number of stored procedures or functions that combine the information in some of these tables and present the results in a more easily readable form. Monitoring PostgreSQL is done by querying these tables or running these functions. Here are some of the common ones that I have used to help monitor PostgreSQL:
Monitoring Current Database Commands
When things start slowing down and traffic backs up it is good to be able to see what queries or commands are currently running. Perhaps there is a query that really slows everything down or a large number of statements are being executed all at once. Looking at the pg_stat_activity view will show you what is running along with some important information about the statements running. If you describe the pg_stat_activity view using the \d command in psql, you get the following output:View "pg_catalog.pg_stat_activity"
Column | Type | Modifiers
------------------+--------------------------+-----------
datid | oid |
datname | name |
pid | integer |
usesysid | oid |
usename | name |
application_name | text |
client_addr | inet |
client_hostname | text |
client_port | integer |
backend_start | timestamp with time zone |
xact_start | timestamp with time zone |
query_start | timestamp with time zone |
state_change | timestamp with time zone |
waiting | boolean |
state | text |
query | text |
The last column in the view is named query but will also show other running SQL statements like INSERTS or DELETES. Other columns of note include the pid column and is the process id of the statement. This allows you to go out to the operating system and use tools like ps to find more information about a specific statement that is running.
Running the command on a development server with no current traffic gives the following results:
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | clie
nt_port | backend_start | xact_start | query_start
| state_change | waiting | state | query
-------+----------+------+----------+----------+------------------+-------------+-----------------+-----
--------+-------------------------------+-------------------------------+-------------------------------
+------------------------------+---------+--------+---------------------------------
12914 | postgres | 3389 | 16391 | mbennett | psql | | |
-1 | 2014-02-19 09:05:51.377371-07 | 2014-02-19 09:10:28.021888-07 | 2014-02-19 09:10:28.021888-07
| 2014-02-19 09:10:28.02189-07 | f | active | select * from pg_stat_activity;
(1 row)
Listing everything in a single line makes it rather difficult to read but this gives you an idea of what your output will look like. For a larger server running in a high-traffic production environment, you will get a number of rows output and so you may only want to see those columns to help you diagnose problems. They may include pid, application_name, query_start, and query.
Calculating the Size of Your Database
Unix and Linux do not behave very well when you fill the disk. Therefore it is very important to make sure you keep a close eye on how large your database is to ensure you don't run out of space. There is a database function named pg_database_size to help see the size of your database. You just need to run the following query:SELECT pg_database_size('database_name');
PostgreSQL can have multiple databases running on the same server. You can get a list of all database by doing a \l in psql. Be sure to replace the string database_name with the name of your actual database. The result is output in bytes.
Calculating the Size of a Specific Table
Knowing the size of your database is helpful to keep from running out of disk space. Knowing the size of a specific table will help with increasing performance. Once a table grows above a specific size, indexes will help reduce query times. You may also want to consider partitioning for larger tables. Finding the table size can be done using the pg_total_relation_size function and can be done using the follow command:SELECT pg_total_relation_size('table_name');
Be sure to replace table_name with the actual name of the table for which you are interested. Once again the result is in total bytes.
Finding the Largest Tables in Your Database
The pg_total_relation_size function is good if you know which of your tables are large. What do you do if you are new to a project and have not had time to have someone explain which tables are your largest? How do you find that out? There is a view called pg_class that contains that information and finding out the top 10 largest tables is as simple as using the following query:SELECT relname, relpages
FROM pg_class
ORDER BY relpages DESC
LIMIT 10;
Changing the LIMIT clause will allow you to see more or fewer tables. Leaving it off completely shows all of the tables in your database including the system catalog or data dictionary tables and views. The relpages column shows how many database pages are used by each of the tables and so it is a relative size as opposed to the number of bytes. You can use the pg_total_relation_size function to see the number of bytes used by the table, as shown above.
Checking for Uncommitted Transaction Locks
In high-volume transaction environments you there is a chance that you will run into deadlocks. Being able to look at the pg_locks view will help you figure out locking issues. Describing the view with the \d command in psql gives the following:View "pg_catalog.pg_locks"
Column | Type | Modifiers
--------------------+----------+-----------
locktype | text |
database | oid |
relation | oid |
page | integer |
tuple | smallint |
virtualxid | text |
transactionid | xid |
classid | oid |
objid | oid |
objsubid | smallint |
virtualtransaction | text |
pid | integer |
mode | text |
granted | boolean |
fastpath | boolean |
The column we are most interested in viewing is granted. If the value of granted is true, the lock has been granted. If the value is false, the query is waiting for a lock and provides a clue as to what might be causing a problem.
Tuesday, February 18, 2014
Using pg_dump and pg_dumpall for Database Backups
The pg_dump and pg_dumpall utilities are great for moving data between databases and for making quick backups. While log shipping may have you back in business after a database failure, it doesn't protect you against a user error where pg_dump and pg_dumpall will. I actually like to set up both nightly pg_dump runs and log shipping on production databases.
There really isn't a need to keep pg_dump files around for a long time and so it is helpful to implement some kind of automatic backup rotation. Otherwise you will constantly have to monitor space on the filesystem where you keep your backups. I like to implement the following strategy for my database dumps:
15 1 * * * <PATH>/nightly_backup.sh >> <PATH>/backup.log 2>&1
15 2 * * 1 <PATH>/weeksave.sh >> <PATH>/backup.log 2>&1
15 3 1 * * <PATH>/monthsave.sh >> <PATH>/backup.log 2>&1
This runs the nightly backup script at 1:15 every day, so it is more of a morning backup instead of a nightly one. You can modify the hour of the backup to meet your needs. I also allow an hour for the backup to complete before I try saving a copy for the weekly backup. You may need to adjust your timing accordingly.
I then have three simple scripts: nightly_backup.sh, weeksave.sh, and monthsave.sh. Here are the contents of each:
nightly_backup.sh
#!/bin/sh
/usr/bin/pg_dumpall > <PATH>/db_backup_$(date +%A).sql
weeksave.sh
#!/bin/sh
let WEEK=($(date +%d)-1)/7+1
/bin/cp <PATH>/db_backup_$(date +%A).sql
<PATH>/db_backup_week$WEEK.sql
# the previous 2 lines should all be on 1 line
monthsave.sh
#!/bin/sh
/bin/cp <PATH>/db_backup_$(date +%A).sql
<PATH>/db_backup_$(date +%B).sql
# the previous 2 lines should all be on 1 line
# Change %B to %B%Y if you want to keep monthly backups
# forever
You will want to replace all <PATH> references to the paths used in your environment.
The key to all of the scripts is to use dates as part of the file names for each of the backup files. The nightly backup includes the day of the week so that when Monday rolls around again, it overwrites the previous Monday's backup. The weekly backup uses Monday's backup and just copies the file so that it has week1, week2, week3, week4, or week5 in the name.
Notice the comment in the monthsave.sh script that shows how to keep your monthly backups forever. It just adds the year to the month used in the file name.
There really isn't a need to keep pg_dump files around for a long time and so it is helpful to implement some kind of automatic backup rotation. Otherwise you will constantly have to monitor space on the filesystem where you keep your backups. I like to implement the following strategy for my database dumps:
- Perform a nightly backup and save them for a week
- Keep a copy of each Monday backup for a month
- Keep a copy of each first day of the month backup for a year
15 1 * * * <PATH>/nightly_backup.sh >> <PATH>/backup.log 2>&1
15 2 * * 1 <PATH>/weeksave.sh >> <PATH>/backup.log 2>&1
15 3 1 * * <PATH>/monthsave.sh >> <PATH>/backup.log 2>&1
This runs the nightly backup script at 1:15 every day, so it is more of a morning backup instead of a nightly one. You can modify the hour of the backup to meet your needs. I also allow an hour for the backup to complete before I try saving a copy for the weekly backup. You may need to adjust your timing accordingly.
I then have three simple scripts: nightly_backup.sh, weeksave.sh, and monthsave.sh. Here are the contents of each:
nightly_backup.sh
#!/bin/sh
/usr/bin/pg_dumpall > <PATH>/db_backup_$(date +%A).sql
weeksave.sh
#!/bin/sh
let WEEK=($(date +%d)-1)/7+1
/bin/cp <PATH>/db_backup_$(date +%A).sql
<PATH>/db_backup_week$WEEK.sql
# the previous 2 lines should all be on 1 line
monthsave.sh
#!/bin/sh
/bin/cp <PATH>/db_backup_$(date +%A).sql
<PATH>/db_backup_$(date +%B).sql
# the previous 2 lines should all be on 1 line
# Change %B to %B%Y if you want to keep monthly backups
# forever
You will want to replace all <PATH> references to the paths used in your environment.
The key to all of the scripts is to use dates as part of the file names for each of the backup files. The nightly backup includes the day of the week so that when Monday rolls around again, it overwrites the previous Monday's backup. The weekly backup uses Monday's backup and just copies the file so that it has week1, week2, week3, week4, or week5 in the name.
Notice the comment in the monthsave.sh script that shows how to keep your monthly backups forever. It just adds the year to the month used in the file name.
Wednesday, February 12, 2014
Creating a Filesystem Backup
One simple way of backing up your PostgreSQL database is to take a filesystem backup. Perhaps you have a system that backs up all of your servers on a nightly basis and you just want to include your PostgreSQL database. Unfortunately there is one condition that must be met before you can use this method safely: The database must be shut down. Some environments allow you to shut the database down on a nightly basis and that will allow you make a filesystem backup.
Doing so is as simple as using a utility like tar to backup the data directory as in the following command:
tar -cvzf my_backup.tgz ~/data
The "z" option tells tar to compress the resulting file so it takes less space.
If you are running on hardware that allows you to take a consistent snapshot snapshot of the filesystem, you can do so with a running PostgreSQL database. Should you need to restore from the backup, PostgreSQL will think that it was shutdown improperly and rerun the WAL files, which is not a problem.
If you implement streaming replication or log shipping, you can set the database into a backup mode that works with write-ahead logging (WAL) to ensure your database files are kept in sync. If you try to run the SELECT pg_start_backup('x'); command on a database that does not have log shipping enabled, you will get an error indicating that WAL archiving is not active.
Doing so is as simple as using a utility like tar to backup the data directory as in the following command:
tar -cvzf my_backup.tgz ~/data
The "z" option tells tar to compress the resulting file so it takes less space.
If you are running on hardware that allows you to take a consistent snapshot snapshot of the filesystem, you can do so with a running PostgreSQL database. Should you need to restore from the backup, PostgreSQL will think that it was shutdown improperly and rerun the WAL files, which is not a problem.
If you implement streaming replication or log shipping, you can set the database into a backup mode that works with write-ahead logging (WAL) to ensure your database files are kept in sync. If you try to run the SELECT pg_start_backup('x'); command on a database that does not have log shipping enabled, you will get an error indicating that WAL archiving is not active.
Tuesday, February 11, 2014
Configuring a Hot Standby Server Using Log Shipping
Log Shipping has been around with PostgreSQL since before version 8.4. Originally the secondary or slave servers could not be used for anything unless something went wrong with the master and then one could be triggered to become the master or primary. This is what is referred to as a warm standby. It is almost ready to use and doesn't require time to recover a database backup. Version 9.0 introduced hot-standby servers with log shipping in that the slave servers can now be used as read-only copies of the data. This allows them to be used as reporting servers which can offload some processing of the primary database server.
Log shipping or streaming replication in PostgreSQL 9.x is near real-time, so it takes less than a second for a change made in the master database to propagate to the replicated database. Here are the steps necessary to configure streaming replication on Linux:
1. Set up two identical database server machines. Make sure to disable the firewall on both machines. For the purposes of this tutorial, one of the machines is named PG9master and the other PG9slave. Their names should indicate which one is which.
2. Install the proper PostgreSQL packages for your given distribution on both servers. RedHat, Fedora, and CentOS require postgresql-libs, postgresql-server, and postgresql-contrib. Ubuntu and Debian require postgresql-9.x and postgresql-contrib-9.x, where the x can be 0, 1, 2, or 3 (9.3 is the latest version of PostgreSQL at the time of this writing).
3. Set up some form of file transfer from PG9master to PG9slave. The example in this tutorial assumes that scp will be used. You will need to generate public and private keys that allow scp to transfer files without the need for password exchanges. As the postgres user on PG9master, run the following:
ssh-keygen -t rsa
Don't enter a passphrase. As it asks for it twice, be sure to just hit return twice. Your key files will be saved in the .ssh directory. You will need to copy id_rsa.pub to the PG9slave server using the following command from PG9master:
scp ~/.ssh/id_rsa.pub PG9slave:.ssh/authorized_keys
You will be asked to enter a password for the postgres user on PG9slave. Once the scp is finished, you can test not needing the password by using ssh to log into PG9slave from PG9master using the following:
ssh PG9slave
You should not need to enter the password.
4. You will need to create a directory where log files can be copied. On PG9slave, create a directory called:
~/master_logs
5. You are now ready to edit the postgresql.conf file on PG9master. If you are running one of the RedHat flavors of Linux, it will be located underneath the postgres user's home directory. If you are running Ubuntu, it will be located under the /etc/postgresql directory. Adjust the following parameters:
wal_level = hot_standby
archive_mode = on
archive_command = 'scp %p PG9slave:master_logs/%f'
listen_addresses = '*'
archive_timeout = '120 #2 minute window if streaming fails
max_wal_senders = 1 #assume a single slave
6. Now edit the pg_hba.conf file on PG9master. It will be found in the same directory as the postgresql.conf file. Add the following being sure to replace <IP of PG9slave> with the actual IP address of PG9slave:
# TYPE DATABASE USER CIDR-ADDRESS METHOD
host replication pgrep <IP of PG9slave>/32 md5
7. If you are running the RedHat flavors of Linux, start the PostgreSQL server on PG9master using the following command as the root user:
/etc/init.d/postgresql-9.x start
Make sure to replace the .x with the correct dot release of PostgreSQL that you are using. If you are using the Ubuntu flavor of Linux, you will need to reload PostgreSQL's postgresql.conf file by running the following command as the root user:
service postgresql restart
8. Create a database user on PG9master named pgrep by going into psql and running the following command:
CREATE USER pgrep WITH superuser login PASSWORD 'pgrep_pass';
9. You are now ready to make a filesystem backup of the master database and copy it to the slave. On PG9master, run psql and issue the following command:
SELECT pg_start_backup('backup');
Exit psql using \q and make a copy of the data directory using the following:
cd 9.x
tar -cvf db_backup.tar data
Be sure to replace the .x with the dot version of PostgreSQL you are using. When the backup completes, go back into psql and issue the following command:
SELECT pg_stop_backup();
Copy the tar file to the slave machine using the following scp command:
scp db_backup.tar PG9slave:
10. Restore the master database on the slave. Log into PG9slave as the postgres user and issue the following commands:
cd 9.x
mv ../db_backup.tar .
tar -xvf db_backup.tar
cd data
rm postmaster.pid
rm -rf pg_xlog/*
Being sure to replace .x with the dot version of PostgreSQL you are using.
11. Edit the postgresql.conf file on PG9slave and set the following parameter:
hot_standby = on
If you are running the RedHat flavor of Linux, you will also need to comment out the following parameters that you brought over when you moved the postgresql.conf file from PG9master:
archive_mode
archive_command
archive_timeout
12. Create the recovery.conf file on PG9slave. If you are running the RedHat flavor of Linux, it will be in the same directory as the postgresql.conf file. If you are running Ubuntu, it will need to go in the data directory. It is a new file that should contain the following information and assumes you are running the RedHat flavor of Linux:
standby_mode = 'on'
primary_conninfo = 'host=PG9master port=5432 user=pgrep password = 'pgrep_pass'
restore_command = 'cp -i /var/lib/pgsql/master_logs/%f %p'
archive_cleanup_command = 'pg_archivecleanup /var/lib/pgsql/master_logs %r'
trigger_file = '/tmp/stop_pg_streaming.tmp'
If you are running Ubuntu Linux, you will need to replace the paths in the restore and archive cleanup commands with those used by Ubuntu.
Be sure to make sure that the program pg_archivecleanup is in the postgres user's path. If not, you can symbolically link it. On RedHat you would use:
ln -s /usr/pgsql-9.x/bin/pg_archivecleanup /usr/bin/pg_archivecleanup
On Ubuntu, you would use:
ln -s /usr/lib/postgresql/9.x/bin/pg_archivecleanup /usr/bin/pg_archivecleanup
Again, be sure to replace the .x with the appropriate dot release of Postgresql that you are using.
13. Start the database on PG9slave as root using the following commands. On RedHat use:
/etc/init.d/postgresql-9.x start
Be sure to replace the .x with the point release of PostgreSQL you are using. On Ubuntu, use:
service postgresql start
14. Test everything out. The psql program will work on both systems. You shold be able to create a table on PG9master and have it immediately show up on PG9slave. If there is a delay of 2 minutes, then something is amiss with streaming replication.
Try logging into the PG9master database from PG9slave using:
psql -h PG9master -U gprep postgres
If you get an error, then streaming replication is not working either.
Make sure that the "wal sender process" is runnning on PG9master by doing the following:
ps aux | grep post
Also make sure there is a "wal receiver process" running on PG9slave.
Log shipping or streaming replication in PostgreSQL 9.x is near real-time, so it takes less than a second for a change made in the master database to propagate to the replicated database. Here are the steps necessary to configure streaming replication on Linux:
1. Set up two identical database server machines. Make sure to disable the firewall on both machines. For the purposes of this tutorial, one of the machines is named PG9master and the other PG9slave. Their names should indicate which one is which.
2. Install the proper PostgreSQL packages for your given distribution on both servers. RedHat, Fedora, and CentOS require postgresql-libs, postgresql-server, and postgresql-contrib. Ubuntu and Debian require postgresql-9.x and postgresql-contrib-9.x, where the x can be 0, 1, 2, or 3 (9.3 is the latest version of PostgreSQL at the time of this writing).
3. Set up some form of file transfer from PG9master to PG9slave. The example in this tutorial assumes that scp will be used. You will need to generate public and private keys that allow scp to transfer files without the need for password exchanges. As the postgres user on PG9master, run the following:
ssh-keygen -t rsa
Don't enter a passphrase. As it asks for it twice, be sure to just hit return twice. Your key files will be saved in the .ssh directory. You will need to copy id_rsa.pub to the PG9slave server using the following command from PG9master:
scp ~/.ssh/id_rsa.pub PG9slave:.ssh/authorized_keys
You will be asked to enter a password for the postgres user on PG9slave. Once the scp is finished, you can test not needing the password by using ssh to log into PG9slave from PG9master using the following:
ssh PG9slave
You should not need to enter the password.
4. You will need to create a directory where log files can be copied. On PG9slave, create a directory called:
~/master_logs
5. You are now ready to edit the postgresql.conf file on PG9master. If you are running one of the RedHat flavors of Linux, it will be located underneath the postgres user's home directory. If you are running Ubuntu, it will be located under the /etc/postgresql directory. Adjust the following parameters:
wal_level = hot_standby
archive_mode = on
archive_command = 'scp %p PG9slave:master_logs/%f'
listen_addresses = '*'
archive_timeout = '120 #2 minute window if streaming fails
max_wal_senders = 1 #assume a single slave
6. Now edit the pg_hba.conf file on PG9master. It will be found in the same directory as the postgresql.conf file. Add the following being sure to replace <IP of PG9slave> with the actual IP address of PG9slave:
# TYPE DATABASE USER CIDR-ADDRESS METHOD
host replication pgrep <IP of PG9slave>/32 md5
7. If you are running the RedHat flavors of Linux, start the PostgreSQL server on PG9master using the following command as the root user:
/etc/init.d/postgresql-9.x start
Make sure to replace the .x with the correct dot release of PostgreSQL that you are using. If you are using the Ubuntu flavor of Linux, you will need to reload PostgreSQL's postgresql.conf file by running the following command as the root user:
service postgresql restart
8. Create a database user on PG9master named pgrep by going into psql and running the following command:
CREATE USER pgrep WITH superuser login PASSWORD 'pgrep_pass';
9. You are now ready to make a filesystem backup of the master database and copy it to the slave. On PG9master, run psql and issue the following command:
SELECT pg_start_backup('backup');
Exit psql using \q and make a copy of the data directory using the following:
cd 9.x
tar -cvf db_backup.tar data
Be sure to replace the .x with the dot version of PostgreSQL you are using. When the backup completes, go back into psql and issue the following command:
SELECT pg_stop_backup();
Copy the tar file to the slave machine using the following scp command:
scp db_backup.tar PG9slave:
10. Restore the master database on the slave. Log into PG9slave as the postgres user and issue the following commands:
cd 9.x
mv ../db_backup.tar .
tar -xvf db_backup.tar
cd data
rm postmaster.pid
rm -rf pg_xlog/*
Being sure to replace .x with the dot version of PostgreSQL you are using.
11. Edit the postgresql.conf file on PG9slave and set the following parameter:
hot_standby = on
If you are running the RedHat flavor of Linux, you will also need to comment out the following parameters that you brought over when you moved the postgresql.conf file from PG9master:
archive_mode
archive_command
archive_timeout
12. Create the recovery.conf file on PG9slave. If you are running the RedHat flavor of Linux, it will be in the same directory as the postgresql.conf file. If you are running Ubuntu, it will need to go in the data directory. It is a new file that should contain the following information and assumes you are running the RedHat flavor of Linux:
standby_mode = 'on'
primary_conninfo = 'host=PG9master port=5432 user=pgrep password = 'pgrep_pass'
restore_command = 'cp -i /var/lib/pgsql/master_logs/%f %p'
archive_cleanup_command = 'pg_archivecleanup /var/lib/pgsql/master_logs %r'
trigger_file = '/tmp/stop_pg_streaming.tmp'
If you are running Ubuntu Linux, you will need to replace the paths in the restore and archive cleanup commands with those used by Ubuntu.
Be sure to make sure that the program pg_archivecleanup is in the postgres user's path. If not, you can symbolically link it. On RedHat you would use:
ln -s /usr/pgsql-9.x/bin/pg_archivecleanup /usr/bin/pg_archivecleanup
On Ubuntu, you would use:
ln -s /usr/lib/postgresql/9.x/bin/pg_archivecleanup /usr/bin/pg_archivecleanup
Again, be sure to replace the .x with the appropriate dot release of Postgresql that you are using.
13. Start the database on PG9slave as root using the following commands. On RedHat use:
/etc/init.d/postgresql-9.x start
Be sure to replace the .x with the point release of PostgreSQL you are using. On Ubuntu, use:
service postgresql start
14. Test everything out. The psql program will work on both systems. You shold be able to create a table on PG9master and have it immediately show up on PG9slave. If there is a delay of 2 minutes, then something is amiss with streaming replication.
Troubleshooting
Always check the log files. As usual RedHat flavors of Linux will store them in different places than Ubuntu. If there is a problem though, you should get some hint of what has gone wrong from the log file.Try logging into the PG9master database from PG9slave using:
psql -h PG9master -U gprep postgres
If you get an error, then streaming replication is not working either.
Make sure that the "wal sender process" is runnning on PG9master by doing the following:
ps aux | grep post
Also make sure there is a "wal receiver process" running on PG9slave.
Labels:
8.4,
9.0,
9.1,
9.2,
9.3,
Log Shipping,
Streaming Replication
Monday, February 10, 2014
Backing up Your PostgreSQL Database
PostgreSQL is one of the most reliable database management systems available. As with anything, it is not 100% foolproof though. There are a number of things that can happen to cause data loss and so it is important to back it up. PostgreSQL has three primary methods of saving copies of your data should something go wrong. They all have their strengths and so you may use more than one or even all three, depending on how secure you need your data to be. These are not the only backup options with PostgreSQL but are used a lot and show the different options available.
Tutorial for pg_dump and pg_dumpall
Tutorial for Filesystem Backup
Tutorial for Streaming Replication
The pg_dump and pg_dumpall Utilities
PostgreSQL has two similar utilities to help with backups: pg_dump and pg_dumpall. The first is used to create a backup of a single database while the later is used to backup all of the databases on the server. Both utilities create files that you can move to other PostgreSQL database servers and load data into them. You have the option of creating a compressed file if space is a concern or you can create SQL files that you view and edit before running the files to load data into the new database.Tutorial for pg_dump and pg_dumpall
Filesystem Backup
PostgreSQL stores data in regular operating system files and so you can use regular backup tools that you may already have running in your current server environment. Unfortunately the database will need to be shut down first unless you trust your filesystem to take consistent snapshots. For some environments, this isn't a problem. For others, it makes this backup method unusable.Tutorial for Filesystem Backup
Streaming Replication
PostgreSQL has a backup method called streaming replication or log-shipping standby servers that performs the same SQL commands executed on a master database to any number of slaves. The slave databases can then be used as read-only copies for reporting purposes. Should something go wrong with your primary database server, it is simply a matter of flipping a software switch and one of the slave database servers can take over as the primary and business can continue. Other backup options require lengthy restores that may not be possible for time-sensitive database applications.Tutorial for Streaming Replication
Which One To Use
Streaming replication seems to be the most promising of these three backup strategies. So why would you use any of the either two? The short answer is: operator error. What happens when the database administrator does something like drop an important table, which can't be undone with a ROLLBACK? It is time to go back to that pg_dump file and fix the mistake. In order to set up streaming replication, you need to leverage a filesystem backup. Ultimately they all have their place and should be used appropriately.Recovering your Database
Having a database backup is kind of like climbing Mt. Everest: Getting to the top is only half of the trip as you still have to climb back down. Sure you may have a backup of your data but what is the best way to restore it. There are a number of considerations that should be decided long before you need to use your backup. Understanding them and having practice with your backups will help when you really need to use them.Thursday, February 6, 2014
Filtering A Query With a Value in the GROUP BY Clause
The GROUP BY clause used in SQL queries allows you to condense data into a single row that share the same values for those columns for the grouped columns or expressions. It can be very helpful for doing aggregations. Suppose we have the following table named favorite_sports:
id | name | sport
----+------+---------------
1 | Matt | Skiing
2 | Nate | Rock Climbing
3 | Dave | Skiing
We can find out how many people like each sport using the following SQL query:
SELECT sport, count(*) AS people
FROM favorite_sports
GROUP BY sport;
This produces the following results:
sport | people
---------------+--------
Skiing | 2
Rock Climbing | 1
Now suppose we only want to show those sports that have more than one person interested in them. Putting the count(*) function in the WHERE clause will throw an error saying aggregates are not allowed. Instead we must use the HAVING clause and it appears after the GROUP BY clause as in the following example:
SELECT sport, count(*) AS people
FROM favorite_sports
GROUP BY sport HAVING count(*) > 1;
Which produces the following results:
sport | people
--------+--------
Skiing | 2
Don't be tempted to use the alias in the HAVING clause as you will get an error saying that the alias column does not exist.
id | name | sport
----+------+---------------
1 | Matt | Skiing
2 | Nate | Rock Climbing
3 | Dave | Skiing
We can find out how many people like each sport using the following SQL query:
SELECT sport, count(*) AS people
FROM favorite_sports
GROUP BY sport;
This produces the following results:
sport | people
---------------+--------
Skiing | 2
Rock Climbing | 1
Now suppose we only want to show those sports that have more than one person interested in them. Putting the count(*) function in the WHERE clause will throw an error saying aggregates are not allowed. Instead we must use the HAVING clause and it appears after the GROUP BY clause as in the following example:
SELECT sport, count(*) AS people
FROM favorite_sports
GROUP BY sport HAVING count(*) > 1;
Which produces the following results:
sport | people
--------+--------
Skiing | 2
Don't be tempted to use the alias in the HAVING clause as you will get an error saying that the alias column does not exist.
Deleting Duplicate Rows
A good database developer will never create a situation where two rows are exactly alike in a table. Everyone knows that it is important to have a primary key that uniquely defines each row in the table. Sometimes it is an artificial key like a unique integer created by a sequence while other times it a natural key like a social security number combined with the person's name (note: social security numbers do not uniquely identify individuals without a name). Yet somehow someone will be testing something and a duplicate row will appear. Perhaps the primary key gets dropped or one was never created and you all of a sudden discover a duplicate row or two.
For example, assume we have the following table named duplicate_rows:
id | name | sport
----+------+---------------
1 | Matt | Skiing
2 | Nate | Rock Climbing
1 | Matt | Skiing
You can see there are 2 entries for Matt. Sure it would be a lot easier to just drop the table and recreate it, but that would defeat the purpose of the example. Here are some of the more common ways to delete the duplicate row.
ALTER TABLE duplicate_rows ADD COLUMN new_pk serial;
Serial is a special data type that gets its value from a sequence. The nice think about it is that it will automatically populate the rows of the table so you don't have to do anything else. Querying the table using
SELECT * FROM duplicate_rows
returns the following results:
id | name | sport | new_pk
----+------+---------------+--------
1 | Matt | Skiing | 1
2 | Nate | Rock Climbing | 2
1 | Matt | Skiing | 3
You can then use the following DELETE statement to remove the duplicate row:
DELETE FROM duplicate_rows WHERE new_pk = 3;
Your table will not contain the correct data. You can drop the column using:
ALTER TABLE duplicate_rows DROP COLUMN new_pk;
In addition to removing the column, the sequence automatically created for the serial data type will also be dropped. Unfortunately there is a side-effect from adding a column and then dropping it. PostgreSQL never really drops the column, it just hides it from you. For those concerned about hidden columns being left around, this solution ends up being a little messy. There must be a better way.
DELETE FROM duplicate_rows AS d
WHERE EXISTS (SELECT 'x',
FROM duplicate_rows AS s
WHERE s.id = d.id
AND s.name = d.name
AND s.sport = d.sport
AND s.ctid < d.ctid);
The SQL command references itself twice and so it is helpful to use a table alias which appears after AS. The DELETE command aliases the table using the letter "d" while the SELECT command uses the letter "s." The command goes through each row in the duplicate_rows table and runs a query to see if there is another row in the table where one current tuple ID is less than the other.
WARNING: If you make the mistake of using the not equal operator (!= or <>), you will delete both copies of the row and that is not what you want.
The downside to this method is that it can be rather long for a table with a lot of rows as you must compare each column with itself. However it does have the advantage of not creating unneeded rows that become hidden from ou.
For example, assume we have the following table named duplicate_rows:
id | name | sport
----+------+---------------
1 | Matt | Skiing
2 | Nate | Rock Climbing
1 | Matt | Skiing
You can see there are 2 entries for Matt. Sure it would be a lot easier to just drop the table and recreate it, but that would defeat the purpose of the example. Here are some of the more common ways to delete the duplicate row.
Adding a Column and Populating it with a Sequence
It is very easy to add a column to a table and can be done with the following command:ALTER TABLE duplicate_rows ADD COLUMN new_pk serial;
Serial is a special data type that gets its value from a sequence. The nice think about it is that it will automatically populate the rows of the table so you don't have to do anything else. Querying the table using
SELECT * FROM duplicate_rows
returns the following results:
id | name | sport | new_pk
----+------+---------------+--------
1 | Matt | Skiing | 1
2 | Nate | Rock Climbing | 2
1 | Matt | Skiing | 3
You can then use the following DELETE statement to remove the duplicate row:
DELETE FROM duplicate_rows WHERE new_pk = 3;
Your table will not contain the correct data. You can drop the column using:
ALTER TABLE duplicate_rows DROP COLUMN new_pk;
In addition to removing the column, the sequence automatically created for the serial data type will also be dropped. Unfortunately there is a side-effect from adding a column and then dropping it. PostgreSQL never really drops the column, it just hides it from you. For those concerned about hidden columns being left around, this solution ends up being a little messy. There must be a better way.
Using One of PostgreSQL's Hidden Columns
PostgreSQL tables contain a number of hidden columns that you can use to help delete duplicate rows. Some suggest using the object ID column or OID but not all tables are created with the OID column. My preference is to use the current tuple ID or CTID column combined with the EXISTS statement as in the following:DELETE FROM duplicate_rows AS d
WHERE EXISTS (SELECT 'x',
FROM duplicate_rows AS s
WHERE s.id = d.id
AND s.name = d.name
AND s.sport = d.sport
AND s.ctid < d.ctid);
The SQL command references itself twice and so it is helpful to use a table alias which appears after AS. The DELETE command aliases the table using the letter "d" while the SELECT command uses the letter "s." The command goes through each row in the duplicate_rows table and runs a query to see if there is another row in the table where one current tuple ID is less than the other.
WARNING: If you make the mistake of using the not equal operator (!= or <>), you will delete both copies of the row and that is not what you want.
The downside to this method is that it can be rather long for a table with a lot of rows as you must compare each column with itself. However it does have the advantage of not creating unneeded rows that become hidden from ou.
Wednesday, February 5, 2014
Important PostgreSQL Resources
There are a number of important resources for PostgreSQL that can be found on the Internet. While you can simply type your question into a search engine like Google, sometimes it is much quicker to just head to one of the following sites:
PostgreSQL has some of the best documentation for any software. As I mentioned in the introduction, the purpose of this blog is not to recreate what already exists, but to add to it in a complimentary fashion. When you look at the home page for documentation, they list the latest supported versions and allow you to go to documentation with comments from other users or documentation without. My preference is with comments because there are often little tips and tricks that help out a lot.
The PostgreSQL Wiki has a lot of information above and beyond the regular documentation. It includes links to frequently asked questions (FAQ), additional tutorials and articles, developer resources, and a number of other useful not found in the official documentation.
The PostgreSQL Mailing Lists page contains a number of links to various mail lists that are very helpful in solving problems. You can view answers to questions that have already been asked or find out how to ask your own.
Command Prompt is a company run by Josh Drake that specializes in PostgreSQL. They have a number of products but also provide services and support for PostgreSQL.
EnterpriseDB is still around and provides packaged distributions of PostgreSQL along with enhanced version that looks and feels a lot like Oracle's database software. They have several of the core developers on-staff including Bruce Momjain
2nd Quadrant is a company that includes Simon Riggs, a former EnterpriseDB coworker and one of the core PostgreSQL developers.
Heroku Postgres provides PostgreSQL services and specializes in running your database in the cloud.
There are a number of other professional services companies that can be found on the PostgreSQL site. Unfortunately I don't have direct experience with them and don't know if they continue to provide support. If I have left anyone out, please comment on this page and I will be sure to edit the content and add you to the list.
The Short List
The PostgreSQL site is usually the best place to start for most PostgreSQL related questions. It includes information about the latest releases, news about PostgreSQL, PostgreSQL-related events, and other blogs with technical tips.PostgreSQL has some of the best documentation for any software. As I mentioned in the introduction, the purpose of this blog is not to recreate what already exists, but to add to it in a complimentary fashion. When you look at the home page for documentation, they list the latest supported versions and allow you to go to documentation with comments from other users or documentation without. My preference is with comments because there are often little tips and tricks that help out a lot.
The PostgreSQL Wiki has a lot of information above and beyond the regular documentation. It includes links to frequently asked questions (FAQ), additional tutorials and articles, developer resources, and a number of other useful not found in the official documentation.
The PostgreSQL Mailing Lists page contains a number of links to various mail lists that are very helpful in solving problems. You can view answers to questions that have already been asked or find out how to ask your own.
Professional PostgreSQL Companies
OpenSCG was started by Denis Lussier, one of the founders of EnterpriseDB after he felt he could fix some of the mistakes he made on his first attempt at a PostgreSQL company. They provide consulting services and support for PostgreSQL.Command Prompt is a company run by Josh Drake that specializes in PostgreSQL. They have a number of products but also provide services and support for PostgreSQL.
EnterpriseDB is still around and provides packaged distributions of PostgreSQL along with enhanced version that looks and feels a lot like Oracle's database software. They have several of the core developers on-staff including Bruce Momjain
2nd Quadrant is a company that includes Simon Riggs, a former EnterpriseDB coworker and one of the core PostgreSQL developers.
Heroku Postgres provides PostgreSQL services and specializes in running your database in the cloud.
There are a number of other professional services companies that can be found on the PostgreSQL site. Unfortunately I don't have direct experience with them and don't know if they continue to provide support. If I have left anyone out, please comment on this page and I will be sure to edit the content and add you to the list.
Tuesday, February 4, 2014
Introduction
I have always been a keeper of engineering notebooks and this blog serves as an evolution of them. My Dad used to keep a notebook and I remember asking him about it when I was young. He told me it was easier to put all of his notes in one place than it was to keep track of a bunch of pieces of paper scattered over his desk. That made sense to me. My Dad wasn't an engineer but he did start his career selling computers to them.
When I got to college and started studying Electrical Engineering, my professors had me create notebooks for each of my classes. They had specific formats for all students to follow and they drilled it into us that our employers would require us to keep notebooks. Furthermore the professors indicated our employers would ask to keep those notebooks when we left for other opportunities. I made sure to ask for a notebook when I started at Oracle Corporation fresh out of college. Then I made sure to keep detailed notes of everything I did. I also noticed that I was the only one with a notebook. When my group got spun out into another company, nobody asked for my notebooks. Nobody asked for them when I left that company to head out on my own to do consulting. In fact, I have 20 years of notebooks full of wonderful information. Some of it is dated, of course, but there is also a lot of useful information.
About the same time I left Oracle's spin-off, I discovered PostgreSQL. While at Oracle, it made sense that any time you needed a database, you would use Oracle. There were no license fees to worry about and so it was like free software. When I left to do my own consulting, I discovered that Oracle wasn't free. In fact it was rather expensive. Not everyone could afford it and so I looked for another solution. At that time, there were 2 possible open source databases: MySQL and PostgreSQL. MySQL was for small projects and PostgreSQL was for larger ones. Coming from Oracle, I decided to learn PostgreSQL and found it to be amazingly similar. It wasn't exactly the same but it had a lot of the same features.
Fast forward a few years and in 2005 I saw an article about a new company called EnterpriseDB that wanted to take PostgreSQL and make it look like Oracle for less money. I had the same idea years earlier and even started talking with several venture capitalists who were willing to fund it. Instead of trying to compete with EnterpriseDB, I decided to work with them and became one of their first employees.
Working at EnterpriseDB was a lot of fun. It was a small company and reminded a little bit of Oracle when I worked for them in high school. There was a lot of energy and we all thought that anything was possible. The whole time I was there, I continued to keep detailed notebooks of all the technical issues I was solving.
Eventually I found myself working for one of EnterpriseDB's customers and my technical knowledge of PostgreSQL continued to grow. While I had once been an Oracle expert, I now found myself favoring PostgreSQL. Now I work for a large video game company and we use PostgreSQL extensively.
A few days ago I was going through my notebook and my boss commented that I should write a book. I have done that before and felt like it was too little reward for such a huge effort. Before I could express this to him, he suggested that I start a blog and compile all of my PostgreSQL technical notes in one place. Then others looking for solutions to problems that I have already solved will be edified. So it is with that thought that I have created this blog.
It is not my intent to have this blog replace PostgreSQL's own excellent documentation. I actually plan to reference it extensively. Instead it is my hope that entries from my own engineering notebooks will help further explain difficult topics and provide another example for PostgreSQL developers to follow.
One downside to a blog instead of writing a book is that I have to act as my own editor. I will try to do a good initial job but mistakes will be made. Please be patient and politely let me know of any typos or wording that isn't quite clear. I will do my best to fix things in a timely manner. I will also make sure to verify all code examples and SQL statements to make sure they work as advertised.
When I got to college and started studying Electrical Engineering, my professors had me create notebooks for each of my classes. They had specific formats for all students to follow and they drilled it into us that our employers would require us to keep notebooks. Furthermore the professors indicated our employers would ask to keep those notebooks when we left for other opportunities. I made sure to ask for a notebook when I started at Oracle Corporation fresh out of college. Then I made sure to keep detailed notes of everything I did. I also noticed that I was the only one with a notebook. When my group got spun out into another company, nobody asked for my notebooks. Nobody asked for them when I left that company to head out on my own to do consulting. In fact, I have 20 years of notebooks full of wonderful information. Some of it is dated, of course, but there is also a lot of useful information.
About the same time I left Oracle's spin-off, I discovered PostgreSQL. While at Oracle, it made sense that any time you needed a database, you would use Oracle. There were no license fees to worry about and so it was like free software. When I left to do my own consulting, I discovered that Oracle wasn't free. In fact it was rather expensive. Not everyone could afford it and so I looked for another solution. At that time, there were 2 possible open source databases: MySQL and PostgreSQL. MySQL was for small projects and PostgreSQL was for larger ones. Coming from Oracle, I decided to learn PostgreSQL and found it to be amazingly similar. It wasn't exactly the same but it had a lot of the same features.
Fast forward a few years and in 2005 I saw an article about a new company called EnterpriseDB that wanted to take PostgreSQL and make it look like Oracle for less money. I had the same idea years earlier and even started talking with several venture capitalists who were willing to fund it. Instead of trying to compete with EnterpriseDB, I decided to work with them and became one of their first employees.
Working at EnterpriseDB was a lot of fun. It was a small company and reminded a little bit of Oracle when I worked for them in high school. There was a lot of energy and we all thought that anything was possible. The whole time I was there, I continued to keep detailed notebooks of all the technical issues I was solving.
Eventually I found myself working for one of EnterpriseDB's customers and my technical knowledge of PostgreSQL continued to grow. While I had once been an Oracle expert, I now found myself favoring PostgreSQL. Now I work for a large video game company and we use PostgreSQL extensively.
A few days ago I was going through my notebook and my boss commented that I should write a book. I have done that before and felt like it was too little reward for such a huge effort. Before I could express this to him, he suggested that I start a blog and compile all of my PostgreSQL technical notes in one place. Then others looking for solutions to problems that I have already solved will be edified. So it is with that thought that I have created this blog.
It is not my intent to have this blog replace PostgreSQL's own excellent documentation. I actually plan to reference it extensively. Instead it is my hope that entries from my own engineering notebooks will help further explain difficult topics and provide another example for PostgreSQL developers to follow.
One downside to a blog instead of writing a book is that I have to act as my own editor. I will try to do a good initial job but mistakes will be made. Please be patient and politely let me know of any typos or wording that isn't quite clear. I will do my best to fix things in a timely manner. I will also make sure to verify all code examples and SQL statements to make sure they work as advertised.
Subscribe to:
Comments (Atom)