Increasing performance on the database 10 Oct 03

I am in the process of reducing the total size of the database and increasing performance a bit in the process. As it stands the links_found table is holding duplicate copies of the home_page table. This was necessary because I wanted to be able to see what links belong to what web pages at the start for testing purposes. I no longer need to do this. What I am going to change is the format that they are stored in. I am going to make the following changes to the link_found table.

FORM
links_found
(
parent_url_id int4,
found_url varchar(2000)
);

TO
child_links
(
parent_url_id int4,
found_url int4
);

I have written a postgres function to carry out the migration. As you can see beneath the space savings are fantastic. I should also see an improvement in my indexes on this table. I am trying very hard to postpone buying any extra hardware. It will make it a bit more awkward to use the data now but this is not my main priority at the moment. When I have lots of disk space I can the create temporary tables for any manipulations as I require them.

You can see below that I have created a new table called child_links. I have converted all the URL's into url_id's which are "int4" types. I have also created indexes on this table. I can now remove all relations relating to the links_found table.



links# select relname, relfilenode, relpages from pg_class order by relpages limit 10;

relname | relfilenode | relpages
-------------------+-------------+----------
links_found | 188163825 | 644114
links_found_pkey | 246168688 | 588185
lf_found_url_idx | 246168682 | 559585
child_links | 246168690 | 255817
child_links_pkey | 246168692 | 216185
home_page | 188163817 | 118338
parent_url_id_idx | 299992353 | 116508
child_url_id_idx | 299992259 | 116231
home_page_pkey | 246168684 | 103223
home_page_url_key | 246168686 | 100120
hp_url_id_index | 246168683 | 15857
hp_url_id_idx | 301324542 | 15660
(12 rows)

Before I remove any of the relations I wanted to see my actual disk savings.

File system 1M-blocks Used Available Use% Mounted on
/dev/hdc2 3938 3125 613 84% /
/dev/hdc1 30 9 20 30% /boot
none 505 0 504 0% /dev/shm
/dev/hdc3 3938 2177 1561 59% /usr
/dev/hdc5 8439 1466 6551 19% /links/pg_xlog
/dev/hdb1 9628 5523 3616 61% /links/tables
/dev/hdb2 9605 7044 2073 78% /links/temp
/dev/sda5 17364 13812 2684 84% /links/database

So that you can see what I had done to the files and where they where all pointing here is a listing of the links database directory for all the big relations.

-rw------- 1 postgres postgres 8.0k Oct 12 03:54 188163815
lrwxrwxrwx 1 postgres postgres 33 Oct 10 23:17 188163817 -> /links/pg_xlog/postgres/188163817
lrwxrwxrwx 1 postgres postgres 30 Oct 10 22:35 188163825 -> /links/temp/postgres/188163825
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:37 188163825.1 -> /links/temp/postgres/188163825.1
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:38 188163825.2 -> /links/temp/postgres/188163825.2
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:39 188163825.3 -> /links/temp/postgres/188163825.3
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:41 188163825.4 -> /links/temp/postgres/188163825.4
lrwxrwxrwx 1 postgres postgres 32 Oct 10 23:30 246168682 -> /links/tables/postgres/246168682
lrwxrwxrwx 1 postgres postgres 34 Oct 10 23:44 246168682.1 -> /links/tables/postgres/246168682.1
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:41 246168682.2 -> /links/tables/postgres/246168682.2
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:41 246168682.3 -> /links/tables/postgres/246168682.3
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:42 246168682.4 -> /links/tables/postgres/246168682.4
-rw------- 1 postgres postgres 132M Oct 12 03:54 246168683
-rw------- 1 postgres postgres 855M Oct 12 03:55 246168684
-rw------- 1 postgres postgres 871M Oct 12 03:55 246168686
-rw------- 1 postgres postgres 1.0G Oct 11 01:59 246168688
-rw------- 1 postgres postgres 1.0G Oct 11 01:38 246168688.1
-rw------- 1 postgres postgres 1.0G Oct 11 01:54 246168688.2
-rw------- 1 postgres postgres 1.0G Oct 11 01:59 246168688.3
-rw------- 1 postgres postgres 499M Oct 11 01:59 246168688.4
-rw------- 1 postgres postgres 1.0G Oct 11 14:32 246168690
-rw------- 1 postgres postgres 1.0G Oct 12 00:42 246168690.1
-rw------- 1 postgres postgres 52M Oct 12 03:55 246168690.2
-rw------- 1 postgres postgres 1.0G Oct 12 03:52 246168692
-rw------- 1 postgres postgres 750M Oct 12 03:55 246168692.1
-rw------- 1 postgres postgres 1005M Oct 12 03:55 299992259
-rw------- 1 postgres postgres 995M Oct 12 03:55 299992353
-rw------- 1 postgres postgres 130M Oct 12 03:55 301324542

After droping the relations and deleting all trace of them I had the following results. You may notice that the filenames are completely different. I had a major problem on the SCSI disk again. I had to recreate the database using initdb because some of my pg_clog files went missing. To allow me to complete a vacuum I copied the last pg_clog file to the file that was missing. I am pretty sure that this is very dangerous but it allowed me to complete the vacuum on the table. I was hoping to see some more errors but I got none. I completely dropped the database reformatted the hard disk and created a new file system on it. I had originally used the "largefile4" option of mke2fs but I have now left it at default.

]# df -m

File system 1M-blocks Used Available Use% Mounted on
/dev/hdc2 3938 3124 613 84% /
/dev/hdc1 30 9 20 30% /boot
none 505 0 504 0% /dev/shm
/dev/hdc3 3938 2177 1561 59% /usr
/dev/hdc5 8439 2369 5648 30% /links/pg_xlog
/dev/hdb1 9628 2034 7105 23% /links/tables
/dev/hdb2 9605 2007 7110 23% /links/temp
/dev/sda5 17093 7513 8712 47% /links/database

Interesting bits from my base directory

-rw------- 1 postgres postgres 1.0G Oct 12 2003 16992
-rw------- 1 postgres postgres 927M Oct 12 2003 16992.1
-rw------- 1 postgres postgres 121M Oct 12 2003 58021849
-rw------- 1 postgres postgres 783M Oct 12 2003 58021850
-rw------- 1 postgres postgres 752M Oct 12 2003 58021852
-rw------- 1 postgres postgres 1.0G Oct 12 2003 58021854
-rw------- 1 postgres postgres 68M Oct 12 2003 58021854.1
-rw------- 1 postgres postgres 872M Oct 12 2003 58021856
-rw------- 1 postgres postgres 872M Oct 12 2003 58021857

We can see that we have dropped the total size of the database considerably.

links=# select relname, relfilenode, relpages from pg_class order by relpages desc limit 9;
relname | relfilenode | relpages
--------------------------------+-------------+----------
child_links | 16992 | 249791
child_links_pkey | 58021854 | 139723
home_page | 16982 | 116267
parent_url_id_idx | 58021856 | 111577
child_url_id_idx | 58021857 | 111577
home_page_pkey | 58021850 | 100253
home_page_url_key | 58021852 | 96209
hp_url_id_index | 58021849 | 15434
pg_proc_proname_args_nsp_index | 16640 | 125

50 Million links found
6.7 Million unique links found

Add to delicious Digg This Add to My Yahoo! Add to Google Add to StumbleUpon
| | Comments (0)

Leave a comment

About this Entry

This page contains a single entry by Harry published on October 10, 2003 12:31 AM.

Few million links more 09 Oct 03 was the previous entry in this blog.

Checking URL's 12 Oct 03 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Powered by Movable Type 4.01