Home Company Services Portfolio Contact us nav spacer

Big Fat Catalog Indexes

by Roché Compaan posted on Sep 23, 2008 10:01 PM last modified Sep 24, 2008 10:23 PM —

Some benchmarks that show how abnormally big Zope's catalog indexes are in relation to the data that is indexed.

I've been doing some indexing benchmarks on Plone and got some surprising stats on the pickle size of btrees and their buckets that are persisted with each transaction. Surprising in the sense that they are very big in relation to the actual data indexed.

In the benchmark I add and index 10000 ATDocuments. I commit after each document to simulate a transaction per request environment. Each document has a 100 byte long description and 100 bytes in it's body. The total transaction size however is 40K in the beginning. The transaction sizes grow linearly to about 350K when reaching 10000 documents.

What concerns me is that the footprint of indexed data in terms of BTrees, Buckets and Sets are huge! The total amount of data committed that related directly to ATDocument is around 30 Mbyte. The total for BTrees, Buckets and IISets is more than 2 Gbyte. Even taking into account that Plone has a lot of catalog indexes and metadata columns (I think 71 in total), this seems very high. I hope that this benchmark will alert developers to the negative side effects of stuffing more indexes in the catalog.
This is a summary of total data committed per class:

Classname Object Count Total Size (Kbytes)
BTrees._IIBTree.IISet 640686 1024506
BTrees._IOBTree.IOBucket 655025 1007623
BTrees._IIBTree.IIBucket 252121 163524
BTrees._OIBTree.OIBucket 132417 101472
BTrees._IOBTree.IOBTree 25645 71072
BTrees._OOBTree.OOBucket 115332 70789
BTrees._IIBTree.IIBTree 143942 53566
BTrees._OOBTree.OOBTree 15875 52354
BTrees._IIBTree.IITreeSet 49383 25975
BTrees._OIBTree.OIBTree 4613 23008
Products.ATContentTypes.content.document.ATDocument 10000 15077
Persistence.mapping.PersistentMapping 20000 8261
Products.Archetypes.BaseUnit.BaseUnit 30000 7504
BTrees.Length.Length 220107 6382
OFS.Folder.Folder 10000 537
Products.PlonePAS.tools.memberdata.MemberData 1 0

Here is a summary of transaction sizes for the first few transactions:
 
Txn id Object count Txn size (bytes)
#00099 179 42119
#00100 175 40021
#00101 167 41746
#00102 171 45480
#00103 171 48411
#00104 173 51524
#00105 171 54265
#00106 175 57744
#00107 175 60380
#00108 180 64854
#00109 172 61819
#00110 176 66281
#00111 173 66906
#00112 176 70307
#00113 174 71629
#00114 184 78853
#00115 181 79756
#00116 188 84928

An the last few transactions:

Txn id Object count Txn size (bytes)
#10081 234 343926
#10082 226 341061
#10083 245 394237
#10084 237 367932
#10085 228 338461
#10086 184 310049
#10087 189 314684
#10088 246 405305
#10089 215 334854
#10090 221 346977
#10091 195 318492
#10092 224 351770
#10093 221 345032
#10094 206 332271
#10095 241 541394
#10096 191 283578
#10097 236 323354
#10098 242 329099
#10099 226 339302

Transaction detail for txn #00099 (first document):

Txn id Classname Object count Size (bytes)
#00099 BTrees._IIBTree.IIBTree 3 286
#00099 OFS.Folder.Folder 1 55
#00099 BTrees._IOBTree.IOBucket 9 4572
#00099 BTrees._OIBTree.OIBucket 5 2964
#00099 BTrees._IOBTree.IOBTree 39 17552
#00099 BTrees.Length.Length 27 768
#00099 Persistence.mapping.PersistentMapping 2 846
#00099 Products.ATContentTypes.content.document.ATDocument 1 1544
#00099 BTrees._OOBTree.OOBTree 20 3986
#00099 BTrees._IIBTree.IISet 3 184
#00099 BTrees._OIBTree.OIBTree 9 1404
#00099 Products.Archetypes.BaseUnit.BaseUnit 3 767
#00099 BTrees._OOBTree.OOBucket 2 3286
#00099 BTrees._IIBTree.IITreeSet 55 3905

Transaction detail for txn #10099 (last document):

Txn id Classname Object count Size (bytes)
#10099 BTrees._IIBTree.IIBTree 8 2517
#10099 OFS.Folder.Folder 1 55
#10099 BTrees._IOBTree.IOBucket 57 81564
#10099 BTrees._OIBTree.OIBucket 13 9872
#10099 BTrees._IIBTree.IIBucket 29 20024
#10099 BTrees._IOBTree.IOBTree 1 85
#10099 Persistence.mapping.PersistentMapping 2 846
#10099 BTrees.Length.Length 22 655
#10099 Products.ATContentTypes.content.document.ATDocument 1 1544
#10099 BTrees._OOBTree.OOBTree 6 30455
#10099 BTrees._IIBTree.IISet 65 182708
#10099 Products.Archetypes.BaseUnit.BaseUnit 3 767
#10099 BTrees._OOBTree.OOBucket 16 8088
#10099 BTrees._IIBTree.IITreeSet 2 122

For a discussion on the above benchmarks, read the thread on ZODB-DEV at
http://mail.zope.org/pipermail/zodb-dev/2008-August/012055.html

Since the discussion on this thread, I've tried out collective.solr but since I don't really know it that well I haven't spend to much time with it. I started developing collective.alchemyindex (not checked in yet) that indexes data in a RDBMS using sqlalchemy. Doing the above benchmark using Postgres for indexing, resulted in Data.fs of around 368MB and a total Postgres database of only 135MB. This seems a lot more acceptable size wise. I'll document this benchmark in a future post.