Big Fat Catalog Indexes
Some benchmarks that show how abnormally big Zope's catalog indexes are in relation to the data that is indexed.
I've been doing some indexing benchmarks on Plone and got some surprising stats on the pickle size of btrees and their buckets that are persisted with each transaction. Surprising in the sense that they are very big in relation to the actual data indexed.
In the benchmark I add and index 10000 ATDocuments. I commit after each document to simulate a transaction per request environment. Each document has a 100 byte long description and 100 bytes in it's body. The total transaction size however is 40K in the beginning. The transaction sizes grow linearly to about 350K when reaching 10000 documents.
What concerns me is that the footprint of indexed data in terms of BTrees, Buckets and Sets are huge! The total amount of data committed that related directly to ATDocument is around 30 Mbyte. The total for BTrees, Buckets and IISets is more than 2 Gbyte. Even taking into account that Plone has a lot of catalog indexes and metadata columns (I think 71 in total), this seems very high. I hope that this benchmark will alert developers to the negative side effects of stuffing more indexes in the catalog.This is a summary of total data committed per class:
| Classname | Object Count | Total Size (Kbytes) |
| BTrees._IIBTree.IISet | 640686 | 1024506 |
| BTrees._IOBTree.IOBucket | 655025 | 1007623 |
| BTrees._IIBTree.IIBucket | 252121 | 163524 |
| BTrees._OIBTree.OIBucket | 132417 | 101472 |
| BTrees._IOBTree.IOBTree | 25645 | 71072 |
| BTrees._OOBTree.OOBucket | 115332 | 70789 |
| BTrees._IIBTree.IIBTree | 143942 | 53566 |
| BTrees._OOBTree.OOBTree | 15875 | 52354 |
| BTrees._IIBTree.IITreeSet | 49383 | 25975 |
| BTrees._OIBTree.OIBTree | 4613 | 23008 |
| Products.ATContentTypes.content.document.ATDocument | 10000 | 15077 |
| Persistence.mapping.PersistentMapping | 20000 | 8261 |
| Products.Archetypes.BaseUnit.BaseUnit | 30000 | 7504 |
| BTrees.Length.Length | 220107 | 6382 |
| OFS.Folder.Folder | 10000 | 537 |
| Products.PlonePAS.tools.memberdata.MemberData | 1 | 0 |
Here is a summary of transaction sizes for the first few transactions:
| Txn id | Object count | Txn size (bytes) |
| #00099 | 179 | 42119 |
| #00100 | 175 | 40021 |
| #00101 | 167 | 41746 |
| #00102 | 171 | 45480 |
| #00103 | 171 | 48411 |
| #00104 | 173 | 51524 |
| #00105 | 171 | 54265 |
| #00106 | 175 | 57744 |
| #00107 | 175 | 60380 |
| #00108 | 180 | 64854 |
| #00109 | 172 | 61819 |
| #00110 | 176 | 66281 |
| #00111 | 173 | 66906 |
| #00112 | 176 | 70307 |
| #00113 | 174 | 71629 |
| #00114 | 184 | 78853 |
| #00115 | 181 | 79756 |
| #00116 | 188 | 84928 |
An the last few transactions:
| Txn id | Object count | Txn size (bytes) |
| #10081 | 234 | 343926 |
| #10082 | 226 | 341061 |
| #10083 | 245 | 394237 |
| #10084 | 237 | 367932 |
| #10085 | 228 | 338461 |
| #10086 | 184 | 310049 |
| #10087 | 189 | 314684 |
| #10088 | 246 | 405305 |
| #10089 | 215 | 334854 |
| #10090 | 221 | 346977 |
| #10091 | 195 | 318492 |
| #10092 | 224 | 351770 |
| #10093 | 221 | 345032 |
| #10094 | 206 | 332271 |
| #10095 | 241 | 541394 |
| #10096 | 191 | 283578 |
| #10097 | 236 | 323354 |
| #10098 | 242 | 329099 |
| #10099 | 226 | 339302 |
Transaction detail for txn #00099 (first document):
| Txn id | Classname | Object count | Size (bytes) |
| #00099 | BTrees._IIBTree.IIBTree | 3 | 286 |
| #00099 | OFS.Folder.Folder | 1 | 55 |
| #00099 | BTrees._IOBTree.IOBucket | 9 | 4572 |
| #00099 | BTrees._OIBTree.OIBucket | 5 | 2964 |
| #00099 | BTrees._IOBTree.IOBTree | 39 | 17552 |
| #00099 | BTrees.Length.Length | 27 | 768 |
| #00099 | Persistence.mapping.PersistentMapping | 2 | 846 |
| #00099 | Products.ATContentTypes.content.document.ATDocument | 1 | 1544 |
| #00099 | BTrees._OOBTree.OOBTree | 20 | 3986 |
| #00099 | BTrees._IIBTree.IISet | 3 | 184 |
| #00099 | BTrees._OIBTree.OIBTree | 9 | 1404 |
| #00099 | Products.Archetypes.BaseUnit.BaseUnit | 3 | 767 |
| #00099 | BTrees._OOBTree.OOBucket | 2 | 3286 |
| #00099 | BTrees._IIBTree.IITreeSet | 55 | 3905 |
Transaction detail for txn #10099 (last document):
| Txn id | Classname | Object count | Size (bytes) |
| #10099 | BTrees._IIBTree.IIBTree | 8 | 2517 |
| #10099 | OFS.Folder.Folder | 1 | 55 |
| #10099 | BTrees._IOBTree.IOBucket | 57 | 81564 |
| #10099 | BTrees._OIBTree.OIBucket | 13 | 9872 |
| #10099 | BTrees._IIBTree.IIBucket | 29 | 20024 |
| #10099 | BTrees._IOBTree.IOBTree | 1 | 85 |
| #10099 | Persistence.mapping.PersistentMapping | 2 | 846 |
| #10099 | BTrees.Length.Length | 22 | 655 |
| #10099 | Products.ATContentTypes.content.document.ATDocument | 1 | 1544 |
| #10099 | BTrees._OOBTree.OOBTree | 6 | 30455 |
| #10099 | BTrees._IIBTree.IISet | 65 | 182708 |
| #10099 | Products.Archetypes.BaseUnit.BaseUnit | 3 | 767 |
| #10099 | BTrees._OOBTree.OOBucket | 16 | 8088 |
| #10099 | BTrees._IIBTree.IITreeSet | 2 | 122 |
For a discussion on the above benchmarks, read the thread on ZODB-DEV at
http://mail.zope.org/pipermail/zodb-dev/2008-August/012055.html
Since the discussion on this thread, I've tried out collective.solr but since I don't really know it that well I haven't spend to much time with it. I started developing collective.alchemyindex (not checked in yet) that indexes data in a RDBMS using sqlalchemy. Doing the above benchmark using Postgres for indexing, resulted in Data.fs of around 368MB and a total Postgres database of only 135MB. This seems a lot more acceptable size wise. I'll document this benchmark in a future post.






