Sunday, April 10, 2011

Using Solr as a fast database cache

In gss (the open source project that the "Pithos" service and mynetworkfolders is based upon), we use apache solr for full-text indexing and searching of the stored documents. However when a user searches for some terms, we have to show him only results from documents that she has read permission on, i.e. her own documents, document shared to her by others and documents made public.

During some benchmarks we did recently we observed extremely high response times even for searches that had very few results. After some code reviews and more fine-grained benchmarks, we realized that over 60% of the time that it took for a search to complete was due to permission checks in the database that stores the document metadata. An example search for the term 'java', returned from solr something in the area of 2000 results. After that, each result should have its permissions checked to see if the user that did the search has read permission on the result and filter out results that cannot be read by the specific user. The response from solr was blazingly fast, the transformation of the SolrDocument objects to gss resources and the marshaling to json was around 40% of the total time and the remaining 60% was the permissions checking.

So, we thought that if the solr search is so fast, why don't we store the document permissions in the index and transform the search query to include the user? That way the search will return only the relevant results (those that the user has read permission) and no permission checking and filtering will be necessary. More specifically, whenever a file is created or updated we store in the index the user and group ids that have read permissions on the file. Now, when a user does a search, we retrieve the groups that the user belongs to and append to the search query a search term that checks if the user id and group ids belong to those stored with the file. That way the search returns only the relevant results, thus improving search times more that 60%.

Note: Care should be taken to update the index, not only when a file is updated but whenever its permissions are updated too. However, this is not something that happens often and index updating is done asynchronously through a message queue, so the load imposed to the server is insignificant.

6 comments:

manos said...

Interesting. Could you provide an update with the relationship model details and user/rights ratio?

chstath said...

@manos: Certainly. I will post about the data model and especially the permissions model. I am not sure if I can have detailed statistics from the production system but I 'll give it a try.

Maggie said...

Any update on how you were able to accomplish this.Am trying to achieve the same and i could use some pointers.

chstath said...

This stores group ids that have read permission



This indicates if the file is public

Then the query is modified to filter for public files or files the user 's id is in the ureaders field e.g. query AND (public:true OR ureaders: userid)

solr_noob said...

awesome post. Have you considered using manifoldCF? not suggesting the solution for/to you, just that I'm trying to solve this problem where users can search for their own documents as well as documents shared with them. In a sense, it is like indexing a sharepoint-like infrastructure, where users have certain permission levels etc.

And if you have considered manifoldCF, would you be willing to share why you chose not to use it?

Thanks :)

chstath said...

At the time, we didn't even know about manifoldCF :-) Anyway since the database and server were built in-house, it was very simple to use solr directly.