In gss (the open source project that the "Pithos" service and mynetworkfolders is based upon), we use apache solr for full-text indexing and searching of the stored documents. However when a user searches for some terms, we have to show him only results from documents that she has read permission on, i.e. her own documents, document shared to her by others and documents made public.
During some benchmarks we did recently we observed extremely high response times even for searches that had very few results. After some code reviews and more fine-grained benchmarks, we realized that over 60% of the time that it took for a search to complete was due to permission checks in the database that stores the document metadata. An example search for the term 'java', returned from solr something in the area of 2000 results. After that, each result should have its permissions checked to see if the user that did the search has read permission on the result and filter out results that cannot be read by the specific user. The response from solr was blazingly fast, the transformation of the SolrDocument objects to gss resources and the marshaling to json was around 40% of the total time and the remaining 60% was the permissions checking.
So, we thought that if the solr search is so fast, why don't we store the document permissions in the index and transform the search query to include the user? That way the search will return only the relevant results (those that the user has read permission) and no permission checking and filtering will be necessary. More specifically, whenever a file is created or updated we store in the index the user and group ids that have read permissions on the file. Now, when a user does a search, we retrieve the groups that the user belongs to and append to the search query a search term that checks if the user id and group ids belong to those stored with the file. That way the search returns only the relevant results, thus improving search times more that 60%.
Note: Care should be taken to update the index, not only when a file is updated but whenever its permissions are updated too. However, this is not something that happens often and index updating is done asynchronously through a message queue, so the load imposed to the server is insignificant.