Hello,
I'm using Hadoop for distributed text mining of large collection of documents, and in my optimizing process, I want to speed things up a bit, and I want to know how can I do this step with Hadoop...
Each Map process takes a group of documents, analyses each sentence, and for certain patterns it queries a database and some indexes to provide a proper answer. This step can take a while, so I'm caching the results in a LinkedHashMap, which works pretty well for standalone jobs, and avoids repeated queries for same patterns in a documet.
I think it would be great to share this LinkedHashMap cache object for all Map instances, so that if the #2 Map object finds the same pattern as the #1 Map object previously noticed on other document, it can use the cached result that #1 Map placed there for all Map objects, saving some time.
Right now, the DistributedCache just shares files, archives and jar files. Is there any way to share such a Java object such as a LinkedHashMap, synchronized or not?