feat: Proposed SIMBAUQ Sampling Strategy#785
Conversation
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
planetf1
left a comment
There was a problem hiding this comment.
Also noticed we don't export SOFAISamplingStrategy in all - not an issue from this PR, but observed
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
|
I made all the required changes. I also replaced the RITS backend in my example with the ollama one. However, I need some guidance with the following: the "classifier" confidence estimation method we developed requires a probabilisitic (skelearn) classifier, which we either receive from the user or we train it on-the-fly based off of the training examples provided by the user. I'd like to pre-train one using the datasets we already collected for our paper and have it as default option but it needs to live somewhere in the package as a serialised object (e.g., pickle file). What would be the best way to do that without messing up too much with the package structure. Thanks. |
Do you have an estimate on how large this file would be? If it's tens of MBs that's probably not a problem, but if we're looking at hundreds of MBs or a GB+ then could be a different story. |
|
@psschwei it's actually not that big. the one we trained for our paper was about 250KB. it's a basic sklearn RandomForestClassifier. |
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
cool, I don't think that will be a problem |
|
I'm assuming then we would need to add sklearn as another dependency? If we're pickling in a file, we probably should consider how to best do so safely. |
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
|
@psschwei yes, added |
sure, any suggestion is welcome :). i usually do this: pickle.dump(model, open(filename, 'wb'))
# some time later...
loaded_model = pickle.load(open(filename, 'rb')) |
| except ImportError: | ||
| msg = ( | ||
| "scipy is required for harmonic mean aggregation. " | ||
| "Please install with `pip install scipy`." |
There was a problem hiding this comment.
correct command, but if this is base library should we have scipy as a core dependency? And what's the relationship to granite-retriever?
There was a problem hiding this comment.
I added scipy to the core dependencies next to numpy and scikit-learn required by this sampling strategy. However, as far as I know scikit-learn requires scipy, so probably we only need numpy and scikit-learn. Please advise.
The granite_retriever dependency group defined in pyproject.toml already contains the sentence-transformers required by the sbert similarity metric. Should we move sentence-transformers to the core dependencies?
Apologies noticed this after doing a per-line review. Also I'm not 100% sure of the intent -- if this is stdlib is it core function? If so shouldn't all dependencies be in our core dependencies. |
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Agree, deps should be in the core deps if this is going into core. |
So the key decision here is where this code belongs. What do you think @jakelorocco -- is this core stdlib, or an optional, additional strategy? I can see there might be three options once we've agreed on this we can nail the actual dependency config needed. |
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
|
@planetf1 either option is fine with me. Another question: I’ve trained an sklearn classifier on the datasets from our paper so it can be used with the classifier confidence estimation method. It’s stored as a standard pickle file. Would it be possible to ship it with mellea? |
You could use @jakelorocco @HendrikStrobelt any thoughts on where this code should go. mellea (core, optional), contrib? |
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
|
@planetf1 Actually I realised that providing a trained classifier (i.e., pickle file) with the library is not really a good idea. The two distributions (target domain and classifier) shouldn't be too different in order to get good performance. Instead, I extended the example script |
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
|
Is there a reason this is proposed for |
|
@jakelorocco suggested adding it there when i first proposed it. i don't think the exact location matters, as long as it's usable downstream. |
planetf1
left a comment
There was a problem hiding this comment.
Thanks for the continued work on this, @radum2275 — a few things not caught in previous rounds below.
On @nrfulton's question about whether this belongs in stdlib: I think it does conceptually — it's a sampling algorithm that sits naturally alongside sofai.py and majority_voting.py. The dependency question is separable though. The existing pattern in this repo is optional extras (see granite_retriever for sentence-transformers). A simbauq extra for scikit-learn would keep the strategy in mellea.stdlib.sampling without forcing the ML deps on everyone:
[project.optional-dependencies]
simbauq = ["scikit-learn", "numpy<=2.2"]scipy can be dropped entirely (see inline comment on _aggregate), and sentence-transformers is already behind granite_retriever. The import guards are already in place — they'd just need the install hint updated to point at the new extra.
| result_mot.parsed_repr = action.parse(result_mot) | ||
| all_mots.append(result_mot) | ||
| all_contexts.append(result_ctx) | ||
| all_actions.append(action) |
There was a problem hiding this comment.
This appends the same action reference N times — the deep copies created for each generation task are discarded here. So SamplingResult.sample_actions[0] through [N-1] all point at the same object. Either keep the per-task copies or deepcopy(action) here.
Sampling Strategy PR
Use this template when adding or modifying sampling strategies in
mellea/stdlib/sampling/.Description
Implementation Checklist
Base Class
BaseSamplingStrategyif your changes are mostly modifying therepairand/orselect_from_failurefunctionsSamplingStrategyif your changes involve a newsamplemethodReturn Value
SamplingResult. Specifically, this means:ModelOutputThunks insample_generationsare properly typed from the Component and theparsed_repris the expected type.Integration
mellea/stdlib/sampling/__init__.pyTesting
tests/sampling/