Making OpenML understandable for people with data (who aren't machine learners) #36
Replies: 16 comments
-
|
hey there, this is amazing, I believe I am exactly your target audience - I have no idea about machine learning (or maths / algorythms etc.. in general) but - I do have a lot of data I work with which I think would benefit greatly from ML here is a small example of something I'm currently working on: it's a small parser written in Python (which is part of an ETL framework - but it doesn't matter for the sake of this discussion) This parser goes over support criteria titles from the Israeli government. The titles are all inputted by hand by government officials, but they all provide similar details. Translated to english, the titles look something like this:
my parser extracts the data from these title:
I think this would be a good match for ML - but I have no idea where to start.. I have other examples as well, I work with a lot of public data - and we have many use-cases for ML Some kind of pluggable / easy to use ML option would be amazing |
Beta Was this translation helpful? Give feedback.
-
|
Awesome @OriHoch! Do you have a data set in your repo so I can take a look at it? Maybe something like:
or
The second one would probably be better for now. |
Beta Was this translation helpful? Give feedback.
-
|
well, it's in hebrew.. so I'm not sure how useful it will be for you, but there is this CSV I use for unit tests - it's still a work in progress, but you can see the |
Beta Was this translation helpful? Give feedback.
-
|
I'm happy to provide other example / think about other possible use-cases for ML and provide data for you - we have a lot of data.. let me know what kind of use-cases you are looking for |
Beta Was this translation helpful? Give feedback.
-
|
Sounds like a text mining application?
OpenML doesn't have very good support for that yet. For now, you'll need to
create a bag-of-words or a word2vec / doc2vec representation yourself, so
that you obtain a table of numbers that a machine learning algorithm can
work with.
@heidi: this would be good input for openml/OpenML#456 or openml/OpenML#457: to create a simple guide
or provide some code to do this.
On Tue, 3 Oct 2017 at 09:17 Ori Hoch ***@***.***> wrote:
I'm happy to provide other example / think about other possible use-cases
for ML and provide data for you - we have a lot of data.. let me know what
kind of use-cases you are looking for
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#35 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABpQV3uIClSxKxpgK2OFgcIS8QgJ2RNwks5sod-EgaJpZM4Pro_K>
.
--
Thank you,
Joaquin
|
Beta Was this translation helpful? Give feedback.
-
|
of course, I don't expect openML to automatically take a csv with text and parse it I understand that data need to be transformed to compatible format / convert to numbers etc.. that's fine and I'm happy to do that but currently, without any ML knowledge, I have no idea where to start |
Beta Was this translation helpful? Give feedback.
-
|
Very interesting that you use a parser for this task! How well does it work and how satisfied are you? My guess is that the optimal way to approach this problem would be to combine ML and parsing. Just looking at the office parsing, here is how I would approach your problem from a supervised machine learning perspective:
B. (machine learning:) Note that A. is far simpler than B. and you may actually not need machine learning. So what is the potentional of machine learning for this problem? It is, if it is too hard to create a coding index and people use too many different ways to describe the entity "culture ministry". In this case, it may be easier to use the input texts directly for classification (i.e. let the machine learn how to use the texts) instead of thinking in advance what words people may use to express their thoughts. |
Beta Was this translation helpful? Give feedback.
-
|
Great. Maybe you could Google a bit yourself here?
Here are a few starting points:
*
https://www.quora.com/Where-can-I-find-some-pre-trained-word-vectors-for-natural-language-processing-understanding
*
https://www.oreilly.com/learning/capturing-semantic-meanings-using-deep-learning
* For Hebrew: https://github.com/liorshk/wordembedding-hebrew
Most work today uses a pre-trained deep learning algorithm that you show
lots of words (on entire documents) in a given language. It will learn a
representation for those words, i.e. a vector of numbers to represent each
word. Once trained, every text input returns a numeric output. For many
languages you can also find pre-trained models (not sure about Hebrew, but
with the third link you can do this yourself).
You can then build a table with those numbers (and attach any labels you
want), and use that to train another machine learning algorithm to
classify/predict something (e.g. in your case the related office).
Just by the by: for many text parsing applications you can also get away
with some clever pattern matching. Machine learning is more powerful (it
can do much more complex things), but also messier (it makes mistakes) and
harder to set up correctly.
On Tue, 3 Oct 2017 at 10:25 Ori Hoch ***@***.***> wrote:
of course, I don't expect openML to automatically take a csv with text and
parse it
I understand that data need to be transformed to compatible format /
convert to numbers etc.. that's fine and I'm happy to do that
but currently, without any ML knowledge, I have no idea where to start
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#35 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABpQV2q9oUt5c_QdN8QxBRsTZtiKGYp6ks5soe-IgaJpZM4Pro_K>
.
--
Thank you,
Joaquin
|
Beta Was this translation helpful? Give feedback.
-
|
+1 for @malsch's comments. Think well about the problem to see whether an engineering or ML approach may be best. |
Beta Was this translation helpful? Give feedback.
-
|
I think we should take @OriHoch 's case as an example and provide a step by step guide on what to do to put the data on OpenML and create a task. Does that make sense? IMHO we shouldn't try to solve his problem in this task (although it is for sure helpful), but think about the bigger picture on how to help people with his knowledge, data and question. @OriHoch would you feel comfortable with that? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks everyone for your input - this is a great discussion and I think it will be great if you / we are able to crack this issue (making ML accessible to people who aren't machine learners) even if for a limited use-case Regarding your comments about the example - I think you are focusing too much on the specific example I provided - which may or may not be suitable and which may have better alternatives. This was just off the top of my head from something I'm actively working on. Like I wrote in previous comment - this is just an example, we have a lot of data and might be able to figure out a better use-case for this issue. Would be good to get some more input from you regarding what kind of use-case is best as the first use-case of this issue (making ML accessible to people who don't know ML) In any case, I'll think about your input and try to figure out a suitable follow-up. |
Beta Was this translation helpful? Give feedback.
-
|
here are some more examples of data we have about the Israeli government - DB tables (some in hebrew, but all the source tables are in english below the hebrew friendly tables) text files which we parse from meeting protocols in doc/rtf format: |
Beta Was this translation helpful? Give feedback.
-
|
Ori, you are sure that this is public data, right? Just checking.
On Tue, 3 Oct 2017 at 11:09 Ori Hoch ***@***.***> wrote:
here are some more examples of data we have about the Israeli government -
DB tables (some in hebrew, but all the source tables are in english below
the hebrew friendly tables)
https://next.oknesset.org/metabase/public/dashboard/57604bd2-73f3-4fbc-943f-53bf45287641
https://next.oknesset.org/metabase/public/dashboard/edf65569-8ca3-41cb-a917-39951c80b9bc
https://next.oknesset.org/metabase/public/dashboard/0c78c5f7-2d1b-4d99-9800-0c7495e2f7be
text files which we parse from meeting protocols in doc/rtf format:
https://next.oknesset.org/data/committee-meeting-protocols-parsed/
available via json as well -
https://next.oknesset.org/data-json/committee-meeting-protocols-parsed/
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#35 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABpQVwW62QlaEBteUY9so7ja5VxO3Z4xks5sofnigaJpZM4Pro_K>
.
--
Thank you,
Joaquin
|
Beta Was this translation helpful? Give feedback.
-
|
yes, it's all public, you can read more about our organization here - http://www.hasadna.org.il/en/ |
Beta Was this translation helpful? Give feedback.
-
|
Making OpenML more accessible to people with that is still important. I am going to transfer this to the discussion board though, since it seems more like a design problem than an implementation issue at this point. Feedback is still welcome. We also recently had some students at Leiden University build a tool which can hopefully make OpenML more accessible, as it allows users to upload any data and then start a discussion around it to try and get it in shape for upload to OpenML. |
Beta Was this translation helpful? Give feedback.
-
|
Another good point that Heidi raised is that in some cases people may need to understand that their data is not (yet) suitable for ML: openml/OpenML#459 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We need to create an OpenML guide for people with data and prediction problems. We really want to attract applied researchers and people who have interesting problems, but at the moment they probably feel intimidated by OpenML.
This is an issue for OpenML newbies and non-technical people since they know best, what they need to work with OpenML. Please start working on this issue by commenting below. We'll get in touch with you then to discuss furthers steps.
Beta Was this translation helpful? Give feedback.
All reactions