Making OpenML understandable for people with data (who aren't machine learners) #36

HeidiSeibold · 2017-10-03T06:27:38Z

HeidiSeibold
Oct 3, 2017

We need to create an OpenML guide for people with data and prediction problems. We really want to attract applied researchers and people who have interesting problems, but at the moment they probably feel intimidated by OpenML.

How can we best welcome people with data
How can we make OpenML understandable for non-MLers?
How can we make OpenML attractive for people to bring their data to OpenML?

This is an issue for OpenML newbies and non-technical people since they know best, what they need to work with OpenML. Please start working on this issue by commenting below. We'll get in touch with you then to discuss furthers steps.

OriHoch · 2017-10-03T06:46:46Z

OriHoch
Oct 3, 2017

hey there, this is amazing, I believe I am exactly your target audience - I have no idea about machine learning (or maths / algorythms etc.. in general)

but - I do have a lot of data I work with which I think would benefit greatly from ML

here is a small example of something I'm currently working on:

it's a small parser written in Python (which is part of an ETL framework - but it doesn't matter for the sake of this discussion)

https://github.com/OriHoch/budgetkey-data-pipelines/blob/096b1e72ed4ddce605f3866663648ecbdaa88016/datapackage_pipelines_budgetkey/pipelines/supports/criteria/parser.py

This parser goes over support criteria titles from the Israeli government. The titles are all inputted by hand by government officials, but they all provide similar details. Translated to english, the titles look something like this:

request for support criteria for the culture ministry to help disabled people access theatres
helping disable people access theatres - a support criteria request by the ministry of culture
support request for amount of 1 Billion Dollar to support disable people access to theatres paid by the Israeli culture office

my parser extracts the data from these title:

parsing the related office - culture ministry / ministry of culture / the Israeli culture office
parsing the amount (if exists)
parsing the required support - help disabled people access theatres / helping disable people access theatres / upport disable people access to theatres

I think this would be a good match for ML - but I have no idea where to start..

I have other examples as well, I work with a lot of public data - and we have many use-cases for ML

Some kind of pluggable / easy to use ML option would be amazing

0 replies

HeidiSeibold · 2017-10-03T06:58:08Z

HeidiSeibold
Oct 3, 2017
Author

Awesome @OriHoch! Do you have a data set in your repo so I can take a look at it?

Maybe something like:

parsed text	unparsed text
ministry of culture	request for support criteria for the culture ministry to help disabled people access theatres
...	...

or

amount	text
1000000000	support request for amount of 1 Billion Dollar to support disable people access to theatres paid by the Israeli culture office
...	...

The second one would probably be better for now.

0 replies

OriHoch · 2017-10-03T07:15:22Z

OriHoch
Oct 3, 2017

well, it's in hebrew.. so I'm not sure how useful it will be for you, but there is this CSV I use for unit tests -
https://github.com/OriHoch/budgetkey-data-pipelines/blob/096b1e72ed4ddce605f3866663648ecbdaa88016/tests/pipelines/support/criteria.csv

it's still a work in progress, but you can see the title column - which is the source unparsed title, and the expected_purpose column which is the part of the title that contains the purpose of the support (e.g. disabled people access to theatres)

0 replies

OriHoch · 2017-10-03T07:17:24Z

OriHoch
Oct 3, 2017

I'm happy to provide other example / think about other possible use-cases for ML and provide data for you - we have a lot of data.. let me know what kind of use-cases you are looking for

0 replies

joaquinvanschoren · 2017-10-03T08:18:17Z

joaquinvanschoren
Oct 3, 2017
Maintainer

Sounds like a text mining application? OpenML doesn't have very good support for that yet. For now, you'll need to create a bag-of-words or a word2vec / doc2vec representation yourself, so that you obtain a table of numbers that a machine learning algorithm can work with. @heidi: this would be good input for openml/OpenML#456 or openml/OpenML#457: to create a simple guide or provide some code to do this.

On Tue, 3 Oct 2017 at 09:17 Ori Hoch ***@***.***> wrote: I'm happy to provide other example / think about other possible use-cases for ML and provide data for you - we have a lot of data.. let me know what kind of use-cases you are looking for — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#35 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABpQV3uIClSxKxpgK2OFgcIS8QgJ2RNwks5sod-EgaJpZM4Pro_K> .

-- Thank you, Joaquin

0 replies

OriHoch · 2017-10-03T08:25:44Z

OriHoch
Oct 3, 2017

of course, I don't expect openML to automatically take a csv with text and parse it

I understand that data need to be transformed to compatible format / convert to numbers etc.. that's fine and I'm happy to do that

but currently, without any ML knowledge, I have no idea where to start

0 replies

malsch · 2017-10-03T08:42:20Z

malsch
Oct 3, 2017

Very interesting that you use a parser for this task! How well does it work and how satisfied are you? My guess is that the optimal way to approach this problem would be to combine ML and parsing.

Just looking at the office parsing, here is how I would approach your problem from a supervised machine learning perspective:

(not machine learning:) Aren't "culture ministry" / "ministry of culture" / "the Israeli culture office" all the same and should have a single code? Create a coding frame (e.g. one entitiy in this coding frame would be the culture ministry). This is a classical content analysis problem and software like MAXQDA exists to support this work.
Now you have two option how to proceed:
A. (not machine learning:) You may want to create a coding index: Words like "culture ministry" / "ministry of culture" / "the Israeli culture office" could be inserted here and whenever this text occurs, you code the text into the coding frame entity "culture ministry"

B. (machine learning:)
a. You need training data, the more the better: 100s or thousands of titles need to be labeled by hand, i.e. you would need to assign the text into categories from the coding scheme.
b. Use the training data to predict the categories for titles that have not been labeled by hand.

Note that A. is far simpler than B. and you may actually not need machine learning. So what is the potentional of machine learning for this problem? It is, if it is too hard to create a coding index and people use too many different ways to describe the entity "culture ministry". In this case, it may be easier to use the input texts directly for classification (i.e. let the machine learn how to use the texts) instead of thinking in advance what words people may use to express their thoughts.

0 replies

joaquinvanschoren · 2017-10-03T08:43:08Z

joaquinvanschoren
Oct 3, 2017
Maintainer

Great. Maybe you could Google a bit yourself here? Here are a few starting points: * https://www.quora.com/Where-can-I-find-some-pre-trained-word-vectors-for-natural-language-processing-understanding * https://www.oreilly.com/learning/capturing-semantic-meanings-using-deep-learning * For Hebrew: https://github.com/liorshk/wordembedding-hebrew Most work today uses a pre-trained deep learning algorithm that you show lots of words (on entire documents) in a given language. It will learn a representation for those words, i.e. a vector of numbers to represent each word. Once trained, every text input returns a numeric output. For many languages you can also find pre-trained models (not sure about Hebrew, but with the third link you can do this yourself). You can then build a table with those numbers (and attach any labels you want), and use that to train another machine learning algorithm to classify/predict something (e.g. in your case the related office). Just by the by: for many text parsing applications you can also get away with some clever pattern matching. Machine learning is more powerful (it can do much more complex things), but also messier (it makes mistakes) and harder to set up correctly.

On Tue, 3 Oct 2017 at 10:25 Ori Hoch ***@***.***> wrote: of course, I don't expect openML to automatically take a csv with text and parse it I understand that data need to be transformed to compatible format / convert to numbers etc.. that's fine and I'm happy to do that but currently, without any ML knowledge, I have no idea where to start — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#35 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABpQV2q9oUt5c_QdN8QxBRsTZtiKGYp6ks5soe-IgaJpZM4Pro_K> .

-- Thank you, Joaquin

0 replies

joaquinvanschoren · 2017-10-03T08:46:15Z

joaquinvanschoren
Oct 3, 2017
Maintainer

+1 for @malsch's comments. Think well about the problem to see whether an engineering or ML approach may be best.

0 replies

HeidiSeibold · 2017-10-03T08:48:16Z

HeidiSeibold
Oct 3, 2017
Author

I think we should take @OriHoch 's case as an example and provide a step by step guide on what to do to put the data on OpenML and create a task. Does that make sense?

IMHO we shouldn't try to solve his problem in this task (although it is for sure helpful), but think about the bigger picture on how to help people with his knowledge, data and question. @OriHoch would you feel comfortable with that?

0 replies

OriHoch · 2017-10-03T08:55:23Z

OriHoch
Oct 3, 2017

Thanks everyone for your input - this is a great discussion and I think it will be great if you / we are able to crack this issue (making ML accessible to people who aren't machine learners) even if for a limited use-case

Regarding your comments about the example - I think you are focusing too much on the specific example I provided - which may or may not be suitable and which may have better alternatives. This was just off the top of my head from something I'm actively working on.

Like I wrote in previous comment - this is just an example, we have a lot of data and might be able to figure out a better use-case for this issue. Would be good to get some more input from you regarding what kind of use-case is best as the first use-case of this issue (making ML accessible to people who don't know ML)

In any case, I'll think about your input and try to figure out a suitable follow-up.

0 replies

OriHoch · 2017-10-03T09:09:53Z

OriHoch
Oct 3, 2017

here are some more examples of data we have about the Israeli government -

DB tables (some in hebrew, but all the source tables are in english below the hebrew friendly tables)
https://next.oknesset.org/metabase/public/dashboard/57604bd2-73f3-4fbc-943f-53bf45287641
https://next.oknesset.org/metabase/public/dashboard/edf65569-8ca3-41cb-a917-39951c80b9bc
https://next.oknesset.org/metabase/public/dashboard/0c78c5f7-2d1b-4d99-9800-0c7495e2f7be

text files which we parse from meeting protocols in doc/rtf format:
https://next.oknesset.org/data/committee-meeting-protocols-parsed/
available via json as well -
https://next.oknesset.org/data-json/committee-meeting-protocols-parsed/

0 replies

joaquinvanschoren · 2017-10-03T10:04:13Z

joaquinvanschoren
Oct 3, 2017
Maintainer

Ori, you are sure that this is public data, right? Just checking.

On Tue, 3 Oct 2017 at 11:09 Ori Hoch ***@***.***> wrote: here are some more examples of data we have about the Israeli government - DB tables (some in hebrew, but all the source tables are in english below the hebrew friendly tables) https://next.oknesset.org/metabase/public/dashboard/57604bd2-73f3-4fbc-943f-53bf45287641 https://next.oknesset.org/metabase/public/dashboard/edf65569-8ca3-41cb-a917-39951c80b9bc https://next.oknesset.org/metabase/public/dashboard/0c78c5f7-2d1b-4d99-9800-0c7495e2f7be text files which we parse from meeting protocols in doc/rtf format: https://next.oknesset.org/data/committee-meeting-protocols-parsed/ available via json as well - https://next.oknesset.org/data-json/committee-meeting-protocols-parsed/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#35 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABpQVwW62QlaEBteUY9so7ja5VxO3Z4xks5sofnigaJpZM4Pro_K> .

-- Thank you, Joaquin

0 replies

OriHoch · 2017-10-03T10:12:43Z

OriHoch
Oct 3, 2017

yes, it's all public, you can read more about our organization here - http://www.hasadna.org.il/en/

0 replies

PGijsbers · 2026-06-24T08:44:15Z

PGijsbers
Jun 24, 2026
Maintainer

Making OpenML more accessible to people with that is still important. I am going to transfer this to the discussion board though, since it seems more like a design problem than an implementation issue at this point. Feedback is still welcome. We also recently had some students at Leiden University build a tool which can hopefully make OpenML more accessible, as it allows users to upload any data and then start a discussion around it to try and get it in shape for upload to OpenML.

0 replies

PGijsbers · 2026-06-24T08:50:16Z

PGijsbers
Jun 24, 2026
Maintainer

Another good point that Heidi raised is that in some cases people may need to understand that their data is not (yet) suitable for ML: openml/OpenML#459
We should consider giving people some guidance there, too. Have a look at the issue for a draft of potential docs.

0 replies

Making OpenML understandable for people with data (who aren't machine learners) #36

Uh oh!

Replies: 16 comments

Uh oh!

Uh oh!

Uh oh!

HeidiSeibold Oct 3, 2017 Author

Uh oh!

Uh oh!

Uh oh!

joaquinvanschoren Oct 3, 2017 Maintainer

Uh oh!

Uh oh!

Uh oh!

joaquinvanschoren Oct 3, 2017 Maintainer

Uh oh!

joaquinvanschoren Oct 3, 2017 Maintainer

Uh oh!

Uh oh!

HeidiSeibold Oct 3, 2017 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joaquinvanschoren Oct 3, 2017 Maintainer

Uh oh!

Uh oh!

PGijsbers Jun 24, 2026 Maintainer

Uh oh!

Uh oh!

PGijsbers Jun 24, 2026 Maintainer

HeidiSeibold
Oct 3, 2017
Author

joaquinvanschoren
Oct 3, 2017
Maintainer

joaquinvanschoren
Oct 3, 2017
Maintainer

joaquinvanschoren
Oct 3, 2017
Maintainer

HeidiSeibold
Oct 3, 2017
Author

joaquinvanschoren
Oct 3, 2017
Maintainer

PGijsbers
Jun 24, 2026
Maintainer

PGijsbers
Jun 24, 2026
Maintainer