The goal of this project is to build a complete machine learning pipeline for a dataset containing descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility (combined with poisonous). The classification method I will be using is Linear Discriminant Analysis (LDA).
website which is the source of this data.
The attribute descriptions provided at the source are listed below:
| # | Attribute | Possible Values |
|---|---|---|
| 1 | cap-shape | bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s |
| 2 | cap-surface | fibrous=f, grooves=g, scaly=y, smooth=s |
| 3 | cap-color | brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y |
| 4 | bruises? | bruises=t, no=f |
| 5 | odor | almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s |
| 6 | gill-attachment | attached=a, descending=d, free=f, notched=n |
| 7 | gill-spacing | close=c, crowded=w, distant=d |
| 8 | gill-size | broad=b, narrow=n |
| 9 | gill-color | black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y |
| 10 | stalk-shape | enlarging=e, tapering=t |
| 11 | stalk-root | bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=? |
| 12 | stalk-surface-above-ring | fibrous=f, scaly=y, silky=k, smooth=s |
| 13 | stalk-surface-below-ring | fibrous=f, scaly=y, silky=k, smooth=s |
| 14 | stalk-color-above-ring | brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y |
| 15 | stalk-color-below-ring | brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y |
| 16 | veil-type | partial=p, universal=u |
| 17 | veil-color | brown=n, orange=o, white=w, yellow=y |
| 18 | ring-number | none=n, one=o, two=t |
| 19 | ring-type | cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z |
| 20 | spore-print-color | black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y |
| 21 | population | abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y |
| 22 | habitat | grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d |
Despite the theoretical mismatch of applying LDA to purely categorical data, the model performs perfectly as a binary classifier. On the test set, the model achieved an F1 score of 1.0 and an ROC-AUC of 1.0. These results suggest that the classes in the original data are linearly separable. The distinction between edible and poisonous samples is so clear that even a simple linear model can find a perfect decision boundary.
The dataset description provided by the author states that there are no simple rules for determining the edibility of a mushroom. However, the LDA coefficients provide valuable clues regarding which features should guide identification. If you had to be guided by specific traits, examining the odor, the stalk are the most valuble.
- Poisonous Indicators: The model assigned the highest weights to Odor, specifically identifying foul, spicy, and fishy smells as primary signs. Additionally, the presence of bruises and a scaly stalk surface further increased the chance that the mushroom was poisonous.
- Edible Indicators: It also found biggest coefficients values for specific Stalk Root shapes, particularly club or rooted forms. Other significant predictors included a white veil color and the presence of two rings on the stem.