Project Description

The goal of this project is to build a complete machine learning pipeline for a dataset containing descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility (combined with poisonous). The classification method I will be using is Linear Discriminant Analysis (LDA).

website which is the source of this data.

The attribute descriptions provided at the source are listed below:

#	Attribute	Possible Values
1	cap-shape	bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
2	cap-surface	fibrous=f, grooves=g, scaly=y, smooth=s
3	cap-color	brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
4	bruises?	bruises=t, no=f
5	odor	almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
6	gill-attachment	attached=a, descending=d, free=f, notched=n
7	gill-spacing	close=c, crowded=w, distant=d
8	gill-size	broad=b, narrow=n
9	gill-color	black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
10	stalk-shape	enlarging=e, tapering=t
11	stalk-root	bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
12	stalk-surface-above-ring	fibrous=f, scaly=y, silky=k, smooth=s
13	stalk-surface-below-ring	fibrous=f, scaly=y, silky=k, smooth=s
14	stalk-color-above-ring	brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
15	stalk-color-below-ring	brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
16	veil-type	partial=p, universal=u
17	veil-color	brown=n, orange=o, white=w, yellow=y
18	ring-number	none=n, one=o, two=t
19	ring-type	cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
20	spore-print-color	black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
21	population	abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
22	habitat	grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

Results

Model efficiency

Despite the theoretical mismatch of applying LDA to purely categorical data, the model performs perfectly as a binary classifier. On the test set, the model achieved an F1 score of 1.0 and an ROC-AUC of 1.0. These results suggest that the classes in the original data are linearly separable. The distinction between edible and poisonous samples is so clear that even a simple linear model can find a perfect decision boundary.

Key Coefficients

The dataset description provided by the author states that there are no simple rules for determining the edibility of a mushroom. However, the LDA coefficients provide valuable clues regarding which features should guide identification. If you had to be guided by specific traits, examining the odor, the stalk are the most valuble.

Poisonous Indicators: The model assigned the highest weights to Odor, specifically identifying foul, spicy, and fishy smells as primary signs. Additionally, the presence of bruises and a scaly stalk surface further increased the chance that the mushroom was poisonous.
Edible Indicators: It also found biggest coefficients values for specific Stalk Root shapes, particularly club or rooted forms. Other significant predictors included a white veil color and the presence of two rings on the stem.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
code		code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Description

Results

Model efficiency

Key Coefficients

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Description

Results

Model efficiency

Key Coefficients

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages