Skip to content

fix: set columns and dtype explicitly in dataset readers#1706

Open
ShresthSamyak wants to merge 1 commit into
deeppavlov:masterfrom
ShresthSamyak:fix/explicit-columns-dtype
Open

fix: set columns and dtype explicitly in dataset readers#1706
ShresthSamyak wants to merge 1 commit into
deeppavlov:masterfrom
ShresthSamyak:fix/explicit-columns-dtype

Conversation

@ShresthSamyak
Copy link
Copy Markdown

Fixes #1654

Changes

  • `docred_reader.py`: replaced generic `d1, d2, d3, d4` column names with `rel_id`, `train`, `valid`, `test`
  • `basic_classification_reader.py`: added `dtype` to the passthrough keys for both `csv` and `json` reads so callers can explicitly
    control column types

Code smell addressed

Columns and DataType Not Explicitly Set — Zhang et al., CAIN 2022

  - docred_reader: replace generic d1-d4 column names with rel_id, train, valid, test
  - basic_classification_reader: add dtype to passthrough keys for csv and json reads

  Fixes deeppavlov#1654
Copilot AI review requested due to automatic review settings May 7, 2026 16:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #1654 by making dataset reader outputs more explicit and controllable, improving downstream schema readability and reducing silent type inference surprises.

Changes:

  • Updated DocREDDatasetReader.print_statistics() to use meaningful column names (rel_id, train, valid, test) instead of generic d1..d4.
  • Extended BasicClassificationDatasetReader to pass through dtype to pandas.read_csv() and pandas.read_json() so callers can explicitly control column types.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
deeppavlov/dataset_readers/docred_reader.py Renames statistics DataFrame columns to explicit, semantically meaningful names for clearer logging/output.
deeppavlov/dataset_readers/basic_classification_reader.py Adds dtype passthrough to pandas readers to allow explicit column type control.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 80 to 87
if format == 'csv':
keys = ('sep', 'header', 'names')
keys = ('sep', 'header', 'names', 'dtype')
options = {k: kwargs[k] for k in keys if k in kwargs}
df = pd.read_csv(file, **options)
elif format == 'json':
keys = ('orient', 'lines')
keys = ('orient', 'lines', 'dtype')
options = {k: kwargs[k] for k in keys if k in kwargs}
df = pd.read_json(file, **options)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Columns and DataType Not Explicitly Set on line 411 of docred_reader.py

2 participants