Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pandas 2 #138

Merged
merged 8 commits into from
Sep 17, 2024
Merged

Support pandas 2 #138

merged 8 commits into from
Sep 17, 2024

Conversation

tung-vu-td
Copy link
Contributor

Couple of sanity tests I've done for pandas 2 support:

Setup

import numpy as np; import pandas as pd
def fake_data(n):
   users = np.random.choice([0., 1., 2.], (n, 1))
   items = np.random.choice([0., 1., 2.], (n, 1))
   weight = np.random.rand(n,1)
   return np.concatenate((users, items, weight), axis=1)

d1 = fake_data(10_000_000)
df = pd.DataFrame(d1, columns=["users", "items", "scores"])

import pytd; import os
client=pytd.Client(database="sample_datasets", apikey=os.environ["TD_API_KEY"])

Upload to TD table

client.load_table_from_dataframe(df, "tung_db.pytd_test", writer="bulk_import", if_exists="overwrite", fmt="msgpack", max_workers=6, chunk_record_size=1_000_000)

Results: table is successfully imported to TD

Query TD table

result = client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')
df2 = pd.DataFrame(**result)
print(df2)

Result:

     symbol   cnt
0      AAIT   590
1       AAL    82
2      AAME  9252
3      AAOI   253
4      AAON  5980
...     ...   ...
2827   ZNGA   698
2828   ZOOM  6021
2829   ZSPH    71
2830     ZU   217
2831   ZUMZ  2364

[2832 rows x 2 columns]

Read through pandas 2.0.0 release notes

https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html

Seems like the only change that can potentially break backward compatibility is Construction with datetime64 or timedelta64 dtype with unsupported resolution

Currently, we only support datetime64[ns] for time column. In pandas 1, if pytd users specify a wrong resolution for datetime64, for example, datetime64[s], they can still convert and import the column correctly. But in pandas 2, they won't be able to convert and import the time column

@@ -15,12 +15,16 @@ jobs:
matrix:
os: [ubuntu-latest, windows-latest]
python-version: ["3.8", "3.9", "3.10", "3.11"]
pandas-version: ["1.3.5", "1.4.4", "1.5.3", "2.2.2"]
pandas-version: ["1.3.5", "1.4.4", "1.5.3", "2.0.3", "2.1.4", "2.2.2"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think testing over two versions would be enough, i.e., 1.5.3 and 2.2.2. Is there any specific reason to test for 1.3.5, 1.4.4, 2.0.3, and 2.1.4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason. I'm just following current convention (1.3, 1.4, 1.5)

If you think it's uneccessary, I'll test major versions only. I think testing minor version has limited benefit as well (unless there is some bug in one of pandas versions). Testing only 2 versions will drastically improve test time (now close to 1 hour 😬 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then let's test 2 major versions only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in cad4317

setup.cfg Outdated Show resolved Hide resolved
@tung-vu-td tung-vu-td merged commit 39550b8 into master Sep 17, 2024
15 checks passed
@tung-vu-td tung-vu-td deleted the support-pandas-2 branch September 17, 2024 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants