this post was submitted on 11 Dec 2023
7 points (88.9% liked)

Python

6375 readers
4 users here now

Welcome to the Python community on the programming.dev Lemmy instance!

πŸ“… Events

PastNovember 2023

October 2023

July 2023

August 2023

September 2023

🐍 Python project:
πŸ’“ Python Community:
✨ Python Ecosystem:
🌌 Fediverse
Communities
Projects
Feeds

founded 1 year ago
MODERATORS
 

Hi, When im working with some big dataframes and I need to create some columns based on functions. So i have some code like this

Def function(row): function

And then I run the function on the df as

df['new column'] = df.apply(function, axis=1)

But I do this with 10 or more columns/functions at time. I don't think this is efficient because each time a column is created it had to parce the entire data frame. There's a way to create all the columns at the same time while parsing the rows only once?

Thanks for any help.

top 5 comments
sorted by: hot top controversial new old
[–] [email protected] 6 points 11 months ago

Then change your function to operate across the dataframe and return a different dataframe?

You can add multiple columns at the same time or do a merge statement of some kind

[–] [email protected] 5 points 11 months ago* (last edited 11 months ago) (1 children)

Whatever you do, usually as long as the data frame fits in memory it should be pretty fast. Depending on functions you're using applymap on splices of columns might be faster but code readability will suffer.

How big is your dataset? If it's huge or your need are complex you'll get way more performance by switching from Pandas to Polars dataframes rather than trying to optimize Pandas operations.

[–] [email protected] 2 points 11 months ago

6M rows (it grows by 35K rows at month aprox), 6 columns, after the function it's go to 17 columns and then finally to 9 where I starts to processes. It currently took 8min the pd.read_cvs() and 20min the creation of the columns. I would like to reduce that 20 min process.

[–] [email protected] 4 points 11 months ago* (last edited 11 months ago) (1 children)

In that case you can iterate over the rows instead of using apply()

Test it out and see if it's more efficient.

Also, you can improve performance by only passing the required columns to apply()

df['c'] = df[['a','b']].apply(function, axis=1)

Actually this seems like a better solution for you.

Here's another approach, I like this one more because it is a closer match to the problem you described.

Check the result_type=expand argument for df.apply()

[–] [email protected] 3 points 11 months ago* (last edited 11 months ago)

Actually this seems like a better solution for you.

Here's another approach, I like this one more because it is a closer match to the problem you described.

Thanks, tried the first approach but was slower that what I was doing. The second one didn't worked because I use some of the new generated columns to create new ones, but doing the process twice, to use the new columns to create the additional columns worked well and reduced the process time from 22m to 13m. Maybe they're ways to optimize even more the code, but 13 minutes is good enough for me.

Edit: for some reason it broke the information in some way and the next steps of the process are giving me errors 😐

Edit2: I'm an idiot, I made an error while updating the code to the new method.