One month's worth of data is prepared for each of the 3 million IDs The contents of the data are one explanatory variable and one objective variable. In other words, there are three columns in the table: ID, explanatory variable x, and objective variable y. The number of records is 3 million x 30 days ≒ 90 million
At this time, for each of the 3 million IDs, a simple regression of the explanatory variables and objective variables for 30 days was performed. I want to store the correlation coefficient, slope, and p-value for each ID as an output.
Regression is performed in a for loop for 3 million IDs, and the results are stored as a list. Finally, the lists are combined into a data frame. See here for the speed of this method
--EC2 instance (ubuntu: r5d.4xlarge)
It takes time (about 13 seconds per id) to simply query and extract the records corresponding to each ID from the data frame.
code1.py
for id in id_list:
tmp_x = df[df.id == id].x
tmp_y = df[df.id == id].y
--Speed up by using id as index and extracting with df.loc [](about 3.9 seconds per id)
code2.py
df.index = df.id
for id in id_list:
tmp_x = df.loc[id].x
tmp_y = df.loc[id].y
--In combination with the above, use dask dataframe instead of pandas dataframe (1.7 seconds per id) * What is dask?
code3.py
import dask.dataframe as dd
import multiprocessing
df.index = df.id
#Cpu in the current environment_count = 32
ddf = dd.from_pandas(df, npartitions=multiprocessing.cpu_count())
for id in id_list:
tmp_x = ddf.loc[id].x.compute()
tmp_y = ddf.loc[id].y.compute()
It's still late. With this, it would take two months to complete the processing of all the data. ..
Currently, 30 records are stored for each ID, but by storing 30 days' worth of data in one cell as a list, one record is created for each ID. By doing so, since the inclusion processing can be used for the loop processing, there is a possibility that the processing speed can be improved. (However, how long does it take to convert from 30 records to 1 record in the first place ... I want you to say it with pivot_table)
Recommended Posts