가장 빠른 방법은 무엇입를 반복적으로 재 timeseries 데이터의 동일한 모양부터 시간당하는 연간에서는 파이썬

Question 1

가장 빠른 방법은 무엇입를 반복적으로 재 timeseries 데이터의 동일한 모양?

문제:저는 30 년의 시간 timeseries 하고자 하는 리 샘플을 연간 및 달력 년(resample 규칙'으로'). 필요를 모두 찾을 것을 의미에서는 매년과 조화를 이루었습니다. 이 없습니다 시간입니다. 나는 다음을 할 필요가 이 10,000 이상 시간. 에 대한 스크립트를 쓰고,이는 샘플링을 단계적으로 가장 시간을 제한하는 요인과 관련하여 최적화를 수행할 필요는 없습니다. 때문에의 도약 년할 수 있고,다시 샘플링에 의해 일관된 8760 시간으로 모든 규 년 8784 시간입니다.

예제 코드:

import pandas as pd
import numpy as np
import time

hourly_timeseries = pd.DataFrame(
    index=pd.date_range(
    pd.Timestamp(2020, 1, 1, 0, 0),
    pd.Timestamp(2050, 12, 31, 23, 30),
    freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
    yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
    yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {start_time - finish_time:0.4f} seconds")
>>> Ran in -3.0516 seconds

솔루션 내가 탐구:

내가 약간의 속도 개선에 의해 집계하는 여러 timeseries 으로 데이터 프레임 및 샘플링에서 그들을 같은 시간,그러나,이의 제한 설정의 넓은 문제가 나는 이 나 제한하는 10timeseries 에서 각각 데이터 프레임. 따라서,문제는 여전히 서 있을 극적으로 속도를 샘플링의 timeseries 데이터를 알고 있는 경우 모양 배열의 것 같습니까?
또한 보였으로 사용하 numba 그러나 이것을 만들지 않는 팬더 기능이 더 빠르다.

가능한 솔루션에는 소리가 합리적이지만 나를 찾을 수 없습한 후 연구:

resample3D 배열의 timeseries 과 데이터를 numpy
캐시 인덱스는 다시 샘플링 그리고 어떻게든지 resample 후 처음 다시 샘플링 훨씬 더 빠르

당신의 도움을 위한 감사합니다:)

Question 2

내가 쓴 댓글을 준비하수 위해 매년들을 사용하여 계산 합계는 많은 매년 빠릅니다.

다음을 제거하고 불필요한 계산을 합계에 의 대신에 다시 계산하는 것을 의미로 sum/length_of_indices 각 연도.

에 대한 N=1000 그~9x 빠르

import pandas as pd
import numpy as np
import time

hourly_timeseries = pd.DataFrame(
    index=pd.date_range(
    pd.Timestamp(2020, 1, 1, 0, 0),
    pd.Timestamp(2050, 12, 31, 23, 30),
    freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
    yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
    yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")


start_time = time.perf_counter()
events_years = hourly_timeseries.index.year
unique_years = np.sort(np.unique(events_years))
indices_per_year = [np.where(events_years == year)[0] for year in unique_years]
len_indices_per_year = np.array([len(year_indices) for year_indices in indices_per_year])
for num in range(100):  # setting as 100 so it runs faster, this is 10,000+ in practice
    temp = hourly_timeseries.values
    yearly_timeseries_sum2 = np.array([np.sum(temp[year_indices]) for year_indices in indices_per_year])
    yearly_timeseries_mean2 = yearly_timeseries_sum2 / len_indices_per_year

finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
assert np.allclose(yearly_timeseries_sum.values.flatten(), yearly_timeseries_sum2)
assert np.allclose(yearly_timeseries_mean.values.flatten(), yearly_timeseries_mean2)

Ran in 0.9950 seconds
Ran in 0.1386 seconds

dankal444 · Answer 1 · 2021-11-21T21:00:47

내가 쓴 댓글을 준비하수 위해 매년들을 사용하여 계산 합계는 많은 매년 빠릅니다.

다음을 제거하고 불필요한 계산을 합계에 의 대신에 다시 계산하는 것을 의미로 sum/length_of_indices 각 연도.

에 대한 N=1000 그~9x 빠르

import pandas as pd
import numpy as np
import time

hourly_timeseries = pd.DataFrame(
    index=pd.date_range(
    pd.Timestamp(2020, 1, 1, 0, 0),
    pd.Timestamp(2050, 12, 31, 23, 30),
    freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
    yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
    yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")


start_time = time.perf_counter()
events_years = hourly_timeseries.index.year
unique_years = np.sort(np.unique(events_years))
indices_per_year = [np.where(events_years == year)[0] for year in unique_years]
len_indices_per_year = np.array([len(year_indices) for year_indices in indices_per_year])
for num in range(100):  # setting as 100 so it runs faster, this is 10,000+ in practice
    temp = hourly_timeseries.values
    yearly_timeseries_sum2 = np.array([np.sum(temp[year_indices]) for year_indices in indices_per_year])
    yearly_timeseries_mean2 = yearly_timeseries_sum2 / len_indices_per_year

finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
assert np.allclose(yearly_timeseries_sum.values.flatten(), yearly_timeseries_sum2)
assert np.allclose(yearly_timeseries_mean.values.flatten(), yearly_timeseries_mean2)

Ran in 0.9950 seconds
Ran in 0.1386 seconds

가장 빠른 방법은 무엇입를 반복적으로 재 timeseries 데이터의 동일한 모양부터 시간당하는 연간에서는 파이썬

질문

최고의 응답

다른 언어로

이 페이지는 다른 언어로되어 있습니다

이 카테고리에서 인기

인기 있는 질문에 이 카테고리