I have a folder with several csv files. Example of the dataframes from csv files in directory:
d1 = {'timestamp': [2013-01-30, 2015-02-29, 2014-03-25, 2016-01-01, 2018-02-20, 2012-05-05, 2018-02-04], 'site': ['plus.google.com','vk.com','yandex.ru','plus.google.com','vk.com', 'oracle.com', 'oracle.com']} df1 = pd.DataFrame(data = d) d2 = {'timestamp': [2013-01-30, 2015-02-29, 2014-03-25, 2016-01-01, 2018-02-20,], 'site': ['plus.google.com','meduza.ru','yandex.ru','google.com', 'meduza.ru'} df2 = pd.DataFrame(data = d2)
I need to make a function that accepts route to file directory and return sites frequency dictionary (one for all sites in file directory) with unique sites names the following kind: {‘site_string’: [site_id, site_freq]} For our examle it will be: {‘vk.com’: (1, 2), ‘plus.google.com’: (2, 3), ‘yandex.ru’: (3, 2), ‘meduza.ru’: (4, 2), ‘oracle.com’: (5, 2), ‘google.com’: (6, 1)}
I tried to aply value_counts() to every dataframe, made dictionaries of them and tried to concatenate dicts, but duplicates are removed in this case. How I can solve this issue? What should I do?
def prepare_train_set(path_to_csv_files): frequency = {} for filename in glob(f'{path_to_csv_files}/*'): sub_iterationed_df = pd.read_csv(filename) value_counts_dict = dict(sub_iterationed_df["site"].value_counts()) frequency.update(value_counts_dict) return frequency
Also I tried to make lists from keys and values of value_counts() dictionary and after that make dictionary with zip function but there is an error “list assignment index out of range”. WHy this error occurs and how can I bypass the error?
def CheckForDuplicates(keys_list, values_list): keys_list = list(value_counts_dict.keys()) values_list = list(value_counts_dict.values()) keys_list_constant = keys_list[:] values_list_constant = values_list[:] for i in range(len(keys_list_constant)): checking_dup_keys_list = keys_list[:i] checking_dup_values_list = values_list[:i] key_value = keys_list_constant[i] if key_value in checking_dup_keys_list: duplicate_index = checking_dup_keys_list.index(key_value) values_list[duplicate_index] = values_list[duplicate_index] + values_list_constant[i] del values_list[i] del keys_list[i] return(keys_list, values_list) CheckForDuplicates(keys_list, values_list)
Advertisement
Answer
You could use a Counter
instead of a plain dictionary:
from collections import Counter def prepare_train_set(path_to_csv_files): frequency = Counter() for filename in glob(f'{path_to_csv_files}/*'): sub_iterationed_df = pd.read_csv(filename) value_counts_dict = sub_iterationed_df['site'].value_counts().to_dict() frequency.update(value_counts_dict) return frequency
From the docs:
update([iterable-or-mapping])
:
Elements are counted from an iterable or added-in from another mapping (or counter). Likedict.update()
but adds counts instead of replacing them.
Or concatenate all the dataframes and take the .value_counts()
afterwards.