Skip to content
Advertisement

pandas df appending altering variables in multithreading: problems creating the initial index for the df, and is pd the correct tool for this?

I need to:

  1. Create a df that looks like this:
items y y y
item z z z
item z z z
item z z z

The first column is named [‘items’] for convenience because the rows created under this custom index will do so based on changing variable item. It will be passed into the items column to create new columns, and the variable’s value will be used later to serve as a selector for specific rows. The variables y and z will be constantly changing as well, and appended to the df..

The z values might be NaN for let’s say range of y to y50 or so, having values only from range y50 to y5000 let’s say.

For that reason – please let me know if this is even the proper tool for the job! I read that sometimes if the data isn’t evenly structured, it can cause problems – why so? This is strictly a “triangular” sort of data structuring I am trying to accomplish and I have chosen pd because of the ease of later analysis of the data obtained (and because I like it). But most importantly: if this would be doable in pd, this data structure would satisfy all operations I will perform on the data later on.

Questions:

  • how to make sure the 500th value of the item variable appends exactly under the initial index column named ‘items’, creating 500th row with “index” value item? It also has to be filled with numeric value 0 or NaN up until the point where it has value let’s say in the 501th column named after consecutive alterations of y? Think of the columns y as the TIME (I know, usually people map it out on the y axis)

I get a feeling it’s really easy, but I need assistance here, as the whole project could fail if this is not set up properly from the beginning.

So, naturally:

  • how to create this properly for this task so that the appending works seamlessly?

Basically 1st main threads picks up data, checks it for certain criteria, and only then passes to the dataframe endless loop, which continuously scans through it to find any anomalies in the data. Want to make this df the central point of this project, and IDK if that’s a good idea, enlighten me please.

EDIT: made some progress

What I did so far:

 df_buffer = pd.DataFrame({'items':[f'{item}']})

df_buffer = df_buffer.insert( loc=1, column=[f'{y}'], value=int)

TypeError: unhashable type: ‘list’

with (...)column=f'{y}'(...) output is None

Appreciate help.

Advertisement

Answer

Dataframes are not good for multithreaded operations. I would suggest using Python’s built-in data structures for this. You can probably use a dictionary of dictionaries and perform operations/ updates on this dictionary. Then, for analytical purposes use the dictionary to create a dataframe.

df_dict = {
"item_1": {"x": x_value, "y": y_value, "z": z_value},
"item_2": {"x": x_value, "y": y_value, "z": z_value}
}

This answer explains how to make dataframes using dictionary of dictionaries.

Advertisement