Skip to content
Advertisement

String modification and sampling change

i have this table:

ID points values (x1;y1|x2;y2|x3;y3|x4;y4……….)
1 8 0,5;1|1;1,5|4;6|5;7|6;9|8;10|10;12|15;18
2 4 20;30|21;32|22;36|25;37
3 306 1;2|3;6|7;9|10;17|11;18|13;22|14;25|19;26|..

the points determine the number of points. It means for example – 306 (306 x points and 306 y points)

My overall goal is to change the sampling density (the start and end points remain) – when i have 8 points, i want 4 points or when i have 306 points, i want 153 points.

I started like this:

df['values']=df['values'].str.replace('|', ';')
df['values'] = df['values'].str.split(';',expand=True)
ID points values (x1;y1|x2;y2|x3;y3|x4;y4……) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 8 0,5;1|1;1,5|4;6|5;7|6;9|8;10|10;12|15;18 0,5 1 1 1,5 4 6 5 7 6 9 8 10 10 12 15 18
.. .. …………………………….. .. .. . . . . . . . . . . . . .

But I would like, as I wrote above, half as many samples and columns named as follows:

ID points values (x1;y1|x2;y2|x3;y3|x4;y4……) x1 y1 x2 y2 x3 y3 x4 y4
1 8 0,5;1|1;1,5|4;6|5;7|6;9|8;10|10;12|15;18 0,5 1 4 6 8 10 15 18
.. …… …………………………………….. . . . . . . . .

Advertisement

Answer

It seems wasteful to create so many new dataframe columns, when many of the cells will be empty, and there is no relation between the values in any given column. More naturally, you could store each sample of points as a list containing pairs, all within one new column of the dataframe.

To obtain the point lists, you can manipulate each values string to match the Python syntax and then pass it to eval(), if you can trust the data source to contain no malicious code.

The sampling can then be done with Python’s slicing syntax, although it’s a bit tricky, because you want to include the first and last values.

The above transformations can be defined as a function, so that you can easily apply them to each string in the values column:

import pandas as pd
from math import ceil

df = pd.DataFrame({'ID': [1, 2, 3],
                   'points': [8, 4, 306],
                   'values': ['0,5;1|1;1,5|4;6|5;7|6;9|8;10|10;12|15;18',
                              '20;30|21;32|22;36|25;37',
                              '1;2|3;6|7;9|10;17|11;18|13;22|14;25|19;26']})


def list_sample(s):
    """
    Convert string s to a list of value pairs 
    and return the list with every other pair left out
    (but may leave no or double gap in the middle, 
    to always include the last pair).
    """
    pair_string = '[(' + s.replace(',', '.').replace(
        ';', ',').replace('|', '), (') + ')]'
    pair_list = eval(pair_string)
    mid = ceil(len(pair_list) / 2)
    return pair_list[:mid:2] + list(reversed(pair_list[-1:(mid-1):-2]))


df['sample'] = df['values'].apply(list_sample)
df
  ID points values                                    sample
0 1  8      0,5;1|1;1,5|4;6|5;7|6;9|8;10|10;12|15;18  [(0.5, 1), (4, 6), (8, 10), (15, 18)]
1 2  4      20;30|21;32|22;36|25;37                   [(20, 30), (25, 37)]
2 3  306    1;2|3;6|7;9|10;17|11;18|13;22|14;25|19;26 [(1, 2), (7, 9), (13, 22), (19, 26)]

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement