Skip to content
Advertisement

Issue with conversion of text data into a dataframe

I have a text file where I have several lines and between them, some data which I need to convert to the dataframe(useful data).

I iterated the text file line by line and captured the useful data with the help of a regex.

Something like this,

pattern = r'^(s)(d+)(s+)(d)(s+)(w+)(s+)(w+)(.*)'
capture_data = []

with open(file,'r') as file_obj:
            lineList = file_obj.readlines()
            for line in lineList:
                info_list = re.search(pattern, line)
                if info_list is not None:
                    capture_data.append(line)

The data captured look like this

' 100         0    PASS     Continuity_PPMU_mV        XSCI         140       -1.0000 V      -427.9508 mV   -300.0000 mV   -100.0000 uA   0         n'

' 100         1    PASS     Continuity_PPMU_mV        XSCI         12        -1.0000 V      -430.3089 mV   -300.0000 mV   -100.0000 uA   0         n'

I thought to iterate each captured row and split on the basis of whitespace, but the issue is, the units have white space in between them and the value, for example….

-300.0000 mV, -100.0000 uA etc

Also one more issue is the trailing newline character, it is also been treated as a new element in the .split(” “).

Can someone please help to find some smarter way to do this?

All I want is to have the values as a separate column value.

For example in the first string,

100 becomes 1st col, 0 – 2nd, PASS – 3rd, Continuity_PPMU_mV – 4th, etc…

Thanks.

Edit:

The raw data somewhat look like this –

Site Number:    
     0,  1,  2,  3

Device#: 1-4

********************************************************************* 
FT45434HAP PQF64 Test @ RHC 
********************************************************************* 
---------------------------Continuity Test--------------------------- 

 Number     Site  Result   Test Name                 Pin          Channel   Low            Measured       High           Force          Loc
 100         0    PASS     Continuity_PPMU_mV        XSCI         140       -1.0000 V      -427.9508 mV   -300.0000 mV   -100.0000 uA   0         
 100         1    PASS     Continuity_PPMU_mV        XSCI         12        -1.0000 V      -430.3089 mV   -300.0000 mV   -100.0000 uA   0         
 100         2    PASS     Continuity_PPMU_mV        XSCI         76        -1.0000 V      -430.7492 mV   -300.0000 mV   -100.0000 uA   0         
 100         3    PASS     Continuity_PPMU_mV        XSCI         204       -1.0000 V      -431.0482 mV   -300.0000 mV   -100.0000 uA   0         
 101         0    PASS     Continuity_PPMU_mV        XSCO         139       -1.0000 V      -456.0359 mV   -300.0000 mV   -100.0000 uA   0         
 101         1    PASS     Continuity_PPMU_mV        XSCO         11        -1.0000 V      -458.0605 mV   -300.0000 mV   -100.0000 uA   0         
 101         2    PASS     Continuity_PPMU_mV        XSCO         75        -1.0000 V      -457.8564 mV   -300.0000 mV   -100.0000 uA   0         

EDIT:

The top rows are not fixed, they are dynamically generated. Also, some other text data can appear in between the relevant data, like between two useful rows. So, I don’t think skipping rows will work here.

Advertisement

Answer

  • Read the file and look for the row the starts with 'Number', and then append those rows after that to data.
  • In the data rows, only the units are separated by a space.
  • It’s better to have the unit separate from the numeric value, so we can split the rows on spaces.
  • Create a new header, with new columns for the units.
  • This will allow the numeric values to be interpreted as floats.
import pandas as pd
import seaborn as sns

# read the file in
data = list()
with open('test.txt', 'r') as f:
    rows = f.readlines()
    flag = False  # flag to True once the header row with Number is found
    for row in rows:
        row = row.strip()
        if row.startswith('Number'):
            flag = True
            continue  # after the header row is found, skip it
        if flag:
            data.append(row.split())  # append rows after the header to data

# create a custom header where the unites have been added as column headers
header = ['Number', 'Site', 'Result', 'Test_Name', 'Pin', 'Channel', 'Low', 'U1', 'Measured', 'U2', 'High', 'U3', 'Force', 'U4', 'Loc']

# create the dataframe
df = pd.DataFrame(data, columns=header)

# save to csv
df.to_csv('file.csv', index=False)

# convert columns to numeric dtypes
df = df.apply(pd.to_numeric, errors='ignore')

# scale the columns as per their units
df.Measured = df.Measured.div(1000)
df.High = df.High.div(1000)
df.Force = df.Force.div(100000)

# display(df)
   Number  Site Result           Test_Name   Pin  Channel  Low U1  Measured  U2  High  U3  Force  U4  Loc
0     100     0   PASS  Continuity_PPMU_mV  XSCI      140 -1.0  V -0.427951  mV  -0.3  mV -0.001  uA    0
1     100     1   PASS  Continuity_PPMU_mV  XSCI       12 -1.0  V -0.430309  mV  -0.3  mV -0.001  uA    0
2     100     2   PASS  Continuity_PPMU_mV  XSCI       76 -1.0  V -0.430749  mV  -0.3  mV -0.001  uA    0
3     100     3   PASS  Continuity_PPMU_mV  XSCI      204 -1.0  V -0.431048  mV  -0.3  mV -0.001  uA    0
4     101     0   PASS  Continuity_PPMU_mV  XSCO      139 -1.0  V -0.456036  mV  -0.3  mV -0.001  uA    0
5     101     1   PASS  Continuity_PPMU_mV  XSCO       11 -1.0  V -0.458060  mV  -0.3  mV -0.001  uA    0
6     101     2   PASS  Continuity_PPMU_mV  XSCO       75 -1.0  V -0.457856  mV  -0.3  mV -0.001  uA    0

# plot
ax = sns.lineplot(data=df.iloc[:, 6:-2])
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

enter image description here

Advertisement