Issue with conversion of text data into a dataframe

I have a text file where I have several lines and between them, some data which I need to convert to the dataframe(useful data).

I iterated the text file line by line and captured the useful data with the help of a regex.

Something like this,

pattern = r'^(s)(d+)(s+)(d)(s+)(w+)(s+)(w+)(.*)'
capture_data = []

with open(file,'r') as file_obj:
            lineList = file_obj.readlines()
            for line in lineList:
                info_list = re.search(pattern, line)
                if info_list is not None:
                    capture_data.append(line)

JavaScript
​x
 
pattern = r'^(s)(d+)(s+)(d)(s+)(w+)(s+)(w+)(.*)'
capture_data = []
​
with open(file,'r') as file_obj:
            lineList = file_obj.readlines()
            for line in lineList:
                info_list = re.search(pattern, line)
                if info_list is not None:
                    capture_data.append(line)
​

The data captured look like this

' 100         0    PASS     Continuity_PPMU_mV        XSCI         140       -1.0000 V      -427.9508 mV   -300.0000 mV   -100.0000 uA   0         n'

' 100         1    PASS     Continuity_PPMU_mV        XSCI         12        -1.0000 V      -430.3089 mV   -300.0000 mV   -100.0000 uA   0         n'

JavaScript
 
' 100         0    PASS     Continuity_PPMU_mV        XSCI         140       -1.0000 V      -427.9508 mV   -300.0000 mV   -100.0000 uA   0         n'
​
' 100         1    PASS     Continuity_PPMU_mV        XSCI         12        -1.0000 V      -430.3089 mV   -300.0000 mV   -100.0000 uA   0         n'
​

I thought to iterate each captured row and split on the basis of whitespace, but the issue is, the units have white space in between them and the value, for example….

-300.0000 mV, -100.0000 uA etc

Also one more issue is the trailing newline character, it is also been treated as a new element in the .split(” “).

Can someone please help to find some smarter way to do this?

All I want is to have the values as a separate column value.

For example in the first string,

100 becomes 1st col, 0 – 2nd, PASS – 3rd, Continuity_PPMU_mV – 4th, etc…

Thanks.

Edit:

The raw data somewhat look like this –

Site Number:    
     0,  1,  2,  3

Device#: 1-4

********************************************************************* 
FT45434HAP PQF64 Test @ RHC 
********************************************************************* 
---------------------------Continuity Test--------------------------- 

 Number     Site  Result   Test Name                 Pin          Channel   Low            Measured       High           Force          Loc
 100         0    PASS     Continuity_PPMU_mV        XSCI         140       -1.0000 V      -427.9508 mV   -300.0000 mV   -100.0000 uA   0         
 100         1    PASS     Continuity_PPMU_mV        XSCI         12        -1.0000 V      -430.3089 mV   -300.0000 mV   -100.0000 uA   0         
 100         2    PASS     Continuity_PPMU_mV        XSCI         76        -1.0000 V      -430.7492 mV   -300.0000 mV   -100.0000 uA   0         
 100         3    PASS     Continuity_PPMU_mV        XSCI         204       -1.0000 V      -431.0482 mV   -300.0000 mV   -100.0000 uA   0         
 101         0    PASS     Continuity_PPMU_mV        XSCO         139       -1.0000 V      -456.0359 mV   -300.0000 mV   -100.0000 uA   0         
 101         1    PASS     Continuity_PPMU_mV        XSCO         11        -1.0000 V      -458.0605 mV   -300.0000 mV   -100.0000 uA   0         
 101         2    PASS     Continuity_PPMU_mV        XSCO         75        -1.0000 V      -457.8564 mV   -300.0000 mV   -100.0000 uA   0

JavaScript
 
Site Number:    
     0,  1,  2,  3
​
Device#: 1-4
​
********************************************************************* 
FT45434HAP PQF64 Test @ RHC 
********************************************************************* 
---------------------------Continuity Test--------------------------- 
​
 Number     Site  Result   Test Name                 Pin          Channel   Low            Measured       High           Force          Loc
 100         0    PASS     Continuity_PPMU_mV        XSCI         140       -1.0000 V      -427.9508 mV   -300.0000 mV   -100.0000 uA   0         
 100         1    PASS     Continuity_PPMU_mV        XSCI         12        -1.0000 V      -430.3089 mV   -300.0000 mV   -100.0000 uA   0         
 100         2    PASS     Continuity_PPMU_mV        XSCI         76        -1.0000 V      -430.7492 mV   -300.0000 mV   -100.0000 uA   0         
 100         3    PASS     Continuity_PPMU_mV        XSCI         204       -1.0000 V      -431.0482 mV   -300.0000 mV   -100.0000 uA   0         
 101         0    PASS     Continuity_PPMU_mV        XSCO         139       -1.0000 V      -456.0359 mV   -300.0000 mV   -100.0000 uA   0         
 101         1    PASS     Continuity_PPMU_mV        XSCO         11        -1.0000 V      -458.0605 mV   -300.0000 mV   -100.0000 uA   0         
 101         2    PASS     Continuity_PPMU_mV        XSCO         75        -1.0000 V      -457.8564 mV   -300.0000 mV   -100.0000 uA   0         
​

EDIT:

The top rows are not fixed, they are dynamically generated. Also, some other text data can appear in between the relevant data, like between two useful rows. So, I don’t think skipping rows will work here.

Answer

Read the file and look for the row the starts with 'Number', and then append those rows after that to data.
In the data rows, only the units are separated by a space.
It’s better to have the unit separate from the numeric value, so we can split the rows on spaces.
Create a new header, with new columns for the units.
This will allow the numeric values to be interpreted as floats.

import pandas as pd
import seaborn as sns

# read the file in
data = list()
with open('test.txt', 'r') as f:
    rows = f.readlines()
    flag = False  # flag to True once the header row with Number is found
    for row in rows:
        row = row.strip()
        if row.startswith('Number'):
            flag = True
            continue  # after the header row is found, skip it
        if flag:
            data.append(row.split())  # append rows after the header to data

# create a custom header where the unites have been added as column headers
header = ['Number', 'Site', 'Result', 'Test_Name', 'Pin', 'Channel', 'Low', 'U1', 'Measured', 'U2', 'High', 'U3', 'Force', 'U4', 'Loc']

# create the dataframe
df = pd.DataFrame(data, columns=header)

# save to csv
df.to_csv('file.csv', index=False)

# convert columns to numeric dtypes
df = df.apply(pd.to_numeric, errors='ignore')

# scale the columns as per their units
df.Measured = df.Measured.div(1000)
df.High = df.High.div(1000)
df.Force = df.Force.div(100000)

# display(df)
   Number  Site Result           Test_Name   Pin  Channel  Low U1  Measured  U2  High  U3  Force  U4  Loc
0     100     0   PASS  Continuity_PPMU_mV  XSCI      140 -1.0  V -0.427951  mV  -0.3  mV -0.001  uA    0
1     100     1   PASS  Continuity_PPMU_mV  XSCI       12 -1.0  V -0.430309  mV  -0.3  mV -0.001  uA    0
2     100     2   PASS  Continuity_PPMU_mV  XSCI       76 -1.0  V -0.430749  mV  -0.3  mV -0.001  uA    0
3     100     3   PASS  Continuity_PPMU_mV  XSCI      204 -1.0  V -0.431048  mV  -0.3  mV -0.001  uA    0
4     101     0   PASS  Continuity_PPMU_mV  XSCO      139 -1.0  V -0.456036  mV  -0.3  mV -0.001  uA    0
5     101     1   PASS  Continuity_PPMU_mV  XSCO       11 -1.0  V -0.458060  mV  -0.3  mV -0.001  uA    0
6     101     2   PASS  Continuity_PPMU_mV  XSCO       75 -1.0  V -0.457856  mV  -0.3  mV -0.001  uA    0

# plot
ax = sns.lineplot(data=df.iloc[:, 6:-2])
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

JavaScript
 
import pandas as pd
import seaborn as sns
​
# read the file in
data = list()
with open('test.txt', 'r') as f:
    rows = f.readlines()
    flag = False  # flag to True once the header row with Number is found
    for row in rows:
        row = row.strip()
        if row.startswith('Number'):
            flag = True
            continue  # after the header row is found, skip it
        if flag:
            data.append(row.split())  # append rows after the header to data
​
# create a custom header where the unites have been added as column headers
header = ['Number', 'Site', 'Result', 'Test_Name', 'Pin', 'Channel', 'Low', 'U1', 'Measured', 'U2', 'High', 'U3', 'Force', 'U4', 'Loc']
​
# create the dataframe
df = pd.DataFrame(data, columns=header)
​
# save to csv
df.to_csv('file.csv', index=False)
​
# convert columns to numeric dtypes
df = df.apply(pd.to_numeric, errors='ignore')
​
# scale the columns as per their units
df.Measured = df.Measured.div(1000)
df.High = df.High.div(1000)
df.Force = df.Force.div(100000)
​
# display(df)
   Number  Site Result           Test_Name   Pin  Channel  Low U1  Measured  U2  High  U3  Force  U4  Loc
0     100     0   PASS  Continuity_PPMU_mV  XSCI      140 -1.0  V -0.427951  mV  -0.3  mV -0.001  uA    0
1     100     1   PASS  Continuity_PPMU_mV  XSCI       12 -1.0  V -0.430309  mV  -0.3  mV -0.001  uA    0
2     100     2   PASS  Continuity_PPMU_mV  XSCI       76 -1.0  V -0.430749  mV  -0.3  mV -0.001  uA    0
3     100     3   PASS  Continuity_PPMU_mV  XSCI      204 -1.0  V -0.431048  mV  -0.3  mV -0.001  uA    0
4     101     0   PASS  Continuity_PPMU_mV  XSCO      139 -1.0  V -0.456036  mV  -0.3  mV -0.001  uA    0
5     101     1   PASS  Continuity_PPMU_mV  XSCO       11 -1.0  V -0.458060  mV  -0.3  mV -0.001  uA    0
6     101     2   PASS  Continuity_PPMU_mV  XSCO       75 -1.0  V -0.457856  mV  -0.3  mV -0.001  uA    0
​
# plot
ax = sns.lineplot(data=df.iloc[:, 6:-2])
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
​

Advertisement

Answer