I have a text file where I have several lines and between them, some data which I need to convert to the dataframe(useful data).
I iterated the text file line by line and captured the useful data with the help of a regex.
Something like this,
pattern = r'^(s)(d+)(s+)(d)(s+)(w+)(s+)(w+)(.*)'
capture_data = []
with open(file,'r') as file_obj:
            lineList = file_obj.readlines()
            for line in lineList:
                info_list = re.search(pattern, line)
                if info_list is not None:
                    capture_data.append(line)
The data captured look like this
' 100 0 PASS Continuity_PPMU_mV XSCI 140 -1.0000 V -427.9508 mV -300.0000 mV -100.0000 uA 0 n' ' 100 1 PASS Continuity_PPMU_mV XSCI 12 -1.0000 V -430.3089 mV -300.0000 mV -100.0000 uA 0 n'
I thought to iterate each captured row and split on the basis of whitespace, but the issue is, the units have white space in between them and the value, for example….
-300.0000 mV, -100.0000 uA etc
Also one more issue is the trailing newline character, it is also been treated as a new element in the .split(” “).
Can someone please help to find some smarter way to do this?
All I want is to have the values as a separate column value.
For example in the first string,
100 becomes 1st col, 0 – 2nd, PASS – 3rd, Continuity_PPMU_mV – 4th, etc…
Thanks.
Edit:
The raw data somewhat look like this –
Site Number:    
     0,  1,  2,  3
Device#: 1-4
********************************************************************* 
FT45434HAP PQF64 Test @ RHC 
********************************************************************* 
---------------------------Continuity Test--------------------------- 
 Number     Site  Result   Test Name                 Pin          Channel   Low            Measured       High           Force          Loc
 100         0    PASS     Continuity_PPMU_mV        XSCI         140       -1.0000 V      -427.9508 mV   -300.0000 mV   -100.0000 uA   0         
 100         1    PASS     Continuity_PPMU_mV        XSCI         12        -1.0000 V      -430.3089 mV   -300.0000 mV   -100.0000 uA   0         
 100         2    PASS     Continuity_PPMU_mV        XSCI         76        -1.0000 V      -430.7492 mV   -300.0000 mV   -100.0000 uA   0         
 100         3    PASS     Continuity_PPMU_mV        XSCI         204       -1.0000 V      -431.0482 mV   -300.0000 mV   -100.0000 uA   0         
 101         0    PASS     Continuity_PPMU_mV        XSCO         139       -1.0000 V      -456.0359 mV   -300.0000 mV   -100.0000 uA   0         
 101         1    PASS     Continuity_PPMU_mV        XSCO         11        -1.0000 V      -458.0605 mV   -300.0000 mV   -100.0000 uA   0         
 101         2    PASS     Continuity_PPMU_mV        XSCO         75        -1.0000 V      -457.8564 mV   -300.0000 mV   -100.0000 uA   0         
EDIT:
The top rows are not fixed, they are dynamically generated. Also, some other text data can appear in between the relevant data, like between two useful rows. So, I don’t think skipping rows will work here.
Advertisement
Answer
- Read the file and look for the row the starts with 'Number', and then append those rows after that todata.
- In the data rows, only the units are separated by a space.
- It’s better to have the unit separate from the numeric value, so we can split the rows on spaces.
- Create a new header, with new columns for the units.
- This will allow the numeric values to be interpreted as floats.
import pandas as pd
import seaborn as sns
# read the file in
data = list()
with open('test.txt', 'r') as f:
    rows = f.readlines()
    flag = False  # flag to True once the header row with Number is found
    for row in rows:
        row = row.strip()
        if row.startswith('Number'):
            flag = True
            continue  # after the header row is found, skip it
        if flag:
            data.append(row.split())  # append rows after the header to data
# create a custom header where the unites have been added as column headers
header = ['Number', 'Site', 'Result', 'Test_Name', 'Pin', 'Channel', 'Low', 'U1', 'Measured', 'U2', 'High', 'U3', 'Force', 'U4', 'Loc']
# create the dataframe
df = pd.DataFrame(data, columns=header)
# save to csv
df.to_csv('file.csv', index=False)
# convert columns to numeric dtypes
df = df.apply(pd.to_numeric, errors='ignore')
# scale the columns as per their units
df.Measured = df.Measured.div(1000)
df.High = df.High.div(1000)
df.Force = df.Force.div(100000)
# display(df)
   Number  Site Result           Test_Name   Pin  Channel  Low U1  Measured  U2  High  U3  Force  U4  Loc
0     100     0   PASS  Continuity_PPMU_mV  XSCI      140 -1.0  V -0.427951  mV  -0.3  mV -0.001  uA    0
1     100     1   PASS  Continuity_PPMU_mV  XSCI       12 -1.0  V -0.430309  mV  -0.3  mV -0.001  uA    0
2     100     2   PASS  Continuity_PPMU_mV  XSCI       76 -1.0  V -0.430749  mV  -0.3  mV -0.001  uA    0
3     100     3   PASS  Continuity_PPMU_mV  XSCI      204 -1.0  V -0.431048  mV  -0.3  mV -0.001  uA    0
4     101     0   PASS  Continuity_PPMU_mV  XSCO      139 -1.0  V -0.456036  mV  -0.3  mV -0.001  uA    0
5     101     1   PASS  Continuity_PPMU_mV  XSCO       11 -1.0  V -0.458060  mV  -0.3  mV -0.001  uA    0
6     101     2   PASS  Continuity_PPMU_mV  XSCO       75 -1.0  V -0.457856  mV  -0.3  mV -0.001  uA    0
# plot
ax = sns.lineplot(data=df.iloc[:, 6:-2])
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
 
						