I have a dataset of jobs where I have columns “Title” ,”Description” , “City” etc. and “Best Jobs” column. Output of the dataset is “Best Jobs” where I have two outputs(Yes , No) Yes mean jobs are part time and No , mean job is full time. I want to train any Machine learning model. Firstly I want to train the Model X or feature columns will be Title , Description etc. and Label will be “Best Jobs”. But I do not know how to train the Model on string columns. Please help me in this.
import numpy as np import pandas as pd import os, sys from sklearn.preprocessing import MinMaxScaler from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score df = pd.read_csv("machinelearning-new-best-gar-jobs.csv", engine = 'python',encoding='mac_roman') df.head()
df['Job description'].replace(' ', np.nan, inplace=True) df=df.dropna(subset=['Job description']) df.isnull().sum()
Then I will convert the Label (BestJobs) to integer 1 and 0
df['BestJobs'] = (df['BestJobs']=='Yes').astype(int) # changing yes to 1 and no to 0 print(df['BestJobs'].value_counts())
I want to know which Model should I apply to get it done.
Advertisement
Answer
I think you probably can only use two columns “Job description” and “Best Job” to train the model. Then it becomes a Text Classification problem, like classifying movie reviews as either positive or negative. Then you can preprocess the job description text and use a neural network to train your model.
The basic idea is that you may only need a few required features to train your model instead of processing all of the feature data you got. You can refer to this blog https://medium.com/analytics-vidhya/text-preprocessing-for-nlp-natural-language-processing-beginners-to-master-fd82dfecf95 (Text Preprocessing for NLP).
Hope it is helpful for you!