

The user can then click the "Process Data" button to run the script and preprocess the data. It uses the Streamlit library to create an interactive web app that allows the user to input the file path, test/train split size, and threshold for the number of missing values per record. This version is a Streamlit app that allows the user to provide the same arguments as command-line arguments.


St.success("Data preprocessing completed!") Train_df, test_df = split_data(df, test_size) Threshold = st.number_input("Enter the threshold for the number of missing values per record: ", step=1, value=1)ĭf, conversions = load_and_convert_data(file_path)ĭf = handle_missing_values(df, threshold) Test_size = st.number_input("Enter the train/test split size (decimal between 0 and 1): ", step=0.01, value=0.2) St.set_page_config(page_title="Data Preprocessing", page_icon=":guardsman:", layout="wide")įile_path = st.text_input("Enter the path/name of the dataset csv file: ") Return train_test_split(df, test_size=test_size) # Drop records with more than threshold missing valueĭf.dropna(thresh=len(df.columns) - threshold, inplace=True) # Impute missing values for records with one missing valueįor col in missing_values.index:ĭf.fillna(df.median(), inplace=True) # Convert string values to numeric and track conversions in dictionaryĬonversions = ĭef handle_missing_values(df, threshold): # Initialize dictionary to track string to numeric conversions
