String & File Processing

Summary

A List of data pipes useful for processing contents of data files.

Data Pre-processing

File to Stream of Lines

fileToStream
  • Type: DataPipe[String, Stream[String]]
  • Result: Converts a text file (inputted as a file path string) into Stream[String]

Write Stream of Lines to File

streamToFile(fileName: String)
  • Type: DataPipe[Stream[String], Unit]
  • Result: Writes a stream of lines to the file specified by filePath

Drop first line in Stream

dropHead
  • Type: DataPipe[Stream[String], Stream[String]]
  • Result: Drop the first element of a Stream of String

Replace Occurrences in of a String

replace(original, newString)
  • Type: DataPipe[Stream[String], Stream[String]]
  • Result: Replace all occurrences of a regular expression or string in a Stream of String with with a specified replacement string.

Replace White Spaces

replaceWhiteSpaces
  • Type: DataPipe[Stream[String], Stream[String]]
  • Result: Replace all white space characters in a stream of lines.

Remove Trailing White Spaces

  • Type: DataPipe[Stream[String], Stream[String]]
  • Result: Trim white spaces from both sides of every line.

Remove White Spaces

replaceWhiteSpaces
  • Type: DataPipe[Stream[String], Stream[String]]
  • Result: Replace all white space characters in a stream of lines.

Remove Missing Records

removeMissingLines
  • Type: DataPipe[Stream[String], Stream[String]]
  • Result: Remove all lines/records which contain missing values

Create Train/Test splits

splitTrainingTest(num_training, num_test)
  • Type: DataPipe[(Stream[(DenseVector[Double], Double)], Stream[(DenseVector[Double], Double)]), (Stream[(DenseVector[Double], Double)], Stream[(DenseVector[Double], Double)])]
  • Result: Extract a subset of the data into a Tuple2 which can be used as a training, test combo for model learning and evaluation.

Comments