Name: Thanmayi Akkala
Email: takkal2@uic.edu
UIN: 650556907
The LLM Encoder Project is designed to process large-scale text corpora using a parallel distributed system architecture. This project utilizes Hadoop's MapReduce framework to handle data tokenization, word frequency calculation, and Word2Vec-based token embedding generation. The primary goal of the system is to generate token embeddings that capture the semantic meaning of words in the corpus and identify tokens that are semantically similar using cosine similarity.
The system processes text data efficiently by leveraging the MapReduce paradigm for distributed data handling. The project is designed to scale and handle massive datasets, making it particularly suited for cloud environments like Amazon EMR. By using a combination of tokenization and deep learning-based Word2Vec embeddings, this system provides meaningful word embeddings and relationships between tokens, which are valuable for various natural language processing tasks.
This project is implemented using Java 11 to ensure compatibility with modern cloud environments and uses key technologies such as Hadoop 3.3.6 and Deeplearning4j for Word2Vec embeddings.
This project uses SBT (Scala Build Tool) to manage dependencies and compile the project.
build.sbt includes the following dependencies:
org.apache.hadoop: For MapReduce and Hadoop Common libraries.org.deeplearning4j: For the Word2Vec implementation.jtokkit: For tokenization.logback and slf4j.Clone the Project:
Build the Project:
build.sbt is properly configured with all dependencies.sbt clean assembly to compile and build the project into a JAR file.Running the Code in IntelliJ:
LLMEncoderDriver.scala file.main method. Example:
sbt run com.thanu.llm.LLMEncoderDriver <input file path> <output directory> <output_directory_2>
Example:Run or Debug from IntelliJ's menu to start the process.Running with Hadoop Locally:
hadoop jar target/scala-2.13/CloudLLMProject-assembly.jar com.thanu.llm.LLMEncoderDriver <input-path> <output-path-1> <output-path-2>
<input-path> is the path to your input text file, and the <output-path-1> and <output-path-2> are the directories where output files will be written.Running on Amazon EMR:
hadoop jar s3://<your-bucket-name>/CloudLLMProject-assembly.jar com.thanu.llm.LLMEncoderDriver s3://<your-bucket-name>/<input-path> s3://<your-bucket-name>/<output-path-1> s3://<your-bucket-name>/<output-path-2>
The tests are under in src/tests/scala. These can be run using sbt test at once or sbt.
It can be run using the scala test or by passing the files individually like: sbt "testOnly *Word2VecMapperTest"
The first mapper reducer gives the tokens and the number of occurences.
The wordsvec Mapper and reducer gives the token and its corresponding embeddings and the similar tokens down in the output file
After deploying on emr when the status is complete:
the output folders that are passed as arguments for output_1 and output_2 are created and the corresponding output files: