Cleaning data and writing 14,000 rows to the database

I needed car data for a Ruby on Rails project. I found a GitHub repository where someone generously shared 14,000 rows of car model, makes, and year.

The challenge was to import it into a Rails database running Postgresql.

I used Ruby to format each row to an array and then exported that into a CSV. Here’s how I did it.

Cleaning Data with Ruby

I created a new file called car_types.txt in the Rails project/lib folder.

I selected a few records from the github repo and pasted them into the car_types.txt file. We’ll process a few records now and wait until the end to process the rest.

I often use a debugger when I’m cleaning up data. Since I’m working in a Ruby on Rails project, I’ll be using byebug debugger.

In Rails, we use seeds.rb file to populate the database with test data. It makes sense to do data clean up there as well.

Let’s open up the car_types.txt file with the File class in Ruby. We’re going to read data line by line and transform it into CSV format.'lib/car_types.txt').each do |line|

I’m going set a byebug breakpoint so that the application halts within the block and we can see what is happening.'lib/car_types.txt').each do |line|
# Run the seeds.rb from the console.
rails db:seed
Byebug halts runtime and you can start to modify data and see changes.
# In byebug mode.
line # "(1926, 'Chrysler', 'Imperial'),\n"
# We want to get each line looking like this:
# "1926,Chrysler,Imperial"
#create an empty array to store each line of record.
cars = []'lib/car_types.txt').each do |line|
  # Remove the left parenthesis
  line.gsub!("(", "")
  # Remove the right parenthesis
  line.gsub!(")", "")
  # Remove the single quotes
  line.gsub!("'", "")
  # Remove the new line (\n)
  # Remove the comma at the end.
  line = line[0...-1]
  # Remove the any spaces.
  line.gsub!(", ", ",")
  #line is now properly formatted.
  line # "1926, Chrysler, Imperial"
  #Add the line into  the cars array.

We successfully formatted each line and wrote all the results to the cars array. Let’s write that array to a csv file.

Let’s create a file car_list_new.csv in the project/lib folder.

I won’t mention byebug with every example I show here but I highly encourage you to explore your code by using it to set breakpoints and see what Ruby is doing.

Create a block which will open the new car_list_new.csv file and write the formatted car data to it.'lib/car_list_new.csv', 'a') do |file|

Open car_list_new.csv and you’ll see csv formatted car data on each line.

1950,Hillman,Minx Magnificent

Writing CSV to Postgresql

Let’s write the results to the database. In seeds.db, we’ll open car_list_new.csv and read each line within the block.'lib/car_list_new.csv').each do |line|
  line # "Year,Make,Model\n"
  #Let's remove the new line (\n) and split each item into an array.
  vehicle = line.chomp.split(",")
  #Create a record in the database from each line.
  Car.create(year: vehicle[0].to_i, make: vehicle[1], model: vehicle[2])

I view the database with Postico and see the populated car records.

Viewing the cars table with Postico.

Inserting all car records

Let’s insert all 14,000 cars in the database. Copy and paste all the car records into car_types.txt and delete all text from car_list_new.csv file. Before we run all our code in seeds.db, let’s clear out all the records from the car table by running Car.delete_all.

Here’s how the final seeds.db file looks like.


cars = []'lib/car_types.txt').each do |line|
  line.gsub!("(", "")
  line.gsub!(")", "")
  line.gsub!("'", "")
  line = line[0...-1]
  line.gsub!(", ", ",")
  cars << line
end'lib/car_list_new.csv').each do |line|
  vehicle = line.chomp.split(",")
  Car.create(year: vehicle[0].to_i, make: vehicle[1], model: vehicle[2])

It took about 15 seconds to populate the database with over 14,000 rows of car data.

The way I’m writing to the database is inefficient. It’s slow. We’re pinging the database to create a row after reading each line of records. That’s over 14,000 hits to the database. You can use the activerecord-import gem to bring the time down to 2 seconds.

However, it doesn’t matter. I’m using this data in development environment and waiting 15 seconds isn’t a deal breaker. As a side benefit, I don’t have to use extra gems.

Share on facebook
Share on linkedin
Share on reddit
Share on twitter
Nikita Kazakov
Nikita Kazakov

Software Developer and Product Owner.