Cleaning data and writing 14,000 rows to the database

I needed car data for a Ruby on Rails project. I found a GitHub repository where someone generously shared 14,000 rows of car model, makes, and year.

The challenge was to import it into a Rails database running Postgresql.

I used Ruby to format each row to an array and then exported that into a CSV. Here’s how I did it.

Cleaning Data with Ruby

I created a new file called car_types.txt in the Rails project/lib folder.

I selected a few records from the github repo and pasted them into the car_types.txt file. We’ll process a few records now and wait until the end to process the rest.

I often use a debugger when I’m cleaning up data. Since I’m working in a Ruby on Rails project, I’ll be using byebug debugger.

In Rails, we use seeds.rb file to populate the database with test data. It makes sense to do data clean up there as well.

Let’s open up the car_types.txt file with the File class in Ruby. We’re going to read data line by line and transform it into CSV format.

File.open('lib/car_types.txt').each do |line|
end

I’m going set a byebug breakpoint so that the application halts within the File.open block and we can see what is happening.

File.open('lib/car_types.txt').each do |line|
  byebug
end
# Run the seeds.rb from the console.
rails db:seed
Byebug halts runtime and you can start to modify data and see changes.
# In byebug mode.
line # "(1926, 'Chrysler', 'Imperial'),\n"
# We want to get each line looking like this:
# "1926,Chrysler,Imperial"
#create an empty array to store each line of record.
cars = []

File.open('lib/car_types.txt').each do |line|
  line
  # Remove the left parenthesis
  line.gsub!("(", "")
  # Remove the right parenthesis
  line.gsub!(")", "")
  # Remove the single quotes
  line.gsub!("'", "")
  # Remove the new line (\n)
  line.chomp!
  # Remove the comma at the end.
  line = line[0...-1]
  # Remove the any spaces.
  line.gsub!(", ", ",")
  #line is now properly formatted.
  line # "1926, Chrysler, Imperial"
  #Add the line into  the cars array.
  cars.push(line)
end

We successfully formatted each line and wrote all the results to the cars array. Let’s write that array to a csv file.

Let’s create a file car_list_new.csv in the project/lib folder.

I won’t mention byebug with every example I show here but I highly encourage you to explore your code by using it to set breakpoints and see what Ruby is doing.

Create a File.open block which will open the new car_list_new.csv file and write the formatted car data to it.

File.open('lib/car_list_new.csv', 'a') do |file|
  file.write(line)
  file.write("\n")
end

Open car_list_new.csv and you’ll see csv formatted car data on each line.

1926,Chrysler,Imperial
1948,Citroën,2CV
1950,Hillman,Minx Magnificent
1953,Chevrolet,Corvette
1954,Chevrolet,Corvette
1954,Cadillac,Fleetwood
1955,Chevrolet,Corvette
1955,Ford,Thunderbird
1956,Chevrolet,Corvette
1957,Chevrolet,Corvette
1957,BMW,600
1958,Chevrolet,Corvette
1958,BMW,600
1958,Ford,Thunderbird

Writing CSV to Postgresql

Let’s write the results to the database. In seeds.db, we’ll open car_list_new.csv and read each line within the block.

File.open('lib/car_list_new.csv').each do |line|
  line # "Year,Make,Model\n"
  #Let's remove the new line (\n) and split each item into an array.
  vehicle = line.chomp.split(",")
  #Create a record in the database from each line.
  Car.create(year: vehicle[0].to_i, make: vehicle[1], model: vehicle[2])
end

I view the database with Postico and see the populated car records.

Viewing the cars table with Postico.

Inserting all car records

Let’s insert all 14,000 cars in the database. Copy and paste all the car records into car_types.txt and delete all text from car_list_new.csv file. Before we run all our code in seeds.db, let’s clear out all the records from the car table by running Car.delete_all.

Here’s how the final seeds.db file looks like.

Car.delete_all

cars = []
File.open('lib/car_types.txt').each do |line|
  line
  line.gsub!("(", "")
  line.gsub!(")", "")
  line.gsub!("'", "")
  line.chomp!
  line = line[0...-1]
  line.gsub!(", ", ",")
  cars << line
end

File.open('lib/car_list_new.csv').each do |line|
  vehicle = line.chomp.split(",")
  Car.create(year: vehicle[0].to_i, make: vehicle[1], model: vehicle[2])
end

It took about 15 seconds to populate the database with over 14,000 rows of car data.

The way I’m writing to the database is inefficient. It’s slow. We’re pinging the database to create a row after reading each line of records. That’s over 14,000 hits to the database. You can use the activerecord-import gem to bring the time down to 2 seconds.

However, it doesn’t matter. I’m using this data in development environment and waiting 15 seconds isn’t a deal breaker. As a side benefit, I don’t have to use extra gems.

Share on facebook
Share
Share on linkedin
Share
Share on reddit
Reddit
Share on twitter
Twitter
Nikita Kazakov
Nikita Kazakov

Software Developer and Product Owner.