Methodology (text)

H-1B Prevailing Wage Claims vs. Reality:
What They Say, Versus What They Pay

By Robert Hill, CUNY, Murphy Institute for Worker Education

Methodology (text)

Acquisition of the Data

The source of the data is the Foreign Labor Certification Data Center website at http://www.flcdatacenter.com/CaseH1B.aspx. This website is the public distribution site for the current and archived LCAs going back to 2001. This study uses the data for FY2010 as published in Microsoft Access “.mdb” format.

RH_H-1B_F06_FLCDC_link

Figure 6: FLCDC Online Wage Library iCert data link.

When opened in Microsoft Access, the downloaded file proves to be a record of 335,328 LCAs stored as a single table named “H1B_FY2010.” 

RH_H-1B_F07_H1B_FY2010_Table

Figure 7: Example of the full data table for the FY 2010 H-1B data.

This table of records represents all LCAs from all categories of job for fiscal year 2010, and it also includes those that were denied or withdrawn. It also includes fields that are not of interest. To sort through the data and produce sub-sets of data that could be used to replicate Miano’s 2007 research, a set of queries was written to extract the data and export it to Microsoft Excel format.

RH_H-1B_F08_Sample_query

Figure 8: Example of an Access query to export data to Excel.

Miano’s 2007 study was drawn from LCA data that used the older 3-digit LCA occupational codes. For his purposes, he focused his attention on the “Computer Related Occupation” codes, as follows:

  • 030: Occupations in Systems Analysis and Programming
  • 031: Occupations in Data Communications and Networks
  • 032: Occupations in Computer System User Support
  • 033: Occupations in Computer System Technical Support
  • 039: Other Computer-Related Occupations

In the corresponding SOC code system, the relevant codes are those in the “15-xxxx.xx” series, as shown in the table below. For the purposes of this study, the ten SOC codes listed below are used for comparison because these are the jobs codes that directly tie the LCAs to the OES data. 

  • 15-1011: Computer and Information Scientists, Research
  • 15-1021: Computer Programmers
  • 15-1031: Computer Software Engineers, Applications
  • 15-1032: Computer Software Engineers, Systems Software
  • 15-1041: Computer Support Specialists
  • 15-1051: Computer Systems Analysts
  • 15-1061: Database Administrators
  • 15-1071: Network and Computer Systems Administrators
  • 15-1081: Network Systems and Data Communications Analysts
  • 15-1099: Computer Specialists, All Other

Further, these are the SOC codes for which the Bureau of Labor Statistics maintains wage estimate data, as available on their website.

RH_H-1B_F09_Computer_Math_Science_Occupation_codes

Figure 9: Listing of SOC codes on the BLS site, with links to data, published May 2009.

There were a few hundred instances where the queries returned SOC codes that fell outside of those listed above. These outlier SOC codes were:

  • 15-1022: Computer Programmers
  • 15-1023: Computer Programmers
  • 15-1034: Computer Software Engineers, Applications, Non R&D
  • 15-1035: Computer Software Engineers, Applications, R&D
  • 15-1036: Computer Software Engineers, Systems Software, Non RD&
  • 15-1037: Computer Software Engineers, Systems Software, R&D

As there were only a few hundred LCAs with these SOC codes, and as there is no OES data available for them, these relatively few records were left outside of the calculations. 

Each of the 10 queries from the database file was exported to one of 10 Excel files for further processing.

Details of how the data was acquired and extracted, including the formal Structured Query Language (SQL) statement and definitions of the fields extracted, can be found in Appendix A: Data Acquisition Details.

Transformation and Grooming of the Data

Once exported from Access to Excel format, each SOC code’s returned data was groomed with a number of formatting treatments to make it usable for calculations and to cast it in a standardized format.

One problem that had to be overcome was transforming the character type of the wage data. When data is stored in a database file, certain decisions are made about its “metadata” characteristics. For instance, metadata about whether the data is text, or an integer, or a floating point number is stored along with the data itself. It was discovered that the numbers representing the wage data, both LCA and prevailing, were stored in the database file as “text.” To the human eye, the numbers looked like numbers, but to the computer, the numbers looked like text and were unavailable for even simple calculations such as addition and multiplication. Because there were hundreds of thousands of data points with this fundamental flaw, a method was required to convert them from “stored-as-text” values to “stored-as-integer” values. Once this was accomplished, mathematical calculations could be performed on the wage data.

The original numbers representing the wage data were also stored in a variety of decimal formats. A standardized number format was adopted to transform numbers into integers, with no decimal places, and using a comma to separate groups of thousands. E.g., “75600.00” was transformed to “75,600” without the use of a dollar sign. This transformation was done for all wage columns in all ten Excel files.

 Details of how the data was transformed and groomed can be found in Appendix B: Transformation and Grooming of the Data, Details.

Removing Outliers By Calculating and Sorting By Parameters

Once calculations were possible, a set of functions was applied to each of the column of wage data so that outliers could be found and removed from the calculations. The following seven functions were calculated.

  • Min: This is the smallest number in the column, used to check for and eliminate negative numbers from the calculations.
  • Max: This is the largest number in the column, used to check for and eliminate any value over $300,000 as an outlier. 
  • Median: This is the number that represents the midpoint of the set of wage values. It is an actual value in the column. 50% of the values in the column should be larger than the median, and 50% smaller.
  • Mean: This is the number that represents the average of the set of wage values. It is a calculated value, not necessarily found in the set. If the median and the mean are close to each other in value, then the likelihood that there are significant outliers is low.
  • Standard Deviation: This is a statistical function that is used to assess how “normal” the distribution is. This value is supplied in the event that others find it useful.
  • Total Number of LCAs (count): This number was calculated to provide the number of LCAs included in each set of calculations.
  • Total Number of Workers (sum): This number was calculated to provide the number of workers included in each set of calculations.

These functions were applied to the following four columns of data.

  • LCA_CASE_WAGE_RATE_FROM: The primary data point used in this calculation. The vast majority of LCAs were for a single worker at a single wage. This value is what the employer proposed to pay the H-1B worker.
  • LCA_CASE_WAGE_RATE_TO: In some cases, LCAs showed wages as a range, in which case there was a upper limit provided. This data point was not used for any main calculations, but was used to eliminate the row of data from the main calculations in the “From Only” condition.
  • PW_1: The claimed prevailing wage for the position being filled. This value was used as the primary data point for prevailing wage calculations.
  • PW_2:  In a few cases, LCAs showed prevailing wages as a range. This data point was not used for any main calculations.

Figure 10, below demonstrates a typical screen shot of spreadsheet with its sorting functions.

RH_H-1B_F10_Numbers_above_bar_excluded

Figure 10: Example of wage values excluded from calculations.

Line 8 from Figure 10 demonstrates a LCA claim that was removed from all calculations on the basis of it being an outlier. This LCA for a “Web Developer” claimed that the prevailing wage for the job was $4,634,200 but that they intended to pay the H-1B worker a salary of $5,000,000. While this is possible, it is highly unlikely to be true, to say the least. It is much more likely that this simply represents a clerical error; the correct values should probably be $46,342 and $50,000, respectively. And, since this outlier was two orders of magnitude larger than expected, i.e. 100 times larger, leaving it in the calculations would have significantly skewed the averages.

Rather than applying any corrections to the suspect data, they were simply removed from the calculations. Maximum LCA_WAGE_FROM and PW_1 values that were above $300,000 were sought out and moved to be above the grey bar representing the column headers. Then, calculations were automatically reperformed on all data below the grey header bar. This process was repeated until all Maximum values were brought below $300,000. The data was left in the spreadsheets for review, but removed from the calculations.

Dividing the Data Into “From” and “From Only” Branches

One factor that potentially affected results was whether the LCA provided only a LCA_Wage_From value or also a LCA_Wage_To value. The vast majority of LCAs were for one worker each, and for this worker the employers provided a single wage value. That wage value was recorded in the LCA_Wage_From field, as illustrated in Table 1 below.

rH_H-1B_T01_LCA_Wage_Examples

Many thousands of the LCAs also included a value in the LCA_WAGE_TO column. This value, if present, indicated an upper limit to the wages that the employers intended to pay the H-1B workers. In the examples above, the wage information for the Java Developer can be reasonably assumed to be $55,000, and the wage information for the two Oracle developers could reasonably be assumed to be $96,000. However, how are we to calculate the salary information for any one of the four Systems Architects or 10 Customer Support workers? 

One option considered was to average the LCA_WAGE_FROM and LCA_WAGE_TO valued for use in the calculations, but this was rejected because it represented changing the raw data.  Ultimately, the solution chosen was to run the calculations on two branches of the data, one “From” and the other “From Only.” 

  • “From” condition: All validated LCAs were included in the calculations.
  • “From ” ONLY condition: Only those validated LCAs that had ONLY a LCA_WAGE_FROM value were included in the calculations. All LCAs that also included a LCA_WAGE_TO value were simply ignored.

The sample size of the “From” ONLY condition is smaller than that for the “From” condition, but it should still be sizeable enough to be valid. At any rate, comparisons can be made after-the-fact to determine whether there is a substantial effect in differentiating the two “From” conditions.

Dividing the Data Into “Non-Weighted” and “Weighted” Branches

Prior to the data analysis, it was not known whether “weighting” the data would reveal significant differences in the patterns revealed, so a decision was made to create both weighted and non-weighted conditions and compare them after analysis. 

The term “weighting” refers to converting single LCAs that represent multiple job positions into multiple LCAs that each represent a single job. For instance, consider the following LCA for five entry-level Java developers in Table 2 below. In the non-weighted condition, the salary information for this LCA is counted in the calculations ONCE, even though there are potentially (but not definitely) FIVE hires available. 

rH_H-1B_T02_Non-Weighted_Condition

In the weighted condition, this LCA for five entry-level Java developers was converted so that the wage information was counted in the calculations FIVE TIMES, as in Table 3 below.

rH_H-1B_T03_Weighted_Condition

Essentially, conducting analyses on both non-weighted and weighted conditions allowed for accounting for any POSSIBLE variation from the difference between the two conditions.

Four-Way Matrix Of Data Conditions

Ultimately, there were to be four variations of the data. The same calculations were run on each of the four variations.  The four dataset variations are listed and explained as follows:

rH_H-1B_T04a_Four_Way_Matrix

“From,” non-weighted: 

  • In this condition, the individual LCAs are all treated as equals, even though some are for more than one worker. 
  • An LCA for one 15-1031 worker at a salary of $50,000 is given the same significance as one that is the same in every respect EXCEPT that it is for 120 workers. 
  • The value of $50,000 is figured into the calculations only once. 
  • Even though some LCAs had only a “From” value while others had both a “From” and a “To” value, all LCAs were included in the calculations. The “To” values were ignored.

“From,” weighted: 

  • In this condition, each LCA is each copied by the number of workers called for in the LCA.
  • An LCA for 120 15-1031 workers at a salary of $50,000 is copied into 120 LCAs, (and 8 workers copied 8 times, etc.)
  • The value of $50,000 is figured into the calculations for as many workers as called for. 
  • Even though some LCAs had only a “From” value while others had both a “From” and a “To” value, all LCAs were included in the calculations. The “To” values were ignored.

“From” ONLY, non-weighted:

  • In this condition, the individual LCAs are all treated as equals, even though some are for more than one worker. 
  • An LCA for one 15-1031 worker at a salary of $50,000 is given the same significance as one that is the same in every respect EXCEPT that it is for 120 workers. 
  • The value of $50,000 is figured into the calculations only once. 
  • ONLY those LCAs that list ONLY an LCA “From” wage (and leave the LCA To field blank) are factored into the calculations. All LCAs that list a LCA “To” wage are discarded.

“From” ONLY, weighted: 

  • In this condition, each LCA is each copied by the number of workers called for in the LCA.
  • An LCA for 120 15-1031 workers at a salary of $50,000 is copied into 120 LCAs, (and 8 workers copied 8 times, etc.)
  • The value of $50,000 is figured into the calculations for as many workers as called for. 
  • ONLY those LCAs that list ONLY an LCA “From” wage (and leave the LCA To field blank) are factored into the calculations. All LCAs that list a LCA “To” wage are discarded.

 

 

[Introduction] [History] [LCA  / H-1B Process] [For And Against] [Previous Research] [The iCert System] [Methodology (text)] [Appendix A] [Appendix B] [Methodology (videos)] [Results] [Discussion] [Conclusions] [Downloadable Files] [External Links]

© 2011, Robert Hill, http://roberthill.org